Sampling Distribution,: College of Information and Computing Sciences
Sampling Distribution,: College of Information and Computing Sciences
Sampling Distribution,: College of Information and Computing Sciences
MODULE 4:
SAMPLING DISTRIBUTION,
Course overview
The course introduces the students to various methods of statistical analyses as applied in various industries and
enterprises. Through the use of primary statistical techniques, the students attain a meaningful understanding of
statistical reasoning within the context of management decision-making. Topics essentially focus on statistical
description, statistical induction, and analysis of statistical relationship.
Objectives
After successful completion of this module, the student can be able to;
• Identify their learning outcomes and expectations for the course;
• Recognize their capacity to create new understandings from reflecting on the course;
• Know the capabilities of Descriptive Statistics.
Module Content:
Sampling Distribution
o Mean and Standard Deviation of Sample Mean
o The Sampling Distribution of the Sample Mean
Statistical Relationship
o Overview of Statistical Relationship
o Line graph and Scatter Plot
o Correlation
o The Correlation Coefficient and Cohen’s D
Supplemental Videos
Sampling Distributions
A statistic, such as the sample mean or the sample standard deviation, is a number computed from a sample. Since a sample
is random, every statistic is a random variable: it varies from sample to sample in a way that cannot be predicted with
certainty. As a random variable it has a mean, a standard deviation, and a probability distribution. The probability distribution
of a statistic is called its sampling distribution.
This module introduces the concepts of the mean, the standard deviation, and the sampling distribution of a sample statistic,
with an emphasis on the sample mean .
The Mean and Standard Deviation of the Sample Mean
Suppose we wish to estimate the mean μ of a population. In actual practice we would typically take just one sample.
Imagine however that we take sample after sample, all of the same size n, and compute the sample mean x−−x- of each
one. We will likely get a different value of each time. The sample mean is a random variable: it varies from sample to
sample in a way that cannot be predicted with certainty. We will write X−−X- when the sample mean is thought of as a
random variable, and write x−−x- for the values that it takes. The random variable X−−X- has a mean, denoted μX−−μX-,
and a standard deviation, denoted σX−−.σX-. Here is an example with such a small population and small sample size that we
can actually write down every single sample.
Now we apply for the formulas from Section 4.2.2 “The Mean and Standard Deviation of a Discrete Random Variable”,
“Random Variable” in “Discrete Random Variables” for the mean and standard deviation of a discrete random variable to
we obtain
The mean and standard deviation of the population {152,156,160,164} in the example are μ = 158 and σ=20−
−√.σ=20. The mean of the sample mean X−−X- that we have just computed is exactly the mean of the population. The
standard deviation of the sample mean X−−X- that we have just computed is the standard deviation of the population
divided by the square root of the sample size: 10−−√=20−−√/2–√.10=20/2. These relationships are not
coincidences, but are illustrations of the following formulas.
The first formula says that if we could take every possible sample from the population and compute the corresponding
sample mean, then those numbers would center at the number we wish to estimate, the population mean μ.
The second formula says that averages computed from samples vary less than individual measurements on the
population do, and quantifies the relationship.
The Sampling Distribution of the Sample Mean
In "Example 1" in "The Mean and Standard Deviation of the Sample Mean" we constructed the probability distribution
of the sample mean for samples of size two drawn from the population of four rowers. The probability distribution is:
Figure 2.1 "Distribution of a Population and a Sample Mean" shows a side-by-side comparison of a histogram for the
original population and a histogram for this distribution. Whereas the distribution of the population is uniform, the
sampling distribution of the mean has a shape approaching the shape of the familiar bell curve. This phenomenon of
the sampling distribution of the mean taking on a bell shape even though the population distribution is not bell-
shaped happens in general. Here is a somewhat more realistic example.
Histograms illustrating these distributions are shown in Figure 2.2 "Distributions of the Sample Mean".
The Central Limit Theorem is illustrated for several common population distributions in Figure 2.3 "Distribution of
Populations and Sample Means".
The dashed vertical lines in the figures locate the population mean. Regardless of the distribution of the population,
as the sample size is increased the shape of the sampling distribution of the sample mean becomes increasingly bell-
shaped, centered on the population mean. Typically by the time the sample size is 30 the distribution of the sample
mean is practically the same as a normal distribution.
The importance of the Central Limit Theorem is that it allows us to make probability statements about the sample
mean, specifically in relation to its value in comparison to the population mean, as we will see in the examples. But
to use the result properly we must first realize that there are two separate random variables (and therefore two
probability distributions) at play:
1. X, the measurement of a single element selected at random from the population; the distribution of X is the
distribution of the population, with mean the population mean μ and standard deviation the population
standard deviation σ;
2. X−−X-, the mean of the measurements in a sample of size n; the distribution of X−−X- is its sampling
distribution, with mean μX−−=μμX-=μ and standard deviation σX−−=σ/n−−√.
Note that if in Note 2.11 "Example 3" we had been asked to compute the probability that the value of a single randomly
selected element of the population exceeds 113, that is, to compute the number P(X > 113), we would not have been able
to do so, since we do not know the distribution of X, but only that its mean is 112 and its standard deviation is 40. By
contrast we could compute P( >113)P( >113) even without complete knowledge of the distribution of X because the
Central Limit Theorem guarantees that is approximately normal.
Statistical Relationship Definition
Statistical relationship can be defined as a relationship used for determining the relationship between two or more
variables in a statistical manner or by conducting a survey. This is a mixture of deterministic and random
relationships.
Deterministic Relationship: This involves an accurate relationship between any two variables. Example: If Paul
earns 20 per hour, for every hour he works, he earns 20 more.
Random Relationship: This is a slightly inaccurate relationship. There seems to be a relationship between two
variables, but it may not be accurate. Example: Joe spent 100 on advertising to get sales worth 300, but there is no
guarantee for sales worth300 again if he spends 100. The statistical relationship is the combination of both
Deterministic and Random relationships. The relationship between two variables exists if the values of one variable
are associated with the values of another variable. This relationship is not necessarily either Deterministic or
Random. It could be either.
· The alcohol consumption and alcohol content in blood represent a positive relationship.
· Driving speed and gasoline mileage establish a negative relationship.
The above two figures are examples that show the statistical relationship.
Statistical relationships can be positive or negative and strong or weak. Correlation and Cohen’s D are two
measures of relationship strength.
Correlation
Correlation is used to test the relationship between the variables in statistical relationships. It is a measure of how
variables are related to each other. This study of the relationship between variables in a statistical relationship is
called correlation analysis.
There could be a high correlation between variables or low correlation between variables.
The dog’s name and the pedigree they like is an example of low correlation or no relation at all. Knowing these
statistics of correlations is useful because you can make predictions about future behavior. Predicting what the
future holds is very useful in areas like healthcare, business, etc.
Correlation analysis is very important in the field of education and research. Correlation analysis is necessary for
the following:
· Ascertaining features of psychological and educational tests (for instance, to find the validity, etc.)
· Testing the consistency of the data for the hypothesis
· Predicting the trends and anticipating a variable upon using the knowledge of other variables
· To construct psychological and educational models and theories
· Isolating influence of variables.
Correlation coefficients are used to assign a value to the relationship. Correlation coefficients have a value between
-1 and 1. In this range, 0 indicates no relationship, 1 indicates a strong positive relationship, and -1 indicates a
strong negative relationship.
There are several types of correlation coefficient formulas. Few of them are the sample correlation coefficient,
population correlation coefficient, Pearson correlation coefficient, and Goodman and Kruskal’s lambda coefficient.
· Cohen’s D:
Cohen’s D is one of the most popular ways to measure relationship strength, effect size. For example, to know
whether Medication A has better effect than Medication B.
M_1 \text {is the mean of the first group}, M_2 \text{ is the mean of the second group},
S_{pooled} \text{is the pooled standard deviation of the two groups}
whereM1 is the mean of the first group,M2 is the mean of the second group,Spooledis the pooled s
tandard deviation of the two groupswhere
here S_1S1is the standard deviation of the first group and S_2S2is the standard deviation of the second group
In the above formula, it does not matter which group mean is M₁, and which group mean is M₂. Usually, the larger
mean is M₁ and the smaller mean is M₂ so that Cohen’s D turns out to be positive. Cohen’s D values should
always be positive, so it is the absolute value of the difference between the means M₁ and M₂. The pooled
standard deviation in this formula is usually a kind of average of the two group standard deviations called the
pooled-within group's standard deviation.
Assessment:
Sampling Distributions
Supplemental Video:
Sampling Distribution
https://youtu.be/z0Ry_3_qhDw
https://youtu.be/p24UTvbKZog
Correlation Coefficient
https://youtu.be/11c9cs6WpJU
https://youtu.be/lVOzlHx_15s
https://youtu.be/MR9M2zN0HFU
Cohens’d
https://youtu.be/IetVSlrndpI
https://youtu.be/GDe4M0xEghs
https://youtu.be/5rXOy1S5bVk
https://youtu.be/lTlEDQK0vQg
Resources:
https://hmhub.me/quantitative-methods-of-forecasting/
https://brilliant.org/wiki/bayes-theorem/
https://www.wikilectures.eu/w/Statistical_Induction_Principle
https://saylordotorg.github.io/text_introductory-statistics/s06-descriptive-statistics.html
https://docs.dart.ucar.edu/en/latest/theory/conditional-probability-bayes-theorem.html
https://www.analyticsvidhya.com/blog/2017/03/conditional-probability-bayes-theorem/
https://eli.thegreenplace.net/2018/conditional-probability-and-bayes-theorem/
https://dataz4s.com/statistics/sample-space-events-probabilities/