Unit Iii

Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

UNIT III

Inferential Statistics
Dr. Valliappan
3.1 Population
• A Population is any entire collection of people, animal, plants
or things from which we may collect data.
• Population is collection of objects. It may be finite or infinite
according to the number of objects in the population.
• It is the entire group we are interested in, which we wish to
describe or draw conclusions about.
• In order to make any generalizations about a population, a
sample, that is meant to be representative of the population, is
often studied. For each population there are many possible
samples.
A sample statistic gives information about a corresponding
population parameter.
Example : The population for a study of infant health might be
all children born in the UK in the 1980’s. The sample might be
all babies born on 7th May in any of the years.
When such measures like the mean, median, mode, variance
and standard deviation of a population distribution are
computed, they are referred to as parameters. A parameter
can be simply defined as a summary characteristic of a
population distribution.
3.1.1 Sample
• A sample is a group of units selected from a
larger group(the population). By studying the
sample it is hoped to draw valid conclusions
about the larger group.
• A sample is a subset of a population.
• A sample is “ a smaller collection of units from a
population used to determine truths about that
population”.
• A sample is generally selected for study because
the population is too large to study in its entirety
The sample should
be representative of
the general population.
This is often best
achieved
Parameter Population Sample by random sampling.
Mean M X Example : The population
for study of infant health
Variance σ2 var might be all children born
in the UK in the 1980’s.
Standard σ sd The sample might be all
Deviation babies born on 7th May
in any of the years.
Symbols for population
and sample
descriptive measures
3.2 A simple random sample is a randomly selected subset of a
population. In this sampling method, each member of the
Random population has an exactly equal chance of being selected.
Sampling
Random sampling occurs if, at each stage of sampling , the
selection process guarantees that all potential observations in
population have an equal chance of being included in the sample.

A casual or haphazard sample doesn’t qualify as a random sample.


Since the selection process is based on probability and random
selection, the end smaller sample is more likely to be
representative of the total population and free from researcher
bias. This method is also called a method of chances.
Simple random sampling is one of the four probability sampling
techniques : Simple random sampling, systematic sampling,
stratified sampling and cluster sampling.
3.2.1 Types of Sampling
Two general approaches to sampling are used.

Probability (Random) Samples


a) Simple random sample
b) Systematic random sample
c) Stratified random sample
d) Multistage sample
e) Multiphase sample
f) Cluster sample
Non-probability Samples
1. Convenience sample
2. Purposive sample
3. Quota
With probability sampling, all element (e.g., persons, households) in the
population have some opportunity of being included in the sample and the
mathematical probability that any one of them will be selected can be
calculated.
With nonprobability sampling, in contrast, population elements are selected on
the basis of their availability or because of the researcher’s personal judgment
that they are representative. The consequence is that an unknown portion of
the population is excluded. One of the most common types of non-probability
sample is called a convenience sample.
Any sampling method where some elements of
population have no chance of selection or where the
probability of selection can’t be accurately
determined.

It involves the selection of element based on


assumptions regarding the population of interest,
which forms the criteria for selection. Hence, because
the selection of elements is non-random, non-
probability sampling not allows the estimation of
sampling error.
1. Random sampling :
Applicable when population is small, homogeneous
and readily available.
All subsets of the frame are give an equal
probability. Each element of the frame thus has an
equal probability of selection.
Estimates are easy to calculate.
Simple random sampling is always an EPS design,
but not all EPS designs are simple random sampling.
Disadvantages
If sampling frame large, this method impracticable.
Minority subgroups of interest in population may
not be present in sample in sufficient numbers of
study
2. Stratified sampling

Where population embraces a number of distinct


categories, the frame can be organized into separate
“strata”.
Each stratum is then sampled as an independent
sub-population, out of which individual elements
can be randomly selected.
Every unit in a stratum has same chance of being
selected.
Using same sampling fraction for all strata ensures
proportionate representation in the sample.
Finally, since each stratum is treated as an
independent population, different sampling
approaches can be applied to different strata.
Drawbacks to using stratified sampling.
First, sampling frame of entire population has to
be prepared separately for each stratum.
Second, when examining multiple criteria,
stratifying variables may be related to some, but
not to others, Further complicating the design
and potentially reducing the utility of the strata.
Finally, in some cases (such as designs with a
large number of strata or those with a specified
minimum sample size per group), stratified can
potentially require a larger sample than would
other methods.
Some terms used in sampling
1. Sampled population – Population from which sample drawn.
2. Frame – List of elements that sample selected from. E.g. telephone book,
city business directory. May be able to construct a frame.
3. Parameter – Characteristic of a population. E.g. Total (annual GDP or
exports),proportion p of population that votes Liberal in federal election.
Also μ or σ of a probability distribution is termed parameters.
4. Statistic – Numerical characteristics of a sample. E.g. monthly
unemployment rate, pre-election polls.
5. Sampling distribution of a statistic is the probability distribution of the
statistic.
Selecting a sample
• N is the symbol given for the size of the population.
• n is the symbol given for the size of the sample or the
number of elements in the sample.
• Simple random sample is a sample of size n selected in
a manner that each possible sample of size n has the
sample probability of being selected.
• In the case of a random sample
The sampling process comprises
several stages :
• Defining the population of concern.
• Specifying a sampling frame, a set of
items or events possible to measure.
• Specifying a sampling method for
selecting items or events from the
frame.
• Determining the sample size.
• Implementing the sampling plan.
• Sampling and data collecting.
• Reviewing the sampling process.
Selecting a simple random sample

Sample with replacement – Any element


randomly selected, replace it and randomly select
another element. But this could lead to the same
element being selected more than once.

More common to sample without replacement.


Make sure that on each stage, each element
remaining in the population has the same
probability of being selected.

Use random number table or a computer


generated random selection process. Or use a
coin, die or bingo ball popper, etc.
Simple random sample of size 2 from a population of
4 elements – without replacement

1. Population elements are A,B,C,D then N= 4and n=2

2. Without replacement Population elements are A, B, C, D.


N=4, n=2. 1st element selected could be any one of the 4
elements and this leaves 3, so there are 4 x 3 = 12 possible
samples, each equally likely: AB, AC, AD, BA, BC, BD, CA, CB,
CD, DA, DB, DC.
3. If the order of selection does not
matter (ie. we are interested only in
what elements are selected), then this
reduces to 6 combination. If {AB} is AB
or BA, etc., then the equally likely
random samples are {AB}, {AC}, {AD},
{BC}, {BD}, {CD}. This is the number of
combinations.
3.3 Sampling Distribution

A sampling distribution is the probability distribution of a statistic from a


larger number of samples drawn from specific population.
Its primary purpose is to establish representative result of small samples of a
comparatively larger population. Since population is too large to analyze, we
can select a smaller group and repeatedly sample or analyze them.
Using a sample distribution simplifies the process of making inference or
conclusions, about large amounts of data.
The standard deviation of the sampling distribution is called the Standard
error.
The sampling error is the difference between the point estimate (value of
the estimator) and the value of the parameter.
The sample is a sampling distribution of the sample means.
When all of the possible sample means are computed, then
the following properties are true:
1. The mean of the sample means will be the mean of the
population.
2. The variance of the sample means will be the variance of
the population divided by the sample size.
3. If the population has a normal distribution, the sample
means will have a normal distribution.
The formula for a Z-score when working with the sample
means is :
Finite population correction factor
If the sample size is more than 5% of then population size and
the sampling is done without replacement, then a correction
needs to be made to the standard error of the mean.
In the following, N is the population size and n is the sample
size. The adjustment is to multiply the standard error by the
square root of the quotient of the difference between the
population and sample sizes and one less than population size.
Random sample from normally distributed population
Classification of samples
Sample are classified as two types : Large sample and small
sample.
1. Large sample : The sample is said to be large if the size of
sample (n >_ 30)
2. Small sample : The sample is said to be large if the size of
sample(n<30)
Normally Sampling
distributed distribution of
population when sample is
random
Number of N n
elements
Mean μ μ

Standard σ
deviation
3.3.1 Standard Error of the Mean
The standard error of the mean (SEM) is used to
determine the differences between more than one
sample of data.
It helps to estimate how well a sample data
represents the whole population by measuring the
accuracy with which the sample data represents a
population using standard deviation.
Standard error of mean is given by :
The standard error of the mean serves as a special
type of standard deviation that measures
variability in the sampling distribution. The error in
standard error refers not to Computational errors.
A high standard error shows that sample means
are widely spread around the population mean, so
sample may not closely represent population. A
low standard error shows that sample means are
closely distributed around the population mean,
which means that sample representative of
population. So standard error can be decreases by
increasing sample is size.
Standard error calculation procedure:
Step 1: Calculate the mean (Total of all samples
divided by the number of samples).
Step 2: Calculate each measurement’s deviation
from the mean (i.e. Mean minus the individual
measurement).
Step 3: Square each deviation from mean. Squared
negatives becomes positive.
Step 4: Sum the squared deviations.
Step 5: Divide that sum from step 4 by one less than
the sample size (n – 1)
Step 6: Take the square root of the number in step
5. That gives you the “Standard Deviation (S.D.)”.
Step 7: Divide the standard deviation by the square
root of the sample size (n). That gives You the
“standard error”.
Step 8: Subtract the standard error from the mean
and record that number. Then add the standard
error to the mean and record that number. You
have plotted mean ± 1 standard error, the distance
from I standard error below the mean to 1
standard error above the mean.
(Step 2) Deviations (Step 3)
Name Height to nearest (m-i) Squared deviations\(m-
i)2

Rupali 150 9.6​ 92.16​

170​ -10.4​ 108.16​

Rakshita​

Sangita​ 165​ -5.4​ 29.16​

Rutuja​ 155​ 4.6​ 21.16​

Rushi​ 158​ 1.6​ 2.56​

n= 5​ Total = 798​ (Step 4) Sum


(Step 1) Mean m=159.6​ of squared deviations​
Σ(m0i)2=253.2
Calculation:
Step 5: Divide by number of measurements-1:
Σ(m-i)2/n-1 = 253.2/5-1 = 63.3

Step 6: Standard deviation


Square root of Σ(m-i)2/n-1 = √63.3/4 = 1.9890

Step 7: Standard error = Standard deviation/ √n = 1.9890/ √4 = 0.9945

Step 8: m± ISE = 159.6±0.9945


= 159.6 + 0.9945 or 159.6 - 0.9945
= 160.5945 or 158.6055
Standard error of the
Sr. No Standard deviation
mean
1 Estimates the variability Describes variability
across Multiple samples within one Sample. 3.3.2 Difference between
of population. Standard Error of the
Mean And Standard
2 An inferential statistic that A descriptive statistic Deviation
can be estimated. that can be calculated.
3 Measures how far the The degree to which
sample mean is The people within
degree to which people the sample differ from
within the the actual mean.
Likely to be from the
actual population sample
differ from the actual
mean.

4 Standard error is the Standard deviation is


standard deviation the square root of the
divided by the square root variance.
of the Sample size.
3.3.3 Central Limit Theorem
The sampling distribution of the sample mean, x̄ is
approximated by a normal distribution When the
sample is a simple random sample and the sample
size, n, is large.
In this case, the mean of the sampling distribution is
the population mean, µ and the standard deviation
of the sampling distribution is the population
standard deviation, σ, divided by the square root of
the sample size. The latter is referred to as the
standard error of the mean.
The central limit theorem states that the mean of
the sampling distribution of the mean w be the
unknown population mean. The standard deviation
of the sampling distribution of the mean is called
the standard error.
3.4 Hypothesis Testing
A statistical hypothesis test is a procedure for deciding between
two possible statements about a population. The phrase
significance test means the same thing as the phrase
“hypothesis test.”
A hypothesis test Is a statistical method that uses sample data
to evaluate a hypothesis about a population.
The goal in hypothesis testing is to analyze a sample in an
attempt to distinguish between population characteristics that
are likely to occur and population characteristics that are
unlikely to occur.
Basic assumption of hypothesis testing
If the treatment has any effect, it is simply to add or subtract a constant amount to
each Individual’s score.
Remember that adding or subtracting constant changes the mean, but not the shape of the
distribution for the population and/or the standard deviation.
The population after treatment has the same shape and standard deviation as the population
prior to treatment.

The purpose of the hypothesis test is to decide between two explanations:


1. The difference between the sample and the population can be explained by sampling error.
2. The difference between the sample and the population is too large to be explained
by Sampling error.
Steps in Hypothesis testing
1.Specify the null hypothesis.
2. Specify the alternative hypothesis
3.Set the significance level (?)
4 .Calculate the test statistic and corresponding P-value.
5. Display the conclusion.
Step 1: Formulate the hypothesis
• A null hypothesis is a statement of the status quo, one of no difference or
no effect. If the null hypothesis is not rejected, no changes will be made.
• An alternative hypothesis is one in which some difference or effect is
expected.

Step 2: Select an appropriate test


•The test statistic measures how close the sample has come to the null
hypothesis.
•The test statistic often follows a well-known distribution (eg, normal, t or
chi-square).
• Calculate Z statistic.

Step 3: Choose level of significance


Type 1 Error
• Occurs if the null hypothesis is rejected when it is in fact true.
• The probability of type I error (a) is also called the level of significance.
Type II Error
• Occurs if the null hypothesis is not rejected when it is in fact false.
• The probability of type II error is denoted by β.
• Unlike α, which is specified by the researcher, the magnitude of β depends on the
actual Value of the population parameter (proportion).
• It is necessary to balance the two types of errors.
4: Collect data and calculate test statistic
The required data are collected and the value of the test statistic computed. The test
statistic Z can be calculated as follows:

Step 5: Determine probability value/critical value


• Using standard normal tables
• Note, in determining the critical value of the test statistic, the area to the right of
the critical Value is either α or α2. It is a for a one-tail test and α2 for a two-tail test.
Alternatively, if the calculated value of the test statistic is greater than the critical
value of the test statistic (Zα), the null hypothesis is rejected.

Two-tailed alternative: If the alternative states that a population parameter is


different From a specific value. The corresponding test is called a two-tailed test.
Right-tailed alternative: If the alternative states that a population parameter is
greater than a specific value. The corresponding test is called a right-tailed test.
Left-tailed alternative: If the alternative states that a population parameter is less
than a specific value. The corresponding test is called a left-tailed test.

Decide the rejection region of the test


Based on the test statistic and a given confidence level, we can determine the
rejection region, the acceptance region and the critical value of the test.
Rejection region is the region in which we can reject the null-hypothesis when the
test Statistics falls in this region. Acceptance region is simply the complement of the
rejection.
P-value and hypotheses testing
• As an alternative approach to the rejection/acceptance-region approach, we can
calculate a probability related to the test statistic, called P-value and base our
decision of rejection/acceptance on the magnitude of the P-value.
• P-value is the probability to observe a value of the test statistic as extreme as the
one observed, if the null hypothesis is true. So a small P-value indicates that the null
hypothesis is not true and hence should be rejected.

In a hypothesis testing problem:

a) The null hypothesis will not be rejected unless the data are not unusual (given
that the hypothesis is true).
b) The null hypothesis will not be rejected the P-value indicates the data are very
unusual (given that the hypothesis is true).
c) The null hypothesis will not be rejected only if the probability of observing the
data provides convincing vidence that it is true.
d) The null hypothesis is also called the research hypothesis: the alternative
hypothesis often represents the status quo.
e) The null hypothesis is the hypothesis that we would like to prove the alternative
hypothesis is also called the research hypothesis.
3.4.1 Null and Alternative Hypothesis
The null and alternate hypothesis statements are important parts of the analytical methods collectively
known as inferential statistics.
Inferential statistics are methods used to determine something about a population, based on the observation
of a sample.
Information about a population will be presented in one of two forms, as mean ( μ) or as a Proportion (p).
1. Null hypothesis (H0)
The null hypothesis states that there is no change in the general population before and After an intervention.
In the context of an experiment, H0 predicts that the independent variable had no effect on the dependent
variable.
The null hypothesis is the stated or assumed value of a population parameter. When trying to identify the
population parameter needed for your solution, look for the following phrases:
i. “It is known that…”
ii. “Previous research shows…”
2. Alternate hypothesis (H1)
The alternative hypothesis states that there is a change in the general population following an intervention.
The alternate hypothesis is the stated or assumed value of a population parameter if the null hypothesis is
rejected. When trying to identify the information needed for alternate hypothesis statement, look for the
following phrases:
i. “Is it reasonable to conclude…
ii. “Is there enough evidence to substantiate…”
3.4.2 Difference between Null and Alternative Hypothesis

Sr. No Null hypothesis Alternative hypothesis


1 Represented by Ho Represented by H
2 Statement about the value of a population Statement about the value of
parameter a population parameter that
must be true if the null
hypothesis is false.
3 Always stated as an equality. Stated in on of three forms:
>,<,≠
4 This is the hypothesis or claim that is initially This is the hypothesis or
assumed to be true. claim which we initially
assume to be false but which
we may decide to accept if
there is sufficient evidence.

5 Independent variable had no effect on the Independent variable did


dependent variable. have an effect on the
dependent variable.
3.4.3 Z-Test
•The Z-test is a hypothesis test to determine if a single
observed mean is significantly different (or greater or less
than) the mean under the null hypothesis, when we know
the Standard deviation of the population.
•The Z-test may be a statistical test to work out whether two
population means are different when the variances are known
and therefore the sample size is large.
• Z-tests are closely associated with tests, but t-tests are best
performed when experiment features a small sample size. The
Z ration for a single population mean is given below:
3.4.4 Critical Region
The critical region of a hypothesis test is the set of all outcomes which, if
they occur, will lead us to decide that there is a difference. That is, cause
the null hypothesis to be rejected in favor of the alternative hypothesis.
The critical region is usually denoted by the letter C .
The rejection region is the set of possible values for which the null
hypothesis will be rejected. This region will depend on a α
In specifying the rejection region for a hypothesis, the value at the
boundary of the rejection region is called the critical value.

3.4.5 Level of Confidence


The level of confidence e is the probability that the interval estimate
contains the population parameter.
The "c" Is the area beneath the normal curve between the critical values.
The remaining area in the tails is 1-c.
The level of confidence in a confidence interval is a probability that
represents the percentage of intervals that will contain if a large number of
repeated samples are obtained.
The construction of a confidence interval for the population mean depends
upon three factors:
1. The point estimate of the population.
2. The level of confidence.
3. The standard deviation of the sample mean.
3.4.6 Confidence Level for One Mean
Estimating the mean of a normally distributed population entails drawing a sample of
size n and computing x̄ which is used as a point estimate of μ.
1. The 90% confidence interval
If the level of confidence is 90 %, this means that we are 90 % confident that the interval
contains the population mean, u. Fig. 3.4.7 shows 90 % level of confidence.
2. The 95% confidence interval
If the level of confidence is 95 %, this means that we are 95% confident that the interval
contains the population mean, μ. The 95 % of the values of x̄ making up the
distribution will lie within two standard deviations of the mean. The actual value is
1.96.
The interval is noted by the two points ,μ – 1.96 σx̄ and m + 1.96 σx̄ , so that 95% of the
Values are in the interval, u± 1.96 σx̄.
Margin of error
The difference between the point estimate and the actual population
parameter value is called the sampling error.
Given a level of confidence, the margin of error(Sometimes called minimum
error of estimate or error tolerance) E is the greatest possible distance
between the point estimate and the value of the parameter it is estimating.
E = Zcσx̄ = Zc σ/√n
When n ≥ 30, the sample standard deviation S, can be used for σ.

3.4.7 Large Sample Test for Single Proportion


If n is large (n>30) then, by central limit theorem, & has an approximate
distribution with mean u and standard deviation σ/√n. Then,

Has an approximate Standard normal distribution.


3.5 Point Estimate
A point estimate of a parameter is the value of a statistic that
estimates the value of the parameter. The sample means x̄ is the
best point estimate of the population mean μ.
• A point estimate is a single value estimate for a population
parameter. The most unbiased point estimate of the population
mean, u is the sample mean.
The value obtained from the sample is known as sample constant or
sample statistics. The unknown population constant is also known as
population parameters.
The procedure or rule to determine an unknown population
parameter is called an estimator.
Properties of a point estimator
• Qualities desirable in estimators include unbiasedness, consistency and relative
efficiency:
Unbiasedness:
• The mean of the sampling distribution of a statistic is called the expectation of this
statistic.
• If the expectation of a statistics is equal to the population parameter this statistics
is intended to estimate, then the statistics is called an unbiased estimator of this
population parameter. If the expectation of a statistics is not equal to the population
parameter, the statistics is a biased estimator.
Consistent:
An unbiased estimator is said to be consistent if the difference between the
estimator and the parameter grows smaller as the sample size grows larger.
Efficient:
If there are two unbiased estimators of a parameter, the one whose variance is
smaller is said to be relatively efficient.
3.5.1 Interval Estimate

An interval estimate is an interval or range of values, used to estimate a population


parameter.

A confidence interval estimate of a parameter consists of an interval of numbers


along with a probability that the interval contains the unknown parameter.

An interval estimate consists of two numerical values that, with a specified degree of
confidence, we feel includes the parameter being estimated.
The sampled population is the population from which we actually draw the sample.
The target population is the population about which we wish to make an inference.

The strict validity of statistical procedures depends on the assumption of random


samples.
3.5.2 Biased and Unbiased Estimates

•An estimator = (X,, X. X. X) is said to be unbiased, if its expected value is equal to


the population parameter θ.

• Sample mean x̄ is an unbiased estimator of population mean μ .

•Suppose, X., X = X n 1 …,X I be a random sample drawn from a given population with
mean(μ) and variance (σ2). Then Fig 3.5.2 shows biased and unbiased estimator.

You might also like