Stat For Comp (7-9)
Stat For Comp (7-9)
Stat For Comp (7-9)
The sampling distribution of a statistic is the distribution of values taken by the statistic in all
possible samples of the same size. If the original sample has a normal distribution, then the
distribution of the sample mean has a normal distribution.
The distribution of the sample mean will also have a normal distribution (approximately) if the
sample size n is fairly large (at least 30), even if the original population does not have a normal
distribution. The larger the sample size, the more closer the distribution of the sample mean will
be to a normal distribution.
Definitions:
Parameter: Characteristic or measure obtained from a population.
Statistic: Characteristic or measure obtained from a sample.
Sampling: The process or method of sample selection from the population.
Sampling unit: the ultimate unit to be sampled or elements of the population to be sampled.
Examples:
If somebody studies Scio-economic status of households, households are the sampling unit.
If one studies performance of freshman students in college, the student is the sampling unit.
Sampling frame: is the list of all elements in a population.
Examples:
List of households.
List of students in the registrar office.
Errors in sample survey: There are two types of errors
Sampling error:
Is the discrepancy between the population value and sample value.
May arise due to in appropriate sampling techniques applied
Non sampling errors: are errors due to procedure bias such as:
Due to incorrect responses
Measurement
Errors at different stages in processing the data.
The Need for Sampling
Reduced cost
Greater speed
Greater accuracy
Greater scope
More detailed information can be obtained.
- There are two types of sampling.
1. Random Sampling or probability sampling.
- It is a method of sampling in which all elements in the population have a pre-assigned non-
zero probability to be included in to the sample.
Examples:
Simple random sampling
Stratified random sampling
Cluster sampling
Systematic sampling
a) Simple Random Sampling:
- Is a method of selecting items from a population such that every possible sample of specific
size has an equal chance of being selected. In this case, sampling may be with or without
replacement. Or All elements in the population have the same pre-assigned non-zero
probability to be included in to the sample.
- Simple random sampling can be done either using the lottery method or table of random
numbers.
b) Stratified Random Sampling:
- The population will be divided in to non-overlapping but exhaustive groups called strata.
- Simple random samples will be chosen from each stratum.
- Elements in the same strata should be more or less homogeneous while different in different
strata.
- It is applied if the population is heterogeneous.
- Some of the criteria for dividing a population into strata are: Sex (male, female); Age (under
18, 18 to 28, 29 to 39); Occupation (blue-collar, professional, other).
c) Cluster Sampling:
- The population is divided in to non-overlapping groups called clusters.
- A simple random sample of groups or cluster of elements is chosen and all the sampling units
in the selected clusters will be surveyed.
- Clusters are formed in a way that elements with in a cluster are heterogeneous, i.e.
observations in each cluster should be more or less dissimilar.
- Cluster sampling is useful when it is difficult or costly to generate a simple random sample.
For example, to estimate the average annual household income in a large city we use cluster
sampling, because to use simple random sampling we need a complete list of households in
the city from which to sample. To use stratified random sampling, we would again need the
list of households. A less expensive way is to let each block within the city represent a
cluster. A sample of clusters could then be randomly selected, and every household within
these clusters could be interviewed to find the average annual household income.
d) Systematic Sampling:
- A complete list of all elements within the population (sampling frame) is required.
- The procedure starts in determining the first element to be included in the sample.
- Then the technique is to take the kth item from the sampling frame.
- Let N population size, n sample size, k N sampling int erval.
n
- Choose any number between 1 and k . Suppose it is j (1 j k ) .
- The j unit is selected at first and then ( j k )th , ( j 2k )th ,....etc until the required
th
X i i
X f 250
10
fi 25
X , say X
2
b) Find the variance of
( X i X ) 2 f i 100
X 2
4 2
fi 25
Remark:
1. In general if sampling is with replacement
2
X2
n
2. If sampling is with out replacement
2 N n
X2
n N 1
3. In any case the sample mean is unbiased estimator of the population mean.i.e
X E ( X ) (Show!)
- Sampling may be from a normally distributed population or from a non-normally distributed
population.
- When sampling is from a normally distributed population, the distribution of X will possess
the following property.
1. The distribution of X will be normal
2. The mean of X is equal to the population mean , i.e. X
3. The variance of X is equal to the population variance divided by the sample size, i.e.
2
X2
n
2
X ~ N ( , )
n
X
Z ~ N (0,1)
n
distribution of X , computed from samples of size n from the population will be approximately
STATISTICAL ESTIMATION
Inference is the process of making interpretations or conclusions from sample data for the
totality of the population.
It is only the sample data that is ready for inference.
In statistics there are two ways though which inference can be made.
Statistical estimation
Statistical hypothesis testing.
Statistical Estimation
This is one way of making inference about the population parameter where the investigator does
not have any prior notion about values or characteristics of the population parameter.
There are two ways estimation.
1) Point Estimation
It is a procedure that results in a single value as an estimate for a parameter.
2) Interval estimation
It is the procedure that results in the interval of values as an estimate for a parameter,
which is interval that contains the likely values of a parameter. It deals with identifying
the upper and lower limits of a parameter. The limits by themselves are random variable.
Definitions
Confidence Interval: An interval estimate with a specific level of confidence
Confidence Level: The percent of the time the true value will lie in the interval estimate
given.
Degrees of Freedom: The number of data values which are allowed to vary once a
statistic has been determined.
Estimator: A sample statistic which is used to estimate a population parameter. It must
be unbiased, consistent, and relatively efficient.
Estimate: Is the different possible values which an estimator can assumes.
Interval Estimate: A range of values used to estimate a parameter.
Point Estimate: A single value used to estimate a parameter.
Properties of best estimator
Unbiased Estimator: An estimator whose expected value is the value of the parameter
being estimated.
Consistent Estimator: An estimator which gets closer to the value of the parameter as
the sample size increases.
Relatively Efficient Estimator: Estimator for a parameter with the smallest variance.
Point estimation of the population mean: µ
A point estimator is the mathematical way we compute the point estimate. Another term for
statistic is point estimate, since we are estimating the parameter value. For instance, sum of xi
over n is the point estimator used to compute the estimate of the population means, .That is
xi is a point estimator of the population mean.
X
n
Confidence interval estimation of the population mean
Although X possesses nearly all the qualities of a good estimator, because of sampling error, we
know that it's not likely that our sample statistic will be equal to the population parameter, but
instead will fall into an interval of values. We will have to be satisfied knowing that the statistic
is "close to" the parameter.
Case 1: If sample size is large or if the population is normal with known variance
Consider samples of size n drawn from a population, whose mean is and standard deviation is
with replacement and order important. The population can have any frequency distribution.
The sampling distribution of X will have a mean x and a standard deviation x ,
n
and approaches a normal distribution as n gets large. This allows us to use the normal
distribution curve for computing confidence intervals.
X
Z has a normal distribution with mean 0 and var iance 1
n
X Z n
X , where is a measure of error.
Z n
- For the interval estimator to be good the error should be small. How it be small?
By making n large, Small variability and Taking Z small
- To obtain the value of Z, we have to attach this to an area of size 1 such
P ( Z 2 Z Z 2 ) 1
Where is the probability that the parameterlies outsidethe int erval
Z 2 s tan ds for the s tan dard normal var iableto the right of which
2 probability lies, i.e P( Z Z 2 ) 2
X
P( Z Z 2 ) 1
n
2
P( X Z 2 n X Z 2 n) 1
( X Z 2 n , X Z 2 n ) is a 100 1 % conifidenc e int erval for
But usually
2
is not known, in that case we estimate by its point estimator S2
( X Z 2 S n , X Z 2 S n ) is a 100 1 % conifidenc e int erval for
Here are the z values corresponding to the most commonly used confidence levels.
100(1 ) % 2 Z 2
90 0.10 0.05 1.645
95 0.05 0.025 1.96
99 0.01 0.005 2.58
Case 2: If sample size is small and the population variance, 2 is not known.
X
t has t distribution with n 1 deg rees of freedom.
S n
( X t 2 S n , X t 2 S n ) is a 100 1 % conifidenc e int erval for
The unit of measurement of the confidence interval is the standard error. This is just the standard
deviation of the sampling distribution of the statistic.
Examples:
1. From a normal sample of size 25 a mean of 32 was found .Given that the population standard
deviation is 4.2. Find
a) A 95% confidence interval for the population mean.
b) A 99% confidence interval for the population mean.
Solution:
X 32, 4.2, 1 0.95 0.05, 2 0.025
Z 2 1.96 from table.
a)
The required int erval will be X Z 2 n
32 1.96 * 4.2 25
32 1.65
(30.35, 33.65)
b)
X 32, 4.2, 1 0.99 0.01, 2 0.005
Z 2 2.58 from table.
The required int erval will be X Z 2 n
32 2.58 * 4.2 25
32 2.17
(29.83, 34.17)
2. A drug company is testing a new drug which is supposed to reduce blood pressure. From the
six people who are used as subjects, it is found that the average drop in blood pressure is 2.28
points, with a standard deviation of .95 points. What is the 95% confidence interval for the
mean change in pressure?
Solution: (exercise)
Hypothesis Testing
- This is also one way of making inference about population parameter, where the investigator
has prior notion about the value of the parameter.
Definitions:
- Statistical hypothesis: is an assertion or statement about the population whose plausibility is
to be evaluated on the basis of the sample data.
- Test statistic: is a statistics whose value serves to determine whether to reject or accept the
hypothesis to be tested. It is a random variable.
- Statistic test: is a test or procedure used to evaluate a statistical hypothesis and its value
depends on sample data.
There are two types of hypothesis:
Null hypothesis:
- It is the hypothesis to be tested.
- It is the hypothesis of equality or the hypothesis of no difference.
- Usually denoted by H0.
Alternative hypothesis:
- It is the hypothesis available when the null hypothesis has to be rejected.
- It is the hypothesis of difference.
- Usually denoted by H1 or Ha.
Types and size of errors:
- Testing hypothesis is based on sample data which may involve sampling and non sampling
errors.
- The following table gives a summary of possible results of any hypothesis test:
Decision
Reject H0 Don't reject H0
H0 Type I Error Right Decision
Truth
H1 Right Decision Type II Error
- Type I error: Rejecting the null hypothesis when it is true.
- Type II error: Failing to reject the null hypothesis when it is false.
NOTE:
1. There are errors that are prevalent in any two choice decision making problems.
2. There is always a possibility of committing one or the other errors.
3. Type I error ( ) and type II error ( ) have inverse relationship and therefore, can not be
minimized at the same time.
In practice we set at some value and design a test that minimize . This is because a type
I error is often considered to be more serious, and therefore more important to avoid, than a type
II error.
General steps in hypothesis testing:
1. The first step in hypothesis testing is to specify the null hypothesis (H0) and the alternative
hypothesis (H1).
2.The next step is to select a significance level,
3.Identify the sampling distribution of the estimator.
4.The fourth step is to calculate a statistic analogous to the parameter specified by the null
hypothesis.
5.Identify the critical region.
6.Making decision.
7.Summarization of the result.
Suppose the assumed or hypothesized value of is denoted by 0 , then one can formulate two
sided (1) and one sided (2 and 3) hypothesis as follows:
1. H 0 : 0 vs H1 : 0
2. H 0 : 0 vs H1 : 0
3. H 0 : 0 vs H1 : 0
CASES:
Case 1: When sampling is from a normal distribution with 2 known
- The relevant test statistic is
X
Z
n
- After specifying we have the following regions (critical and acceptance) on the standard
normal distribution corresponding to the above three hypothesis.
0 Z cal Z Z cal Z
0 Z cal Z Z cal Z
X 0
Where: Z cal
n
Case 2: When sampling is from a normal distribution with unknown and small sample
2
r (Oij eij ) 2 ~ 2
c
2
cal ( r 1)( c 1)
i 1 j 1
eij
Where Oij the number of units that belong to categoryi of A and j of B.
eij Expected frequencythat belong to categoryi of A and j of B.
i 1 j 1 eij
Examples:
1. A geneticist took a random sample of 300 men to study whether there is association between
father and son regarding boldness. He obtained the following results.
Son
Father Bold Not
Bold 85 59
Not 65 91
Using 5% test whether there is association between father and son regarding boldness.
Solution:
H 0 : There is no association between Father and Son regarding boldness.
H1 : not H 0
- First calculate the row and column totals: R1 144, R2 156, C1 150, C2 150
Ri * C j
- Then calculate the expected frequencies( eij’s) as eij
n
e11 R1 * C1 144 *150 72 , e12 R1 * C2 144 *150 72 , e21 R2 * C1 156 *150 78 , e22 R2 * C2 156 *150 78
n 300 n 300 n 300 n 300
- Obtain the calculated value of the chi-square.
2 2 (O e ) 2
2 cal ij ij
eij
i 1 j 1
(85 72) 2 (59 72) 2 (65 78) 2 (91 78) 2
9.028
72 72 78 78
- Obtain the tabulated value of chi-square
0.05
Degreesof freedom (r 1)(c 1) 1*1 1
02.05 (1) 3.841 from table.
- The decision is to reject H0 since 2 cal 02.05 (1)
Conclusion: At 5% level of significance we have evidence to say there is association between
father and son regarding boldness, based on this sample data.
2. Random samples of 200 men, all retired were classified according to education and number
of children is as shown below
Education level Number of children
0-1 2-3 Over 3
Elementary 14 37 32
Secondary and above 31 59 27
Test the hypothesis that the size of the family is independent of the level of education attained by
fathers. (Use 5% level of significance) (exercise)
CHAPTER 9
SIMPLE LINEAR REGRESSION AND CORRELATION
Linear regression and correlation is studying and measuring the linear relationship among two or
more variables. When only two variables are involved, the analysis is referred to as simple
correlation and simple linear regression analysis, and when there are more than two variables the
term multiple regression and partial correlation is used.
Regression Analysis: is a statistical technique that can be used to develop a mathematical
equation showing how variables are related.
Correlation Analysis: deals with the measurement of the closeness of the relationship which are
described in the regression equation.
We say there is correlation when the two series of items vary together directly or inversely.
Simple Correlation
Suppose we have two variables X ( X 1 , X 2 ,...X n ) and Y (Y1 , Y2 ,...Yn )
When higher values of X are associated with higher values of Y and lower values of X are
associated with lower values of Y, then the correlation is said to be positive or direct.
Examples:
- Income and expenditure
- Height and weight
- Distance covered and fuel consumed by car.
When higher values of X are associated with lower values of Y and lower values of X are
associated with higher values of Y, then the correlation is said to be negative or inverse.
Examples:
- Demand and supply
- Income and the proportion of income spent on food.
The correlation between X and Y may be one of the following
1. Perfect positive (r=1)
2. Positive (r is between 0 and 1)
3. No correlation (r=0)
4. Negative (r is between -1 and 0)
5. Perfect negative (r=-1)
The presence of correlation between two variables may be due to three reasons:
1. One variable being the cause of the other. The cause is called “subject” or “independent”
variable, while the effect is called “dependent” variable.
2. Both variables being the result of a common cause. That is, the correlation that exists
between two variables is due to their being related to some third force.
Example: Let X1= be ESLCE result, Y1=be rate of surviving in the University, and Y2=be
the rate of getting a scholar ship.
Both X1&Y1 and X1&Y2 have high positive correlation, likewise Y1 & Y2 have positive
correlation but they are not directly related, but they are related to each other via X1.
3. Chance: The correlation that arises by chance is called spurious correlation.
Examples:
Price of teff in Addis Ababa and grade of students in USA.
Weight of individuals in Ethiopia and income of individuals in Kenya.
Therefore, while interpreting correlation coefficient, it is necessary to see if there is any
likelihood of any relationship existing between variables under study.
The correlation coefficient between X and Y denoted by r is given by
r
( X X )(Y Y )
i i
and the short cut formula is
( X X ) (Y Y )
i
2
i
2
n XY ( X )( Y )
r
[ n X 2 ( X ) 2 ] [ n Y 2 ( Y ) 2
r
XY nXY
[ X nX ] [ Y nY ]
2 2 2 2
Remark: Always this r lies between -1 and 1 inclusively and it is also symmetric.
Interpretation of r
Perfect positive linear relationship ( if r 1)
Some Positive linear relationship ( if r is between 0 and 1)
No linear relationship ( if r 0)
Some Negative linear relationship ( if r is between -1 and 0)
Perfect negative linear relationship ( if r 1)
Examples:
1. Calculate the simple correlation between mid semester and final exam scores of 10 students
(both out of 50)
Student Mid Sem.Exam (X) Final Sem.Exam (Y)
1 31 31
2 23 29
3 41 34
4 32 35
5 29 25
6 33 35
7 28 33
8 31 42
9 31 31
10 33 34
Solution:
n 10, X 31.2, Y 32.9, X 2 973.4, Y 2 1082.4
XY 10331, X 2
9920, Y 2
11003
r
XY nXY
[ X 2 nX ] [ Y
2 2
nY 2 ]
10331 10(31.2)(32.9)
(9920 10(973.4)) (11003 10(1082.4))
66.2
0.363
182.5
This means mid semester exam and final exam scores have a slightly positive correlation.
2. The following data were collected from a certain household on the monthly income (X) and
consumption (Y) for the past 10 months. Compute the simple correlation coefficient.(
Exercise)
X: 650 654 720 456 536 853 735 650 536 666
Y: 450 523 235 398 500 632 500 635 450 360
The above formula and procedure is only applicable on quantitative data, but when we have
qualitative data like efficiency, honesty, intelligence, etc
We calculate what is called Spearman’s rank correlation coefficient as follows:
Steps
i. Rank the different items in X and Y.
ii. Find the difference of the ranks in a pair , denote them by Di
iii. Use the following formula
6 Di
2
rs 1
n(n 2 1)
Where rs coefficient of rank correlation
D the differencebetween paired ranks
n the number of pairs
Example:
Aster and Almaz were asked to rank 7 different types of lipsticks, see if there is correlation
between the tests of the ladies.
Lipsticks A B C D E F G
Aster 2 1 4 3 5 7 6
Almaz 1 3 2 4 5 6 7
Solution:
X Y R1-R2 D2
(R1) (R2) (D)
2 1 1 1
1 3 -2 4
4 2 2 4
3 4 -1 1
5 5 0 0
7 6 1 1
6 7 -1 1
Total 12
6 Di
2
6(12)
rs 1 1 0.786
n(n 1)
2
7(48)
b
( X i X )(Yi Y ) XY nXY
( X i X )2 X 2 nX 2
a Y bX
Example 1: The following data shows the score of 12 students for Accounting and Statistics
Examinations.
a) Calculate a simple correlation coefficient
b) Fit a regression line of Statistics on Accounting using least square estimates.
c) Predict the score of Statistics if the score of accounting is 85.
Accounting Statistics
X Y
1 74.00 81.00
2 93.00 86.00
3 55.00 67.00
4 41.00 35.00
5 23.00 30.00
6 92.00 100.00
7 64.00 55.00
8 40.00 52.00
9 71.00 76.00
10 33.00 24.00
11 30.00 48.00
12 71.00 87.00
Scatter Diagram of raw data.
Accounting(X) Statistics(Y) X2 Y2 XY
1 74.00 81.00 5476.00 6561.00 5994.00
2 93.00 86.00 8649.00 7396.00 7998.00
3 55.00 67.00 3025.00 4489.00 3685.00
4 41.00 35.00 1681.00 1225.00 1435.00
5 23.00 30.00 529.00 900.00 690.00
6 92.00 100.00 8464.00 10000.00 9200.00
7 64.00 55.00 4096.00 3025.00 3520.00
8 40.00 52.00 1600.00 2704.00 2080.00
9 71.00 76.00 5041.00 5776.00 5396.00
10 33.00 24.00 1089.00 576.00 792.00
11 30.00 48.00 900.00 2304.00 1440.00
12 71.00 87.00 5041.00 7569.00 6177.00
Total 687.00 741.00 45591.00 52525.00 48407.00
Mean 57.25 61.75
a)
The Coefficient of Correlation (r) has a value of 0.92. This indicates that the two variables are
positively correlated (Y increases as X increases).
b) Using OLS:
5 5 5
i 1
X i 23,000 , Y 36 ,
i 1 i i 1
X i Yi 212 , 000
a) Find the least squares regression equation of Y on X
b) Compute the correlation coefficient and interpret it.
c) Estimate the maintenance cost of a car which has been driven for 6 km
- To know how far the regression equation has been able to explain the variation in Y we use a
2
measure called coefficient of determination ( r )
i.e r 2
(Yˆ Y ) 2
(Y Y ) 2
SX Y
(X i X )(Yi Y )
XY nXY
n 1 n 1
o Next we will see the relation ship between the coefficients.
2
S XY SX Y
i. r r 2 2
2
S X SY S X SY
bS rS
ii. r X b Y
SY SX
o When we fit the regression of X on Y , we interchange X and Y in all formulas, i.e. we fit
Xˆ a1 b1Y
b1
XY nXY
Y nY
2 2
b1SY
a1 X b1Y , r
SX
Here X is dependent and Y is independent.
Choice of Dependent and Independent variable
- In correlation analysis there is no need of identifying the dependent and independent
variable, because r is symmetric. But in regression analysis
If bYX is the regression coefficient of Y on X
bXY is the regression coefficient of X on Y
b S b S
Then r YX X XY Y r 2 bYX * bXY
SY SX
- Moreover, bYX and bX Y are completely different numerically as well as conceptually.
- Let us consider three cases concerning these coefficients.
1. If the correlation is perfect positive, i.e. r 1 then the b values reciprocals of each other.
2. If S X SY , then irrespective of the value of r the b values are equal, i.e.
r bYX bXY ( but this is unlikely case)
3. The most important case is when S X SY & r 1 , here b values are not equal or reciprocals
to each other, but rather the two lines differ, intersecting at the common point ( X , Y )
Thus to determine if a regression equation is X on Y or Y on X , we have to use the
formula r 2 bYX * bXY
If r [1,1] , then our assumption is correct
If r [1,1] , then our assumption is wrong
Example: The regression line between height (X) in inches and weight (Y) in lbs of male
students are:
4Y 15 X 530 0 and
20 X 3Y 975 0
Determine which is regression of Y on X and X on Y
Solution
We will assume one of the equation as regression of X on Y and the other as Y on X and
calculate r
Assume 4Y 15 X 530 0 is regressionof X on Y
20 X 3Y 975 0 is regressionof Y on X
Then write these in the standard form.
530 4 4
4Y 15 X 530 0 X Y bXY
15 15 15
975 20 20
20 X 3Y 975 0 Y X bYX
3 3 3
4 20
r 2 bXY * bYX 1.78 1,
15 3
This is impossible (contradiction). Hence our assumption is not correct. Thus
4Y 15 X 530 0 is regressionof Y on X
20 X 3Y 975 0 is regressionof X on Y
To verify:
530 15 15
4Y 15 X 530 0 Y X bYX
4 4 4
975 3 3
20 X 3Y 975 0 X Y bXY
20 20 20
15 3 9
r 2 bYX * bXY 0,1
4 20 16