Stat For Comp (7-9)

CHAPTER 7
Sampling and Sampling Distribution

Sampling: Sampling is a statistical process in which one can select and examine sample units
and provide statistical information by involving a variety of techniques instead of studying the
whole population units. In other words, it is a process that allows the investigator to obtain
accurate information from a sample and relate that information to the population characteristic
without examining every unit of that population.
The sampling distribution of a statistic is the distribution of values taken by the statistic in all
possible samples of the same size. If the original sample has a normal distribution, then the
distribution of the sample mean has a normal distribution.
The distribution of the sample mean will also have a normal distribution (approximately) if the
sample size n is fairly large (at least 30), even if the original population does not have a normal
distribution. The larger the sample size, the more closer the distribution of the sample mean will
be to a normal distribution.
Definitions:
 Parameter: Characteristic or measure obtained from a population.
 Statistic: Characteristic or measure obtained from a sample.
 Sampling: The process or method of sample selection from the population.
 Sampling unit: the ultimate unit to be sampled or elements of the population to be sampled.
Examples:
 If somebody studies Scio-economic status of households, households are the sampling unit.
 If one studies performance of freshman students in college, the student is the sampling unit.
 Sampling frame: is the list of all elements in a population.
Examples:
 List of households.
 List of students in the registrar office.
 Errors in sample survey: There are two types of errors
Sampling error:
 Is the discrepancy between the population value and sample value.
 May arise due to in appropriate sampling techniques applied
Non sampling errors: are errors due to procedure bias such as:
 Due to incorrect responses
 Measurement
 Errors at different stages in processing the data.
The Need for Sampling
 Reduced cost
 Greater speed
 Greater accuracy
 Greater scope
 More detailed information can be obtained.
- There are two types of sampling.
1. Random Sampling or probability sampling.
- It is a method of sampling in which all elements in the population have a pre-assigned non-
zero probability to be included in to the sample.
Examples:
 Simple random sampling
 Stratified random sampling
 Cluster sampling
 Systematic sampling
a) Simple Random Sampling:
- Is a method of selecting items from a population such that every possible sample of specific
size has an equal chance of being selected. In this case, sampling may be with or without
replacement. Or All elements in the population have the same pre-assigned non-zero
probability to be included in to the sample.
- Simple random sampling can be done either using the lottery method or table of random
numbers.
b) Stratified Random Sampling:
- The population will be divided in to non-overlapping but exhaustive groups called strata.
- Simple random samples will be chosen from each stratum.
- Elements in the same strata should be more or less homogeneous while different in different
strata.
- It is applied if the population is heterogeneous.
- Some of the criteria for dividing a population into strata are: Sex (male, female); Age (under
18, 18 to 28, 29 to 39); Occupation (blue-collar, professional, other).
c) Cluster Sampling:
- The population is divided in to non-overlapping groups called clusters.
- A simple random sample of groups or cluster of elements is chosen and all the sampling units
in the selected clusters will be surveyed.
- Clusters are formed in a way that elements with in a cluster are heterogeneous, i.e.
observations in each cluster should be more or less dissimilar.
- Cluster sampling is useful when it is difficult or costly to generate a simple random sample.
For example, to estimate the average annual household income in a large city we use cluster
sampling, because to use simple random sampling we need a complete list of households in
the city from which to sample. To use stratified random sampling, we would again need the
list of households. A less expensive way is to let each block within the city represent a
cluster. A sample of clusters could then be randomly selected, and every household within
these clusters could be interviewed to find the average annual household income.
d) Systematic Sampling:
- A complete list of all elements within the population (sampling frame) is required.
- The procedure starts in determining the first element to be included in the sample.
- Then the technique is to take the kth item from the sampling frame.
- Let N  population size, n  sample size, k  N  sampling int erval.
n
- Choose any number between 1 and k . Suppose it is j (1  j  k ) .
- The j unit is selected at first and then ( j  k )th , ( j  2k )th ,....etc until the required
th
sample size is reached.

2. Non Random Sampling or non-probability sampling.
- It is a sampling technique in which the choice of individuals for a sample depends on the
basis of convenience, personal choice or interest.
Examples: Judgment sampling, Convenience sampling, Quota Sampling and etc.
a) Judgment Sampling
- In this case, the person taking the sample has direct or indirect control over which items are
selected for the sample.
b) Convenience Sampling
- In this method, the decision maker selects a sample from the population in a manner that is
relatively easy and convenient.
c) Quota Sampling
- In this method, the decision maker requires the sample to contain a certain number of items
with a given characteristic. Many political polls are, in part, quota sampling.
Note: let N  population size, n  sample size.
1. Suppose simple random sampling is used
 We have N n possible samples if sampling is with replacement.
 We have  N  possible samples if sampling is without replacement.

 n

 
2. After this on wards we consider that samples are drawn from a given population using
simple random sampling.
Sampling Distribution of the sample mean
- Sampling distribution of the sample mean is a theoretical probability distribution that shows
the functional relation ship between the possible values of a given sample mean based on
samples of size n and the probability associated with each value, for all possible samples of
size n drawn from that particular population.
- There are commonly three properties of interest of a given sampling distribution.
 Its Mean
 Its Variance
 Its Functional form.
Steps for the construction of Sampling Distribution of the mean
1. From a finite population of size N , randomly draw all possible samples of size n .
2. Calculate the mean for each sample.
3. Summarize the mean obtained in step 2 in terms of frequency distribution or relative
frequency distribution.
Example: Suppose we have a population of size N  5 , consisting of the age of five children:
6, 8, 10, 12, and 14
 Population mean    10
population Variance   2  8
Take samples of size 2 with replacement and construct sampling distribution of sample mean.
Solution: N  5, n  2
 We have N n  52  25 possible samples since sampling is with replacement.

Step 1: Draw all possible samples:
6 8 10 12 14
6 (6, 6) (6, 8) (6, 10) (6, 12) (6, 14)
8 (8,6) (8,8) (8,10) (8,12) (8,14)
10 (10,6) (10,8) (10,10) (10,12) (10,14)
12 (12,6) (12,8) (12,10) (12,12) (12,14)
14 (12,6) (14,8) (12,10) (12,12) (12,14)
Step 2: Calculate the mean for each sample:
6 8 10 12 14
6 6 7 8 9 10
8 7 8 9 10 11
10 8 9 10 11 12
12 9 10 11 12 13
14 10 11 12 13 14
Step 3: Summarize the mean obtained in step 2 in terms of frequency distribution.
X Frequency
6 1
7 2
8 3
9 4
10 5
11 4
12 3
13 2
14 1
a) Find the mean of X , say  X
X   i i 
X f 250
 10  
 fi 25
X , say  X
2
b) Find the variance of
 ( X i   X ) 2 f i 100
X 2
   4 2
 fi 25
Remark:
1. In general if sampling is with replacement
2
X2 
n
2. If sampling is with out replacement
2  N n
X2   
n  N 1 
3. In any case the sample mean is unbiased estimator of the population mean.i.e
 X    E ( X )   (Show!)
- Sampling may be from a normally distributed population or from a non-normally distributed
population.
- When sampling is from a normally distributed population, the distribution of X will possess
the following property.
1. The distribution of X will be normal
2. The mean of X is equal to the population mean , i.e.  X  
3. The variance of X is equal to the population variance divided by the sample size, i.e.
2
X2 
n
2
 X ~ N ( , )
n
X 
Z  ~ N (0,1)
 n
Central Limit Theorem

Given a population of any functional form with mean  and finite variance  2 , the sampling
distribution of X , computed from samples of size n from the population will be approximately
 and variance  , when the sample size is large.

2
normally distributed with mean
n
CHAPTER 8
ESTIMATION AND HYPOTHESIS TESTING
STATISTICAL ESTIMATION
 Inference is the process of making interpretations or conclusions from sample data for the
totality of the population.
 It is only the sample data that is ready for inference.
 In statistics there are two ways though which inference can be made.
 Statistical estimation
 Statistical hypothesis testing.
Statistical Estimation
This is one way of making inference about the population parameter where the investigator does
not have any prior notion about values or characteristics of the population parameter.
There are two ways estimation.
1) Point Estimation
 It is a procedure that results in a single value as an estimate for a parameter.
2) Interval estimation
 It is the procedure that results in the interval of values as an estimate for a parameter,
which is interval that contains the likely values of a parameter. It deals with identifying
the upper and lower limits of a parameter. The limits by themselves are random variable.
Definitions
 Confidence Interval: An interval estimate with a specific level of confidence
 Confidence Level: The percent of the time the true value will lie in the interval estimate
given.
 Degrees of Freedom: The number of data values which are allowed to vary once a
statistic has been determined.
 Estimator: A sample statistic which is used to estimate a population parameter. It must
be unbiased, consistent, and relatively efficient.
 Estimate: Is the different possible values which an estimator can assumes.
 Interval Estimate: A range of values used to estimate a parameter.
 Point Estimate: A single value used to estimate a parameter.
Properties of best estimator
 Unbiased Estimator: An estimator whose expected value is the value of the parameter
being estimated.
 Consistent Estimator: An estimator which gets closer to the value of the parameter as
the sample size increases.
 Relatively Efficient Estimator: Estimator for a parameter with the smallest variance.
Point estimation of the population mean: µ
A point estimator is the mathematical way we compute the point estimate. Another term for
statistic is point estimate, since we are estimating the parameter value. For instance, sum of xi
over n is the point estimator used to compute the estimate of the population means,  .That is
 xi is a point estimator of the population mean.
X 
n
Confidence interval estimation of the population mean
Although X possesses nearly all the qualities of a good estimator, because of sampling error, we
know that it's not likely that our sample statistic will be equal to the population parameter, but
instead will fall into an interval of values. We will have to be satisfied knowing that the statistic
is "close to" the parameter.
There are different cases to be considered to construct confidence intervals.
Case 1: If sample size is large or if the population is normal with known variance
Consider samples of size n drawn from a population, whose mean is  and standard deviation is
 with replacement and order important. The population can have any frequency distribution.

The sampling distribution of X will have a mean  x   and a standard deviation  x  ,
n
and approaches a normal distribution as n gets large. This allows us to use the normal
distribution curve for computing confidence intervals.
X 
Z  has a normal distribution with mean  0 and var iance  1
 n
   X  Z n
 X , where  is a measure of error.
  Z n
- For the interval estimator to be good the error should be small. How it be small?
 By making n large, Small variability and Taking Z small
- To obtain the value of Z, we have to attach this to an area of size 1   such
P ( Z 2  Z  Z 2 )  1  
Where   is the probability that the parameterlies outsidethe int erval
Z 2  s tan ds for the s tan dard normal var iableto the right of which
 2 probability lies, i.e P( Z  Z 2 )   2
X 
 P( Z   Z 2 )  1  
 n
2
 P( X  Z 2  n    X  Z 2  n)  1  
 ( X  Z 2  n , X  Z 2  n ) is a 100 1   % conifidenc e int erval for 
But usually 
2
is not known, in that case we estimate by its point estimator S2
 ( X  Z 2 S n , X  Z 2 S n ) is a 100 1   % conifidenc e int erval for 
Here are the z values corresponding to the most commonly used confidence levels.
100(1   ) %   2 Z 2
90 0.10 0.05 1.645
95 0.05 0.025 1.96
99 0.01 0.005 2.58
Case 2: If sample size is small and the population variance,  2 is not known.
X 
t has t distribution with n  1 deg rees of freedom.
S n
 ( X  t 2 S n , X  t 2 S n ) is a 100 1   % conifidenc e int erval for 
The unit of measurement of the confidence interval is the standard error. This is just the standard
deviation of the sampling distribution of the statistic.
Examples:
1. From a normal sample of size 25 a mean of 32 was found .Given that the population standard
deviation is 4.2. Find
a) A 95% confidence interval for the population mean.
b) A 99% confidence interval for the population mean.
Solution:
X  32,   4.2, 1    0.95    0.05,  2  0.025
 Z 2  1.96 from table.
a)
 The required int erval will be X  Z 2  n
 32  1.96 * 4.2 25
 32  1.65
 (30.35, 33.65)
b)
X  32,   4.2, 1    0.99    0.01,  2  0.005
 Z 2  2.58 from table.
 The required int erval will be X  Z 2  n
 32  2.58 * 4.2 25
 32  2.17
 (29.83, 34.17)
2. A drug company is testing a new drug which is supposed to reduce blood pressure. From the
six people who are used as subjects, it is found that the average drop in blood pressure is 2.28
points, with a standard deviation of .95 points. What is the 95% confidence interval for the
mean change in pressure?
Solution: (exercise)
Hypothesis Testing
- This is also one way of making inference about population parameter, where the investigator
has prior notion about the value of the parameter.
Definitions:
- Statistical hypothesis: is an assertion or statement about the population whose plausibility is
to be evaluated on the basis of the sample data.
- Test statistic: is a statistics whose value serves to determine whether to reject or accept the
hypothesis to be tested. It is a random variable.
- Statistic test: is a test or procedure used to evaluate a statistical hypothesis and its value
depends on sample data.
There are two types of hypothesis:
Null hypothesis:
- It is the hypothesis to be tested.
- It is the hypothesis of equality or the hypothesis of no difference.
- Usually denoted by H0.
Alternative hypothesis:
- It is the hypothesis available when the null hypothesis has to be rejected.
- It is the hypothesis of difference.
- Usually denoted by H1 or Ha.
Types and size of errors:
- Testing hypothesis is based on sample data which may involve sampling and non sampling
errors.
- The following table gives a summary of possible results of any hypothesis test:
Decision
Reject H0 Don't reject H0
H0 Type I Error Right Decision
Truth
H1 Right Decision Type II Error
- Type I error: Rejecting the null hypothesis when it is true.
- Type II error: Failing to reject the null hypothesis when it is false.
NOTE:
1. There are errors that are prevalent in any two choice decision making problems.
2. There is always a possibility of committing one or the other errors.
3. Type I error (  ) and type II error (  ) have inverse relationship and therefore, can not be
minimized at the same time.
 In practice we set  at some value and design a test that minimize  . This is because a type
I error is often considered to be more serious, and therefore more important to avoid, than a type
II error.
General steps in hypothesis testing:
1. The first step in hypothesis testing is to specify the null hypothesis (H0) and the alternative
hypothesis (H1).
2.The next step is to select a significance level, 
3.Identify the sampling distribution of the estimator.
4.The fourth step is to calculate a statistic analogous to the parameter specified by the null
hypothesis.
5.Identify the critical region.
6.Making decision.
7.Summarization of the result.
Hypothesis testing about the population mean,  :
Suppose the assumed or hypothesized value of  is denoted by  0 , then one can formulate two
sided (1) and one sided (2 and 3) hypothesis as follows:
1. H 0 :   0 vs H1 :    0
2. H 0 :   0 vs H1 :    0
3. H 0 :   0 vs H1 :    0
CASES:
Case 1: When sampling is from a normal distribution with  2 known
- The relevant test statistic is
X 
Z
 n
- After specifying  we have the following regions (critical and acceptance) on the standard
normal distribution corresponding to the above three hypothesis.
Summary table for decision rule.

H0 Reject H0 if Accept H0 if
  0 Z cal  Z 2 Z cal  Z 2
  0 Z cal   Z Z cal   Z
  0 Z cal  Z Z cal  Z
X  0
Where: Z cal 
 n
Case 2: When sampling is from a normal distribution with  unknown and small sample
2
- The relevant test statistic is

X  t with n  1 deg rees of freedom.
t ~
S n
- After specifying  we have the following regions on the student t-distribution corresponding
to the above three hypothesis.
H0 Reject H0 if Accept H0 if
  0 tcal  t 2 tcal  t 2
  0 tcal  t tcal  t

  0 tcal  t tcal  t
X  0
Where: t
cal 
S n
Case3: When sampling is from a non- normally distributed population or a population
whose functional form is unknown.
- If a sample size is large one can perform a test hypothesis about the mean by using:
X  0
Z cal  , if  2 is known.
 n
X  0
 , if  2 is unknown.
S n
- The decision rule is the same as case I.
Examples:
1. Test the hypotheses that the average height content of containers of certain lubricant is 10
liters if the contents of a random sample of 10 containers are 10.2, 9.7, 10.1, 10.3, 10.1, 9.8, 9.9,
10.4, 10.3, and 9.8 liters. Use the 0.01 level of significance and assume that the distribution of
contents is normal.
Solution: Let   Population mean. ,  0  10
Step 1: Identify the appropriate hypothesis
H 0 :   10 vs H1 :   10
Step 2: select the level of significance,   0.01( given)
Step 3: Select an appropriate test statistics
t- Statistic is appropriate because population variance is not known & the sample size is small.
Step 4: identify the critical region.
Here we have two critical regions since we have two tailed hypothesis
The critical region is tcal  t0.005 (9)  3.2498
 (3.2498, 3.2498) is accep tan ce region.
Step 5: Computations:
X  10.06, S  0.25
X   0 10.06  10
 tcal    0.76
S n 0.25 10
Step 6: Decision: Accept H0 , since tcal is in the acceptance region.
Step 7: Conclusion
At 1% level of significance, we have no evidence to say that the average height content of
containers of the given lubricant is different from 10 litters, based on the given sample data.
2. The mean life time of a sample of 16 fluorescent light bulbs produced by a company is
computed to be 1570 hours. The population standard deviation is 120 hours. Suppose the
hypothesized value for the population mean is 1600 hours. Can we conclude that the life time of
light bulbs is decreasing? (Use   0.05 and assume the normality of the population)
(exercise!)
Test of Association
- Suppose we have a population consisting of observations having two attributes or qualitative
characteristics say A and B.
- If the attributes are independent then the probability of possessing both A and B is PA*PB
Where PA is the probability that a number has attribute A.
PB is the probability that a number has attribute B.
- Suppose A has r mutually exclusive and exhaustive classes.
B has c mutually exclusive and exhaustive classes
- The entire set of data can be represented using r * c contingency table.
B
A B1 B2 . . Bj . Bc Total
A1 O11 O12 O1j O1c R1
A2 O21 O22 O2j O2c R2
.
.
Ai Oi1 Oi2 Oij Oic Ri
.
.
Ar Or1 Or2 Orj Orc
Total C1 C2 Cj n
- The chi-square procedure test is used to test the hypothesis of independency of two attributes
.For instance we may be interested
 Whether the presence or absence of hypertension is independent of smoking habit or not.
 Whether the size of the family is independent of the level of mothers education attained.
 Whether there is association between father and son regarding boldness.
 Whether there is association between stability of marriage and period of acquaintance
ship prior to marriage.
- The  statistic is given by:
2
r  (Oij  eij ) 2  ~  2
c
 2
cal     ( r 1)( c 1)
i 1 j 1
 eij 
Where Oij  the number of units that belong to categoryi of A and j of B.
eij  Expected frequencythat belong to categoryi of A and j of B.
The eij is given by:

Ri * C j
eij 
n
Where Ri  the i th row total.
C j  the j th column total.
n  total number of oservations
r c r c
Remark: n    Oij    eij
i 1 j 1 i 1 j 1
- The null and alternative hypothesis may be stated as:
H 0 : There is no association between A and B.
H1 : not H 0 ( There is association between A and B).
Decision Rule: Reject H0 for independency at  level of significance if the calculated value of
 2
exceeds the tabulated value with degree of freedom equal to (r  1)(c  1) .
 (Oij  eij ) 2 
   ( r 1)(c 1) at 
r c
 Reject H 0 if  2
cal   
2
i 1 j 1 eij 

Examples:
1. A geneticist took a random sample of 300 men to study whether there is association between
father and son regarding boldness. He obtained the following results.
Son
Father Bold Not
Bold 85 59
Not 65 91
Using   5% test whether there is association between father and son regarding boldness.
Solution:
H 0 : There is no association between Father and Son regarding boldness.
H1 : not H 0
- First calculate the row and column totals: R1  144, R2  156, C1  150, C2  150
Ri * C j
- Then calculate the expected frequencies( eij’s) as eij 
n
 e11  R1 * C1  144 *150  72 , e12  R1 * C2  144 *150  72 , e21  R2 * C1  156 *150  78 , e22  R2 * C2  156 *150  78
n 300 n 300 n 300 n 300
- Obtain the calculated value of the chi-square.
2 2  (O  e ) 2 
 2 cal    ij ij 
 eij
i 1 j 1  
(85  72) 2 (59  72) 2 (65  78) 2 (91  78) 2
     9.028
72 72 78 78
- Obtain the tabulated value of chi-square
  0.05
Degreesof freedom  (r  1)(c  1)  1*1  1
 02.05 (1)  3.841 from table.
- The decision is to reject H0 since  2 cal   02.05 (1)
Conclusion: At 5% level of significance we have evidence to say there is association between
father and son regarding boldness, based on this sample data.
2. Random samples of 200 men, all retired were classified according to education and number
of children is as shown below
Education level Number of children
0-1 2-3 Over 3
Elementary 14 37 32
Secondary and above 31 59 27
Test the hypothesis that the size of the family is independent of the level of education attained by
fathers. (Use 5% level of significance) (exercise)
CHAPTER 9
SIMPLE LINEAR REGRESSION AND CORRELATION
Linear regression and correlation is studying and measuring the linear relationship among two or
more variables. When only two variables are involved, the analysis is referred to as simple
correlation and simple linear regression analysis, and when there are more than two variables the
term multiple regression and partial correlation is used.
Regression Analysis: is a statistical technique that can be used to develop a mathematical
equation showing how variables are related.
Correlation Analysis: deals with the measurement of the closeness of the relationship which are
described in the regression equation.
We say there is correlation when the two series of items vary together directly or inversely.
Simple Correlation
Suppose we have two variables X  ( X 1 , X 2 ,...X n ) and Y  (Y1 , Y2 ,...Yn )
 When higher values of X are associated with higher values of Y and lower values of X are
associated with lower values of Y, then the correlation is said to be positive or direct.
Examples:
- Income and expenditure
- Height and weight
- Distance covered and fuel consumed by car.
 When higher values of X are associated with lower values of Y and lower values of X are
associated with higher values of Y, then the correlation is said to be negative or inverse.
Examples:
- Demand and supply
- Income and the proportion of income spent on food.
The correlation between X and Y may be one of the following
1. Perfect positive (r=1)
2. Positive (r is between 0 and 1)
3. No correlation (r=0)
4. Negative (r is between -1 and 0)
5. Perfect negative (r=-1)
The presence of correlation between two variables may be due to three reasons:
1. One variable being the cause of the other. The cause is called “subject” or “independent”
variable, while the effect is called “dependent” variable.
2. Both variables being the result of a common cause. That is, the correlation that exists
between two variables is due to their being related to some third force.
Example: Let X1= be ESLCE result, Y1=be rate of surviving in the University, and Y2=be
the rate of getting a scholar ship.
Both X1&Y1 and X1&Y2 have high positive correlation, likewise Y1 & Y2 have positive
correlation but they are not directly related, but they are related to each other via X1.
3. Chance: The correlation that arises by chance is called spurious correlation.
Examples:
 Price of teff in Addis Ababa and grade of students in USA.
 Weight of individuals in Ethiopia and income of individuals in Kenya.
Therefore, while interpreting correlation coefficient, it is necessary to see if there is any
likelihood of any relationship existing between variables under study.
The correlation coefficient between X and Y denoted by r is given by
r
 ( X  X )(Y  Y )
i i
and the short cut formula is
 ( X  X )  (Y  Y )
i
2
i
2
n XY  ( X )( Y )
r
[ n  X 2  ( X ) 2 ] [ n  Y 2  ( Y ) 2
r
 XY  nXY
[ X  nX ] [ Y  nY ]
2 2 2 2
Remark: Always this r lies between -1 and 1 inclusively and it is also symmetric.
Interpretation of r
 Perfect positive linear relationship ( if r  1)
 Some Positive linear relationship ( if r is between 0 and 1)
 No linear relationship ( if r  0)
 Some Negative linear relationship ( if r is between -1 and 0)
 Perfect negative linear relationship ( if r  1)
Examples:
1. Calculate the simple correlation between mid semester and final exam scores of 10 students
(both out of 50)
Student Mid Sem.Exam (X) Final Sem.Exam (Y)
1 31 31
2 23 29
3 41 34
4 32 35
5 29 25
6 33 35
7 28 33
8 31 42
9 31 31
10 33 34
Solution:
n  10, X  31.2, Y  32.9, X 2  973.4, Y 2  1082.4
 XY  10331, X 2
 9920, Y 2
 11003
r
 XY  nXY
[ X 2  nX ] [ Y
2 2
 nY 2 ]
10331  10(31.2)(32.9)

(9920  10(973.4)) (11003  10(1082.4))
66.2
  0.363
182.5
This means mid semester exam and final exam scores have a slightly positive correlation.
2. The following data were collected from a certain household on the monthly income (X) and
consumption (Y) for the past 10 months. Compute the simple correlation coefficient.(
Exercise)
X: 650 654 720 456 536 853 735 650 536 666
Y: 450 523 235 398 500 632 500 635 450 360
The above formula and procedure is only applicable on quantitative data, but when we have
qualitative data like efficiency, honesty, intelligence, etc
We calculate what is called Spearman’s rank correlation coefficient as follows:
Steps
i. Rank the different items in X and Y.
ii. Find the difference of the ranks in a pair , denote them by Di
iii. Use the following formula
6 Di
2
rs  1 
n(n 2  1)
Where rs  coefficient of rank correlation
D  the differencebetween paired ranks
n  the number of pairs
Example:
Aster and Almaz were asked to rank 7 different types of lipsticks, see if there is correlation
between the tests of the ladies.
Lipsticks A B C D E F G
Aster 2 1 4 3 5 7 6
Almaz 1 3 2 4 5 6 7
Solution:
X Y R1-R2 D2
(R1) (R2) (D)
2 1 1 1
1 3 -2 4
4 2 2 4
3 4 -1 1
5 5 0 0
7 6 1 1
6 7 -1 1
Total 12
6 Di
2
6(12)
 rs  1   1  0.786
n(n  1)
2
7(48)
Yes, there is positive correlation.

Simple Linear Regression
- Simple linear regression refers to the linear relationship between two variables
- We usually denote the dependent variable by Y and the independent variable by X.
- A simple regression line is the line fitted to the points plotted in the scatter diagram, which
would describe the average relationship between the two variables. Therefore, to see the type
of relationship, it is advisable to prepare scatter plot before fitting the model.
- The linear model is:
Y    X  
Where:Y  Dependent var iable
X  independent var iable
  Re gression cons tan t
  regression slope
  random disturbance term
Y ~ N (   X ,  2 )
 ~ N (0,  2 )
- To estimate the parameters (  and  ) we have several methods:
 The free hand method
 The semi-average method
 The least square method
 The maximum likelihood method
 The method of moments
 Bayesian estimation technique.
- The above model is estimated by:
Yˆ  a  bX
Where a is a constant which gives the value of Y when X=0 .It is called the Y-intercept. b is
a constant indicating the slope of the regression line, and it gives a measure of the change in Y
for a unit change in X. It is also regression coefficient of Y on X.
- a and b are found by minimizing SSE      (Yi  Yˆi )
2 2
Where : Yi  observed value

Yˆi  estimated value  a  bX i
And this method is known as OLS (ordinary least square)
- Minimizing SSE    gives
2
b
 ( X i  X )(Yi  Y )   XY  nXY
 ( X i  X )2  X 2  nX 2
a  Y  bX
Example 1: The following data shows the score of 12 students for Accounting and Statistics
Examinations.
a) Calculate a simple correlation coefficient
b) Fit a regression line of Statistics on Accounting using least square estimates.
c) Predict the score of Statistics if the score of accounting is 85.
Accounting Statistics
X Y
1 74.00 81.00
2 93.00 86.00
3 55.00 67.00
4 41.00 35.00
5 23.00 30.00
6 92.00 100.00
7 64.00 55.00
8 40.00 52.00
9 71.00 76.00
10 33.00 24.00
11 30.00 48.00
12 71.00 87.00
Scatter Diagram of raw data.
Accounting(X) Statistics(Y) X2 Y2 XY
1 74.00 81.00 5476.00 6561.00 5994.00
2 93.00 86.00 8649.00 7396.00 7998.00
3 55.00 67.00 3025.00 4489.00 3685.00
4 41.00 35.00 1681.00 1225.00 1435.00
5 23.00 30.00 529.00 900.00 690.00
6 92.00 100.00 8464.00 10000.00 9200.00
7 64.00 55.00 4096.00 3025.00 3520.00
8 40.00 52.00 1600.00 2704.00 2080.00
9 71.00 76.00 5041.00 5776.00 5396.00
10 33.00 24.00 1089.00 576.00 792.00
11 30.00 48.00 900.00 2304.00 1440.00
12 71.00 87.00 5041.00 7569.00 6177.00
Total 687.00 741.00 45591.00 52525.00 48407.00
Mean 57.25 61.75
a)
The Coefficient of Correlation (r) has a value of 0.92. This indicates that the two variables are
positively correlated (Y increases as X increases).
b) Using OLS:
 Yˆ  7.0194  0.9560 X is the estimated regression line.
Scatter Diagram and Regression Line

c) Insert X=85 in the estimated regression line.
Yˆ  7.0194  0.9560 X
 7.0194  0.9560(85)  88.28
Example 2: A car rental agency is interested in studying the relationship between the
distance driven in kilometer (Y) and the maintenance cost for their cars (X in birr). The
following summarized information is given based on samples of size 5. (Exercise)
 
5 2 5
 147,000,000  314
2
i 1
Xi Y
i 1 i
  
5 5 5
i 1
X i  23,000 , Y  36 ,
i 1 i i 1
X i Yi  212 , 000
a) Find the least squares regression equation of Y on X
b) Compute the correlation coefficient and interpret it.
c) Estimate the maintenance cost of a car which has been driven for 6 km
- To know how far the regression equation has been able to explain the variation in Y we use a
2
measure called coefficient of determination ( r )
i.e r 2

 (Yˆ  Y ) 2
 (Y  Y ) 2
Where r  the simple correlation coefficient.

2
- r gives the proportion of the variation in Y explained by the regression of Y on X.
- 1  r 2 gives the unexplained proportion and is called coefficient of indetermination.
Example: For the above problem (example 1): r  0.9194
 r 2  0.8453  84.53% of the variation in Y is explained and only 15.47% remains unexplained
and it will be accounted by the random term.
o Covariance of X and Y measures the co-variability of X and Y together. It is denoted by
S XY and given by
SX Y 
(X i  X )(Yi  Y )

 XY  nXY
n 1 n 1
o Next we will see the relation ship between the coefficients.
2
S XY SX Y
i. r r  2 2
2
S X SY S X SY
bS rS
ii. r X b Y
SY SX
o When we fit the regression of X on Y , we interchange X and Y in all formulas, i.e. we fit
Xˆ  a1  b1Y
b1 
 XY  nXY
 Y  nY
2 2
b1SY
a1  X  b1Y , r
SX
Here X is dependent and Y is independent.
Choice of Dependent and Independent variable
- In correlation analysis there is no need of identifying the dependent and independent
variable, because r is symmetric. But in regression analysis
If bYX is the regression coefficient of Y on X
bXY is the regression coefficient of X on Y
b S b S
Then r  YX X  XY Y  r 2  bYX * bXY
SY SX
- Moreover, bYX and bX Y are completely different numerically as well as conceptually.
- Let us consider three cases concerning these coefficients.
1. If the correlation is perfect positive, i.e. r  1 then the b values reciprocals of each other.
2. If S X  SY , then irrespective of the value of r the b values are equal, i.e.
r  bYX  bXY ( but this is unlikely case)
3. The most important case is when S X  SY & r  1 , here b values are not equal or reciprocals
to each other, but rather the two lines differ, intersecting at the common point ( X , Y )
 Thus to determine if a regression equation is X on Y or Y on X , we have to use the
formula r 2  bYX * bXY
 If r [1,1] , then our assumption is correct
 If r [1,1] , then our assumption is wrong
Example: The regression line between height (X) in inches and weight (Y) in lbs of male
students are:
4Y  15 X  530  0 and
20 X  3Y  975  0
Determine which is regression of Y on X and X on Y
Solution
We will assume one of the equation as regression of X on Y and the other as Y on X and
calculate r
Assume 4Y  15 X  530  0 is regressionof X on Y
20 X  3Y  975  0 is regressionof Y on X
Then write these in the standard form.
530 4 4
4Y  15 X  530  0  X   Y  bXY 
15 15 15
 975 20 20
20 X  3Y  975  0  Y   X  bYX 
3 3 3
 4  20 
 r 2  bXY * bYX      1.78  1,
 15  3 
This is impossible (contradiction). Hence our assumption is not correct. Thus
4Y  15 X  530  0 is regressionof Y on X
20 X  3Y  975  0 is regressionof X on Y
To verify:
 530 15 15
4Y  15 X  530  0  Y   X  bYX 
4 4 4
975 3 3
20 X  3Y  975  0  X   Y  bXY 
20 20 20
 15  3  9
 r 2  bYX * bXY       0,1
 4  20  16

Stat For Comp (7-9)

Uploaded by

Copyright:

Available Formats

Stat For Comp (7-9)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stat For Comp (7-9)

Uploaded by

Copyright:

Available Formats

CHAPTER 7

Sampling and Sampling Distribution

sample size is reached.

 We have N n  52  25 possible samples since sampling is with replacement.

a) Find the mean of X , say  X

Central Limit Theorem

 and variance  , when the sample size is large.

There are different cases to be considered to construct confidence intervals.

Hypothesis testing about the population mean,  :

Summary table for decision rule.

- The relevant test statistic is

  0 tcal  t tcal  t

The eij is given by:

Yes, there is positive correlation.

Where : Yi  observed value

 Yˆ  7.0194  0.9560 X is the estimated regression line.

Scatter Diagram and Regression Line

Where r  the simple correlation coefficient.

You might also like