Solved Problems and Exercises. Part 2
Solved Problems and Exercises. Part 2
Solved Problems and Exercises. Part 2
СТАТИСТИКА
Задачи и упражнения. Часть 1
Humbat Nabi oglu Aliyev
STATISTICS
SOLVED PROBLEMS AND EXERCISES. PART 2
Заказ № 700
Цена договорная
H.N.Aliyev
STATISTICS
Solved problems and exercises
Part 2
Almaty – 2006
ББК 60.6
A 36
Aliyev H.N..
А 36 Statistic. Solved problems and exercises. Part 2. –
Алматы, 2006 - 245с.
ISBN № 9965-9605-7-7
ББК 60.6
0702000000
A
00 (05)-05
ISBN № 9965-9605-7-7
1. Explain which of the following is a two tailed test, a left tailed test, or a right tailed test.
a) H 0 : 25, H 1 : 25
b) H 0 : 134 , H 1 : 134
c) H 0 : 16 , H 1 : 16
Show the rejection and nonrejection regions for each of these cases by drawing a sampling distribution
curve for the sample mean, assuming that sample size is large in each case.
2. Consider H 0 : 35, against H 1 : 35 .
a) What type of error would you make if the null hypothesis is actually false and you fail to reject it?
b) What type of error would you make if the null hypothesis is actually true and you reject it?
3. For each of the following rejection regions, sketch the sampling distribution for z and indicate the
location of rejection region.
a) z 2.05 ; b) z 2.75 ; c) z 1.28 ;
d) z 2.13 ; f) z 2.575 or z 2.575 ;
g) z 1.82 or z 1.82
4. Write the null hypothesis and alternative hypothesis for each of the following examples. Determine if
each is a case of a two tailed, a left tailed, or a right tailed test.
a) To test whether or not the mean price of houses in a certain city is greater than $ 45 000.
b) To test if the mean number of hours spent working per week by students who hold jobs is different
from 18 hours.
c) To test whether the mean life of a particular brand of auto batteries is less than 28 days.
d) To test if the mean amount of time taken by all workers to do a certain job is more than 45 minutes.
f) To test the mean age of all managers of companies is different from 40 years.
g) To test the mean time for an airline passenger to obtain his or her luggage, once luggage starts coming
out the conveyer belt, is less than 180 seconds.
CONTENTS
3
Exercises……………………………………………………………67
2.3. The Wilcoxon signed test ………………………….………….70
2.3.1. The Wilcoxon signed test for paired samples
(small sample size)………………………………………………....70
2.3.2. The Wilcoxon signed test for paired samples
(large sample size)………………………………...……………….72
Exercises……………………………………………………………74
2.4. The Mann-Whitney test………………………………………..77
Exercises……………………………………………………………82
Chapter 3. Simple linear regression…..………….…………..84
3.1. Introduction ……………………………………………………84
3.2. The scatter diagram……………….…………………….……...84
3.3. Correlation analysis…………………………………………….86
3.3.1. Hypothesis test for correlation ……………………………….89
Exercises…………………………………………………………… 91
3.4. Spearman rank correlation …………………………………......93
Exercises…………………………………………………………… 95
3.5. The linear regression model…………………………………… 96
3.5. 1. Least squares coefficient estimators…………………… …. 98
3.5.2. Least square procedure …...………………………………… .99
3.5.3. Interpretation of a and b ……………………...……………...102
3.5.4. Assumptions of the regression model ……………………….103
Exercises…………………………………………………………….104
3.6. The explanatory power of a linear regression equation ……….106
3.6.1. Coefficient of determination R 2 ………………………..…...109
3.6.2. Estimation of model error variance …… …………………110
Exercises………………………………………………………… 111
3.7. Statistical inference: Hypothesis tests and confidence
intervals…………………………………………………………….113
3.7.1. Hypothesis testing about ………………………………....114
3.7.2. Confidence intervals for the population regression slope ..116
Exercises…………………………………………………………….117
3.8. Using the regression model for prediction a particular
value of y……………………………………………………………120
Exercises………………………………………………………… .123
Chapter 4. Multiple regression analysis …………………….125
4.1. Introduction…………………………………………………….125
4
4.2. Multiple regression model……………………………… ……125
4.3. Standard assumptions for the multiple regression models …..127
4.4. The explanatory power of a multiple regression equation…..…127
4.4.1. Estimation of error variance distribution…………….………127
4.4.2. The coefficient of determination……………………………..128
4.4.3. Adjusted coefficient of determination………………………..129
4.4.4. Predictions from the multiple regression models…………….130
Exercises…………………………………………………………….130
4.5. Computer solution of multiple regressions …………………….132
4.6. Confidence interval for individual coefficients …………….….138
Exercises…………………………………………………………….139
4.7. Test of hypothesis about individual coefficients ……………….143
4.8. Tests on sets of regression parameters………………………….145
Exercises……………………………………………………………. 146
4.9. Dummy variables in the regression models…………………….150
Exercises…………………………………………………………….155
Chapter 5. Analysis of variance (ANOVA) …………………..159
5.1. Introduction………………………..…………………………...159
5.2. One-way analysis of variance …….…………………………...159
Exercises…………………………………………………………….168
5.3. The Kruskal-Wallis test …………………….………………….171
Exercises…………………………………………………………….173
5.4. Two-way analysis of variance …….…………………………...175
Exercises…………………………………………………………….182
Chapter 6. Statistical quality control ………………………185
6.1. Introduction……………………………………………………..185
6.2. Variation ……………………………………….………………185
6.3. Control charts …………………………………………………..186
6.3.1. Control charts for means and standard deviations…………….187
6.4. Interpretation of control charts………………………………… 192
Exercises……………………………………………………………..193
6.5. Control charts for proportions …………...……………………..195
Exercises……………………………………………………………..197
6.6. Control charts for number of occurrences: c-chart ……………..199
Exercises……………………………………………………………..201
Chapter 7. Time series analysis and forecasting …..…..… 203
7.1. Introduction to index numbers ……...………………………….203
7.1.1. Price index for a single item (Simple index number)…………203
5
7.1.2. Unweighted aggregate price index…………………………....203
7.1.3. A weighted aggregate price index…………………………… 205
7.1.4. A weighted aggregate quantity index…………………………206
7.2. Commonly used index numbers………………………………...206
7.3. Deflating a series by price indexes……………………………...207
Exercises……………………………………………………………..209
7.4 A nonparametric test for randomness……………………………211
7.4.1. The runs test for the small sample sizes………………………211
7.4.2. The run test for the large sample sizes………………………..212
Exercises…………………………………………………………….213
7.5. Components of time series……………………………………...216
7.6. Moving averages………………………………………………..217
Exercises…………………………………………………………….220
7.7. Exponential smoothing…………………………………………221
Exercises…………………………………………………………….224
7.8. Double exponential smoothing (Holt-Winters exponential
forecasting model)…………………………………………….……..226
Exercises…………………………………………………………….230
Appendix………………………………………………………… ....232
References………………………………………………………….. 253
6
Chapter 1
Hypothesis testing
1.1. Introduction
Let us consider example about coffee cans. A company may claim that, on
average, its cans contain 100 grams of coffee. A government agency may
want to test whether or not such cans contain, on average, 100 grams of
coffee.
Suppose we take a sample of 50 cans of the coffee under investigation. We
then find out that the mean amount of coffee in these 50 cans is 97 grams.
Based on these results, can we state that on average, all such cans contain
less than 100 grams of coffee and that the company is lying to the public?
Not until we perform a test of hypothesis. The reason is that the mean
x 97 grams is obtained from the sample. The difference between 100
grams (the required amount for the population) and 97 grams (the observed
7
average amount for the sample) may have occurred only because of the
sampling error. Another sample of 100 cans may give us a mean of 105
grams. Therefore, we make a test of hypothesis to find out how large the
difference between 100 grams and 97 grams is and to investigate whether or
not this difference has occurred as a result of chance alone. If 97 grams is the
mean of all cans and not for only 100 cans, then we do not need to make a
test of hypothesis. Instead, we can immediately state that the mean amount
of coffee in all such cans is less than 100 grams. We perform a test of
hypothesis only when we are making a decision about a population
parameter based on the value of a sample statistic.
8
H 0 then either H 0 is true or our evidence is not sufficient to reject H 0 and
hence accept H 0 . Thus we will be more comfortable with our decision if we
reject H 0 and accept H 1 .
A hypothesis, whether null or alternative, might specify a single
value, say 0 , for the population parameter . In that case, the hypothesis is
said to be a simple hypothesis designated as
H 0 : 0
That is read as, “The null hypothesis is that the population parameter is
equal to the specific value 0 ”.
Alternatively, a range of values might be specified for unknown parameter.
We define such hypothesis as a composite hypothesis, and it will hold true
for more than one value of the population parameter. In many applications, a
simple null hypothesis, say
H 0 : 0
is tested against a composite alternative. One possibility would be to test the
null hypothesis against the general two-sided hypothesis
H1 : 0
In other cases, only alternatives on one side of the null hypothesis are of
interest. For example, a government agency would be perfectly happy if the
mean weight of coffee cans greater than 100 grams. Then we could write the
null hypothesis as
H 0 : 0
and the alternative hypothesis of interest might be
H1 : 0
We call these hypothesis one- sided composite alternatives.
Example:
A company intends to accept the product unless it has evidence to suspect
that more than 10% of products are defective. Let denote the population
proportion of defectives. The null hypothesis is that the proportion is less
than 0.1, that is
H 0 : 0 .1
and the alternative hypothesis is
9
H 1 : 0.1
The null hypothesis is that the product is of adequate quality overall, while
the alternative is that the product is not adequate quality. In this case the
product would only be rejected if there is strong evidence that there are more
than 10% defectives.
Once we have specified a null hypothesis and alternative hypothesis and
collected sample data, a decision concerning the null hypothesis must be
made. We can either accept the null hypothesis or reject it in favor of the
alternative. For good reasons many statisticians prefer not to use the term
“accept the null hypothesis” and instead say “fail to reject”. When we accept
or fail to reject the null hypothesis, then either the hypothesis is true or our
test procedure was not strong enough to reject and we have committed an
error. When we use the term accept a null hypothesis that statement can be
considered shorthand for failure to reject.
From our discussion of sampling distributions, we know that the
sample mean is different from the population mean. With only a sample
mean we can not be certain of the value of the population mean. Thus the
decision rule we adopt will have some chance of reaching an erroneous
conclusion. One error we call Type I error. Type I error is defined as the
rejection of the null hypothesis when the null hypothesis is true. We will see
that our decision rules will be defined so that the probability of rejecting a
true null hypothesis, denoted as , is “small”. The probability, , is
defined as the significance level of the test. Since the null hypothesis is
either accepted or rejected, it follows that the probability of accepting the
null hypothesis when it is true is (1 ) . The other possible error, called
Type II error, arises when false null hypothesis is accepted. We say that for
a particular decision rule, the probability of making such an error when the
null hypothesis is false is denoted . Then, the probability of rejecting a
false null hypothesis is (1 ) which is called the power of test.
10
Type I error
A type I error occurs when a true null hypothesis is rejected. The value
represents the probability of committing this type of error, that is
P( H 0 is rejected / H 0 is true)
The value represents the significance level of the test.
Type II error
A Type II error occurs when a false null hypothesis is not rejected. The
value represents the probability of committing a Type II error,
that is
P( H 0 is not rejected / H 0 is false)
The value (1 ) is called the power of the test. It represents the
probability of not making a Type II error.
11
This shaded This shaded
area is / 2 area is / 2
_
x
3.75
Rejection Non rejection Rejection
region region region
Fig.1.1
As shown in Figure 1.1, a two tailed test has two rejection regions, one in
each tail of the distribution curve.
12
Shaded area is
120
Rejection Nonrejection region
region
Fig.1.2
c) A right tailed test
Suppose that mean monthly income of all households was 45 500 tg in 2001.
We want to test if current income of all households is higher than 45 500 tg.
The key phrase in this case is higher than, which indicates a right tailed test.
Let be the mean income of all households.
We write the null and alternative hypothesis for this test as
H 0 : 45500 (The current income is not higher than 45 500 tg)
H 1 : 45500 (The current income is higher than 45 500 tg)
In this case, we can also write the null hypothesis as H 0 : 45500 , which
states that current mean income is either equal to or less than 45 500 tg.
Again, the result of the test will not be affected whether we use an equal to
(=) or a less or equal to ( ) sign in H 0 as long as the alternative hypothesis
has a greater than (>) sign.
_
x
45500
Nonrejection region Rejection region
Fig.1.3
13
When an alternative hypothesis has a greater than (>) sign, the test is always
right tailed. As shown in the Fig. 1.3, in a right tailed test, the rejection
region is in the right tail of the distribution curve. The area of this rejection
region is equal to , the significance level. We will reject H 0 if the value of
_
x obtained from the sample falls in the rejection region. Otherwise, we will
not reject H 0 .
Remark: Note that the null hypothesis always has an equal to (=) or a less
or equal to () or a greater than or equal to () sign and the alternative
hypothesis always has a not equal to () or a greater than (>) or a less than
(<) sign.
Exercises
1. Explain which of the following is a two tailed test, a left tailed test, or a
right tailed test.
a) H 0 : 25, H 1 : 25
b) H 0 : 134 , H 1 : 134
c) H 0 : 16 , H 1 : 16
Show the rejection and nonrejection regions for each of these cases by
drawing a sampling distribution curve for the sample mean, assuming that
sample size is large in each case.
2. Consider H 0 : 35, against H 1 : 35 .
a) What type of error would you make if the null hypothesis is actually false
and you fail to reject it?
b) What type of error would you make if the null hypothesis is actually true
and you reject it?
3. For each of the following rejection regions, sketch the sampling
distribution for z and indicate the location of rejection region.
a) z 2.05 ; b) z 2.75 ; c) z 1.28 ;
d) z 2.13 ; f) z 2.575 or z 2.575 ;
g) z 1.82 or z 1.82
14
4. Write the null hypothesis and alternative hypothesis for each of the
following examples. Determine if each is a case of a two tailed, a left tailed,
or a right tailed test.
a) To test whether or not the mean price of houses in a certain city is greater
than $ 45 000.
b) To test if the mean number of hours spent working per week by students
who hold jobs is different from 18 hours.
c) To test whether the mean life of a particular brand of auto batteries is less
than 28 days.
d) To test if the mean amount of time taken by all workers to do a certain job
is more than 45 minutes.
f) To test the mean age of all managers of companies is different from 40
years.
g) To test the mean time for an airline passenger to obtain his or her luggage,
once luggage starts coming out the conveyer belt, is less than 180 seconds.
15
H 0 : 0 or H 0 : 0 against the alternative
H1 : 0
the decision rule is
Reject H 0 if T .S . z
3. To test the null hypothesis
H 0 : 0 against the two sided alternative
H1 : 0
the decision rule is
Reject H 0 if T .S . z / 2 or T .S . z / 2 ,
where z / 2 is the number for which
P ( Z z / 2 ) / 2
and Z is the standard normal distribution.
A statistical test of hypothesis procedure contains the following five steps:
1. State the null and alternative hypothesis
2. Select the distribution to use
3. Determine the rejection and nonrejection regions
4. Calculate the value of the test statistic
5. Make a decision.
Example:
A manufacturer of detergent claims that the content of boxes sold weigh on
average at least 160 grams. The distribution of weights is known to be
normal, with standard deviation of 14 grams. A random sample of 16 boxes
yielded a sample mean weight of 158.9 grams. Test at the 10% significance
level the null hypothesis that the population mean is at least 160 grams.
Solution:
_
Let be the mean average of all boxes and x be the corresponding mean
for the sample.
_
n 16 ; 14 ; x 158.9
The significance level is is 0.1. That is, the probability of rejecting the
null hypothesis when it is actually is true should not exceed 0.1. This is the
probability of making a Type I error. We perform the test of hypothesis
using the five steps as follows.
16
Step 1. State the null and alternative hypothesis
We write the null and alternative hypothesis as
H 0 : 160 grams
H 1 : 160 grams
Step 2. Select the distribution to use
_
x
Since population standard deviation is known we will use z .
/ n
Step 3. Determine the rejection and nonrejection regions
The significance level is 0.1. The < sign indicates that the test is left tailed.
We look for 0.9 from in the standard normal distribution table, (Table 1 of
Appendix). The value of z is 1.28 . (Fig. 1.4).
Step 4. Calculate the value of the test statistic
The decision to reject or not to reject the null hypothesis will depend on
whether the evidence from the sample falls in the rejection or nonrejection
_
region. If the value of the sample mean x falls in rejection region, we
reject H 0 . Otherwise we do not reject the null hypothesis. To locate the
_ _
position of x 158.9 on the sampling distribution curve of x in Figure 1.4
_
we first calculate z value for x 158.9 . This is called the value of the test
statistic.
_
x 158.9 160
T .S . z 0.31
/ n 14 / 16
0.1
_
x
160
Rejection region Nonrejection region
z
-1.28 0
Fig.1.4
17
Step 5. Make a decision
In the final step we make a decision based on the value of the test statistic
_
T .S. z for x in previous step. This value of z 0.31is not less than the
critical value of z 1.28 , and it falls in the nonrejection region. Hence we
accept H 0 and conclude that based on sample information, it appears that the
mean weight of all boxes is greater than 160 grams.
By accepting the null hypothesis we are stating that the difference between
_
the sample mean x 158.9 and the hypothesized value of the population
mean 160 is not too large and may occurred because of the chance or
sampling error. There is a possibility that the mean weight is less than 160
grams, by the luck of the draw, we selected a sample with a mean that is not
too far from required mean of 160 grams.
18
Solution:
_
Let be the mean length bolts made on this machine and x be the
corresponding mean for the sample.
_
n 49 ; x 2.49 cm; s 0.021 cm
The mean length of all bolts is supposed to be 2.5 cm. The significance level
is is 0.05. That is, the probability of rejecting the null hypothesis when it
is actually is true should not exceed 0.05.
Step 1. State the null and alternative hypothesis
We are testing to find whether or not the machine needs to be adjusted. The
machine will need an adjustment if the mean length of these bolts is either
less than 2.5 cm or more than 2.5.
We write the null and alternative hypothesis as
H 0 : 2.5 cm (The machine does not need adjustment)
H 1 : 2.5 cm (The machine needs an adjustment)
19
/ 2 =0.025 / 2 =0.025
_
x
2.5
Reject H 0 Reject H 0
Do not
reject H 0
z
-1.96 1.96
Fig.1.5
20
There is a possibility that the mean length of bolts equal to 2.5 cm. If so, we
have wrongfully rejected the null hypothesis H 0 . This is Type I error and
probability of making such an error in this case is 0.05.
Exercises
21
b) Test the null hypothesis that 100 against the alternative hypothesis
that 100 using 0.05 . Interpret the results of the test.
c) Compare the results of the two tests you conducted. Explain why the
results differ.
6. In a random sample of 250 observations, the mean and standard deviation
are found to be 169.8 and 31.6, respectively. Is the claim that larger than
169 substantiated by these data at the 10% level of significance?
7. From records, it is known that the duration of treating a disease by a
standard therapy has a mean of 15 days. It is claimed that a new therapy can
reduce the treatment time. To test this claim, the new therapy is tried on 70
patients, and from the data of their times to recovery, the sample mean and
standard deviation are found to be 14.6 and 3.0 days, respectively.
Perform the hypothesis test using a 2.5% level of significance.
8. Suppose that you are to verify the claim that 20 on the basis of a
random sample of size 70, and you know that 5.6 .
_
a) If you set the rejection region to be x 21.31 , what is the level of
significance of your test?
_
b) Find the numerical value of c so that the test x c has a 5% level of
significance.
Answers
1. a) T .S. 9.00 ; reject H 0 ; b) T.S. 1.49 ; do not reject H 0 ;c) T .S. 8.57;
reject H 0 ; 2. a) T.S. 1.33 ; do not reject H 0 ; b) T .S. 3.20 ; reject H 0 ;
3. T.S. 32.71; reject H 0 ; 4. T.S. 1.87 ; accept H 0 ; 5. a) z 1.67 ;
reject H 0 ; b) z 1.67 ; accept H 0 ; 6. T .S. 0.4 ; accept H 0 ;7. z 1.12 ;
H 0 is not rejected at 0.025 ; 8. a) 0.025 ; b) c=21.10.
22
1.4. Hypothesis testing using the p –value approaches
1. Determine the value of the test statistic T .S. z corresponding to the result
of the sampling experiment.
2.
a) If the test is one- tailed, the p-value is equal to the tail area beyond z in the
same direction as the alternative hypothesis. Thus, if the alternative
hypothesis is of the form >, the p- value is the area to the right of, or above,
the observed z value. Conversely, if the alternative is of the form <, the
p- value is the area to the left of, or below, the observed z value. (Fig.1.6;1.7)
p-value
p-value
z
0 z 0
Test statistic
Test statistic
Fig.1.6. Left tailed test Fig.1.7. Right tailed test
23
b) If the test is two tailed, the p-value is equal to twice the area beyond the
observed z-value in the direction of the sign of z. That is, if z is positive, the
p-value is twice the area to the right of, or above, the observed z- value.
Conversely, if z is negative, the p-value is twice the area to the left of, or
below, the observed z-value. (See Fig.1.8)
p/2 p/2
Example:
The management of Health club claims that its members lose an average of
10kg or more within the first month after joining the club. A random sample
of 36 members of this health club was taken and found that they lost an
average of 9.2 kg within the first month of membership with standard
deviation of 2.4kg. Find the p- value for this test.
Solution:
Let be the mean weight lost during the first month of membership by all
_
members and x be corresponding mean for the sample.
Step 1. State the null and alternative hypothesis
24
_ _
of x where x is less than 9.2. To find this area, we first find the z value for
_
x 9.2 as follows
_
x 9.2 10
T .S . z 2.00
s/ n 2.4 / 36
_ _
The area to the left of x 9.2 under the sampling distribution of x is equal to
the area under the standard normal curve to the left of z 2.00 . The area to
the left of z 2.00 is 0.0228. Consequently,
p value 0.0228
Thus, based on the p- value of 0.0228 we can state that for any
(significance level) greater than 0.0228 we will reject the null hypothesis
and for any less than 0.0228 we will accept the null hypothesis.
Suppose we make the test for this example at 0.01 . Because 0.01 is
less than p-value of 0.0228, we will not reject the null hypothesis. Now
suppose we make the test at 0.05 . Because 0.05 is greater than the p-
value of 0.0228, we will reject the null hypothesis.
Exercises
25
3. In a given situation, suppose H 0 was rejected at 0.05 . Answer the
following questions as “yes”, “no”, or “can’t tell” as the case may be.
a) Would H 0 also be rejected at 0.02 ?
b) Would H 0 also be rejected at 0.10 ?
c) Is the p-value smaller than 0.05?
4. In a problem of testing H 0 : 75 against H 1 : 75 , the following
sample quantities are recorded.
_
n 56; x 77.04; s 6.80
a) State the test statistic and find the rejection region with 0.05 .
b) Calculate the test statistic and draw a conclusion with 0.05 .
c) Find the p-value and interpret the results.
Answers
Many times the size of a sample that is used to make test of hypothesis about
is small, that is, n 30 . If the population is (approximately) normally
distributed, the population standard deviation is not known and the sample
size is small ( n 30 ), then the normal distribution is replaced by the
Student’s t distribution to make a test of hypothesis about . In such a case
the random variable
_
x
t n 1
s/ n
has a Student’s t distribution with (n 1) degrees of freedom.
26
_
The value of test statistic t for the sample mean x is computed as
_
x
T .S . t n 1
s/ n
and we can use the following tests with significance level .
1. To test either null hypothesis
H 0 : 0 or H 0 : 0 against the alternative
H1 : 0
the decision rule is
Reject H 0 if T .S . t n 1,
2. To test either null hypothesis
H 0 : 0 or H 0 : 0 against the alternative
H1 : 0
the decision rule is
Reject H 0 if T .S . t n 1,
3. To test the null hypothesis
H 0 : 0 against the two sided alternative
H1 : 0
the decision rule is
Reject H 0 if T .S . t n 1, / 2 or T .S . t n 1, / 2 ,
Here, t n 1, is the number for which
P(t n 1 t n 1, )
where the random variable t n 1 follows a Student’s t distribution with
(n 1) degrees of freedom.
Example:
The company that produces auto batteries claims that its batteries are good,
for an average, for at least 64 days. A consumer protection agency tested 15
such batteries to check this claim. It found the mean life of these 15 batteries
to be 62 days with a standard deviation of 3 days. At the 5% significance
level, can you conclude that the claim of the company is true? Assume that
the life of such a battery has and approximate normal distribution.
27
Solution:
_
Let be the mean life of all batteries and x be the corresponding mean for
the sample. Then from the given information,
_
n 15 ; x 62 days; s 3 days
The mean life of all batteries is supposed to be at least 64 days. The
significance level is is 0.05. That is, the probability of rejecting the null
hypothesis when it is actually is true should not exceed 0.05.
Step 1. State the null and alternative hypothesis
We write the null and alternative hypothesis as
H 0 : 64 days (The mean life is at least 64 days)
H 1 : 64 days (The mean life is less than 64 days)
Step 2. Select the distribution to use
The sample size is small ( n 15 ), and the life of a battery is approximately
normally distributed. Since population standard deviation is unknown, we
use the Student’s t distribution to make the test.
Step 3. Determine the rejection and nonrejection regions
The significance level is 0.05. The sign in the alternative test indicates that
the test is left tailed with the rejection region in the left tail of the t
distribution curve.
Area in the left tail= 0.05
Degree of freedom= n 1 15 1 14
From the Student’s t distribution table (Table 2 of Appendix), the critical
value of t for 14 degrees of freedom and an area 0.05 in the left tail
is 1.761 . (Fig.1.9).
0.05
t
-1.761 0
Fig.1.9
28
Step 4. Calculate the value of the test statistic
As is not known, and sample size is small, we calculate the t value as
follows
_
x 62 64
T .S . t 2.50
s / n 3 / 15
Step 5. Make a decision
The value of T.S. t 2.50 is less than the critical value of t 1.761 , and
it falls in the rejection region. Therefore, we reject H 0 and conclude that the
sample mean is too small compared to 62 days (company’s claimed value
of ) and the difference between the two may not be attributed to chance
alone. We can conclude that the mean life of company’s batteries is less than
62 days.
Remark: The conclusion of a t-test can also be strengthened by reporting
the significance probability (p- value) of the observed statistic. Since the t
table provides only a few selected percentage points, we can get an idea
about the p-value but not its exact determination. For instance, the data in
example above gave an observed value T.S. t 2.50 with degree of
freedom=14. Scanning the t table for (n 1) 14 , we notice that that 2.50
lies between t 0.025 and t 0.010 . Therefore, the p-value of t 2.50 is higher than
0.025 but not as great as 0.010.
Exercises
29
3. Consider H 0 : 40 versus H 0 : 40 for a population that is normally
distributed.
a) A random sample of 16 observations taken from this population produced
a sample mean of 45 and a standard deviation of 5. Using 0.025 , would
you reject the null hypothesis?
b) Another random sample of 16 observations taken from the same
population produced a sample mean of 41.9 and a standard deviation of 7.
Using 0.025 , would you reject the null hypothesis?
Comment on the result of parts a) and b).
4. Assuming that respective populations are normally distributed, make the
following hypothesis tests.
_
a) H 0 : 60 ; H 1 : 60 ; n 14 ; x 56 ; s 9 ; 0.05
_
b) H 0 : 35 ; H 1 : 35 ; n 24 ; x 29 ; s 5.4 ; 0.005
_
c) H 0 : 47 ; H 1 : 47 ; n 18 ; x 51 ; s 6; 0.001
5. A business school claims that students who complete a three month course
of typing course can type on average, at least 1200 words an hour.
A random sample of 25 students who completed this course typed, on
average, 1130 words an hour with a standard deviation of 85 words. Assume
that the typing speeds for all students who complete this course have an
approximate normal distribution.
Using the 5% significance level, can you conclude that the claim of the
business school is true?
6. The supplier of home heating furnaces of a new model claims that the
average efficiency of the new model is at least 60. Before buying these
heating furnaces, a distributor wants to verify the supplier’s claim is valid.
To this end, the distributor chooses a random sample of 9 heating furnaces of
a new model and measures their efficiency. The data are
63; 72; 64; 69; 59; 65; 66; 64; 65
Determine the rejection region of the test with 0.05 . Apply the test and
state your conclusion.
7. A past study claims that adults spend an average of 18 hours a week on
leisure activities. A researcher wanted to test this claim. He took a sample of
10 adults and asked them about the time they spend per week on leisure
activities. Their responses (in hours) were as follows
30
14; 25; 22; 38; 16; 26; 19; 23; 41; 33
Assume that the time spent on leisure activities by all adults is normally
distributed. Using the 5% significance level, can you conclude that the claim
of earlier study is true?
8. According to the department of Labor, private sector workers earned, on
average $354.32 a week in 2001. A recently taken random sample of 400
private sector worker showed that they earn, on average, $362.50 a week
with a standard deviation of $72. Find p-value for the test with an alternative
hypothesis that the current wean weekly salary of private sector workers is
different from $354.32.
9. A manufacturer of a light bulbs claims that the mean life of these bulbs is
at least 2500 hours. A consumer agency wanted to check whether or not this
claim is true. The agency took a random sample of 36 such bulbs and tested
them. The mean life for the sample was found to be 2447 hours with a
standard deviation of 180 hours.
a) Do you think that the sample information supports the company’s claim?
Use 2.5% .
b) What is the Type I error in this case? Explain. What is the probability of
making this error?
c) Will your conclusion of part a) change if the probability of making a
Type I error is zero?
10. Given the eight sample observations 31, 29, 26, 33, 40, 28, 30, and 25,
test the null hypothesis that the mean equals 35 versus the alternative that it
does not. Let 0.01 .
Answers
31
1.6. Tests of the population proportion (Large sample)
32
.
Once again, z is the number for which
P ( Z z )
and Z is the standard normal distribution.
Example:
Mr. A and Mr. B are running for local public office in a large city. Mr. A
says that only 30% of the voters are in favor of a certain issue, a law to sell
liquor on Sundays. Mr. B doubts A’s statement and believes that more than
30% favor such legislation. Mr. B pays for an independent organization to
make a study of this situation. In a random sample 400 voters, 160 favored
the legislation. What conclusions should the polling organization report to
Mr. B?
Solution:
^
Let p 0 be proportion of all people who favor such legislation and p the
corresponding sample proportion. Then from given information,
^ 160
n 400 ; p0 0.30 ; p 0.40 . Let 0.05 .
400
The null and alternative hypotheses are as follows
H 0 : p p0 0.30
H 1 : p 0.30
The decision rule is to reject the null hypothesis in favor of alternative if
T .S. z
0.05 ; / 2 0.025 .
P( Z z / 2 ) P(Z z 0.025 ) 0.025 .
P(Z z 0.025 ) F ( z 0.025 ) 0.975 and
z / 2 z 0.025 1.645
From the given information we calculate the value of test statistic as
^
p p0 0.40 0.30
T .S . z 4.36
p 0 (1 p 0 ) / n 0.30 0.70 / 400
Since 4.36 1.645 we reject H 0 . We make conclusion that more than 30% of
voters are in favor of a law to sell liquor on Sundays.
33
Exercises
34
6. A magazine claims that 25% of its readers are university students. A
random sample of 200 readers is taken and 42 of these readers are university
students. Use 0.10 level of significance to test the validity of the
magazine’s claim.
7. Suppose that in order to test the hypothesis that p 0.6 against the
alternative that p 0.6 ,we decide to obtain a sample of size 100 and reject
H 0 if we obtain fewer than 48 successes.
a) What is the approximate size of the Type I error?
b) If the value of p is really 0.5, what is the size of Type II error?
8. An educator wishes to test H 0 : p 0.3 against H 1 : p 0.3 , where
p-proportion of football players who graduate university in four years.
a) State the test statistic and the rejection region having 0.05 .
b) If 19 out of a random sample of 48 players graduated in four years, what
does the test conclude? Also evaluate p-value.
9. The president of a company that produces national brand coffee claims
that 40% of the people prefer to buy national brand coffee. A random sample
of 700 people who buy coffee showed that 252 of them buy national brand
coffee.
a) Using 0.01 , can you conclude that the percentage of people who buy
national brand coffee is different from 40%?
b) Find the p-value for the test. Using this p-value, would you reject the null
hypothesis at 0.05 ? What if 0.02 ?
Answers
35
1.7. Tests of the variance of a normal distribution
In addition to the need for tests based on the sample mean and sample
proportion, there are a number of situations where we want to determine if
the population variance is a particular value or set of values. The basis for
developing particular tests lies in the fact that the random variable
(n 1) s 2
n21
2
follows a Chi-square distribution with (n 1) degrees of freedom.
The value of the test statistic n21 is calculated as
(n 1) s 2
T .S . n21
2
We are given a random sample of n observations from a normally distributed
population with variance 2 . If we observe the sample variance s 2 , then the
following tests have significance level :
1. To test either null hypothesis
H 0 : 2 02 or H 0 : 2 02 against the
alternative
H 1 : 2 02
the decision rule is
Reject H 0 if T .S . n21,
36
H 1 : 2 02
the decision rule is
Reject H 0 if T .S . n21, / 2 or T .S . n21,1 / 2
where n21 is a Chi-square random variable and P ( n21 n21, ) .
Example:
Variance of yearly earnings of all state employees for all 40 states is
$49000 square dollars. A sample of 29 employees selected from state A
produced a variance of their earnings equal to $600 000 square dollars. Test
at 5% significance level if the variance of yearly earnings of state employees
in state A is different from $490 000 square dollars. Assume that the yearly
earnings of all state employees in state A have an (approximate) normal
distribution.
Solution:
From the given information,
n 29 ; 0.05 ; s 2 600000
The null and alternative hypotheses are
H 0 : 2 49000
H 1 : 2 49000
We use Chi square distribution to use. The decision rule is
Reject H 0 if T .S . n21, / 2 or T .S . n21,1 / 2
/ 2 0.025 ; 1 / 2 0.975 ; n 1 29 1 28
Then from Table 3 of appendix we obtain
n21, / 2 28
2
, 0.025 44.461 and n 1,1 / 2 28, 0.975 15 .308
2 2
37
Exercises
1. A sample of 24 observations selected from a normally distributed
population produced a sample variance of 12.
a) Write the null hypothesis and alternative hypothesis, and decision rule to
test if the population variance is different from 10.
b) Using 0.05 , find the critical values of n21 . Show the rejection and
nonrejection regions on a Chi-square distribution curve.
c) Using the 5% significance level, will you reject the null hypothesis stated
in part a)?
2. A sample of 25 observations selected from a normally distributed
population produced a sample variance of 18. Using the 2.5% significance
level, test hypothesis if the population variance is less than 25.
3. Usually people do not like waiting in line for service for a long time.
A bank management does not want the variance of the waiting time for her
customers to be higher than 4.0 square minutes. A random sample of 25
customers taken from this bank gave the variance of the waiting times equal
to 7.9 square minutes. Test at 1% significance level if the variance of the
waiting time for all customers at this bank is higher than 4.0 square minutes.
Assume that the waiting time for all customers is normally distributed.
4. Test H 0 : 10 against H 1 : 10 with 0.05 in each case
25 _
a) n 25; i 1
( x i x) 2 4016
b) n 15 ; s 12
5. A sample of seven observations taken from a population produced the
following data
10; 8; 13; 15; 6; 8; 13
Assuming that the population from which this sample is selected is normally
distributed, test at 2.5 significance level if the population variance is
different from 10.
6. A drug manufacturer requires that the variance for a chemical contained in
the bottles of certain type of drug should not exceed 0.03 square grams. A
sample of 25 such bottles gave the variance for this chemical as 0.06 square
grams. Test at the 1% significance level if the variance of this chemical in all
such bottles exceeds 0.03 square grams. Assume that the amount of this
chemical in all such bottles is (approximately) normally distributed. Find and
interpret p-value of this test.
38
7. A random sample of ten students was asked, in hours, for time they spent
studying in the week before final exams. The data are as follows:
28; 57; 42; 35; 61; 39; 55; 46; 49; 38
Assuming that the population distribution is normal, test at 5% significance
level against two sided alternative the null hypothesis that the population
standard deviation is 10 hours
8. Company claims that its employees earns a mean of at least $40 000 in a
year and that the population standard is no more than $6 000. Earnings of a
random sample of nine employees of this company produced
9 9 _
i 1
x i 333 and
i 1
( xi x) 2 312
Answers
1. b) reject H 0 if 2
23 38.08 or 23
2
11.69 ; b) T .S . 23
2
27.6 ; do not
reject H 0 ; 2. T .S . 24
2
17.280 ; do not reject H 0 ;3. T .S . 24
2
47.400 ;
reject H 0 ; 4. a) T.S. 40.16 ; reject H 0 ; b) T.S. 20.16 ; accept H 0 ;
5. T.S. 6.571 ; do not reject H 0 ; 6. T.S. 48.00 ; reject H 0 ; reject H 0 for
0.0005 ; 7. T .S . 2 9.999 ; accept H 0 ; 8. T .S. 8.67 ; accept H 0 ;
9. a) false; b) true; c) false; d) false; f) true; g) false.
39
1.8. Tests for the difference between two population means
1.8.1. Tests based on paired samples
_ d
i 1
i
d
n
n _ n _
i 1
(d i d ) 2
i 1
d i2 n(d ) 2
and s_
d n 1 n 1
denote the observed sample mean and standard deviation for the n
differences d i xi y i . Let us denote difference between two population
means by D0 x y . In this case test statistic will be calculated as
_
d D0
T .S .
s_ / n
d
If the population differences is a normal distribution, then the following tests
have significance level
40
H 1 : x y D0
the decision rule is
Reject H 0 if T .S . t n 1,
3. To test the null hypothesis
H 0 : x y D0 against the two sided alternative
H 1 : x y D0
the decision rule is
Reject H 0 if T .S . t n 1, / 2 or T .S . t n 1, / 2
Here, t n 1, is the number for which
P(t n 1 t n 1, )
where the random variable t n 1 follows a Student’s t distribution with
(n 1) degrees of freedom.
Remark: When we want to test the null hypothesis that the two population
means are equal, we set D0 0 .
Example:
A medical researcher wishes to determine if a pill has the undesirable side
effect of reducing the blood pressure of the user. The study involves
recording the initial blood pressures of 7 college age adults. After they use
the pill regularly for three month, their blood pressures are again recorded.
The researcher wishes to draw inferences about the effect of the pill on blood
pressure from the information given in table
Before x i 64 71 68 66 73 62 70
After y i 60 66 66 69 63 57 62
Do the data substantiate the claim that use of the pill reduces the blood
pressure? Use 0.01 . Assume that the population of paired differences
has a normal distribution.
Solution:
Let d be the difference between the pressures before and after using pills.
d=before –after= xi yi
The necessary calculations are shown in the following table
41
Before After Difference d2
d
64 60 4 16
71 66 5 25
68 66 2 4
66 69 -3 9
73 63 10 100
62 57 5 25
70 62 8 64
d 31 d 243
2
_
The values of d and S d are calculated as follows:
_
d
d 31 4.43
n 7
1
d
1
Sd n (d ) 2
2
(243 7 4.43 2 ) 4.198 .
n 1 6
Let x be the mean blood pressure for all adults before and y -after using
the pill.
The null and alternative hypotheses are
H 0 : x y 0 (no difference)
against
H 1 : x y 0 ( mean decreases)
42
Exercises
Before x i 90 86 72 65 44 52 46 38 43
After y i 85 87 70 62 44 53 42 35 46
Before x i 81 75 89 91 65 70 90 69
After y i 97 72 93 110 78 69 115 75
Using the 5% significance level, can you conclude that attending this course
increases the writing speed of secretaries?
Assume that the population of paired differences is (approximately)
normally distributed.
4. A random sample of nine employees was selected to test for the
effectiveness of hypnosis on their job performance. The following table
gives the job performance ratings (on a scale of 1 to 4, with 1 being the
43
lowest and 4 being the highest) before and after these employees tried
hypnosis.
Answers
Let us consider the case where we have independent random samples from
two normally distributed populations. The first population has mean x and
variance x2 and we obtain a random sample of size n x . The second
population has mean y and variance y2 and we obtain a random sample of
size n y .
_ _
We know that if the sample means are denoted x and y , then the random
variable
_ _
( x y) x y
Z
x2 y2
nx ny
44
has a standard normal distribution. If the population variances are known,
tests for the difference between the population means can be based on this
_ _
result. The value of the test statistic z for ( x y ) is computed as
_ _
( x y) x y
T .S . z
x2 2
y
nx ny
and the following tests have a significance level
1. To test either null hypothesis
H 0 : x y D0 or H 0 : x y D0
against the alternative
H 1 : x y D0
the decision rule is
Reject H 0 if T .S. z
2. To test either null hypothesis
H 0 : x y D0 or H 0 : x y D0
against the alternative
H 1 : x y D0
the decision rule is
Reject H 0 if T .S. z
3. To test the null hypothesis
H 0 : x y D0 against the two sided alternative
H 1 : x y D0
the decision rule is
Reject H 0 if T .S. z / 2 or T .S. z / 2
Remark: If the sample sizes are large ( n x 30; n y 30 ) then a good
approximation at significance level can be made if the population
variances x2 and y2 are replaced by the sample variances s x2 and s 2y .
In addition the central limit theorem leads to good approximations even if
the populations are not normally distributed.
45
Example:
According to the Bureau of Labor Statistics, last year university instructors
earned an average $440 per month and college instructors earned an average
of $420 per month. Assume that these mean earnings have been calculated for
samples of 400 and 600 instructors taken from the two populations,
respectively. Further assume that the standard deviations of monthly earnings
of the two populations are $50 and $63, respectively. Test at 1% significance
level if the mean monthly earnings of the two groups of the instructors are
different.
Solution:
From the information given above,
_
n x 400 ; x 440 ; x 50 ;
_
n y 600 ; y 420 ; y 63 ;
where the subscript x refers to university instructors and y-to college
instructors. Let
x = mean monthly earnings of all university instructors
y = mean monthly earnings of all college instructors.
We are to test if the two population means are different. The null and
alternative hypotheses are
H 0 : x y 0 (the monthly earnings are not different)
H 1 : x y 0 (the monthly earnings are different).
The decision rule is
Reject H 0 if T .S. z / 2 or T .S. z / 2
First of all we find the value of z / 2 . Since / 2 0.005 , the value of z / 2 is
(approximately) 2.58 and z / 2 2.58 .
The value of the test statistic T .S. z is computed as follows:
_ _
( x y) x y (440 420) (0)
T .S . z 5.57 .
x2 y2 50 2 63 2
nx ny 400 600
46
5.57 2.58 and the value of test statistic T .S. z 5.57 falls in the rejection
region, we reject the null hypothesis H 0 . Therefore, we conclude that the
mean monthly earnings of the two groups of instructors are different.
Note that we can not say for sure that two means are different. All we can
say is that the evidence from the two samples is very strong that the
corresponding population means are different.
Exercises
47
4. The management at the bank A claims that the mean waiting time for all
customers at its branches is less than that at the bank B, which is main
competitor. They took a sample of 200 customers from the bank A and found
that they waited an average of 4.60 minutes with a standard deviation of 1.2
minutes before being served. Another sample of 300 customers taken from
the bank B showed that these customers waited an average of 4.85 minutes
with a standard deviation of 1.5 minutes before being served.
a) Test at the 2.5% significance level if the claim of the management of the
bank A is true.
b) Calculate the p-value. Based on this p-value, would you reject the null
hypothesis if 0.01? What if 0.05 ?
5. A production line is designed on the assumption that the difference in
mean assembly times for two operations is 5 minutes. Independent tests for
the two assembly operations show the following results:
Operation A Operation B
n1 100 n 2 50
_ _
x 14.8 minutes y 10.4 minutes
s x 0.8 minutes s y 0.6 minutes
For 0.02 , test the hypothesis that the difference between the mean
assembly times is 5 minutes.
6. An investigation was carried out to determine if women employees are as
well paid as their male counterparts. Random samples of 75 males and 64
females are selected. Their mean salaries were 45 530 and 44 620, standard
deviations were 780 and 750, correspondingly. If you were to test the null
hypothesis that the mean salaries are equal against the two sided alternative,
what would be the conclusion of your test with 0.05 ?
7. For a random sample of 125 state companies, the mean number of job
changes was 1.91 and the standard deviation was 1.32. For a random sample
of 86 private companies, the mean number of job changes was 0.21 and the
standard deviation was 0.53. Test the null hypothesis that the population
means are equal against the alternative that the mean number of job changes
is higher in state companies than for private companies.
48
Answers
1. T .S. z 4.56 ;reject H 0 ;2.a) T.S. 28.27 ; reject H 0 ;b)do not reject H 0 ;
3. T .S. 4.30 ; reject H 0 ; 4. a) T.S. 2.06 ; reject H 0 ;b) p-value=0.0197;
do not reject H 0 at 0.01? ; reject H 0 at 0.05 ; 5. T.S. 5.15 ;
reject H 0 ; 6. T .S. 7 ; reject H 0 ; 7. T.S. 13 ; reject H 0 at any level.
49
(n x 1) s x2 (n y 1) s 2y
s 2p
(n x n y 2)
_ _
The value of the test statistic z for ( x y ) is computed as
_ _
( x y) x y
T .S . t
s 2p s 2p
nx ny
and the following tests have a significance level
1. To test either null hypothesis
H 0 : x y D0 or H 0 : x y D0
against the alternative
H 1 : x y D0
the decision rule is
Reject H 0 if T .S . t n x n y 2,
2. To test either null hypothesis
H 0 : x y D0 or H 0 : x y D0
against the alternative
H 1 : x y D0
the decision rule is
Reject H 0 if T .S . t n x n y 2,
3. To test the null hypothesis
H 0 : x y D0 against the two sided alternative
H 1 : x y D0
the decision rule is
Reject H 0 if T .S . t nx n y 2, / 2 or T .S . t n x n y 2, / 2
Here, t nx n y 2, is the number for which
P(t nx n y 2 t nx n y 2, )
50
where the random variable t nx n y 2, follows a Student’s t distribution with
(n x n y 2) degrees of freedom.
Example:
A sample of 12 cans of Brand A diet soda gave a mean number of calories of
22 per can with a standard deviation of 2 calories. Another sample of 15 cans
of Brand B diet soda gave the mean number of calories of 24 per can with a
standard deviation of 3 calories. At the 1% significance level, are the mean
number of calories per can different for these two brands of diet soda?
Assume that the calories per can of diet soda are normally distributed for
each of the two brands and that the variances for the two populations are
equal.
Solution:
Let x and y be the mean number of calories per can for diet soda of Brand
_ _
A and Brand B, respectively, and let x and y be the means of respective
samples. From the given information,
_
n x 12 ; x 22 ; sx 2 ;
_
n y 15 ; y 24 ; sy 3
The significance level is 0.01 .
We are to test for the difference in the mean number of calories per can for
two brands. The null and alternative hypotheses are
H0 : x y 0 ( the mean number of calories are not different)
H1 : x y 0 ( the mean number of calories are different)
The decision rule is
Reject H 0 if T .S . t nx n y 2, / 2 or T .S . t n x n y 2, / 2
t nx n y 2, / 2 t1215 2,0.005 t 25,0.005 2.787 and t 25,0.005 2.787 .
The pooled estimate is
(n x 1) s x2 (n y 1) s 2y (12 1) 2 2 (15 1) 3 2
s 2p 6.8
(n x n y 2) (12 15 2)
The test statistic is then computed as
51
_ _
( x y ) x y (22 24) (0)
T .S . t 1.98
s 2p s 2p 6.8 6.8
12 15
nx ny
_ _
Because the value of test statistic T.S. t 1.98 for ( x y ) falls in the
nonrejection region (Fig.1.10), we fail to reject the null hypothesis.
Consequently we conclude that there is no difference between the mean
number of calories per can for the two brands of diet soda. The difference in
_ _
x and y observed for two samples may have occurred due to sampling error
only.
/ 2 =0.05 / 2 =0.05
t
-2.787 0 2.787
Fig.1.10
Exercises
52
2. The following summary statistics are recorded for independent random
samples from two normally distributed populations with equal variances
Sample 1 Sample 2
n1 9 n2 6
_ _
x 16.18 y 4.22
s1 1.54 s 2 1.37
Test the null hypothesis 1 2 10 against the alternative that
1 2 10 with 0.01 .
3. Salary surveys of marketing and management majors show the following
starting annual salary data
Marketing majors management majors
n1 14 n 2 16
_ _
x $14800 x 2 $14300
s1 $1000 s 2 $1400
Consider the test of the hypothesis that the mean annual salaries are the same
for both majors. For 0.05 can you conclude that a difference exists in the
mean annual salary for the two majors?
4. A professor took two samples, one of 21 males and another of 15 females
from university students who were enrolled in business statistics at the same
university. He found that the mean score of male students in a mid-term
examination in statistics was 75.3 with a standard deviation of 6.4, and the
mean score of female students was 78.3 with a standard deviation of 7.3.
Assume that the scores of all male and all female students are normally
distributed with equal but unknown standard deviations.
Test at the 2.5 significance level if the mean score in business statistics for
all male and female students are the same against the alternative that male
students have lower score than that for all female students.
5. The management of a supermarket wanted to investigate if the male
customers spend less money on average, than the female customers. A
sample of 16 male customers who shopped at this supermarket showed that
they spent an average of $55 with a standard deviation of $12.50. Another
sample of 22 female customers who shopped at the supermarket showed that
they spent an average of $63 with a standard deviation of $14.5. Assume that
the amounts of money spent at this supermarket by all male and female
53
customers are normally distributed with equal but unknown population
variance. Test at the 5% significance level if the mean amount spent by all
male and female customers are the same against the alternative that male
customers at this supermarket spend less than that of female customers.
6. A bank has two branches. The quality department wanted to check if the
customers are equally satisfied with the service provided at these two
branches. Randomly selected customers asked to measure the satisfaction of
services (on scale of 1 to 11, 1 being the lowest and 11 being the highest).
A random sample of six customers from the branch A produced following
data:
9.50; 8.60; 8.59; 6.50; 4.79; 4.29
An independent random sample of six customers selected from the branch B
produced following data:
10.21; 9.66; 7.67; 5.12; 4.88; 3.12
Stating any assumptions you need to make, test against two sided alternative
the null hypothesis that the two populations mean satisfaction index for all
customers for the two branches are the same.
_ _ _
7. Given that n1 14 , x 22 , ( xi x) 2 30 , and n2 13 , y 18 ,
_
(y i y ) 2 24 . Test H 0 : 1 2 against H 1 : 1 2 with 0.05 .
8. A researcher wants to test the mean GPA (grade point averages) of all
male and all female university students. She took a random sample of 28
male students and 24 female students. She found that GPA’s of the two
groups to be 2.62 and 2.74, respectively, with the corresponding standard
deviations equal to 0.43 and 0.38. Test at the 5% significance level if the
mean GPA’s of the two populations are equal against two sided alternative.
Assume that the GPA’s of all male and female students are normally
distributed with equal but unknown standard deviations.
Answers
1. a) T.S. t 3.514 ; reject H 0 ; b) T.S. t 3.514 ; reject H 0 ;
2. T.S. t 2.52 ; H 0 is not rejected; 3. T.S. t 1.11 ; accept H 0 ;
4. T.S. 1.308 ; accept H 0 ; 5. T.S. 1.778 ; reject H 0 ; 6. We assume that
the values are normally distributed with equal variance; T.S. 0.183 ; Fail to
reject H 0 at 20% significance level; 7. T.S. 7.071 ; reject H 0 ;
8. T.S. 1.058 ; accept H 0 .
54
1.9. Tests for the difference between two population proportions
(Large samples)
55
Which of these formulas is used to calculate p 0 depends on whether the
^ ^
values of x1 and y1 or the values of p x and p y are known.
Example:
A company is planning to buy a few machines. Company is considering two
types of machines, but will buy all of the same type. The company selects
one machine from each type and uses for a few days. A sample of 900 items
produced on machine A showed that 55 of them were defective. A sample of
700 items produced on machine B showed that 41 of them were defective.
Testing at 1% significance level, can we conclude based on the information
from these samples that the proportions of the defective items produced on
the two machines are different?
Solution:
Let p x be the proportion of all items in all items produced on machine A,
and p y be the proportion of all items in all items produced on machine B.
^ ^
Let p x and p y be the corresponding sample proportions. Let x1 and x 2 be the
number of defective items in two samples respectively.
Machine A: n x 900 ; x1 55
Machine B: n y 700 ; y1 41
The two sample proportions are calculated as follows:
^ x 55
px 1 0.0611 ;
n x 900
^ y 41
py 1 0.0586
n y 700
The null and alternative hypotheses are
H 0 : px p y 0 ( the two proportions are equal)
H1 : p x p y 0 ( the two proportions are different)
The decision rule is
57
Reject H 0 if T .S. z / 2 or T .S. z / 2
Let us check if the sample sizes are large:
^ ^
n x p x q x 9 900 0.0611 0.9389 51.63 9
^ ^
n y p y q y 9 700 0.05860 0.9414 38.62 9
Since the samples are large and independent we apply the normal
distribution to make a test.
The pooled sample proportion is
x y1 55 41
p0 1 0.06
n x n y 900 700
The value of the test statistics is
^ ^
( p x p y ) ( px p y ) (0.0611 0.0586 )
T .S . z 0.2089 .
p 0 (1 p 0 ) p 0 (1 p 0 ) 0.06 0.94 0.06 0.94
nx ny 900 700
Let us find the value of z / 2 .
0.01 ; / 2 0.005
Fz ( z / 2 ) Fz ( z 0.005 ) 0.995
z 0.005 2.58 and z 0.005 2.58
The value of the test statistic T.S. z 0.2089 falls in the nonrejection
region. Consequently, we fail to reject the null hypothesis. As a result, we
can conclude that proportions of defective items produced by two machines
are not different.
58
Exercises
59
voters who are in favor of higher taxes on wealthy people is not different
from that of female voters against two-sided alternative.
6. A medical researcher investigates if the smoking results in wrinkled skin
around the eyes. By observing 150 smokers and 250 nonsmokers, the
researcher finds that 95 of the smokers and 103 of the nonsmokers have
prominent wrinkles around their eyes. Do these data substantiate the belief
that prominent wrinkles around eyes are more prevalent among smokers than
nonsmokers? Answer by calculating p-value.
7. In a comparative study of two new drugs, A and B, 120 patients treated
with drug A and 150 patients with drug B, and the following results were
obtained
Drug A Drug B
Cured: 52 88
Not cured: 68 62
Total: 120 150
Do these results demonstrate statement that these two drugs have the same
effect against the alternative that higher cure rate with drug A? Test at
0.05 .
8. According to a 2001 survey, 48% of managers “would choose the same
career if they were starting over again”. In a similar survey conducted 10
years ago, 60% of managers said that they “would choose the same career if
they were starting over again”. Assume that the 2001 survey is based on a
sample of 800 managers and the one done 10 years ago included 600
managers. Test at the 5% significance level if the proportion of all managers
who “would choose the same career if they were starting over again” has not
changed against the alternative that it decreased during the past 10 years.
Answers
60
Chapter 2
Some nonparametric tests
2.1. Introduction
61
2.2. 1. The Sign test for paired or matched samples
Suppose that paired random samples are taken from a population and the
differences equal to 0 are ignored. Calculate the difference for each pair and
record the sign of this difference. The Sign test is used to test:
H 0 : p 0.5
where p-is the proportion of nonzero observations in the population that are
positive. The test statistic S for the Sign test for paired samples is simply
62
and S has a binomial distribution with p 0.5 and n the number of
nonzero differences.
After determining the null and alternative hypotheses and finding a test
statistic, the next step is to determine the p-value and to draw conclusions
based on a decision rule.
The p-value for a Sign test is found using the binomial distribution with
n the number of nonzero differences, S = the number of pairs with positive
differences and p 0.5 .
1. For right tailed test,
H 1 : p 0.5 , p-value = P( x S )
2. For left tailed test,
H 1 : p 0.5 , p-value = P( x S )
3. For two tailed test,
H 1 : p 0.5 , 2 ( p value)
Example:
In the study 8 individuals were asked to rate on a scale from 1 to 10 the test
of products of two brands: Brand A and Brand B. The scores of the test
comparison are shown in the following table
N Brand A Brand B
1 5 7
2 3 10
3 4 8
4 9 6
5 8 8
6 5 7
7 6 5
8 9 6
Do the data indicate an overall tendency to prefer the Brand B to the
Brand A?
63
Solution:
First of all, let us calculate differences
p value P( x 3) P( x 0) P( x 1) P( x 2) P( x 3)
C 07 (0.5) 0 (0.5) 7
C17 (0.5)1 (0.5) 6 C 27 (0.5) 2 (0.5) 5 C 37 (0.5) 3 (0.5) 4
0.0078 0.0547 0.1641 0.2734 0.5000
64
For this example p- value is 50%. We are unable to reject the null hypothesis
and conclude that data is not sufficient to suggest that population have a
preference for Brand B. Since the p-value is the smallest significance level at
which the null hypothesis can be rejected, for this example, the null
hypothesis can be rejected at 50% or higher. It is unlikely that one would be
willing to accept such a high significance level. Again, we conclude that the
data is not statistically significant to recommend that Brand B is preferred by
majority.
65
Reject H 0 if T .S. z
66
Hence the null hypothesis can be rejected at all significance levels greater
than 10.32%.
Exercises
Use the sign test and perform the null hypothesis that there is no overall
preference for one method over the other.
67
4. A social researcher interviews 25 newly married couples. Each husband
and wife are independently asked the question: “How many children would
you like to have?” The following data are obtained
Answer of Answer of
Couple Husband Wife Couple Husband Wife
1 3 2 14 2 1
2 2 2 15 3 2
3 2 1 16 2 2
4 2 3 17 0 0
5 5 1 18 1 2
6 0 1 19 2 1
7 0 2 20 3 2
8 1 3 21 4 3
9 2 2 22 3 1
10 3 1 23 0 0
11 4 2 24 2 3
12 1 2 25 2 2
13 3 3
Use the Sign test with 0.05 to test against two sided alternative the null
hypothesis that, for the population of families no difference in opinions
between husbands and wives.
5. A random sample of 80 sale managers was asked to predict whether next
year’s sale would be higher than, lower than, or about the same as in the
current year. The results are shown below. Test the null hypothesis that the
opinion of managers is evenly divided on the question against a two sided
alternative.
Prediction Number
Higher 37
Lower 28
About the same 15
6. Of a random sample of 120 university students, 67 expected to achieve a
better GPA than last year, 48 expected a lower GPA than last year, and
5 expected about the same GPA. Do these data present strong evidence that,
for population of students they are divided evenly on the expectations,
against the alternative that more expect a lower GPA compared with last
year?
68
7. Of a random sample of 150 university instructors, 62 believed that
student’s skills in solving problems increased over the last decade, 54
believed these skills had deteriorated and 4 saw no change. Evaluate the
strength of the sample evidence suggesting that, for all university instructors,
teachers are divided evenly on the issue against the alternative that more
teachers believe that student’s skills in solving problems have improved.
8. In a coffee taste test 48 individuals stated a preference for one of two well-
known brands. Results showed 28 favoring brand A, 16 favoring brand B,
and 4 undecided. Use the sign test with 0.10 to test the null hypothesis
that there is no difference in the preferences for the two brands of coffee
against a two sided alternative.
Answers
69
2.3. The Wilcoxon signed test
One disadvantage of the sign test is that it takes into account only a
very limited amount information-namely, the signs of the differences. The
Wilcoxon signed rank test provides a method to use information about the
magnitude of the differences between matched pairs. It is still a distribution
free test. Like many nonparametric test, it is based on ranks.
Table 2.1
Worker Method Method Difference Absolute Rank Rank
I II value of (+) (-)
difference
1 10.2 9.5 0.7 0.7 8
2 9.6 9.8 -0.2 0.2 2
3 9.2 8.8 0.4 0.4 3.5
4 10.6 10.1 0.5 0.5 5.5
5 9.9 10.3 -0.4 0.4 3.5
6 10.2 9.3 0.9 0.9 10
7 10.6 10.5 0.1 0.1 1
8 10.0 10.0 0 0 --
9 11.2 10.6 0.6 0.6 7
10 10.7 10.2 0.5 0.5 5.5
11 10.6 9.8 0.8 0.8 9
49.5 5.5
To demonstrate the use of the Wilcoxon signed ranked test let us consider a
manufacturing firm that is attempting to determine if a difference exists in
two production methods. A sample of 11 workers was selected, and each
worker completed the production task using each of the two production
methods. Each worker in the sample provides a pair of observations, as
shown in Table2.1. Table 2.1 also provides the difference in the completion
times. A positive value indicates that Method I require more time, and a
negative value indicates that Method II require more time. The statistical
question is whether or not the data indicate that the methods are significantly
70
different in terms of completion times. Thus the null and alternative
hypothesis can be written as
H 0 : The two populations of task completion times are identical
H 1 : The two populations of task completion times are not identical
As with the sign test, we ignore any difference of “0”, so sample size in
example above is reduced to n 10 . The nonzero absolute differences are
then ranked in ascending order of magnitude. That is, the smallest absolute
value 0.1 is given a rank of “1”. If two or more values are equal, they are
assigned the average of the next available ranks. In example above, absolute
value of difference-0.4 occurs twice. The rink assigned to them is therefore
the average of ranks 3 and 4-that is 3.5. The next absolute value-0.5 occurs
twice. The rank assigned to them is therefore the average of ranks 5 and 6-
that is 5.5. The next absolute value is assigned rank 7, and so on.
The ranks for positive and negative differences are summed separately. The
smaller of these sums is the Wilcoxon Signed Rank Statistic T.S.
Hence T.S.=5.5.
We will now suppose that the population distribution of the paired
differences is symmetric. The null hypothesis to be tested is that the center of
this distribution is 0. In example above, we are assuming that differences in
the task completion times have a symmetric distribution, and we want to test
whether that distribution is centered on 0-that is no difference between task
completion times.
Cutoff points for the distribution of this random variable are given in
Appendix (Table 4) for tests against a one sided alternative that the
population distribution of the paired differences is specified either to be
centered on some number bigger than 0 or to be centered on some number
less than 0. For sample size, n , the table shows, for selected probabilities ,
the number T such that P(T T ) . In other words, the null hypothesis
is rejected if T .S. is less than or equal to the corresponding number in the
Table4.
In example above, T .S. 5.5 . For n =10 we find that the null hypothesis will
be rejected for any significance level greater than 0.005 .
71
Steps in the Wilcoxon Signed Rank test for paired samples
When the number of n nonzero differences in the sample is large (n >20), the
normal distribution provides a good approximation to the distribution of the
Wilcoxon statistic T under the null hypothesis that the population differences
are centered on 0.
Let T denote the smaller of the rank sums.
With increasing sample size of n (n>20) nonzero differences, the null
hypothesis is that the population differences are centered on 0, Wilcoxon
Signed Rank test has mean and variance given by
n(n 1)
E (T ) T
4
and
n(n 1)(2n 1)
Var(T ) T2
24
For large n, the distribution of the random variable, Z, is approximately
standard normal where
T T
Z
T
72
If the number of nonzero differences is large and T is the observed value of
the Wilcoxon Signed test statistic, then the following tests have significance
level ,
1. If the alternative hypothesis is one sided, reject the null hypothesis if
T T
z
T
2. If the alternative hypothesis is two sided, reject the null hypothesis if
T T
z / 2
T
Example:
A random sample of 38 students who had just completed courses in statistics
and accounting was asked to rate each in terms of level of interest, on a scale
from one (very uninteresting) to ten (very interesting). The 38 differences in
the pairs of ratings were calculated and the absolute differences ranked. The
smaller of the rank sums, which was for those finding accounting the more
interesting, was 278. Test at 5 % significance level the null hypothesis that
the population of students would rate these courses equally against the
alternative that the statistics course is viewed as the more interesting.
Also find the p-value.
Solution:
From the given information
n 38 ; T 278
The mean and variance of the Wilcoxon statistic are
n(n 1) 38 (38 1)
T 370.5
4 4
n(n 1)(2n 1) 38 39 77
T2 4754.75
24 24
So the standard deviation is
T 68.95
According to the condition, the null and alternative hypothesis can be written
as
H 0 : both courses rated equally interesting
H 1 : statistics course rated more interesting
If T is the observed value of the test statistic, the null hypothesis is rejected
against one sided alternative if
73
T T
z
T
Here, the value of T is T 278 and the value of test statistic is
T T 278 370 .5
1.34
T 68.95
0.05 ; Fz ( z 0.05 ) 0.95 ; z 0.05 1.65 ; z 0.05 1.65
Since 1.34 is not less than 1.65 we fail to reject H 0 , and accept it.
The value of corresponding to z 1.34 is, from Table 1 of the
Appendix, (1 0.9099) 0.0901 . Then the null hypothesis can be rejected at
all significance levels greater than 9.01%. The data contain modest evidence
suggesting that statistics course is more interesting.
Exercises
74
3. Twelve customers were asked to estimate the selling price of two models
of refrigerators. The estimates of selling price provided by the customers are
shown below:
Customer Model A Model B
1 $650 $900
2 760 720
3 740 690
4 700 850
5 590 920
6 620 800
7 700 890
8 690 920
9 900 1000
10 500 690
11 610 700
12 720 700
Use these data and test at the 0.05 level of significance to determine if there
is no difference in the customers’ perception of selling price of the two
models.
4. A certain brand of microwave oven was priced at 12 stores in two
different cities.
These data are presented below:
District A District B
18 500 16 700
16 000 20 500
12 000 23 000
20 000 17 500
19 000 22000
17 000 21 000
16 500 21 500
19 000 19 500
15 500 17 000
16 000 23 000
17 500 21 000
18 000 22 000
Use a 0.05 level of significance and apply the Wicoxon signed rank test to
test whether o not prices for the microwave oven are the same in the two
cities.
75
5. The company is interested in the impact of the newly introduced quality
management program on job satisfaction of workers. A random sample of 34
workers was asked to assess level of satisfaction on a scale from 1 to 10 two
month before the program. These same sample members were asked to make
this assessment again two month after the introduction of the program. The
34 differences in the pairs of ratings were calculated and absolute differences
ranked. The smaller of the rank sums, which was for those more satisfied
before the introduction of the program, was 178. What can be concluded
from these findings?
6. A random sample of 90 members was taken. Each sample member was
asked to assess the amounts of time in a month spent watching TV and the
amounts of time in a month spent reading. The 90 differences in times spent
were then calculated and their absolute differences ranked. The smaller of
the of the rank sums, which was for watching TV, was 1680. Test the null
hypothesis that the population amounts of time spent on watching TV and
reading divides equally against the alternative that watching TV takes more
amounts of time.
7. Suppose you wish to test hypothesis that two treatments, A and B, are
equivalent against the alternative that the responses for A tend to be larger
than those of B. If the number of pairs equals 25, and smaller of the rank of
the absolute differences is 273, then what would you decide? Use 5%
then find p-value for the test and interpret it.
8. An experiment was conducted to compare two print types, A and B, to
determine whether type A is easier to read. A sample of 22 persons was
given the same material to read. First they read the material printed with type
A, then read the same material printed with type B. The times necessary for
each person to read the materials (in seconds) were
Type A: 95;122;101;99;108;122;135;127;119;127;99;98;97;96;112;97;100;
116; 111;117;102;103
Type B: 110;102;115;112;120;117;119;127;137;119;99;100;102;103;118;
99;89;97;112;116; 178; 94.
Do the data provide sufficient evidence to indicate that print type A and print
type B are the same for reading against the alternative that print type A is
easier to read? Test using 0.05 .
76
Answers
1. T .S. 7 ; accept H 0 ; 2. T.S. 1 ; reject H 0 ;3. T .S. 6 ; reject H 0 ;
4. T .S. 3 ; reject H 0 virtually at any levels;5. T.S. 2.04 ;p-value = 4.12%;
6. T.S. 1.48 ; reject H 0 at levels higher than 6.94%;7. T .S. 2.97 ; accept
H 0 at any levels; 8. T.S. 0.71 ; reject H 0 .
77
Then for large sample sizes (both at least 10), the distribution of the random
variable,
U U
Z
U
is well approximated by the standard normal distribution.
78
Table2.2
Branch 1 Branch 2
Sampled Account Sampled Account
Account balance account balance
1 1 095 1 885
2 955 2 850
3 1 200 3 915
4 1 195 4 950
5 925 5 800
6 950 6 750
7 805 7 865
8 945 8 1 000
9 875 9 1 050
10 1 055 10 935
11 1 025
12 975
The first step in the Mann- Whitney test is to rank the combined (pooled)
data from the two samples from low to high. Using the combined set of 22
observations shown in Table 2.2, the lowest value of $750(item 6 of
sample2) is ranked number 1. Continuing the ranking, we have
Account balance Item Rank
750 6 of sample 2 1
800 5 of sample 2 2
805 7 of sample 1 3
…… ……………. …
1 195 4 of sample1 21
1 200 3 of sample 1 22
Item 6 of sample 1 and item 4 of sample 2 both have the same account
balance, $950. We could give one of these items a rank 12 and the other a
rank 13, but this could lead to an erroneous conclusion. In order to avoid this
difficulty the usual treatment for tied data values is to assign each value the
rank equal to the average of the ranks associated with the tied items. Thus
the tied observations of $950 are both assigned ranks of 12.5. Table 2.3
shows the entire data set with the rank of each observation.
79
Table2.3
Branch 1 Branch 2
Sampled Account Sampled Account
Account balance Rank account balance Rank
1 1 095 20 1 885 7
2 955 14 2 850 4
3 1 200 22 3 915 8
4 1 195 21 4 950 12.5
5 925 9 5 800 2
6 950 12.5 6 750 1
7 805 3 7 865 5
8 945 11 8 1 000 16
9 875 6 9 1 050 18
10 1 055 19 10 935 10
11 1 025 17
12 975 15________________________________
Sum of ranks 169.5 83.5
The next step in the Mann-Whitney test is to sum the ranks for each sample.
These sums are shown in Table 2.3. The test procedure can be based upon
the sum of the ranks for either sample. In the following discussion we use
the sum of the ranks for the sample from branch 1. We will denote this sum
by R1 . Thus, in our example R1 169.5 .
The value observed for the Mann-Whitney test is
n (n 1) 12 13
U n1 n2 1 1 R1 12 10 169.5 28.5
2 2
Since two samples are selected from identical populations and n1 and n 2
each is 10 or greater, the sampling distribution of U can be approximated by
a normal distribution with mean
n n 12 10
E (U ) U 1 2 60
2 2
and variance
n n (n1 n2 1) 12 10 23
Var(U ) U2 1 2 230
12 12
80
Suppose that we want to test the null hypothesis that the central locations of
the distributions of account balance are identical against the two-sided
alternative for 0.05 . The decision rule is to reject the null hypothesis if
U U U U
z / 2 or z / 2
U U
Here
U U 28.5 60
2.08
U 230
z / 2 z 0.025 1.96 and z 0.025 1.96
Since -2.08 is less than -1.96, we reject the null hypothesis that two
population account balances are identical. Thus we conclude that two
populations are not identical. The probability distribution of account
balances at branch 1 is not the same as that at branch 2.
Now, from Table1 of the Appendix, the value of / 2 corresponding to a
value (-2.08) is 0.0188, so the corresponding is 0.0376
p value 2 (1 - Fz (test statistics)) 2(1 - 0.9812) 0.0376
The null hypothesis will be rejected for any significance level higher than
3.76%. Thus, these data do not contain strong evidence against the
hypothesis that the central locations of accounts at two branches are the
same. There is very strong support that two branches account balances are
not identical.
81
Exercises
1. Starting salaries were recorded for ten recent business administration
graduates at each of two well-known universities. Use 0.1 and test for
the difference in the starting salaries from the two universities is zero against
the alternative that starting salaries are higher for the university A.
University A University B
Student Monthly salary ($) Student Monthly salary ($)
1 890 1 1 000
2 950 2 1 020
3 1 200 3 1 140
4 1 150 4 1 000
5 1 300 5 975
6 1 350 6 925
7 990 7 900
8 1 050 8 1 025
9 1 400 9 1 075
10 1 450 10 930
2. The following data show product weights for items produced on two
production lines
Line 1: 13.6; 13.8; 14.0; 13.9; 13.4; 13.2; 13.3; 13.6; 12.9; 14.4
Line 2: 13.7; 14.1; 14.2; 14.0; 14.6; 13.5; 14.4; 14.8; 14.5; 14.3; 15.0; 14.9
Test that the difference between the product weights for the two lines is zero
against the alternative that product weights of second line is higher.
Use 0.10 . Also find p-value.
3. A random sample of 14 male students and an independent random sample
of 16 female students were asked to write essays at the conclusion of a
writing course. Their grades were recorded below:
Male: 75; 80; 60; 80; 95; 100; 65; 70; 75; 60; 50; 55; 90; 95
Female: 85; 70; 90; 100; 95; 67; 50; 50; 67; 83; 78; 62; 43; 97; 89; 73
Test the 5% significance level null hypothesis that, in the aggregate the male
and female students are equally ranked, against a two-sided alternative. Also
find p-value.
4. For a random sample of 12 management department gradates and 14
economics department graduates were asked their starting salaries. Those
salaries were then ranked from 1 to 26. The following rankings resulted
Management: 2; 6; 7; 1; 11; 20; 8; 14; 21; 12; 4; 26
Economics: 13; 3; 17; 25; 5; 9; 10; 24; 15; 23; 16; 22; 18; 19
Analyze the data using the Mann-Whitney test, and comment on the results.
82
5. Starting salaries of graduates from two leading universities were
compared. Independent random samples of 40 from each university were
taken, and the 80 starting salaries were pooled and ranked. The sum of the
ranks for students from one of these universities was 1450. Test the null
hypothesis that the central locations of the population distributions are
identical against two sided alternative.
6. A stock market analyst produced at the beginning of the year a list of
stocks to buy and another list of stocks to sell. For a random sample of ten
stocks from the “buy list”, percentage returns over the year were as follows:
10.6; 5.2; 12.8; 16.2; 10.6; 4.3; 3.1; 11.7; 13.9; 11.3
For an independent random sample of ten stocks from the “sell list”,
percentage returns over the year were as follows:
-2.6; 6.1; 9.9; 11.3; 2.3; 3.9; -2.3; 1.3; 7.9; 10.8
For 0.05 use the Mann-Whitney test to interpret these data. Also find
and interpret p-value.
Answers
1. T .S. ; reject H 0 ;2. T .S. ; reject H 0 ; p-value = 0.3%;
3. T .S. ; accept H 0 ; 4. T .S. ; p- value =12.36%; H 0 will be rejected
at all levels higher than 12.36%; 5. T .S. ; p-value = 0.101; H 0 will be
rejected at any level higher than 10.1%; 6. T .S. ; reject H 0 at 5%;
p- value = 2.58%.
83
Chapter 3
3.1. Introduction
Salesperson 1 2 3 4 5 6 7 8 9 10
Years 1 3 4 4 6 8 10 10 11 13
of experience
Annual sales 80 97 92 102 103 111 119 123 117 136
($1000’s)
Let us plot these data on a graph with years of selling experience on the
horizontal axis and annual sales on the vertical axis. We now have a scatter
diagram. It is given this name because the plotted points are “scattered”
over the graph or diagram. The scatter diagram for these data is shown in
Figure 3.1.
84
y
140
Annual sales ($1000’s)
130
120
110
100
90
80
70 x
1 2 3 4 5 6 7 8 9 10 11 12 13
Years of experience
Fig.3.1. Scatter diagram of annual sales and years of experience
85
3.3. Correlation analysis
Cov ( x, y )
(x
I 1
i x)( y i y )
rxy (3.1)
sx s y n _ n _
(x
i 1
i x) 2
(y
i 1
i y) 2
An equivalent expression is
86
n _ _
x y
I 1
i i nx y
rxy (3.2)
n 2 _ n 2 _
xi n ( x) 2
i 1
i 1
y i n ( y ) 2
87
Household 1 2 3 4 5 6 7
Income (100’s of $) 35 49 21 39 15 28 25
Food expenditure (100’s of $) 9 15 7 11 5 8 9
2150 7 (30.29) (9.14) 212.05
0.96
(7222 7 (30.29) 2 ) (646 7 (9.14) 2 ) 221.25
88
The sample correlation, 0.96, indicates very strong positive relationships
between monthly income and food expenditure. The high value of monthly
income tends to be associated with the higher value of food expenditure.
89
rxy n 2
where T .S . , and t n 2, is the number for which
1 rxy2
P(t n 2 t n 2, ) 2
where the random variable t n2 follows a Student’s t distribution with (n -2)
degrees of freedom.
Example:
A sample data set produced the following information
n 10 ;
xi 66 ;
yi 588 ;
xi yi 2244 ;
90
Exercises
1. For the data set
x 0 1 6 3 5
y 4 3 0 2 1
Experience 14 3 5 6 4 9 18 5 16
Monthly salary 22 12 15 17 15 19 24 13 27
91
a) Develop a scatter diagram for the above data.
b) Compute the sample correlation coefficient between grade point average
and salary.
c) Test at the 5% significance level the null hypothesis that the population
correlation coefficient is zero against the alternative that it is positive.
6. The management of a supermarket wanted to check the effect of the
number of broadcast on TV on the gross sales at the store. The management
experimented for eight weeks by broadcasting a different number of
commercials each week on TV. The following table gives the number of
commercials during each week and the gross sales (in 1000’s of dollars)
Number 22 16 28 12 30 19 24 32
of commercials
Gross sales 3.64 3.12 4.08 2.84 3.98 3.55 4.02 4.38
per week
92
3.4. Spearman rank correlation
Suppose that a random sample ( x1 , y1 ), ( x2 , y 2 ),........ , ( x n , y n ) of n pairs of
observations is taken. If x i and y i are each ranked in ascending order and the
sample correlation of these ranks is calculated, the resulting coefficient is
called the Spearman rank correlation coefficient. If there are no tied
ranks, an equivalent formula for computing this coefficient is
n
6 d
i 1
i
2
rs 1
n( n 1) 2
93
a) Find and interpret Spearman rank correlation
b) Test the null hypothesis that aggressiveness and sales are independent
again the alternative that they are positively correlated. Take 0.05 .
Solution:
a) First of all, let us rank separately x and y in ascending order. These two
rank appear in third and fourth columns of the following table 3.2
Table 3.2
x y Rank x i Rank y i d i xi y i d i2
30 35 4 5 1 1
17 31 8 8 0 0
35 40 2 4 -2 4
28 46 5 2 3 9
42 50 1 1 0 0
25 32 6 7 -1 1
19 33 7 6 1 1
34 42 3 3 0 0
sum 16
The differences between ranks and squared differences between ranks are
shown in the last two columns of the table. Substituting the values n 8 and
d i2 16 into formula for Spearman rank correlation, we obtain
n
6 d
i 1
i
2
6 16
rs 1 1 1 0.19 0.81
n( n 1)
2
8 63
It means that there exists strong positive correlation between aggressiveness
and sales volume.
b) The null and alternative hypotheses are
H 0 : x and y are independent
H 1 : x and y are positively correlated
The decision rule is
Reject H 0 if rs rs ,
For a sample of size n=8, and 0.05 ,
rs , r8,0.05 0.643
Since 0.81>0.643 we reject H 0 , and accept the alternative hypothesis that x
and y are positively correlated.
94
Exercises
1. Specify the rejection region for Spearman’s nonparametric test for rank
correlation in each of the following cases
a) H 0 : 0; H1 : 0; n 10; 0.05
b) H 0 : 0; H1 : 0; n 20; 0.025
c) H 0 : 0; H 1 : 0; n 30; 0.01
2. Compute Spearman’s rank correlation coefficient for each of the
following pairs of sample observations
a) b)
x 33 61 20 19 40 x 5 20 15 10 3
y 26 36 65 25 35 y 80 83 91 82 87
Answers
97
where a and b are estimated values of the coefficients and e is the difference
between the predicted value of y on the regression line, defined as
^
y i a b xi
^
and the observed value y i . The difference between y i and y i for each value
of x is defined as the residual
^
ei y i y i y i ( a b x i )
Thus for each observed value of x there is a predicted value of y from the
estimated model and an observed value. The difference between the
observed and predicted values of y is defined as the residual. The residual,
ei , is not the model error, , but is the combined measure of the model
error and errors in estimating, a and b, and in turn the errors in estimating the
predicted value.
98
^
y ( xi , y i )
ei
^
( x1 , y1 ) ( xi , y i )
e1
+ ( x1 , y1 )
x
x1 xi
Fig. 3.2
^
SSE ei2 ( yi y i ) 2
The coefficients a and b are chosen so that the quantity
SSE ei2 ( yi (a bxi )) 2
is minimized. It can be shown that the resulting estimates are
99
n _ _ n _ _
(x
i 1
i x)( y i y ) x y
i 1
i i nx y
b n
n
_ _
i 1
( xi x) 2 i 1
xi2 n ( x) 2
_ _
and a y b x
_ _
where x and y are the respective sample means.
The line
^
y a b x
is called the sample regression line or the least squares regression line of
y on x.
Example:
Find the least squares regression line for the data on incomes (in hundreds of
dollars) and food expenditures of seven households given in the table below.
Household 1 2 3 4 5 6 7
Income x 35 49 21 39 15 28 25
Food expenditure y 9 15 7 11 5 8 9
Solution:
^
We are to find the values of a and b for the regression model y i a b xi .
The following table shows the calculations required for the computations of
a and b.
Using data from the table 3.3 we find
_ _
212 64
x 30.2857 ; y 9.1429
7 7
100
Table3.3
Food
Household Income expenditure xi y i x i2
( xi ) ( yi )
1 35 9 315 1225
2 49 15 735 2401
3 21 7 147 441
4 39 11 429 1521
5 15 5 75 225
6 28 8 224 784
7 25 9 225 625
Sums 212 64 2150 7222
n _ _
i 1
xi y i n x y
2150 7 (30.2857 ) (9.1429 )
b 0.2642
n _
7222 7 (30.2857 ) 2
x i 1
2
i n ( x) 2
_ _
a y b x 9.1429 (0.2642 ) (30.2857 ) 1.1414
^
Thus, our estimated regression model y a b x is
^
y 1.1414 0.2642 x
This regression line is called the least squares regression line. It gives the
regression of food expenditure on income.
Using this estimated model, we can find the predicted value of y for a
specific value of x. For example, suppose that we randomly select a
household whose monthly income is $3500 so that x 35 (x denotes income
in hundred of dollar in our example). The predicted value of food
expenditure for this household is
^
y 1.1414 0.2642 35 $10.3884 hundred
In other words, based on our regression line, we predict that a household
with a monthly income of $3500 is expected to spend $1038.84 per month
on food.
101
3.5.3. Interpretation of a and b
a) Interpretation of a
Consider a household with zero income. Using the estimated regression line
obtained above, the predicted value of y for x 0 is
^
y 1.1414 0.2642 0 $1.1414 hundred
Thus, we can state that a household with no income is expected to spend
$114.4 per month on food. We should be very careful while making this
interpretation of a. In example of seven households, the incomes vary from a
minimum of $1500 to a maximum of $4900. Hence, our regression line is
valid only for the values of x between 15 and 49. If we predict y for a value
of x outside this range, the prediction usually will not hold true. Thus, since
x 0 is outside the range of household incomes that we have in the sample
data, the prediction that a household with zero income spends $114.14 per
month on food does not carry much credibility.
b) Interpretation of b
The value of b in a regression model gives the change in y (dependent
variable) due to a change of one unit in x (independent variable).
^
For example, by using the regression line y 1.1414 0.2642 x
^
when x 30 ; y 1.1414 0.2642 30 9.0674
^
when x 31 ; y 1.1414 0.2642 31 9.3316
^
Hence, when x increased by one unit, from 30 to 31, y increased
by 9.3316 9.0674 0.2642 , which is the value of b. Because of unit of
measurement in hundred of dollars, we can state that, on average, a $100
increase in income will cause a $26.42 increase in food expenditure. We can
also state that, on average, a $1 increase in income of household will
increase the food expenditure by $0.2642.
Note that when b is positive, an increase in x will lead to an increase in y and
decrease in x will lead to a decrease in y. Such a relationship between x and y
is called a positive linear relationship. On the other hand, if the value of b is
negative, an increase in x will cause a decrease in y and a decrease in x will
cause an increase in y. Such a relationship between x and y is called a
negative linear relationship.
102
The values of y- intercept and slope calculated from sample data on x and y
are called estimated values of and and denoted by a and b. Using a and
b we can write estimated model as
^
y a b x
^
where y (read as y hat) is the estimated or predicted value of y for a given
value of x.
3.5.4. Assumptions of the regression model
Like any other theory, the linear regression analysis is also based on certain
assumptions. Consider the population regression model
y i xi i
There are four assumptions made about this model.
Assumption1: The random error term has a mean equal to zero for each x.
In other words, among all households with the same income, some spend
more than predicted food expenditure; others spend less than predicted food
expenditure. Some of positive errors equal to the sum of negative errors so
that the mean of errors for all households with the same income is zero.
Assumption 2: The errors associated with different observations are
independent. According to this assumption, the errors for any two
households are independent. All households decide independently how much
spend on food.
Assumption 3: For any given x , the distribution of errors is normal. In other
words, food expenditure for all households with the same income are
normally distributed.
Assumption 4: The distribution of population errors for each x has the same
(constant) standard deviation, which is denoted by . This assumption
indicates that the spread of points around the regression line is similar for all
x values.
103
Exercises
1. Plot the following straight lines. Give the values of the y-intercept and
slope for each of these lines and interpret them. Indicate whether each of the
lines gives a positive or negative relationships between x and y .
a) y 53 7 x ; b) y 75 6 x
2. The following information is obtained from a sample data
10 10 10 10
n 10 ; i 1
x i 100 ; i 1
y i 220 ;
i 1
xi y i 3680 ; x
i 1
2
i 1140
14 _ _ 14 _
i 1
( xi x)( y i y ) 2.677 ;
i 1
( y i y ) 2 2.01
x 0 1 6 3 5
y 4 3 0 2 1
104
b) Suppose five electronic companies spent 2000$ each on advertising
during that year. Do you expect these five companies to have the same actual
gross sales for that year? Explain.
6. An economist wanted to determine whether or not the amount of phone
bills and income of households are related. The following table gives
information on the monthly incomes (in hundreds of dollars) and monthly
telephone bills (in dollars) for a random sample of 10 households
Income 16 45 36 32 30 13 41 15 36 40
Phone bill 35 78 102 56 75 26 130 42 59 85
a) Find the regression line with income as an independent variable and the
amount of the phone bill as a dependent variable.
b) Give an interpretation of the values of a and b calculated in part a.
c) Estimate the amount of the monthly phone bill for a household with a
monthly income of $2500.
7. An auto manufacturing company wanted to investigate how the price of
one of its car models depreciates with age. The research department at the
company took a sample of 9 cars of this model and collected the following
information on the ages (in years) and prices (in hundreds of dollars) of these
cars.
Age 8 3 7 10 3 5 6 9
Price 16 74 38 21 98 56 49 30
x 0.5 1 1.5
y 2 1 3
a) Plot the following two lines on your scatter diagram
1) y 3 x and 2) y 1 x
105
b) Which of these lines would you choose to characterize the relationship
between x and y? Explain
c) Show that the sum of errors for both of these lines equals 0.
d) Which of these lines has smaller SSE ?
e) Find the least squares regression line for the data and compare it to two
lines described in part a.
Answers
^ ^ ^
2. y 83.714 10.571 x ;3. y 4.225 0.247 x ;4.c) y 3.845 0.615 x ;
^
5. a)$50.6 thousand; b)different amounts;6.a) y 2.3173 2.1869 x ;
^
c) $56.99; 7. y 111 9.84 x ; 8. b) The second line; d) The second line;
^
e) y 1 x .
In Figure 3.3 it is shown that the deviation of an individual y value from its
mean can be
^
y a bx
_ ^
y SST yi y ei y y i SSE
^ _
y y SSR
_
y
_
x xi
Fig.3.3
106
partitioned into deviation of the predicted value from the mean and the
deviation of the observed value from the predicted value
_ ^ ^ _
yi y ( yi y) ( y i y)
We square each side of the equation-because the sum of deviations about the
mean is equal to zero-and sum the results over all n points
n _ n n _
(y (y ( y y)
^ ^
i y) 2 i y) 2 i
2
i 1 i 1 i 1
Some of you may note the squaring of the right- hand side should include the
cross product of the two terms in addition to their squared quantities. It can
be shown that the cross predicted term goes to zero. This equation is
expressed as
SST SSR SSE
We see that the total variability-SST- consists of two components-SSR-the
amount of variability explained by the regression equation- named
“Regression Sum of Squares” and –SSE-random or unexplained deviation of
points from the regression line-named “Error Sum of Squares”. Thus
n _
Total sum of squares: SST (y
i 1
i y) 2
n _ n _
(x
^
Regression Sum of Squares: SSR ( y y) 2 b 2 i x) 2
i 1 i 1
n n n
(e )
^
Error Sum of Squares: SSE ( yi y) 2 ( y i (a bx i )) 2 i
2
i 1 i 1 i 1
For a given set of observed values of the dependent variables, y, the SST is
fixed as the total variability of all observations from the mean. We see that in
the partitioning larger values of SSR and hence smaller value of SSE indicate
a regression equation that “fits” or comes closer to the observed data. This
partitioning is shown graphically in Figure 3.3.
Example:
Let us find SST, SSR and SSE for the data on incomes and food expenditure.
Using calculation given in the table 3.3 we find the value of total sum of
squares as
107
2
7
n 7
y i
2
y i2 646 64 60.8571
_
SST (y
i 1
i y) 2
i 1
i 1
n 7
Table3.4
x y ^
y2 ei _ _
ei
2
y xi x ( xi x) 2
The error sum of squares SSE is given in the sum of the eights column in
Table 3.4. Thus,
n n
(e )
^
SSE ( yi y) 2 i
2
4.9283
i 1 i 1
The regression sum of squares can be found from SST SSR SSE .
Thus
SSR SST SSE 60.8571 4.9283 55.9288 .
The value of SSR can also be computed by using the formula.(Check!!)
n _ n _
^
SSR ( y y) 2 b 2 ( xi x) 2 .
i 1 i 1
The total sum of squares SST is a measure of the total variation in food
expenditures, SSR is the portion of total variation explained by the regression
model (or by income), and the error sum of squares SSE is the portion of
total variation not explained by the regression model.
108
3.6.1. Coefficient of determination R 2
109
We can state that 92% of the variability in y is explained by linear
regression, and the linear model seems very satisfactory in this respect. In
other words, we can state that 92% of the total variation in food expenditures
of households occurs because of the variation in their incomes, and the
remaining 8% is due to other variables, like differences in size of the
household, preferences and tastes and so on.
When we consider income and food expenditures, all households with the
same income are expected to spend different amounts on food.
Consequently, the random error i will have different values for these
^ 2
households. The variance i measures the spread of these errors around the
^ 2
population regression line. Note that i denotes the variance of errors for
^ 2
the population. However, usually i is unknown. In such cases, it is
estimated by s e2 , which is the standard deviation of errors for the sample
data.
An estimator for the variance of the population model error is
n
^ 2 e SSE
i 1
2
i
e s e2
n2 n2
Division by ( n 2 ) instead of (n 1) results because the simple regression
model uses two estimated parameters, a and b , instead of one.
The formula for SSE is
n n
(y
^
SSE (ei ) 2 i y) 2
i 1 i 1
If we introduce the following notations
n
n _ n
( x ) i
2
SS xx (x
i 1
i x) 2 x
i 1
2
i i 1
n
110
n
n _ n
( y ) i
2
SS yy (y
i 1
i y) 2 y
i 1
2
i i 1
n
n n
n _ _ n
( xi )( y ) i
SS xy i 1
( xi x)( y i y )
i 1
xi y i i 1
n
i 1
12 12
i 1
xi2 396 ; and yi 1
2
i 58734
460 460
x
i 1
2
i 48530 ; and yi 1
2
i 39347
111
4. Computing from a data set of (x, y) values produced the following
summary statistics
_ _
n 14 ; x 1.2 ; y 5.1 ;
SS xx 14.10 ; SS xy 2.31 ; SS yy 2.01
Determine the proportion of variation in y that is explained by linear
regression.
5. A calculation shows that SS xx 10.1 , SS yy 16.5 , and SS xy 9.3 ,
determine the proportion of variation in y that is explained by linear
regression.
6. The following table lists the sizes of offices (in hundreds of square meters)
and the rents (in dollars) paid for those offices.
Size of offices 22 17 19 28 35 24
Monthly rent 710 590 730 880 1080 820
^
a) Find the regression line y a bx with the size of an office as an
independent variable and monthly rent as a dependent variable.
b) Give a brief interpretation of the values of a and b.
c) Predict the monthly rent for the office with 2400 square meters.
d) One of the offices is 2600 square meters and its rent is $850. What is the
predicted rent for this office? Find the error for this office.
e) Compute the standard deviation of errors.
f) Calculate the coefficient of determination. What percentage of the
variation in monthly rents explained by the sizes of the offices? What
percentage of this variation is not explained?
7. Refer to exercise 7 of previous chapter. The following table which gives
the ages (in years) and prices (in hundred of dollars) of eight cars of specific
model, is reproduced from that exercise.
Age 8 3 7 10 3 5 6 9
Price 16 74 38 21 98 56 49 30
112
Answers
^
1. 22.2; 0.99; 2.50.06; 0.04; 3. a) y 1.454 0.247 x ; b) s e2 0.031 ;
^
4. 0.188; 5. 0.5190; 6. a) y 194 25.1 x e)40.2; f) 0.953;
; 7. a) 11.39; b) 0.856
One of the main purposes for determining a regression line is to find the true
value of the slope of the population regression line. However, in almost
all cases, the regression line is estimated using sample data. Then based on
the sample regression line, inferences are made about the population
regression line. The slope b of a sample regression line is a point estimator
of the slope of the population regression line. The different sample
regression lines estimated for different samples taken from the same
population will give different values of b. If only one sample is selected,
then the value of b will depend on which elements are included in the
sample. Thus, b is a random variable and it possesses a probability
distribution called a sampling distribution.
Assume that assumptions 3.5.4 are hold. Then b is an unbiased estimator of
and has a population variance
2 2
b2 n _
n _
i 1
( xi x) 2 x
i 1
2
i n ( x) 2
2
s s e2 s e2
sb2 n
e
n
_ _ SS xx
(x
i 1
i x) 2 x
i 1
2
i n ( x) 2
113
3.7.1. Hypothesis testing about
Let be a population regression slope and b its least square estimate based
on n pairs of sample observations. Assume that assumptions 3.5.4 hold and
also assume that the errors i are normally distributed. Then the random
variable
b
t
sb
is distributed as Student’s t distribution with (n 2) degree of freedom.
If we use notation
b
T .S . t
sb
for the test statistic then the following tests have a significance level
1. To test either null hypothesis
H 0 : 0 or H 0 : 0
against the alternative
H1 : 0
the decision rule is
Reject H 0 if T .S . t n 2,
2. To test either null hypothesis
H 0 : 0 or H 0 : 0
against the alternative
H1 : 0
the decision rule is
Reject H 0 if T .S . t n 2,
3. To test null hypothesis
H 0 : 0
against the two sided alternative
H1 : 0
the decision rule is
Reject H 0 if T .S . t n 2, / 2 or T .S . t n 2, / 2
114
Remark1: To test the hypothesis that x does not determine y linearly and
there is no linear relationship, we will test the null hypothesis that the slope
of the regression line is zero, that is H 0 : 0 0 ; the alternative
hypothesis that H 1 : 0 0 means x determines y linearly;
H 1 : 0 0 means x determines y positively; H 1 : 0 0 means x
determines y negatively.
Remark2: The null hypothesis does not always have to be 0 .We may
test the null hypothesis that is equal to a value different from zero.
Example:
Test at the 5% significance level if the slope of the population regression line
for the example on incomes and food expenditure of seven households is
positive.
Solution:
From earlier calculations we have
n7; b 0.2642 and s e 0.9922
s e2 0.9856
sb2 n
0.001229 ; and sb 0.0350 .
_ 801.429
(x
i 1
i x) 2
115
positively. That is, food expenditure increases with an increase in income
and it decreases with a decrease in income.
116
Exercises
1. The following information is obtained for a sample of 16 observations
taken from a population
^
SS xx 340.700 ; s e 1.951 ; and y 12.45 6.32 x
a) Make a 99% confidence interval for .
b) Using a significance level of 0.025, test the null hypothesis that is zero
against the alternative that is positive.
c) Using a significance level of 0.01, can you conclude that is zero against
the alternative that it is different from zero?
d) Using a significance level of 0.02, test whether is different from 4.50.
2. The following information is obtained for a sample of 100 observations
taken from a population. (Note that because n 30 , we can use the normal
distribution to make a confidence interval and test a hypothesis about )
^
SS xx 524.884 ; s e 1.464 ; and y 5.48 2.50 x
a) Make a 98% confidence interval for
b) Test at the 2% significance level whether is zero against the alternative
that it is positive.
c) Can you conclude that is zero? Use 0.01 .
d) Using a significance level of 0.01, test whether is 1.75 against the
alternative that it is greater than 1.75.
3. Refer to exercise 7 of previous chapter. The following table which gives
the ages (in years) and prices (in hundred of dollars) of eight cars of specific
model, is reproduced from that exercise.
Age 8 3 7 10 3 5 6 9
Price 16 74 38 21 98 56 49 30
117
4. The following table gives the experience (in years) and monthly salaries
(in thousands of tenge) of nine randomly selected secretaries
Experience 14 3 5 6 4 9 18 5 16
Monthly salary 22 12 15 17 15 19 24 13 27
Size of offices 22 17 19 28 35 24
Monthly rent 710 590 730 880 1080 820
a) Construct a 99% confidence interval for . You can use the calculations
made in exercise 6 of previous section here.
b) Test at the 5% significance level the null hypothesis that is zero against
the alternative that it is different from zero.
6. The following data give information on the ages (in years) and the number
of breakdowns during the past year for a sample of six machines at a large
company.
Age 9 14 18 15 10 11
Number of breakdowns 34 46 52 64 42 44
^
a) Find the least squares regression line y a b x
b) Give a brief interpretation of the values a and b.
c) Compute and interpret R 2 .
d) Compute the standard deviation of errors.
e) Construct a 98% confidence interval for .
f) Test at the 2.5% significance level the null hypothesis that is zero
against the alternative that it is positive.
118
7. The following table gives information on the temperature in a city and
volume of the ice cream (in thousands) sold at the supermarket for a random
sample of eight days during the summer.
Temperature 22 16 28 12 30 19 24 32
Ice cream sold 3.64 3.12 4.08 2.84 3.98 3.55 4.02 4.38
^
a) Find the least squares regression line y a b x . Take temperature as an
independent variable and volume of ice cream sold as a dependent variable.
b) Give a brief interpretation of the values a and b.
c) Compute and interpret R 2 .
d) Compute the standard deviation of errors.
e) Construct a 95% confidence interval for .
f) Test at the 1% significance level the null hypothesis that is zero against
the alternative that it is positive.
Answers
1.a) 6.01 to 6.63;b) T .S. t 59.792; reject H 0 ;c) T .S. t 59.792; reject H 0
d) T .S. t 17.219; reject H 0 ;2.a)2.35 to 2.65;b) T .S. z 39.12; reject H 0 ;
c) T .S. z 39.12; reject H 0 ; d) T .S. z 11.74; reject H 0 ; 3. a) -15.53 to -
^
7.17; b) T .S. t 6.645; reject H 0 ;4. a) y 10.4986 0.8689 x ; b) 0.5559
to 1.1819; c) T .S. t 8.323; reject H 0 ; 5. a) 12.34 to 37.92 ; b)
^
T .S. t 4.604; reject H 0 6. a) y 1.4337 0.8916 x ; c) R 2 0.94 ;
d) s e 0.9285 ; e) 0.5708 to 1.2124; f) T .S. t 9.356; reject H 0 ;
^
7. a) y 2.0680 0.0714 x ; c) R 2 0.92 ;d) s e 0.1537 ; e) 0.0511 to
0.0917; f) T.S. t 8.602 reject H 0 .
119
3.8. Using the regression model for prediction a particular value of y
120
_
2
^ 1 ( x x)
y n 1 t n 2, / 2 1 n n 1 s
n _ e
i 1
( xi x)
2
where
n
_ x
i 1
i
^
x and y n 1 a b x n 1 .
n
Example:
For the data on incomes and food expenditures of seven households, find
a) 99% prediction interval for the predicted food expenditure for a single
household with a monthly income of $3500;
b) Obtain a 99% confidence interval for the expected food expenditure for all
households with a monthly income of $3000.
Solution:
a) The point estimate of the predicted food expenditure for x 35 is given by
^
y 1.1414 0.2642 (35) 10.3884
100(1 )% 99%
0.01
/ 2 0.005
t n 2, / 2 t 5,0.005 4.032
Using data from the previous chapters
_
s e 0.9922 ; x 30.2857 ; and SS xx 801.4286
121
Hence, the 99% prediction interval for y p for x 35 is
_
2
^ 1 ( x x)
y n 1 t n 2, / 2 1 n n 1 s
n _ e
i 1
( xi x)
2
1 (35 30.2857 ) 2
10.3884 4.032 1 0.9922
7 801.4286
10.3884 4.3284 6.0600 to 14.7168
Thus, with 99% confidence we can state that the predicted food expenditure
of a household with a monthly income of $3500 is between $606.00 and
$1471.68.
b) Once again, the point estimate of the expected food expenditure for
x 35 is
^
y 1.1414 0.2642 (35) 10.3884
Hence, the 99% confidence interval for E ( y n1 / 35) is
_
2
^ 1 ( x x)
y n 1 t n 2, / 2 n n 1 s
n _ e
i 1
( xi x)
2
1 (35 30.2857 ) 2
10.3884 4.032 0.9922
7 801.4286
10.3884 1.6523 8.7361 to 12.0407
Thus, with 99% confidence we can state that the mean food expenditure for
all households with monthly income of $3500 is between $873.61 and
$1204.07.
As we can observe, the interval in part a) 606.00 to 1471.68 is much wider
than the one for the mean value of y for x 35 calculated in part b)
873.61 to1204.04 . This is always true. The prediction interval for
predicting a single value of y is always larger than the confidence interval for
estimating the mean value of y for a certain value of x.
122
Exercises
1. Construct a 99% confidence interval for the mean value of y and a 99%
prediction interval for the predicted value of y for the following
^
a) y 3.25 80 x for x 15 given se 0.954 ;
_
x 18.52 ; SS xx 144.65 ; and n 10
^ _
b) y 27 7.67 x for x 12 given s e 2.46 ; x 13.43 ;
SS xx 369.77 ; and n 10
2. Refer to Exercise 4 of the previous section. Construct a 90% confidence
interval for the mean monthly salary of secretaries with 10 years of
experience. Construct a 90% prediction interval for the monthly salary of a
randomly selected secretary with 10 years of experience.
3. Refer to Exercise 6 of the previous section. Construct a 95% confidence
interval for the mean number of breakdowns for all cars which are 16 years
old. Determine a 95% prediction interval for y p for x 16 .
4. The following data give information on the lowest cost price (in dollars)
and the average attendance (thousand) for the past year for eight football
teams
Ticket price 3.6 3.3 2.8 2.6 2.7 2.9 2.0 2.6
Attendance 24 21 22 22 18 13 9 6
123
^
y 13.6 1.2 x
was estimated by least squares for these data. Also found that
_ 25 _
x 6.0 ; (x
i 1
i x) 2 130 ; SSE 80.6
124
Chapter 4
4.1. Introduction
125
Thus, is expected value of the dependent variable when every independent
variable takes value 0. Frequently this interpretation does not carry practical
interest and often leads to meaningless.
The interpretation of the coefficients 1 , 2 , 3 ,..... k is extremely
important. For example, 1 is expected increase in y resulting from 1 unit
increase in x1 when the values of the other independent variables remain
constant. In general, i is expected increase in the dependent variable
resulting from a 1-unit increase in the independent variable x i when the
values of the other independent variables remain constant.
If model (1) is estimated using sample data, which is usually the case, the
estimated regression model is written as
^
y a b1 x1 b2 x2 b3 x3 ...... bk xk (2)
In model (2) a, b1 , b2 , b3 ,......and bk are the sample statistics, which are the
point estimators of , 1 , 2 , 3 ,.....and k , respectively.
In model (1) y denotes the actual values of the dependent variable. In model
^
(2), y denote the predicted or estimated values of the dependent variable.
^
The difference between y and y gives the error of prediction.
The method of fitting multiple regression of least squares model is similar to
that of fitting the linear regression model: method of least squares. That is,
we choose the estimated model
^
y a b1 x1 b2 x2 b3 x3 ...... bk xk
that minimizes
( y y)
^
SSE 2
.
126
4.3. Standard assumptions for the multiple regression models
Like the simple linear regression model, the multiple regression model is
also based on certain assumptions.
Consider the multiple regression model
y 1 x1 2 x2 3 x3 .... k xk
The following assumptions are often made:
Assumption 1: For any given set of values of x1 , x2 , x3 ,....... and x k , the
random error has a normal probability distribution with mean equal to 0
and variance equal to 2 .
Assumption 2: The errors associated with different sets of values of
independent variables are independent.
Assumption 3: The independent variables are not linearly related. If any of
them is linearly related, then we can eliminate one of the variables by
making substitution and reduce the number of independent variables.
Assumption 4: It is not possible to find a set of numbers c0 , c1 , c 2 ,........c k ,
such that
c0 c1 x1 c2 x2 c3 x3 .... ck xk 0
The variance of errors (also called the variance of the estimate) for the
multiple regression model
y 1 x1 2 x2 3 x3 .... k xk
is denoted by e2 . However, when sample data are used to estimate
multiple regression model (1), the variance of errors, denoted by s e2 , is an
unbiased estimate of the e2 . The formula for calculating s e2 is as follows
n
e
i 1
2
i
SSE
s e2
n K 1 n K 1
where
n-is the sample size
127
K-is the number of independent variables included in the model.
The positive square root of the variance s e is also called the standard error
of the estimate.
4.4.2 The coefficient of determination
n n
(y e
^
Error sum of squares: SSE i yi )2 2
i
i 1 i 1
n _
( y y)
^
Regression sum of squares: SSR i
2
i 1
The coefficient of determination for a multiple regression model, usually
called the coefficient of determination, is denoted by R 2 and is defined as
the proportion of the total sample sum of squares SST that is explained by
the multiple regression model.
SSR SSE
R2 1
SST SST
It tells us how good the multiple regression model is and how well the
independent variables included in the model explain the dependent variable.
128
The value of the coefficient of determination R 2 always lies in the range 0 to
1, that is
0 R2 1
_ 2
n 1
R 1 (1 R 2 ) or
n K 1
_ 2
SSE /( n K 1)
R 1
SST /( n 1)
129
4.4.4 Predictions from the multiple regression models
Exercises
1. The regression model yi 1 x1i 2 x2i i was fitted to a data
set obtained from 20 runs of an experiment in which two predictors x1i and
x 2i were observed along with the response y i . The least squares estimates
were
a 4.21 ; b1 11.37 ; b2 0.513
Predict the response for
a) x1 8 ; x 2 30
b) x1 8 ; x 2 50
2. The following model was fitted to a sample of 25 families in order to
explain household milk consumption
yi 1 x1i 2 x2i i
where
y i -milk consumption, in liters per week
x1i -weekly income, in hundreds of dollars
x 2i -family size
The least squares estimates of the regression parameters were
a 0.30 ; b1 2.32 ; b2 1.41
a) Interpret the estimates b1 and b 2
130
b) Is it possible to provide a meaningful interpretation of the estimate a ?
3. The following model was fitted to a sample of 20 students using data
obtained at the end of the education year. The aim was to explain students’
weight gains.
yi 1 x1i 2 x2i 3 x3i i
where
y i -weight gained, in kilograms, during the academic year
x1i -average number of meals eaten per week
x 2i -average number of exercise per week, ( in hours)
x 3i -average number of beers consumed per week
The least squares estimates of the regression parameters were
a 12.9 ; b1 4.5 ; b2 6.3 ; b3 3.14
a) Interpret the estimates b1 , b2 and b 3
b) Is it possible to provide a meaningful interpretation of the estimate a ?
4. In the study of exercise 2, where the least squares estimates were based on
25 sets of sample observations, the following data were found
SST 160.6 and SSR 80.3
a) Find and interpret the coefficient of determination.
b) Find the adjusted coefficient of determination.
5. In the study of exercise 3, sample of 20 observations were used to
calculate the least squares estimate. The regression sum of squares and error
sum of squares were found to be
SST 82.6 and SSE 49..3
a) Find and interpret the coefficient of determination.
b) Find the adjusted coefficient of determination.
6. A multiple linear regression was fitted to a data set obtained from 27 runs
of an experiment, in which four predictors x1 , x 2 , x3 , and x 4 were observed
along with the response y. The following results were obtained:
a 5.46; b1 2.35 ; b2 18.4 ; b3 0.91 ; b4 6.2 ;
SSR 920.60 ; SSE 78.92
a) Predict response for x1 14; x 2 0.6; x3 5; x 4 5.2
b) Estimate the error standard deviation
c) What proportion of the y variability I explained by the fitted regression?
131
Answers
1. a) 79.78; b) 69.52; 4. a) 0.5; b) 0.45; 5. a) 0.4; b) 0.52; 6. a) 66.17;
b) 1.89; c) 0.92.
Usually the calculations for a multiple regression model are made by using
statistical software package for computers, such as MINITAB, instead of
using the formula manually. In this chapter we will analyze the multiple
regression models using MINITAB statistical software. The solutions
obtained using other packages can be interpreted the same way.
Remark:
To use MINITAB menu follow the following instructions
1. Select Stat>Regression
2. Select Response column
3. Select Predictors columns
4. Click OK.
Example1:
Suppose that we want to find the effect of driving experience and the number
of driving violations on auto insurance premiums. A random sample of 10
drivers insured with a company and having similar auto insurance policies
was selected. Table 4.1 lists the yearly auto insurance premiums (in dollars)
paid by these drivers, y, their driving experience ( x1 , in years), and the
number of driving violations that each of them has committed during the past
five years.
Table 4.1
y x1 x2
74 5 2
50 6 1
97 4 6
57 11 3
99 3 1
35 19 0
40 15 1
49 13 2
101 2 8
42 10 3
132
Use a computer package to perform a regression analysis using model
yi 1 x1i 2 x2i i
and answer the following questions:
a) Write the estimated regression equation;
b) Interpret the meaning of the estimated regression coefficients;
c) What are the values of the variance and standard deviation of errors, the
coefficient of determination, and the adjusted coefficient of determination?
d) What is the predicted auto insurance premium paid per month by a driver
with seven years of experience and four driving violations?
Solution:
Using MINITAB, we first enter the data of y, x1 , x 2 in three different
columns and then use the regression command. The computer executes a
multiple regression analysis. We focus our attention on the principal aspects
of the output as shown in Figure 4.1
Figure 4.1
The regression equation is
Y= 87.9 - 3.39 X1 + 2.33 X2
Predictor Coef St. dev. T P
Constant 87.92 13.96 6.30 0.000
X1 -3.3869 0.9930 -3.41 0.011
X2 2.327 2.264 1.03 0.338
Analysis of Variance
Source DF SS MS F P
Regression 2 4824.5 2412.3 12.72 0.005
Residual Error 7 1327.9 189.7
Total 9 6152.4
Source DF SEQ. SS
X1 1 4624.1
X2 1 200.4
We now proceed to interpret the results in Figure 4.1 and use them to make
further statistical inferences.
a) The equation of the fitted linear regression is
133
^
y 87.9 3.39 x1 2.33 x 2
From this equation,
a 87.9 ; b1 3.39 ; b2 2.33
We can also read the values of these coefficients from the column labeled
COEF in the MINITAB solution of Figure 4.1.
Notice that in this column the coefficients appear with more digits after the
decimal point. With these coefficient values, we can write the estimated
regression equation as
^
y 87.92 3.3869 x1 2.327 x 2
b) The value of a 87.92 is in the estimated regression equation gives the
^
value of y for x1 0 and x 2 0 . It means that a driver with no experience
and no driving violations is expected to pay an auto insurance premium of
$87.92 per year. This is the technical interpretation of a.
The value of b1 3.3869 in the estimated regression model gives the
^
change in y for a one unit change in x1 when x 2 is held constant. Thus, we
can state that a driver with one extra year of experience but with the same
number of violations is expected to pay $3.3869 less for the auto insurance
premium per year.
^
The value of b2 2.327 in the estimated regression model gives change in y
for a one unit change in x 2 when x1 is held constant. Thus, we can state that
a driver with one extra driving violation but with the same years of driving
experience is expected to pay $82.327 more per year for the auto insurance
premium.
SSE 1327 .9
b) e2 is estimated by s e2 189.7 , so s e 13.77 .
n K 1 7
The values of the standard deviation of errors, the coefficient of
determination and the adjusted coefficient of determination are also given in
MINITAB solution. From Figure 4.1 we obtain
_ 2
s s e 13.77 ; R-SQ R 2 78.4% ; R-SQ (adj) R =72.3%
The value of R 2 78.4% tells us that the two independent variables included
in our model explain 78.4% of the variation in the dependent variable.
134
_ 2
The value of R 72.3% is the value of the coefficient of determination
adjusted for degrees of freedom. It states that when adjusted for degrees of
freedom, the two independent variables explain 72.3% of the variation in the
dependent variable.
c) To predict auto premium paid per year by a driver with seven years of
experience and four driving violations, we substitute x1 7 and x 2 4 in the
estimated regression model
^
y 87.92 3.3869 x1 2.327 x 2 87.92 3.3869 7 2.327 4 73.5197
^
Note that this value of y is a point estimate of the predicted value of y,
which is denoted by y p .
Remark:
In figure 4.1 there is portion of solution in the end reproduced below,
Source DF SEQ. SS
X1 1 4624.1
X2 1 200.4
which we have not used in any of the examples. From figure 4.1 we have
SSR 4824.5
If we estimate the simple linear regression of y on x1 ,
y 1 x1
the value of SSR will be 4624.1, which is the value in the row of X1 and the
column labeled SEQ.SS. That is, x1 alone will reduce SST by 4624.1
Then, if we add x 2 to model above, the SST will further be reduced by 200.4,
which is the value in the row of X2 and the column labeled SEQ.SS.
The sum of the two numbers in the column of SEQ.SS is
4624.1 200.4 4824.5
which is the value of SSR in the Figure 4.1.
Example2:
We are interested in studying the blood pressure y of males in relation to
weight x1 and age x 2 . Sample of 10 male was selected. The data set listed
below:
135
y x1 x2
120 76 60
160 84 45
134 95 37
149 99 46
153 74 49
164 83 70
130 92 38
170 110 54
148 80 28
125 79 19
Use a computer package to perform a regression analysis using model
yi 1 x1i 2 x2i i
Solution:
Using MINITAB, we first enter the data of y, x1 , x 2 in three different
columns and then use the regression command. The computer executes a
multiple regression analysis. We focus our attention on the principal aspects
of the output as shown in Table 4.2
Table 4.2
The regression equation is
Y = 81.2 + 0.493 X1 + 0.474 X2
Source DF SS MS F P
Regression 2 805.2 402.6 1.51 0.285
Residual Error 7 1864.9 266.4
Total 9 2670.1
Source DF SEQ. SS
X1 1 353.4
X2 1 451.8
136
We now proceed to interpret the results in table 4.2 and use them to make
further statistical inferences.
a) The equation of the fitted linear regression is
^
y 81.2 0.493 x1 0.474 x2
This means that the mean blood pressure increases by 0.493 if weight x1
increases by 1 kilogram and age x 2 remains fixed.
Similarly, a 1-year increase in age with the weight held fixed will increase
the mean blood pressure by 0.474.
b) The estimated regression coefficients and the corresponding estimated
standard errors are
a 81.17 estimated standard error S.E.(a) 43.50
b1 0.4929 estimated standard error S.E.(b1 ) 0.4749
b2 0.4741 estimated standard error S.E.(b2 ) 0.3641
Further, the error standard deviation estimated by s 16.32 with
degrees of freedom n (number of variables ) - 1 10 - 2 - 1 7 .
These results are useful in interval estimation and hypothesis tests about the
regression coefficients.
c) In Table 4.2, the result " R SQ 30.2%" or R 2 0.302 tells us that
30.2% of the variability of y is explained by the fitted multiple regression of
y on x1 and x 2 . The analysis of variance shows the decomposition of the
_
total variability ( y y) 2
2670 .1 into the two components
137
4.6. Confidence interval for individual coefficients
138
The 90% confidence interval for 1 is
b1 t n K 1, / 2 s b1 1 b1 t n K 1, / 2 s b1
3.3869 1.895 0.9930 1 3.3869 1.895 0.9930
5.269 1 1.505
Thus, the 90% confidence interval for 1 is 5.269 to 1.505 . That is, we
can state with 90% confidence that for one extra year of driving experience,
the yearly auto insurance premium decreases by an amount between $1.505
and $5.269.
Exercises
139
Analysis of Variance
Source DF SS MS F P
Regression 2 849.65 424.83 130.10 0.000
Residual Error 10 32.65 3.27
Total 12 882.30
Source DF SEQ. SS
X1 1 841.25
X2 1 8.40
Using the MINITAB solution, answer the following questions for the
population regression model y 1 x1 2 x2
a) Write the estimated regression equation.
b) Write the values of a , b1 , b 2 and explain the meaning of these estimated
regression coefficients.
c) Write the values of the standard deviation of the coefficients of a , b1 , b 2 .
d) What are the values of the variance and standard deviations of errors, the
coefficient of determination, the adjusted coefficient of determination, SST,
SSR, SSE, MSR, and MSE?
e) What is the predicted value of y for x1 74 and x 2 140 ?
f) Construct a 99% confidence interval for the coefficient of x1 in the
population regression model.
g) Make a 95% confidence interval for the coefficient of x 2 in the population
regression model.
h) Determine a 90% confidence interval for , the constant term in the
population regression model.
4. The following is the MINITAB solution for a regression of y on x1 , x 2
and x 3 .
140
X3 -0.21648 0.04836 -4.48 0.004
Analysis of Variance
Source DF SS MS F P
Regression 3 422.21 140.74 136.44 0.000
Residual Error 6 6.19 1.03
Total 9 428.40
Source DF SEQ SS
X1 1 401.42
X2 1 0.12
X3 1 20.67
Using the MINITAB solution, answer the following questions for the
population regression model y 1 x1 2 x2 3 x3
a) Write the estimated regression equation.
b) Write the values of a , b1 , b 2 , b 3 and explain the meaning of these
estimated regression coefficients.
c) Write the values of the standard deviation of the coefficients of a , b1 , b 2 ,
b3 .
d) What are the values of the variance and standard deviations of errors, the
coefficient of determination, the adjusted coefficient of determination, SST,
SSR, SSE, MSR, and MSE?
e) What is the predicted value of y for x1 33 , x 2 50 and x3 60 ?
f) Construct a 95% confidence interval for the coefficient of x1 in the
population regression model.
g) Make a 90% confidence interval for the coefficient of x 2 in the population
regression model.
h) Determine a 99% confidence interval for x 3 , in the population regression
model.
5. In a study of revenue generated by national lotteries, the following
regression equation was fitted to a data from 26 countries with lotteries:
y 30.29 0.0354 x1 0.9734 x 2 340 .9524 x3
(0.00652) (0.3210) (225.78)
141
R 2 0.56%
where
y dollars of net revenue per capita per year generated by lottery;
x1 mean per capita personal income of the country
x 2 number of hotel, motel, resort rooms per thousand of people
x3 spendable revenue per capita per year generated by legalized gambling
The numbers in parentheses below the coefficient estimates are the
corresponding estimated standard errors.
a) Interpret the estimated coefficient on x1 , x 2 and x 3 .
b) Find and interpret a 90% confidence interval for the coefficient on x 2 , in
the population regression.
c) Find and interpret a 99% confidence interval for the coefficient on x 3 , in
the population regression.
Answers
1. a) 2.167 to 2.473; 2.135 to 2.504; b) 0.48 to 2.34; 0.142 to 2.678; 2. 4.19
^
to 4.814; 4.12 to 4.88; 3.98 to 5.02; 3. a) y 12.4 0.24 x1 0.036 x 2 ;
b) a 12.410; b1 0.2415; b2 0.0362; c) s a 5.234; s b1 0.0345 ;
_ 2
s b2 0.024 ; d) s e2 3.0976 ; s e 1.76 ; R 2 97.8% ; R 96.6% ;
SST 882.30; SSR 849.65; SSE 32.65; MSR 424.83; MSE 3.27;
e) y p 35.2; f) 0.132 to 0.348; g) -0.017 to 0.089; h) 2.93 to 21.89;
^
4. a) y 22.2 0.203 x1 0.0499 x2 0.216 x3 ;b) a 2.212 ; b1 0.20276;
b2 0.04991 ; b3 0.21648 ;c) s a 4.602 ; s b1 0.06171 ; s b2 0.04166 ;
_ 2
s b3 0.04836 ;d) s e2 1.032 ; s e 1.016 ; R 2 98.6% ; R 97.8% ;
SST 428.40; SSR 422.21; SSE 6.19; MSR 140.74; MSE 1.03;
e) y p 13.444; f) 0.051 to 0.355; g) -0.132 to 0.031; h) -0.394 to -0.038;
5. b) 0.422 to 1.524 c) -977.42 to 295.52.
142
4.7. Test of hypothesis about individual coefficients
We can make a test of hypothesis about any of the i coefficients of model
y 1 x1 2 x2 3 x3 .... k xk
Using the same procedure that we used to make a test of hypothesis about
for a simple regression model in previous chapter. The only difference is the
degrees of freedom, which are equal to (n K 1) for a multiple regression.
In this case the value of the test statistic t for bi is calculated as
b i
T .S . t i
s bi
The value of i is substituted from the null hypothesis.
If the regression errors i are normally distributed and the standard
regression assumptions hold, then the following hypothesis tests have
significance level
1. To test either null hypothesis
H 0 : i 0 or H 0 : i 0
against the alternative
H1 : i 0
the decision rule is
Reject H 0 if T .S . t n K 1,
2. To test either null hypothesis
H 0 : i 0 or H 0 : i 0
against the alternative
H1 : i 0
the decision rule is
Reject H 0 if T .S . t n K 1,
3. To test null hypothesis
H 0 : i 0
against the two sided alternative
H1 : i 0
the decision rule is
Reject H 0 if T .S . t n K 1, / 2 or T .S . t n K 1, / 2
Remark: In most cases we are interested in the null hypothesis H 0 : i 0 .
143
Example:
For example 1 of the section 4.5, using 1% significance level, can you
conclude that the slope of the number of driving violations in regression
model is 0 against the alternative that it is positive? Use the MINITAB
solution given in Figure 4.1.
Solution:
x 2 is the number of driving violations committed during the past five years.
The portion of the solution is reproduced below
Predictor Coef St. dev. T P
Constant 87.92 13.96 6.30 0.000
X1 -3.3869 0.9930 -3.41 0.011
X2 2.327 2.264 1.03 0.338
We are to test the following null and alternative hypotheses
H0 : 2 0
H1 : 2 0
The decision rule is
reject H 0 if T .S . t n K 1,
From solution we obtain that t T.S. 1.03 . It also can be found as
b i 2.327 0
T .S . t i 1.03
s bi 2.264
d . f . n K 1 10 2 1 7
t n K 1, t 7,0.01 2.998
Since 1.03 2.998 , we accept the null hypothesis. Consequently, we
conclude that the slope of x 2 in regression model is zero. That is, the number
of driving violations is not significant and an increase (or decrease) in the
number of driving violations does not affect the auto insurance premium.
Remark:
Note that the observed value of test statistic T (test statistic t) is obtained
from the MINITAB solution only if the null hypothesis is H 0 : 2 0 .
However, if the null hypothesis is that 2 is equal to a number other than
zero, then the t value obtained from the MINITAB solution is no longer
valid. In this case observed value of the test statistic will be calculated as
b 2
T .S . t 2
s b2
144
4.8. Tests on sets of regression parameters
145
where FK ,n K 1, is the number for which P ( FK ,n K 1 FK ,n K 1, ) and
FK ,n K 1 follows an F distribution with numerator degrees of freedom K and
denominator degrees of freedom (n K 1) .
Example:
Using 5% significance level, can you conclude that the coefficients of all
independent variables in the example 4.1 are equal to zero? Use the
MINITAB solution shown in Figure 4.1
Solution:
The two hypotheses are
H 0 : 1 2 0
H 1 : at least one 0
The portion of the solution is reproduced below
Analysis of Variance
Source DF SS MS F P
Regression 2 4824.5 2412.3 12.72 0.005
Residual Error 7 1327.9 189.7
Total 9 6152.4
From the portion of MINTAB solution we obtain
MSR 2412.3 ; MSE 189.7
and the value of the test statistic is T .S. F 12.07
FK ,n K 1, F2,1021,0.05 F2,7,0.05 4.74
Because the value of the test statistic T .S. F 12.07 greater than 4.74 , it
falls in the rejection region. Consequently, we reject the null hypothesis and
conclude that at least one of the two ’s is different from zero.
Exercises
146
3. The following is the MINITAB solution for a regression of y on x1 , x 2
and x 3 .
Analysis of Variance
Source DF SS MS F P
Regression 3 2909.91 969.97 1198.86 0.000
Residual Error 10 8.09 0.81
Total 13 2918.00
Source DF SEQ SS
X1 1 2901.82
X2 1 7.85
X3 1 0.24
Using the MINITAB solution, answer the following questions for the
population regression model y 1 x1 2 x2 3 x3
a) Write the estimated regression equation.
b) Write the values of a , b1 , b 2 , b 3 and explain the meaning of these
estimated regression coefficients.
c) What are the values of the standard deviation of errors, the coefficient of
determination, the adjusted coefficient of determination, SST, SSR, SSE,
MSR, and MSE?
d) Write the values of the standard deviation, the value of test statistic, and
the p-value for each of the coefficients of a , b1 , b 2 , b 3 .
147
e) What is the predicted value of y for x1 310 , x2 260 and x3 180 ?
f) Construct a 95% confidence interval for the coefficient of x1 in the
population regression model.
g) Make a 99% confidence interval for the coefficient of x 2 in the population
regression model.
h) Make a 98% confidence interval for the coefficient of x 3 in the population
regression model.
i) Determine a 95% confidence interval for , the constant term in the
population regression model.
j) Using the 5% significance level, test the null hypothesis that the
coefficient of x1 in the population regression model is zero against the
alternative that it is negative.
k) Using the 1% significance level, can you conclude that the coefficient of
x 2 in the population regression model is zero against the alternative that it is
positive?
l) At the 2.5% significance level, test if the coefficient of x 3 in the
population regression model is zero against the alternative that it is negative.
m) Using the 5% significance level, can you conclude that the coefficients of
all independent variables in the population regression model are equal to
zero?
4. The Corporation has a large number of restaurants through the country.
The research department wanted to find if the sales of the restaurants depend
on the size of the population within a certain area surrounding the restaurants
and the mean income of households in those areas. They collected
information on these variables for 10 restaurants. The following table gives
information on the monthly sales (in thousands of dollars) of these
restaurants, the population (in thousands) within 10 kilometers of the
restaurants, and means monthly income (in hundreds of dollars) of the
households of those areas.
Sales 18 28 16 20 13 29 34 23 19 28
Population 22 16 33 19 47 70 30 45 77 41
Income 39 51 28 32 28 37 42 27 20 18
148
Using MINITAB (or any other statistical software package), find the
regression of sales on a population and income. Using solution, answer the
following questions.
a) Write the estimated regression equation.
b) Explain the meaning of the estimates of the constant term and regression
coefficients of the population and income.
c) What are the values of the standard deviation of errors, the coefficient of
determination, the adjusted coefficient of determination?
d) What are the value of the total sum of squares? What portion of SST is
explained by our regression model? What portion of SST is not explained by
our regression model?
e) What is the predicted sales for a restaurant with 52 thousand people living
within 10 km surrounding it and $3600 mean monthly income of households
living in those areas?
f) Construct a 95% confidence interval for the coefficient of income.
g) Using the 5% significance level, test the null hypothesis that the
coefficient of population in regression model is zero against the two-sided
alternative.
h) Using the 1% significance level, can you conclude that the coefficients of
both independent variables in the population regression model are equal to
zero?
Answers
1. T .S. 11; reject H 0 virtually at any level; 2. T .S. 3.06; reject H 0 at 5%
^
level; 3. a) y 51.6 0.0599 x1 0.0850 x2 0.0048 x3 ; b) a 51.61 ;
b1 0.05993; b2 0.08497 ; b3 0.004773 ;c) s e 0.8995 ; R 2 99.7%
_ 2
R 99.6% ; SST 2918.00; SSR 2909.91; SSE 8.09; MSR 969.97;
MSE 0.81; d) s a 11.25 ; ta 4.59; pa 0.000 ; s b1 0.01526 ; t b1 3.93;
p b1 0.003 ; s b2 0.02875 ; t b2 2.96; p b2 0.014 ; s b3 0.008707 ;
t b3 0.55; p b3 0.596 ; e) y p 54.2724; f) 0.09393 to 0.02593 ; g) -
0.00614 to 0.17608; h) -0.028839 to 0.019293; i) 26.545 to 76.675; j)
T.S. 3.927 ; reject H 0 ; k) T.S. 2.955 ; reject H 0 ; l) T.S. 0.548 ;
accept H 0 ;m) T.S. F 1198.86 ; reject H 0
149
4.9. Dummy variables in the regression models
y 1 x1
Now suppose that we introduce a dummy variable, x 2 , that has values 0 and
1 and the resulting equation becomes
y 1 x1 2 x 2
Example:
Refer to example 1. Following table reproduces the data from that example
with additional column that contains information for each of the 10 drivers.
150
Yearly Driving Number of violations
premium experience (past 5 years) Gender
y x1 x2
74 5 2 Male
50 6 1 Female
97 4 6 Female
57 11 3 Female
99 3 1 Female
35 19 0 Male
40 15 1 Female
49 13 2 Female
101 2 8 Male
42 10 3 Male
151
In this case, our population regression model becomes
y 1 x1 2 x2 3 D
Assuming values of 0 and 1 to male and female respectively, we rewrite the
data
Yearly Driving Number of violations
premium experience (past 5 years) Gender
y x1 x2
74 5 2 0
50 6 1 1
97 4 6 1
57 11 3 1
99 3 1 1
35 19 0 0
40 15 1 1
49 13 2 1
101 2 8 0
42 10 3 0
Analysis of Variance
Source DF SS MS F P
Regression 3 4853.1 1617.7 7.47 0.019
Residual Error 6 1299.3 216.5
152
Total 9 6152.4
Source DF Seq SS
X1 1 4624.1
X2 1 200.4
D 1 28.6
153
^
y 84.54 3.318 14 2.559 3 3.573 0 45.765 $45.765
Thus, a male driver with 14 years of driving experience and 3 driving
violations is expected to pay a yearly auto insurance premium of $45.765.
d) To find the predicted auto insurance premium for a female driver with 14
years of driving experience and 3 driving violations, we substitute x1 14 ,
x 2 3 , and D 1 in the estimated regression model (2),
^
y 84.54 3.318 14 2.559 3 3.573 1 49.338 $49.338
Thus, a female driver with 14 years of driving experience and 3 driving
violations is expected to pay a yearly auto insurance premium of $49.338.
e) We are to make a 99% confidence interval for 3 . From the given
information and from the MINITAB solution we obtain
n 10 ; b3 3.573 ; and s b3 9.829
t n K 1, / 2 t1031,0.005 t 6,0.005 3.707
So, from
b3 t n K 1, / 2 s b3 3 b3 t n K 1, / 2 s b3
a 99% confidence interval for 3 is
3.573 3.707 9.829 3 3.573 3.707 9.829
32.863 3 40.009
Thus, the 99% confidence interval for 3 is -$32.863 to $40.009. We can
state with 99% confidence that female drivers pay somewhere between
$32.863 less than to $40.009more than male drivers with similar values for
the x1 and x 2 variables.
f) We are to test whether or not the coefficient 3 of gender in model (1) is
zero. The two hypotheses are
H 0 : 3 0
H1 : 3 0
The decision rule is
Reject H 0 if T .S . t t n K 1, / 2 or T .S . t t n K 1, / 2
From MINITAB solution we find that the value of test statistic is
T.S. t 0.36
t n K 1, / 2 t1031,0.005 t 6,0.005 3.707 and
154
t n K 1, / 2 3.707
Since 0.36 is not greater than 3.707, the value of test statistic falls in the non
rejection region. Consequently, we fail to reject the null hypothesis and
conclude that 3 in regression model is not different from zero. That is, the
variable gender has no effect on the auto insurance premiums paid by
drivers.
Remark:
The number of dummy variables used for qualitative variable in a regression
model is one less than the number of categories for that variable. For
example, we may want to investigate influence of quarters. Because the
variable quarter is a qualitative variable, we will use dummy variables to
represent it in our regression model. Since there are 4 quarters in a year, we
will use 3 dummy variables. Let D1 be the dummy variable for the first
quarter, D2 be the dummy variable for the second quarter and D3 be the
dummy variable for the third quarter. Then
D1 1 for the first quarter, and zero for other quarters
D2 1 for the second quarter, and zero for other quarters
D3 1 for the third quarter, and zero for other quarters
If our regression model consists of two independent variables x1 and x 2 ,
then we will estimate regression model as
y 1 x1 2 x2 3 D1 4 D2 5 D3
Exercises
155
b) Test against a two-sided alternative the null hypothesis that the true
coefficient on the dummy variable is zero. Take 0.05
2. The following model was fitted, to explain the selling prices of home, to a
sample of 815 sales.
y 1264 48.18 x1 3382 x 2 3219 x3 2005 x 4 _ 2
R 0.86
(0.91) (515) (947) (768)
where
y selling price of home, in thousands of dollars
x1 square meters of living area
x 2 size of garage, in square of meters
x3 dummy variable taking the value 1 if the house has a fireplace,
and 0 otherwise
x 4 dummy variable taking the value 1 if the house has a wood
floors, and 0 otherwise
a) Interpret the estimated coefficient of x 3 .
b) Interpret the estimated coefficient of x 4 .
c) Find a 95% confidence interval for the impact of fireplace on a selling
price, all other being equal.
d) Test the null hypothesis that type of flooring has no impact on selling
price, against the alternative that, all other things equal, house with wood
floors have a higher selling price than other flooring.
3. The following MINITAB solution was obtained for the regression model
y 1 x1 2 x2 3 x3 4 D
for a sample data set.
156
S = 0.6183 R-Sq = 99.6% R-Sq(adj) = 99.2%
Analysis of Variance
Source DF SS MS F P
Regression 4 426.49 106.62 278.87 0.000
Residual Error 5 1.91 0.38
Total 9 428.40
Source DF Seq SS
X1 1 401.42
X2 1 0.12
X3 1 20.67
D 1 4.28
Salary 30 22 21 45 36 39 17 22 18 19
Studying 18 16 15 22 20 20 14 16 12 14
Experience 8 7 6 15 14 16 2 4 3 4
Gender F F M M F F M M F M
Using MINITAB (or any other statistical software package), find the
regression of salary on studying, experience, and gender. Then answer the
following questions.
157
a) Write the estimated regression equation.
b) Explain the meaning of the estimated regression coefficient of the dummy
variables.
c) By estimating the regression model with gender as a dummy variable, you
have actually estimated two regression models-one for males and the other
for females. Write these two regression equations.
d) How much salary is a male worker with 18 years of studying and 7 years
of work experience expected to earn?
e) How much salary is a female worker with 18 years of studying and 7
years of work experience expected to earn?
f) Determine a 95% confidence interval for the coefficient of dummy
variable.
g) Using the 5% significance level, can you conclude that female workers
are paid lower salaries than male workers?
Answers
158
Chapter 5
5.1. Introduction
159
Figure 5.1
Block POPULATION (GROUP)
1 2 …. K
x11 x 21 …. x K1
x12 x 22 …. xK 2
…. …. …. ….
…. …. …. …..
x1n1 x 2 n2 …. x Kn K
1) The first step is to calculate the sample mean for the K groups of
_ _ _
observations. These sample means will be denoted as x 1 , x 2 ,....... x K .
In general
ni
_
x
j 1
ij
xi
ni
where n i denotes the number of observations in i th group.
2) The second step is to find overall mean of the all sample observations,
__
denoted x , and defined as
K ni
_
x
i 1 j 1
ij
x
n
where n denotes the total number of sample observations
K
n ni
i 1
An equivalent expression for overall mean is
K _
__ ni x i
x i 1
n
3) In third step, we consider variability within-groups. To measure
variability in the any group, we calculate the sum of squared deviations of
160
the observations about their sample means. Within-groups variability will be
denoted by SS. For example, for the first group the sum of squared
_
deviations of the observations about their sample mean x 1 is
n1 _
SS1 ( x1 j x1 ) 2
j 1
_
For the second group, whose sample mean is x 2 , we calculate
n2 _
SS 2 ( x2 j x 2 ) 2
j 1
and so on.
4) In fourth step, we find total within-groups variability, denoted SSW. That
is
SSW SS1 SS 2 ....... SS K
or
K ni _
SSW ( xij x i ) 2
i 1 j 1
5) Now we need a measure of variability between groups. It is based on the
discrepancies between the individual group means and the overall mean.
Total between-groups sum of squares denoted SSG , and defined as
K _ _
SSG ni ( x i x) 2
i 1
6) As a last step, we calculate the sum of squared discrepancies of all the
sample observations about their overall mean. This is called the total sum of
squares, denoted SST , expressed as
K ni _
SST ( xij x) 2
i 1 j 1
It can be shown, that the total sum of squares is the sum of the within-groups
and between-groups sum of squares, that is
SST SSW SSG
161
If H 0 : 1 2 3 ...... K
is true, each of the SSW and SSG can be used as the basis for estimate of the
common population variance. To obtain these estimates, the sums of squares
must be divided by the corresponding numbers of degrees of freedom.
SSW divided by (n K ) results estimate called the within-groups mean
square, denoted MSW, so that
SSW
MSW
nK
SSG divided by ( K 1) results estimate called the between-groups mean
square, denoted MSG, so that
SSG
MSG
K 1
The test of null hypothesis is based on the ratio of the mean squares
MSG
F
MSW
If this ratio is close to 1, there would be little cause to doubt the null
hypothesis of equality of population means.
Summary
We define the following sums of squares:
K ni _
Within-groups: SSW ( x
i 1 j 1
ij x) 2
K _ _
Between groups: SSG ni ( x i x ) 2
i 1
K ni _
Total: SST
i 1 j 1
( xij x) 2
162
The decision rule is
MSG
Reject H 0 if T .S. F FK 1,n K ,
MSW
where FK 1, n K , is the number for which P ( FK 1,n K FK 1,n K , ) and
FK 1,n K follows an F distribution with numerator degrees of freedom (K-1)
and denominator degrees of freedom (n K ) . (Table 6 of Appendix).
For convenience, these calculations are often recorded in a table called a
one-way analysis of variance table or ANOVA table, shown below
(Table5.1):
Table 5.1
Source of Sum of Degrees of Mean F ratio
variation squares freedom squares
Between groups SSG ( K 1) MSG MSG
Within groups SSW (n K ) MSW MSW
Total SST (n 1)
Example:
A company buys thousands of light bulbs every year. The company is
considering three brands of light bulbs to choose from. Before the company
decides which light bulbs to buy, it wants to investigate if the mean life of
the three types of light bulbs is the same. The research department selects
randomly a few bulbs of each type and tested them. Table lists number of
hours (in thousands) that each of the bulbs in each brand survived before
being burned out.
Brand I Brand II Brand III
22 18 27
23 23 24
26 22 20
27 21 21
22 23
At the 5% significance level, test the null hypothesis that the mean life of
bulbs for each of these three brands is the same.
163
Solution:
__ n
i 1
i xi
5 24 4 21 5 23 319
x 22.79
n 14 14
3) In the first group, sum of squared deviations is
5 _
SS1 (x
j 1
1j x1 ) 2 (22 24) 2 (23 24) 2 (26 24) 2
(27 24 ) 2 (22 24 ) 2 4 1 4 9 4 22
Similarly,
4 _
SS2 (xj 1
2j x 2 ) 2 (18 21) 2 (23 21) 2 (22 21) 2
(21 21) 2 9 4 1 0 14
and
5 _
SS3 (x
j 1
3j x 3 ) 2 (27 23) 2 (24 23) 2 (20 23) 2
164
K _ _
SSG n
i 1
i ( x i x) 2 5 (24 22.79) 2
165
Remark1: An alternative formula for SSB and SSW are
2
n
T12 T22 T32
xi
i 1
SSB ....
n1 n2 n3 n
n
T12 T22 T32
SSW x 2
i ....
i 1 n1 n2 n3
where
Ti the sum of the values in sample i
n
x
i 1
i the sum of the values in all samples T1 T2 T3 ....
n
x
i 1
2
i the sum of the squares of the values in all samples.
Example:
Consider the following data obtained for two samples selected from two
populations
Sample I Sample II
9 4
3 1
7 1
8 6
8
x
i 1
i T1 T2 27 20 47
n1 4 ; n2 5 ; n n1 n2 9
166
n
x
i 1
2
i 9 2 3 2 7 2 8 2 4 2 12 12 6 2 8 2 321
Substituting all the values in the formula for SSG and SSW, we obtain
2
n
T12 T22 T32
xi
i 1
SSG ....
n 1 n 2 n 3 n
27 2 20 2 47 2
16.81
4 5 9
n T 2 T 2 T 2
SSW xi2 1 2 3 ....
i 1 n1 n2 n3
27 2 20 2
321 58.75
4 5
Hence, the variance between samples MSG and the variance within samples
MSW are
SSG 16.81
MSG 16.81
K 1 2 1
SSW 58.75
MSW 8.39
nK 92
We write an ANOVA table for our example as
167
Exercises
1. The following ANOVA table, based on information obtained for four
samples selected from four independent populations that are normally
distributed with equal variances, has a few missing values
168
a) Set out the analysis of variance table for these data.
b) Test at a 1% significance level, the null hypothesis that the means of these
three populations are equal.
4. Consider the following data obtained for two samples selected at random
from two populations that are independent and normally distributed with
equal variances
Sample I Sample II
29 37
31 27
27 36
28 20
25
169
6. A consumer agency that wanted to compare drying times for paints made
by three companies tested a few samples of paints from each of these
companies. The following table lists the drying times (in minutes) for these
samples of paints
Answers
1. b) T .S. F 5.67 ; reject H 0 ; 2. a) 4; d) T .S. F 11.13; reject H 0 ;
3. a)
170
5. a)
Source of Sum of Degrees of Mean F ratio
variation squares freedom squares
Between groups 144.00 2 72.00
Within groups 70.00 15 4.67 F 15.43
Total 214.00 17
b) reject H 0 ;
6. a)
Source of Sum of Degrees of Mean F ratio
variation squares freedom squares
Between groups 571.7 2 285.8
Within groups 481.3 17 28.3 F 10.10
Total 1053.0 19
b) reject H 0 .
5.3. The Kruskal-Wallis test
171
A test of significance level is given by the decision rule
Reject H 0 if W K2 1,
where K2 1, is the number that is exceed with probability by a
2 random variable with ( K 1) degrees of freedom.
Example:
The following table gives the response time (in minutes) of three fire
companies in a city for certain randomly selected incidents after a fire was
reported.
Perform at the 5% significance level the Kruskal-Wallis test to test the null
hypothesis that the mean response time for each of these fire companies for
all fire incidents are the same.
Solution:
First of all we pool all sample observation together and rank them in
ascending order. The following table illustrates this procedure
172
The null hypothesis is
H 0 : 1 2 3
The decision rule is
Reject H 0 if W K2 1,
The value of the test statistic is
12 K
Ri2
W 3(n 1)
n(n 1) i 1 ni
12 90 2 69 2 512
3 (20 1) 66.35 63 3.35
20 (20 1) 7 6 7
K2 1, 22,0.05 5.99
Since, 3.35 is not greater than 5.99, we fail to reject the null hypothesis. And
we accept that the mean response time for each of these fire companies for
all fire incidents is the same.
Exercises
173
At the 5% significance level perform a Kruskal-Wallis test of the null
hypothesis that the population mean sales levels are identical for three wall
paper colors. Also find the p-value.
2. A study was conducted in which samples were selected independently
from four populations. The sample size from each population was 21. The
data were converted to ranks. The sum of the ranks for the data from each
sample is
n1 20 n2 25 n3 35
R 1 1660 R 2 1150 R3 1350
Based on these data, what can be concluded about the means for three
populations? Apply Kruskal-Wallis test at an 0.01.
4. Given the following data:
Use the Kruskal-Wallis procedure to test the null hypothesis that the mean
values for all four populations are the same. What conclusion should be
reached using a significance level of 0.10? Also find the p-value.
174
5. Suppose as a part of your job you are responsible for installing emergency
lighting in a series of buildings. Bids have been received from four
manufacturers of battery-operated emergency lights. The costs are about
equal, so the decision will be based on the length of time the lights last
before failing. A sample of five lights from each manufacturer has been
tested, and values (time in hours) recorded for each manufacturer
Using 0.01, what conclusion should you reach about the mean length of
time the lights last before failing for the four manufacturers? Explain.
Answers
175
Our aim is to test the null hypothesis that all group means are equal, and the
null hypothesis that all block means are equal.
To develop these tests we need to set two-way ANOVA table.
BLOCK GROUP
1 2 …… K
1 x11 x 21 ……. x K1
2 x12 x 22 ……. xK 2
. … … ……. …..
. …. …. ….. …
H x1H x2H ……. x KH
Figure 5.2
_
x
j 1
ij
x i ; (i 1,2,3......, K )
H
2) Find sample mean for each block. The mean of the j th block we use
_
notation x j , defined as
K
_ x
i 1
ij
x j ; ( j 1,2,3......, H )
K
3) Find the overall mean of the sample observations. The overall mean
_
denoted x , defined as
K H K _ H _
_
x
i 1 j 1
ij x i x
j 1
j
i 1
x
n K H
176
4) Find between groups sum of squares, denoted SSG, defined as
K _ _
SSG H (x
i 1
i x) 2
177
2) The null hypothesis H 0 that the H population block means are the same is
provided by the decision rule
MSB
Reject H 0 if FH 1,( K 1)(H 1),
MSE
where, F 1 , 2 , is the number exceeded with probability by a random
variable following an F distribution with numerator degrees of freedom
1 and denominator degrees of freedom 2 .
It is very convenient to summarize the calculations in tabular form, called a
two-way analysis of variance table or ANOVA table, shown below
(Table5.2):
Table 5.2
Source of Sum of Degrees of Mean F
variation squares freedom squares ratios
SSG
Between groups SSG ( K 1) MSG
K 1 MSG
SSB MSE
Within blocks SSB ( H 1) MSB
H 1 MSB
SSB MSE
Error SSE ( K 1)(H 1) MSE
( K 1)(H 1)
Total SST (n 1)
Exercise:
Four drivers tested three types of cars for fuel consumptions. The
accompanying table shows fuel consumptions of cars
Block (Drivers) Group (Cars)
A B C
1 22 24 26
2 21 25 22
3 19 20 23
4 18 19 21
a) Set out the two-way analysis of variance table.
b) Test the null hypothesis that the population mean fuel consumption is the
same for all three types of cars. Take 0.05 .
c) Test the null hypothesis that population values of mean fuel consumption
are the same for each driver. Take 0.05 .
178
Solution:
a)
1) Let us find sample mean for each group
H
_
x
j 1
ij
x i (i 1,2,3)
H
_
22 21 19 18 80
x1 20
4 4
_
24 25 20 19 88
x 2 22
4 4
_
26 22 23 21 92
x 3 23
4 4
2) Find sample mean for each block.
K
_ x
i 1
ij
x j ; ( j 1,2,3,4)
K
_
22 24 26 72 _
21 25 22 68
x 1 24 ; x 2 22.67
3 3 3 3
_
19 20 23 62 _
18 19 21 58
x 3 20.67 ; x 4 19.33
3 3 3 3
_
j 1 i 1
xij x i x
j 1
j
i 1
x
n K H
K _
_ x
i 1
i
20 22 23
x 21.67
K 3
4) Find between groups sum of squares
K _ _
SSG H i 1
( x i x) 2
4 (20 21.67) (22 21.67) 2 (23 21.67) 2 18.67
2
179
5) Find between blocks sum of squares
H _ _
SSB K
j 1
( x j x) 2 3 ((24 21.67) 2 (22.67 21.67) 2
180
b) We can write the null hypothesis that the population means fuel
consumption is the same for all three types of cars as
H 0 : 1 2 3
The decision rule is
MSG
Reject H 0 if FK 1,( K 1)(H 1),
MSE
FK 1,( K 1)(H 1), F2,6,0.05 5.14
Since the value of test statistic 4.97 is not greater than 5.14, we fail to reject
the null hypothesis. Therefore, we accept the hypothesis that the fuel
consumptions are the same for all types of cars.
c) We write the null hypothesis of equality of the population values of mean
fuel consumption for all four drivers as
H 0 : 1 2 3 4
The decision rule is
MSB
Reject H 0 if FH 1,( K 1)(H 1),
MSE
FH 1,( K 1)(H 1), F3,6,0.05 4.76
Since the value of test statistic 6.86 is greater than 4.76, we reject the null
hypothesis. Therefore, we accept the hypothesis that the fuel consumptions
are not the same for each driver age class. In other words, fuel consumption
of car depends on driver’s habit.
Remark1:
To use MINITAB menu follow the following instructions:
1. Select Stat>ANOVA>Two-way
2. Enter Response variable
3. Enter row factor
4. Enter column factor
5. Click OK.
181
Remark2:
For example above the MINITAB instruction is shown below
C1 C2 C3
Driver Car Fuel consumption
1 1 22
1 2 24
1 3 26
2 1 21
2 2 25
2 3 22
3 1 19
3 2 20
3 3 23
4 1 18
4 2 19
4 3 21
In this case the row factor is “Car” and the column factor is “driver”
Exercises
182
2. Three analysts were asked to predict earnings growth over the coming
year for four companies producing cars. Their forecasts (in percentage
increase in earnings) are given below
183
House Agents
A B C
I 200 210 220
II 190 192 196
III 180 195 205
IV 160 182 194
V 170 171 185
a) Set out analysis of variance table.
b) Test at 5% significance level the null hypothesis that population mean
valuations are the same for the three real estate agents.
Answers
MSG MSB
1. b) 8; c) 15; d) 44.06 ; reject H 0 ; e) 19.63 ; reject H 0 ;
MSE MSE
2. a)
Source of Sum of Degrees of Mean F
variation squares freedom squares ratios
Between car comp. 54.00 3 18.00 10.29
Within analysts 22.17 2 11.08 6.33
Error 10.50 6 1.75
Total 86.67 11
b) T.S. 10.29 , reject H 0 ; c) accept H 0 ;
3. a)
184
Chapter 6
6.1. Introduction
6.2. Variation
185
can and must be detected, and corrective actions must be taken to remove
them from the process. Not taking actions will increase variation and lower
the quality.
Definition:
A production process is called stable (in-control) if all assignable causes are
removed; thus, variation results only from common causes.
Figure 6.1 illustrates the general format of a process control chart. The upper
and lower control limits define the normal operating region for the process.
Average
LCL
Lower control limit
The horizontal axis reflects the passage of time, or order of production. The
vertical axis corresponds to the variable of interest. There are number of
different types of process control charts. We will consider the most
commonly used process control charts
_
x chart
_
s chart
p chart
c chart
186
6.3.1 Control charts for means and standard deviations
187
and hence that an unbiased estimate of the process standard deviation is
given by
^ _
s/ c 4
The value of c 4 -control chart factor, can be found in Table 6.1. Table 6.1
lists values of c 4 , corresponding sample sizes from two to ten. It also
contains factors for other control charts that will be discussed throughout this
chapter.
N C4 A3 B3 B4
2 0.7879 2.659 0 3.267
3 0.8862 1.954 0 2.568
4 0.9213 1.628 0 2.266
5 0.9400 1.427 0 2.089
6 0.9515 1.287 0.030 1.970
7 0.9594 1.182 0.118 1.882
8 0.9650 1.099 0.185 1.815
9 0.9690 1.032 0.239 1.761
10 0.9727 0.975 0.284 1.716
_
To determine control limits for x -charts, we assume that the process has
been operating at a constant level of performance over the whole observation
period and, assume that all sample observations have been drawn from the
same normal distribution.
The sampling distribution is centered on the overall mean, and the value of
the overall mean determines the central line, called center line. Then, if
three-standard error limits are to be used, the control limits are
___ ___ ___
__ ^ __ _ __ _
x 3 / n x 3 s/( c 4 n ) x A3 s
where
A3 3 /( c 4 n )
188
_
Control chart for x means
_
The x chart is a time plot of the sequence of sample means.
The center line is
___
__
CL_ x
x
In addition, there are three-standard error control limits.
The lower control limit is
___
__ _
LCL_ x A3 s
x
The upper control limit is
___
__ _
UCL_ x A3 s
x
where the values of A3 are given in Table 6.1.
Example:
The accompanying table shows sample means and sample standard
deviations for a sequence of 10 samples of seven observations on a quality
characteristic of a product
_
Sample x s
1 145.2 2.3
2 139.2 3.1
3 146.3 2.1
4 138.2 1.9
5 141.2 2.4
6 144.3 2.2
7 140.1 3.1
8 139.9 2.3
9 145.5 2.7
10 143.3 2.8
_
a) Find the center line and lower and upper control limits for an x chart.
_
b) Draw the x chart.
189
Solution:
a) First of all, let us find overall mean and average of the sample standard
deviations are
___
145.26 UCL
`
144.32
`
142.32 CL
`
1 2 3 4 5 6 7 8 9 10
140.32
139.38 LCL
_
Figure 6.2. Control chart for x
190
Control chart for s standard deviations
The s chart is a time plot of the sequence of sample standard deviations.
The center line is
_
CL s s
In addition, there are three-standard error control limits.
The lower control limit is
_
LCL s B3 s
The upper control limit is
_
UCL s B 4 s
where the values of B3 and B 4 are given in Table 6.1.
Example: Refer to the data of example above.
a) Find the center line and lower and upper control limits for an s chart.
b) Draw the s chart and discuss its features
Solution:
_
2.3 3.1 .... 2.8
a) s 2.49
10
The sample size is seven. So from table 6 of the appendix we obtain
B3 0.118 and B4 1.882
191
4.87 UCL
3.49
3 6 8
2.49 CL
1 2 4 5 7 9 10
1.49
0.29 LCL
192
Exercises
1. Data were collected on a quantitative measure with a sample of 6
observations for a sequence of thirty samples. 30 samples were collected,
and following results were found
___
__ _
x 42.3 ; s 4.2
_
a) Find the center line and lower and upper control limits for an x chart.
b) Find the center line and lower and upper control limits for an s chart.
2. Weights of samples of canned fruit were measured. Results were available
for a sequence of thirty samples, each of seven observations. The overall
mean of the sample observations was 192.6 grams, and the average sample
standard deviation was 5.42.
a) Use an unbiased estimator to find an estimate of the process standard
deviation.
_
b) Find the center line and lower and upper control limits for an x chart.
c) Find the center line and lower and upper control limits for an s chart.
3. The accompanying table shows sample means and sample standard
deviations for a sequence of 14 samples of eight observations on a quality
characteristic of a product
Sample x s
1 146.4 4.37
2 152.8 6.79
3 150.6 3.17
4 149.2 4.71
5 150.6 4.98
6 150.4 6.28
7 151.1 6.20
8 152.9 6.97
9 147.2 4.28
10 154.3 7.29
11 151.8 3.1
12 149.9 5.31
13 146.7 4.73
14 152.1 6.12
193
a) Find the overall mean of the sample observations.
b) Find the average sample standard deviation.
c) Use an unbiased estimator to find an estimate of the process standard
deviation.
_
d) Find the center line and lower and upper control limits for an x chart.
_
e) Draw the x chart and discuss its features.
f) Find the center line and lower and upper control limits for an s chart.
g) Draw the s chart and discuss its features.
4. Ten samples, each consisting of five automobile batteries, are tested for
strength. The means and sample standard deviations are given here.
194
Answers
1.a) CL_ 42.3 ; LCL_ 36.89 ; UCL_ 47.71 ; b) CL s 4.2 ; LCLs 0.13 ;
x x x
^
UCL s 8.27 ;2.a) 5.65 ;b) CL_ 192.6 ; LCL_ 186.20 ; UCL_ 199.00 ;
x x x
c) CL s 5.42 ; LCL s 0.64 ; UCLs 10.19 ; 3. a) 150.43; b) 5.307;
^
c) 5.50 ; d) CL_ 150.43 ; LCL_ 144.60 ; UCL_ 156.26 ;
x x x
f) CLs 5.307 ; LCLs 0.98 ; UCLs 9.63 ; 4. a) 12.00; b) 1.57; c) 1.67;
d) CL_ 12.00 ; LCL_ 9.76 ; UCL_ 14.24 ; f) CL s 1.57 ; LCL s 0.00 ;
x x x
UCL s 3.28 ;
195
^
_ n
pi
p
i 1 K
^
ni p i
_
i 1
p n
ni 1
i
^
We know that individual sample proportions p i have sampling distribution
_
with mean estimated by p and standard deviation (standard error) given by
_ _
^ p (1 p )
p
n
_
Similar to the x chart and s chart, three-standard error limits will be used
for p chart.
Control chart for p proportions
The p chart is a time plot of the sequence of sample proportions of
nonconforming items.
The central line is
_
CL p p
The lower control limit is
_ _
_
p(1 p)
LCL p p 3
n
The upper control limit is
_ _
_
p(1 p)
UCL p p 3
n
196
Remark1:
If the lower control limit gets a negative value, which is impossible, we set
the lower control limit at zero.
Remark2:
_
Interpretation of p chart is similar to the interpretation of x chart and
s chart.
Remark3:
To use MINITAB menu follow the following instructions:
1. Select Stat>Control charts>Select P
2. Enter variable location (for example, C1)
3. Enter subgroup size (200, 300 etc.). If sample sizes vary, corresponding
sample sizes should be in column (for example, C2)
4. Click OK.
Exercises
a) Find the center line and upper and lower control limits for p-chart.
b) Draw the p-chart and discuss its features.
5. Proportions of nonconforming items in a sequence of 12 samples, each
400 observations, are given below
Sample ^ Sample ^
p p
1 0.061 7 0.068
2 0.060 8 0.036
3 0.043 9 0.064
4 0.051 10 0.056
5 0.037 11 0.046
6 0.042 12 0.039
a) Find the center line and upper and lower control limits for p-chart.
b) Draw the p-chart and discuss its features.
Answers
1. CL p 0.064 ; LCL p 0.022 ; UCL p 0.106 ;2. CL p 0.090 ;
LCL p 0.004 ; UCL p 0.176 ;3.a) CL p 0.098 ; LCL p 0.035 ;
UCL p 0.161 ;4. a) CL p 0.108 ; LCL p 0.015 ; UCL p 0.202 ;
5. a) CL p 0.050 ; LCL p 0.017 ; UCL p 0.083 ;
198
6.6. Control charts for number of occurrences: c-chart
The p-chart just discussed is used when you select sample of items
and you determine the number of the sample items that possesses a specific
attribute of interest. Each item either has or does not have that attribute. In
practice often we meet the situations that involve attribute data but differ
from the p-chart. Each sampling unit could have one or more of the
attributes of interest. Number of attributes of interest, called the number of
occurrences , counts number of imperfections per item over time. This is
called a c-chart.
As with the other control charts studied in this chapter, some general
notations used for control charts for number of occurrences are necessary.
Assume that, a sequence of K items is inspected over time, and for
each item, the number of occurrences of some event, such as imperfections,
is recorded. These numbers of occurrences denoted ci for i 1,2,....K .
The sample mean of occurrences is
_ K
ci
c
i 1 K
A 3-sigma (3 standard deviation) control chart for the number of occurrences
(c-chart) can be constructed in the usual way:
The c chart is a time plot of the number of occurrences over the time.
The central line is
_
CLc c
The lower control limit is
_ _ _
LCL c c 3 c if c 9
_
LCLc 0 if c 9
The upper control limit is
_ _
UCLc c 3 c
199
Example:
Handheld calculators are manufactured and checked for defects. If a
calculator is not defective, it is packaged and shipped to a retail store. Any
defective calculators are repaired before they are shipped. Twelve of the
defective calculators are checked for the number of defects per calculator.
The numbers of defects per calculator are:
6, 3, 2, 5, 6, 7, 4, 3, 7, 8, 9 and 5
a) Find the central line and lower and upper limits for c-chart.
b) Draw the c-chart and discuss its features.
Solution:
_
a) First of all, let us find the mean number of defects per calculator, c .
_ K
6 3 2 5 6 7 4 3 7 8 9 5 65
K
ci
c 5.42
i 1 12 12
_
CLc c 5.42
The lower and upper control limits are
_
LCLc 0 since c 9
_ _
UCLc c 3 c 5.42 3 5.42 12.40
b) Figure 6.4 represents c-chart
12.40 UCL
8.40
2 3 7 8
5.42 ` CL
1 4 5 6 9 10 11 12
2.40
0.00 LCL
Figure 6.4. Control chart for c
200
Since, all points fall within the upper and lower control limits, the process is
in control. That is, in the defective calculators, the number of defects per
calculator is not excessive.
Remark:
To use MINITAB menu follow the following instructions:
1. Select Stat>Control charts>Select C
2. Enter variable location (C1; C2; etc.)
3. Enter subgroup size (200, 300 etc.)
4. Click OK.
Exercises
1. Sixty minute cassette tapes are checked for defects. The number of defects
in each of 8 tapes is shown below
Tape 1 2 3 4 5 6 7 8
Number of defects 1 2 1 1 1 3 6 2
a) Find the central line and lower and upper limits for c-chart.
b) Draw the c-chart and discuss its features.
Sample 1 2 3 4 5 6 7 8 9 10
Number of defects 6 12 9 8 8 6 12 10 11 8
a) Find the central line and lower and upper limits for c-chart.
b) Draw the c-chart and discuss its features.
3. A reader has very carefully read local paper for 15 weeks. For each
Sunday’s edition he has counted the number of typographical and spelling
errors. The results are shown below
201
Week Errors Week Errors
1 13 9 12
2 15 10 13
3 14 11 21
4 17 12 9
5 12 13 17
6 20 14 19
7 7 15 22
8 14
Answers
1. a) CLc 2.125; LCLc 0 ; UCLc 6.498 ;2. a) CLc 9.00; LCLc 0 ;
UCLc 18 ; 3. a)15; b) CL c 15; LCLc 3.381 ; UCLc 26.62 ;
4. a) 17.94; b) CLc 17.94; LCLc 5.232; UCLc 30.64 .
202
Chapter 7
While analyzing time series data, decision maker often compares one
value measured at one point in time with other values measured at
different points in time. For example, a student may wish to compare
book price in 2005 with prices in previous years. A common procedure
for making relative comparisons is to begin by determining a base
period index to which all other data values can be compared. This kind
of index is called a price index for a single item. To form a price
index, one time period is chosen as a base and the price for all other
periods are expressed as a percentage of the base period price. If we
denote the price in the base period by p 0 , and the price in another
period by p1 , then, the price index for the another period is
p
100 1
p0
For example, the price of the house in 2001 was $73.000 and in 2004
the price was $121.000. If we take 2001 as a base period, then
p0 $73.000 and p1 $121.000 . The price index is
p 121 .000
100 1 100 165 .75%
0
p 73.000
It means that the price of house increased by 65.75% in 2004
compared with 2001.
While price index for a single item can be used to identify price
changes of a single item, we often are interested in the general price
change for a group of items taken as a whole. An unweighted
aggregate price index can be developed by simply summing the unit
203
prices in the time of interest and dividing this sum by the sum of the
unit prices in the base year. Suppose that we have K number of items.
Let
p0i denote the price of the i th item in base period
p1i denote the price of the i th item in period t
The unweighted aggregate price index in period t, denoted I t , is given
by
K
p1i
I t 100 iK1
i 1
p 0i
Example:
The following table lists prices of three different types of cars in
different years
204
For example, 116.67 % indicates that as a group, prices for the three
types of cars in 2002 year have increased by 16.67 % since 2000 year.
The philosophy behind the weighted aggregate index is that each item
in the group should be weighted according to its importance. In most
cases the quantity of usage provides the best measure of importance.
Suppose that we have a group of K items. Let
qoi be the quantity of i th item in the base period;
p0i be the price of i th item in the base period;
p1i be the price of i th item in the period of t
The Laspeyres price index for the period t is given by
K
q 0i p1i
I p 100 iK1
i 1
q 0i p
0i
Example:
The company sells beer, wine, and soft drinks. Prices and quantity are
shown below
Quantity Unit price ($)
Item (bottles) 2000 2004
Beer 35 000 5.1 5.7
Wine 5 000 35.0 37.0
Soft drink 50 000 2.80 3.5
Compute a weighted aggregate price index for the company sales in
2004, with 2000 as the base period.
Solution:
From the information
q01 35000 ; q 02 5000 ; q03 50000
p01 5.1 ; p02 35.0 ; p03 2.80
p11 5.7 ; p12 37.0 ; p13 3.5
205
5.7 35000 37.0 5000 3.5 50000 559500
Ip 100 100 113.37%
5.1 35000 35.0 5000 2.80 50000 493500
206
b) Producer Price Index
The Producer Price Index (PPI), measures the monthly
changes in prices in primary markets in the US. The index is based on
prices for the transaction of each product in nonretail markets. All
commodities sold in commercial transactions in these markets are
represented, including those imported for sale. One of the common
uses of this index is as leading indicator of the future trend of
consumer prices and the cost of living An increase in the PPI reflects
producer price increases that will eventually be passed on to the
consumer through higher retail prices.
207
Year Salary ($) CPI (2000 base)
2000 490 100
2001 540 105
2002 585 113
2003 640 122
2004 700 138
At first glance, we see sharply increasing trend in monthly salaries,
with excellent growth from $490 to $700. Should the factory workers
be pleased with this growth in salary? Perhaps yes, but on the other
hand, if the cost of living has increased just as fast as salary, maybe the
answer is no. If we can compare purchasing power of the $490 in 2000
with the purchasing power of the $700 in 2004, we will have a better
idea of the relative improvement in salaries. Table above also shows
the Consumer Price Index (CPI) for the period 2000-2004. Here we
use 2000 as the base for CPI. With these data we will see how the CPI
can be used to deflate the index of monthly salaries. In effect we will
be removing the consumer price increases from power of salaries in an
attempt to measure the change in purchasing power of the wages.
The calculations used to deflate the salaries are not difficult.
The deflated series is found by dividing the monthly salary in each
year by the corresponding value of CPI
What does deflated series of salaries tells us about the “real salary” or
“purchasing power” of workers during 2000-2004? In terms of 2000
dollars, the monthly salary has risen from $490 to $507.2 or
approximately 3.5% In fact, after we remove the price increase effect
we see that factory workers are doing little more than keeping even
with the inflationary price increases of the period. Thus, we see that
the advantage of using price indexes to deflate a series is that we have
a clearer picture of the real dollar changes that are occurring.
208
Exercises
209
b) Compute a weighted aggregate Laspeyres price index. What is your
interpretation of this index value?
4. Total personal income for the 5 years 1995 to 1999 as follows
Year Total personal income (In millions of dollars)
1995 1200
1996 1450
1997 1650
1998 1800
1999 2050
Use following Consumer Price Index below
Year CPI
1995 100
1996 110
1997 118
1998 122
1999 130
to deflate the personal income series. What has been the percent
increase in “real personal income” from 1995 to 1999? Sketch actual
and real personal incomes and interpret your results.
5. The following table reports total inventories of all companies for
the 5 years 1990 to 1994 as follows
Year Total inventories (In billions of dollars)
1990 155
1991 163
1992 178
1993 198
1994 227
Use following Producers Price Index below
Year CPI
1990 100
1991 105
1992 113
1993 122
1994 138
to deflate this series. Sketch the real and actual total inventories and
interpret your findings.
210
Answers
1. 100%; 124.18%; 138.46%; 179.12%; 212.09%; 218.13%; 220.88%;
226.92%; 2. a) 100%; 113.45%; 122.69%; 131.09; 140.34%; b) 100%;
113.61%; 121.89%; 131.36%; 141.12%; 3. a) 158%; 113%; 96%; b)
120%; 4. 1200; 1318.2; 1398.3; 1475.4; 1576.9; 5. 155; 155; 158; 162;
164.
First of all we write data in ascending order and find the median.
th th
n 1 15 7 th 8 th 92 99
Median 2 95.5
2 2 2
The run test developed here separates the observations into a subgroup
above the median and a subgroup below the median. Then letting a
“ ”denote observations above the median and a “–“ denote
observations below the median we find the following pattern over the
sequence
211
This sequence consists of a run of one “ ”, followed by a one run of
one “-“, a run of one “ +”, a run of one “-“, a run of one “-“, a run of
three “-“, a run of three “+”, a run of two “-“ , and one run of “+”. In
total there are R 9 runs.
The null hypothesis is that the series is a set of random variables. The
table 7 in the Appendix gives the smallest significance level against
which this null hypothesis can be rejected against the alternative of
positive association between adjacent observations, as a function of
R and n.
If the alternative hypothesis is two-sided hypothesis on randomness,
the significance level must be doubled if it is less than 0.5.
Alternatively, if the significance level, , read from table is greater
than 0.5, the corresponding significance level for the test against the
two sided alternative is 2(1 ) .
In our case, n 14 , and R 9 . From table in the appendix we see that
for n 14 observations, the probability under the null hypothesis of
finding 9 or fewer runs is 0.791. Therefore, the null hypothesis of
randomness can only be rejected against the alternative hypothesis of
positive association between adjacent observations at the 79.1%
significance level. We have not found strong evidence to reject the null
hypothesis that series are randomness.
If the sample size is large (n 20) , the distribution of the runs under
the null hypothesis can be approximated by a normal distribution.
Under the null hypothesis the random variable
n
R 1
Z 2
n 2n
2
4( n 1)
has a standard normal distribution. In formula above, R, defines the
number of runs, as the number of sequences above or below the
median.
We want to test the null hypothesis
H 0 : The series is random
212
1) If the alternative hypothesis is positive association between adjacent
observations, the decision rule is
n
R 1
Reject H 0 if Z 2 z
n 2n
2
4(n 1)
2) If the alternative hypothesis is that series are nonrandom, then the
decision rule is
n n
R 1 R 1
Reject H 0 if Z 2 z / 2 or Z 2 z / 2
n 2n
2
n 2n
2
4(n 1) 4( n 1)
Remark:
To use MINITAB menu follow the following instructions
1. Select Stat>Nonparametrics>Runs test
2. Enter time series variable (for example, C1)
3. Select Aboveand below
4. Insert value of the median
5. Select “generate forecasts”
6. Click OK.
Exercises
1. The following table shows country’s industrial production index
over 14 years.
213
2. The following table shows 24 annual observations on sale of certain
brand of product
Use the large- sample variant of the runs test to test this series for
randomness.
3. The following table shows annual return on a stock market index
over 14 years.
214
4. The table shows earnings per share of a company over a period of
28 years.
Use the large-sample variant of the runs test to test this series for
randomness.
Answers
1. Median=88.5; R 6 ; p value 2 (0.209) 0.481 ; Fail to reject
H 0 at 10% level; 2. Median = 737; R 10 ; Z 1.2523;
p-value = 1 0.8944 0.1056 ; Reject H 0 at level above 10.56%;
3. Median = 17.5; R 9 ; p- value = 0.791; Fail to reject H 0 at 10%
level; 4. Median = 30.7; R 7 ; Z 3.0813 ; p- value =
= (1 0.999) 0.001 ; Reject H 0 at 0.01%.
215
7.5. Components of time series
216
such as production and sales, exhibit seasonal patterns over different
time periods of a year. For example, a manufacturer of snow removal
equipment and heavy clothing expects low sales activity in the spring
and summer months, with peak sales occurring in the fall and winter
months.
d) Irregular component
The random or chance variations in a time series are referred to as the
irregular component. For example, strikes and natural disasters such
as storms and earthquakes can cause unpredicted irregular movement
in the time series. Since this component accounts for the random
variability in the time series, it is unpredictable. Thus we can not
attempt to predict its impact on the time series in advance.
1
X t* xt j (t m 1, m 2,......, n m)
2m 1 J m
For instance, if we want to find 3-point moving averages, then solve
2m 1 3
and find m 1 . If m 1 , then the first available data will be X 2* .
General X t* in this case is
x xt xt 1
X t* t 1
3
If we set m 2 , then a 5-point moving averages will be formed as
217
xt 2 xt 1 xt xt 1 xt 2
X t* .
5
Example:
The following data show the sales over the past six years. Compute a
simple centered 3-point moving averages to smooth data
Year Sales
1999 2169
2000 3678
2001 2789
2002 4783
2003 1280
2004 2379
Solution:
Since we need to find 3-point moving averages then (2m 1) 3 , and
m 1.
x xt xt 1
Then X t* t 1
3
Using formula above, we obtain
x x2 x3 2169 3678 2789
X 2* 1 2878.67
3 3
x x3 x4 3678 2789 4783
X 3* 2 3750
3 3
x x4 x5 2789 4783 1280
X 4* 3 2950.67
3 3
x x5 x6 4783 1280 2379
X 5* 4 2814
3 3
The original data and smoothed data are given below:
Year Sales X t*
1999 2169 --
2000 3678 2878.67
2001 2789 3750
2002 4783 2950.67
2003 1280 2814
2004 2379 --
218
Figure 7.1
The original data and smoothed data are graphed in Figure 7.1.
Remark:
To use MINITAB menu follow the following instructions:
1. Select Stat>Time series>Moving averages
2. Enter time series variable (for example, C1)
3. Enter the moving average length
4. Click results and select summary table and results table
5. Click OK.
219
Exercises
1. The following table gives the gross domestic product (in billions of
dollars) of the country for the years 1990 through 1997
Year GDP
1990 1768.4
1991 1974.1
1992 2488.6
1993 3030.6
1994 3405.0
1995 4038.7
1996 4268.6
1997 4900.4
Compute a simple centered 3-point moving average for the GDP. Plot
the smoothed series and comment on your results.
2. The following table shows the year-end price of gold (in dollars)
over 10 consecutive years.
Compute a simple centered 5-point moving averages for the gold price
data.
Draw a time plot of the smoothed series and comment on your results.
220
3. The table shows earnings per share of a corporation over a period of
14 years.
Year Earnings Year Earnings
1 49.2 8 43.2
2 34.7 9 56.2
3 23.6 10 34.2
4 45.7 11 35.8
5 34.8 12 43.2
6 53.2 13 28.9
7 23.7 14 36.5
221
where is a smoothing constant whose value is fixed between 0
and 1. And standing at time n , we obtain forecasts of future values,
x t h of the series by
^ ^
x t h x h ; h 1,2,3,.....
Example:
The following table shows the price of a share of common stock for a
well-known computer firm over the past 8 weeks. The price shown is
the closing price on the same day of the week for 8 consecutive weeks.
222
Week Stock price Smoothed time
series value
1 50 50.00
2 53 51.80
3 49 50.12
4 50 50.05
5 42 45.22
6 57 52.29
7 52 52.12
8 57 55.05
60
55
50
Smoothed time series
45
1 2 3 4 5 6 7 8 9 10 11
Week
Figure 7.2. Exponential smoothing of stock price time series
with smoothing constant 0.4
223
Remark:
To use MINITAB menu follow the following instructions
1. Select Stat>Time series>Single exponential smoothing
2. Enter time series variable (for example, C1)
3. Select Use
4. Insert (1 )
5. Select “generate forecasts”
6. Number of forecasts: Insert an integer to indicate number of
forecasts you want.
7. Starting from origin: Enter a positive integer to specify a starting
point for the forecasts. For example, if you specify 4 forecasts and 10
for the origin, Minitab computes forecasts for periods 11, 12, 13, and
14, based on the level and trend components at period 10. If you leave
this space blank, Minitab generates forecasts from the end of the data.
8. Select Options
9. Select graphics, outputs
10. Enter 1 to the window “Use average of __ observations”
11. Click OK.
Exercises
1. The following time series shows the sales of a particular product
over the past 12 month.
224
2. The following table gives the gross domestic product (in billions of
dollars) of the country for the years 1990 through 1997.
Year GDP
1990 1768.4
1991 1974.1
1992 2488.6
1993 3030.6
1994 3405.0
1995 4038.7
1996 4268.6
1997 4900.4
225
7.8. Double exponential smoothing
(Holt-Winters exponential smoothing forecasting model)
226
Month Sales
1 145
2 165
3 175
4 149
5 167
6 156
7 176
8 195
227
For t 5 :
^ ^
x 5 0.2 ( x 4 T4 ) 0.8 x5
0.2 (157.48 9.344) 0.8 167 163.23
^ ^
T5 0.3 T4 0.7 ( x 5 x 4 ) 0.3 (9.344) 0.7 (163.23 157.48) 1.22
For t 6 :
^ ^
x 6 0.2 ( x 5 T5 ) 0.8 x6 0.2 (163.23 1.22) 0.8 156 157.7
^ ^
T6 0.3 T5 0.7 ( x 6 x 5 ) 0.3 1.22 0.7 (157.7 163.23) 3.51
For t 7 :
^ ^
x 7 0.2 ( x 6 T6 ) 0.8 x7 0.2 (157.7 3.51) 0.8 176 171.64
^ ^
T7 0.3 T6 0.7 ( x 7 x 6 ) 0.3 (1.04) 0.7 (171.64 157.7) 8.7
For t 8 :
^ ^
x 8 0.2 ( x 7 T7 ) 0.8 x8 0.2 (171.64 8.7) 0.8 195 192.2
^ ^
T8 0.3 T7 0.7 ( x 8 x 7 ) 0.3 8.7 0.7 (192.2 171.64) 17.00
In general for h periods forecasting is
^ ^
x n h x n h Tn
The most recent level and trend estimates are
^
x 8 192.2 ; T8 17.00
Then the forecasts for the next three months are
^
x 9 192.2 1 17.00 209.2
^
x10 192.2 2 17.00 226.2
^
x11 192.2 3 17.00 243.2
The results of these calculations are shown below:
228
Month Sales ^ MINITAB
xt solution
1 145 -- 146.150
2 165 165 161.457
3 175 177 174.503
4 149 157.48 156.590
5 167 163.23 163.157
6 156 157.7 157.823
7 176 171.64 171.735
8 195 192.2 192.106
The last column of the table above contains MINITAB solution.
According to MINITAB solution the predictions are
^ ^ ^
x 9 209.004 ; x10 225.902 ; x11 242.800
The values calculated by the MINITAB program differ slightly from
those in the third column of the table above. The MINITAB
procedures will generally provide slightly better forecasts compared to
the more simplified procedure we have shown. The observed time
series and forecasts are shown in Figure 7.3.
Figure 7.3
229
Remark:
To use MINITAB menu follow the following instructions:
1. Select Stat>Time series>Double exponential smoothing
2. Enter time series variable (for example, C1)
3. Select Use
4. Enter 1 -for level
5. Enter 1 - for trend
6. Select “generate forecasts”
7. Enter number of forecasting
8. Enter number of starting point for forecasting
9. Select Options
10. Select graphics, outputs
11. Click OK.
Exercises
230
Use the Holt-Winters procedure, with smoothing constants 0.3 and
0.4 to obtain forecasts for the next 3 months. Graph the data and
forecasts.
3. The following table shows percentage profit of a company over a
period of 8 years.
Find forecasts for the next three years, using the Holt-Winters
procedure, with smoothing constants 0.7 and 0.6 . Graph the
data and forecasts.
Answers
231
APPENDIX
232
Table 1: Cumulative distribution function of the standard normal distribution
z F (z ) z F (z ) z F (z ) z F (z ) z F (z ) z F (z )
.00 .5000
.01 .5040 .21 .5832 .41 .6591 .61 .7291 .81 .7910 1.01 .8438
.02 .5080 .22 .5871 .42 .6628 .62 .7324 .82 .7939 1.02 .8461
.03 .5120 .23 .5910 .43 .6664 .63 .7357 .83 .7967 1.03 .8485
.04 .5160 .24 .5948 .44 .6700 .64 .7389 .84 .7995 1.04 .8508
.05 .5199 .25 .5987 .45 .6736 .65 .7422 .85 .8023 1.05 .8531
.06 .5239 .26 .6026 .46 .6772 .66 .7454 .86 .8051 1.06 .8554
.07 .5279 .27 .6064 .47 .6803 .67 .7486 .87 .8078 1.07 .8577
.08 .5319 .28 .6103 .48 .6844 .68 .7517 .88 .8106 1.08 .8599
.09 .5359 .29 .6141 .49 .6879 .69 .7549 .89 .8133 1.09 .8621
.10 .5398 .30 .6179 .50 .6915 .70 .7580 .90 .8159 1.10 .8643
.11 .5438 .31 .6217 .51 .6950 .71 .7611 .91 .8186 1.11 .8665
.12 .5478 .32 .6255 .52 .6985 .72 .7642 .92 .8212 1.12 .8686
.13 .5517 .33 .6293 .53 .7019 .73 .7673 .93 .8238 1.13 .8708
.14 .5557 .34 .6331 .54 .7054 .74 .7704 .94 .8264 1.14 .8729
.15 .5596 .35 .6368 .55 .7088 .75 .7734 .95 .8289 1.15 .8749
.16 .5636 .36 .6406 .56 .7123 .76 .7764 .96 .8315 1.16 .8770
.17 .5675 .37 .6443 .57 .7157 .77 .7794 .97 .8340 1.17 .8790
.18 .5714 .38 .6480 .58 .7190 .78 .7823 .98 .8365 1.18 .8810
.19 .5753 .39 .6517 .59 .7224 .79 .7852 .99 .8389 1.19 .8830
.20 .5793 .40 .6554 .60 .7257 .80 .7881 1.00 .8413 1.20 .8849
233
1.21 .8869 1.46 .9279 1.71 .9564 1.96 .9750 2.21 .9864 2.46 .9931
1.22 .8888 1.47 .9292 1.72 .9573 1.97 .9756 2.22 .9868 2.47 .9932
1.23 .8907 1.48 .9306 1.73 .9582 1.98 .9761 2.23 .9871 2.48 .9934
1.24 .8925 1.49 .9319 1.74 .9591 1.99 .9767 2.24 .9875 2.49 .9936
1.25 .8944 1.50 .9332 1.75 .9599 2.00 .9772 2.25 .9878 2.50 .9938
1.26 .8962 1.51 .9345 1.76 .9608 2.01 .9778 2.26 .9881 2.51 .9940
1.27 .8980 1.52 .9357 1.77 .9616 2.02 .9783 2.27 .9884 2.52 .9941
1.28 .8997 1.53 .9370 1.78 .9615 2.03 .9788 2.28 .9887 2.53 .9943
1.29 .9015 1.54 .9382 1.79 .9633 2.04 .9793 2.29 .9890 2.54 .9945
1.30 .9032 1.55 .9394 1.80 .9641 2.05 .9798 2.30 .9893 2.55 .9946
1.31 .9049 1.56 .9406 1.81 .9649 2.06 .9803 2.31 .9896 2.56 .9948
1.32 .9066 1.57 .9418 1.82 .9656 2.07 .9808 2.32 .9898 2.57 .9949
1.33 .9082 1.58 .9429 1.83 .9664 2.08 .9812 2.33 .9901 2.58 .9951
1.34 .9099 1.59 .9441 1.84 .9671 2.09 .9817 2.34 .9904 2.59 .9952
1.35 .9115 1.60 .9452 1.85 .9678 2.10 .9821 2.35 .9906 2.60 .9953
1.36 .9131 1.61 .9463 1.86 .9686 2.11 .9826 2.36 .9909 2.61 .9955
1.37 .9147 1.62 .9474 1.87 .9693 2.12 .9830 2.37 .9911 2.62 .9956
1.38 .9162 1.63 .9484 1.88 .9699 2.13 .9834 2.38 .9913 2.63 .9957
1.39 .9177 1.64 .9495 1.89 .9706 2.14 .9838 2.39 .9916 2.64 .9959
1.40 .9192 1.65 .9505 1.90 .9713 2.15 .9842 2.40 .9918 2.65 .9960
1.41 .9207 1.66 .9515 1.91 .9719 2.16 .9846 2.41 .9920 2.66 .9961
1.42 .9222 1.67 .9525 1.92 .9726 2.17 .9850 2.42 .9922 2.67 .9962
1.43 .9236 1.68 .9535 1.93 .9732 2.18 .9854 2.43 .9925 2.68 .9963
1.44 .9251 1.69 .9545 1.94 .9738 2.19 .9857 2.44 .9927 2.69 .9964
234
1.45 .9265 1.70 .9554 1.95 .9744 2.20 .9861 2.45 .9929 2.70 .9965
z F (z ) z F (z ) z F (z ) z F (z ) z F (z ) z F (z ) z F (z )
2.71 .9966 2.91 .9982 3.11 .9991 3.31 .9995 3.51 .9998 3.71 .9999 3.91 1.0000
2.72 .9967 2.92 .9982 3.12 .9991 3.32 .9996 3.52 .9998 3.72 .9999 3.92 1.0000
2.73 .9968 2.93 .9983 3.13 .9991 3.33 .9996 3.53 .9998 3.73 .9999 3.93 1.0000
2.74 .9969 2.94 .9984 3.14 .9992 3.34 .9996 3.54 .9998 3.74 .9999 3.94 1.0000
2.75 .9970 2.95 .9984 3.15 .9992 3.35 .9996 3.55 .9998 3.75 .9999 3.95 1.0000
2.76 .9971 2.96 .9985 3.16 .9992 3.36 .9996 3.56 .9998 3.76 .9999 3.96 1.0000
2.77 .9972 2.97 .9985 3.17 .9992 3.37 .9996 3.57 .9998 3.77 .9999 3.97 1.0000
2.78 .9973 2.98 .9986 3.18 .9993 3.38 .9996 3.58 .9998 3.78 .9999 3.98 1.0000
2.79 .9974 2.99 .9986 3.19 .9993 3.39 .9997 3.59 .9998 3.79 .9999 3.99 1.0000
2.80 .9974 3.00 .9986 3.20 .9993 3.40 .9997 3.60 .9998 3.80 .9999
2.81 .9975 3.01 .9987 3.21 .9993 3.41 .9997 3.61 .9998 3.81 .9999
2.82 .9976 3.02 .9987 3.22 .9994 3.42 .9997 3.62 .9999 3.82 .9999
2.83 .9977 3.03 .9988 3.23 .9994 3.43 .9997 3.63 .9999 3.83 .9999
2.84 .9977 3.04 .9988 3.24 .9994 3.44 .9997 3.64 .9999 3.84 .9999
2.85 .9978 3.05 .9989 3.25 .9994 3.45 .9997 3.65 .9999 3.85 .9999
2.86 .9979 3.06 .9989 3.26 .9994 3.46 .9997 3.66 .9999 3.86 .9999
2.87 .9979 3.07 .9989 3.27 .9995 3.47 .9997 3.67 .9999 3.87 .9999
2.88 .9980 3.08 .9990 3.28 .9995 3.48 .9997 3.68 .9999 3.88 .9999
2.89 .9981 3.09 .9990 3.29 .9995 3.49 .9998 3.69 .9999 3.89 1.0000
2.90 .9981 3.10 .9990 3.30 .9995 3.50 .9998 3.70 .9999 3.90 1.0000
235
Table 2: Cut-off point of Student’s t distribution
0.100 0.050 0.025 0.010 0.005
1 3.078 6.314 12.706 31.821 63.657
2 1.886 2.920 4.303 6.965 9.925
3 1.638 2.353 3.182 4.541 5.841
4 1.533 2.132 2.776 3.747 4.604
5 1.476 2.015 2.571 3.365 4.032
6 1.440 1.943 2.447 3.143 3.707
7 1.415 1.895 2.365 2.998 3.499
8 1.397 1.860 2.306 2.896 3.355
9 1.383 1.833 2.262 2.821 3.250
10 1.372 1.812 2.228 2.764 3.169
236
Table 3: Chi- square distribution
.995 .990 .975 .950 .900 .100 .050 .025 .010 .005
1 0.04393 0.03157 0.03982 0.02393 0.0158 2.71 3.84 5.02 6.63 7.88
2 0.0100 0.0201 0.0506 0.103 0.211 4.61 5.99 7.38 9.21 10.60
3 0.072 0.115 0.216 0.352 0.584 6.25 7.81 9.35 11.34 12.84
4 0.207 0.297 0.484 0.711 1.064 7.78 9.49 11.14 13.28 14.86
5 0.412 0.554 0.831 1.145 1.61 9.24 11.07 12.83 15.09 16.75
6 0.676 0.872 1.24 1.64 2.20 10.64 12.59 14.45 16.81 18.55
7 0.989 1.24 1.69 2.17 2.83 12.02 14.07 16.01 18.48 20.28
8 1.34 1.65 2.18 2.73 3.49 13.36 15.51 17.53 20.09 21.96
9 1.73 2.09 2.70 3.33 4.17 14.68 16.92 19.02 21.67 23.59
10 2.16 2.56 3.25 3.94 4.87 15.99 18.31 20.48 23.21 25.19
11 2.60 3.05 3.82 4.57 5.58 17.28 19.68 21.92 24.73 26.76
12 3.07 3.57 4.40 5.23 6.30 18.55 21.03 23.34 26.22 28.30
13 3.57 4.11 5.01 5.89 7.04 19.81 22.36 24.74 27.69 29.82
14 4.07 4.66 5.63 6.57 7.79 21.06 23.68 26.12 29.14 31.32
15 4.60 5.23 6.26 7.26 8.55 22.31 25.00 27.49 30.58 32.80
16 5.14 5.81 6.91 7.96 9.31 23.54 26.30 28.85 32.00 34.27
17 5.70 6.41 7.56 8.67 10.09 24.77 27.59 30.19 33.41 35.72
18 6.26 7.01 8.23 9.39 10.86 25.99 28.87 31.53 34.81 37.16
19 6.84 7.63 8.91 10.12 11.65 27.20 30.14 32.85 36.19 38.58
20 7.43 8.26 9.59 10.85 12.44 28.41 31.41 34.17 37.57 40.00
21 8.03 8.90 10.28 11.59 13.24 29.62 32.67 35.48 38.93 41.40
22 8.64 9.54 10.98 12.34 14.04 30.81 33.92 36.78 40.29 42.80
23 9.26 10.20 11.69 13.09 14.85 32.01 35.17 38.08 41.64 44.18
24 9.89 10.86 12.40 13.85 15.66 33.20 36.42 39.36 42.98 45.56
25 10.52 11.52 13.12 14.61 16.47 34.38 37.65 40.65 44.31 46.93
26 11.16 12.20 13.84 15.38 17.29 35.56 38.89 41.92 45.64 48.29
27 11.81 12.88 14.57 16.15 18.11 36.74 40.11 43.19 46.96 49.64
28 12.46 13.56 15.31 16.93 18.94 37.92 41.34 44.46 48.28 50.99
29 13.12 14.26 16.05 17.71 19.77 39.09 42.56 45.72 49.59 52.34
30 13.79 14.95 16.79 18.49 20.60 40.26 43.77 46.98 50.89 53.67
40 20.71 22.16 24.43 26.51 29.05 51.81 55.76 59.34 63.69 66.77
50 27.99 29.71 32.36 34.76 37.69 63.17 67.50 71.42 76.15 79.49
60 35.53 37.48 40.48 43.19 46.46 74.40 79.08 83.30 88.38 91.95
70 43.28 45.44 48.76 51.74 55.33 85.53 90.53 95.02 100.4 104.2
80 51.17 53.54 57.15 60.39 64.28 96.58 101.9 106.6 112.3 116.3
90 59.20 61.75 65.65 69.16 73.29 107.6 113.1 118.1 124.1 128.3
100 67.33 70.06 74.22 77.93 82.36 118.5 124.3 129.6 135.8 140.2
237
Table 4: Cutoff point of the distribution of the Wicoxon test
statistic
n
0.005 0.01 0.025 0.05 0.10
4 0 0 0 0 1
5 0 0 0 1 3
6 0 0 1 3 4
7 0 1 3 4 6
8 1 2 4 6 9
9 2 4 6 9 11
10 4 6 9 11 15
11 6 8 11 14 18
12 8 10 14 18 22
13 10 13 18 22 27
14 13 16 22 26 32
15 16 20 26 31 37
16 20 24 30 36 43
17 24 28 35 42 49
18 28 33 41 48 56
19 33 38 47 54 63
20 38 44 53 61 70
238
Table 5: Cutoff point of the distribution of the Spearman’s
rank correlation coefficient
n
0.05 0.025 0.01 0.005
5 .900 - - -
6 .829 .886 .943 -
7 .714 .786 .893 -
8 .643 .738 .833 .881
9 .600 .683 .783 .833
10 .564 .648 .745 .794
239
Table 2: Cutoff points for the F distribution
0.05
2 denominato r 1 numerator
1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 120
1 161.4 199.5 215.7 224.6 230.2 234.0 236.8 238.9 240.5 241.9 243.9 245.9 248.0 249.1 250.1 251.1 252.2 253.3 254.3
2 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.38 19.40 19.41 19.43 19.45 19.45 19.46 19.47 19.48 19.49 19.50
3 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79 8.74 8.70 8.66 8.64 8.62 8.59 8.57 8.55 8.53
4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96 5.91 5.86 5.80 5.77 5.75 5.72 5.69 5.66 5.63
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 4.68 4.62 4.56 4.53 4.50 4.46 4.43 4.40 4.36
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06 4.00 3.94 3.87 3.84 3.81 3.77 3.74 3.70 3.67
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 3.57 3.51 3.44 3.41 3.38 3.34 3.30 3.27 3.23
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35 3.28 3.22 3.15 3.12 3.08 3.04 3.01 2.97 2.93
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 3.07 3.01 2.94 2.90 2.86 2.83 2.79 2.75 2.71
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 2.91 2.85 2.77 2.74 2.70 2.66 2.62 2.58 2.54
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85 2.79 2.72 2.65 2.61 2.57 2.53 2.49 2.45 2.40
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 2.69 2.62 2.54 2.51 2.47 2.43 2.38 2.34 2.30
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67 2.60 2.53 2.46 2.42 2.38 2.34 2.30 2.25 2.21
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60 2.53 2.46 2.39 2.35 2.31 2.27 2.22 2.18 2.13
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54 2.48 2.40 2.33 2.29 2.25 2.20 2.16 2.11 2.07
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49 2.42 2.35 2.28 2.24 2.19 2.15 2.11 2.06 2.01
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45 2.38 2.31 2.23 2.19 2.15 2.10 2.06 2.01 1.96
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41 2.34 2.27 2.19 2.15 2.11 2.06 2.02 1.97 1.92
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38 2.31 2.23 2.16 2.11 2.07 2.03 1.98 1.93 1.88
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35 2.28 2.20 2.12 2.08 2.04 1.99 1.95 1.90 1.84
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 2.32 2.25 2.18 2.10 2.05 2.01 1.96 1.92 1.87 1.81
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30 2.23 2.15 2.07 2.03 1.98 1.94 1.89 1.84 1.78
23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 2.27 2.20 2.13 2.05 2.01 1.96 1.91 1.86 1.81 1.76
240
1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 120
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25 2.18 2.11 2.03 1.98 1.94 1.89 1.84 1.79 1.73
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24 2.16 2.09 2.01 1.96 1.92 1.87 1.82 1.77 1.71
26 4.23 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27 2.22 2.15 2.07 1.99 1.95 1.90 1.85 1.80 1.75 1.69
27 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.31 2.25 2.20 2.13 2.06 1.97 1.93 1.88 1.84 1.79 1.73 1.67
28 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24 2.19 2.12 2.04 1.96 1.91 1.87 1.82 1.77 1.71 1.65
29 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.22 2.18 2.10 2.03 1.94 1.90 1.85 1.81 1.75 1.70 1.64
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16 2.09 2.01 1.93 1.89 1.84 1.79 1.74 1.68 1.62
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08 2.00 1.92 1.84 1.79 1.74 1.69 1.64 1.58 1.51
60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 1.99 1.92 1.84 1.75 1.70 1.65 1.59 1.53 1.47 1.39
120 3.92 3.07 2.68 2.45 2.29 2.17 2.09 2.02 1.96 1.91 1.83 1.75 1.66 1.61 1.55 1.50 1.43 1.35 1.25
3.84 3.00 2.60 2.37 2.21 2.10 2.01 1.94 1.88 1.83 1.75 1.67 1.57 1.52 1.46 1.39 1.32 1.22 1.00
0.01
2 denominato r 1 numerator
1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 120
1 4052 4999 5403 5625 5764 5859 5928 5982 6022 6056 6106 6157 6209 6235 6261 6287 6313 6339 6366
2 98.50 99.00 99.17 99.25 99.30 99.33 99.36 99.37 99.39 99.40 99.42 99.43 99.45 99.46 99.47 99.47 99.48 99.48 99.50
3 34.12 30.82 29.46 28.71 28.24 27.91 27.67 27.49 27.35 27.23 27.05 26.87 26.69 26.60 26.50 26.41 26.32 26.22 26.13
4 21.20 18.00 16.69 15.98 15.52 15.21 14.98 14.80 14.66 14.55 14.37 14.20 14.02 13.93 13.84 13.75 13.65 13.56 13.46
5 16.26 13.27 12.06 11.39 10.97 10.67 10.46 10.29 10.16 10.05 9.89 9.72 9.55 9.47 9.38 9.29 9.20 9.11 9.02
6 13.75 10.92 9.78 9.15 8.75 8.47 8.26 8.10 7.98 7.87 7.72 7.56 7.40 7.31 7.23 7.14 7.06 6.97 6.88
7 12.25 9.55 8.45 7.85 7.46 7.19 6.99 6.84 6.72 6.62 6.47 6.31 6.16 6.07 5.99 5.91 5.82 5.74 5.65
8 11.26 8.65 7.59 7.01 6.63 6.37 6.18 6.03 5.91 5.81 5.67 5.52 5.36 5.28 5.20 5.12 5.03 4.95 4.86
9 10.56 8.02 6.99 6.42 6.06 5.80 5.61 5.47 5.35 5.26 5.11 4.96 4.81 4.73 4.65 4.57 4.48 4.40 4.31
10 10.04 7.56 6.55 5.99 5.64 5.39 5.20 5.06 4.94 4.85 4.71 4.56 4.41 4.33 4.25 4.17 4.08 4.00 3.91
11 9.65 7.21 6.22 5.67 5.32 5.07 4.89 4.74 4.63 4.54 4.40 4.25 4.10 4.02 3.94 3.86 3.78 3.69 3.60
12 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.50 4.39 4.30 4.16 4.01 3.86 3.78 3.70 3.62 3.54 3.45 3.36
241
1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 120
13 9.07 6.70 5.74 5.21 4.86 4.62 4.44 4.30 4.19 4.10 3.96 3.82 3.66 3.59 3.51 3.43 3.34 3.25 3.17
14 8.86 6.51 5.56 5.04 4.69 4.46 4.28 4.14 4.03 3.94 3.80 3.66 3.51 3.43 3.35 3.27 3.18 3.09 3.00
15 8.68 6.36 5.42 4.89 4.56 4.32 4.14 4.00 3.89 3.80 3.67 3.52 3.38 3.29 3.21 3.13 3.05 2.96 2.87
16 8.53 6.23 5.29 4.77 4.44 4.20 4.03 3.89 3.78 3.69 3.55 3.41 3.26 3.18 3.10 3.02 2.93 2.84 2.75
17 8.40 6.11 5.18 4.67 4.34 4.10 3.93 3.79 3.68 3.59 3.46 3.31 3.16 3.08 3.00 2.92 2.83 2.75 2.65
18 8.29 6.01 5.09 4.58 4.25 4.01 3.84 3.71 3.60 3.51 3.37 3.23 3.08 3.00 2.92 2.84 2.75 2.66 2.57
19 8.18 5.93 5.01 4.50 4.17 3.94 3.77 3.63 3.52 3.43 .3.30 3.15 3.00 2.92 2.84 2.76 2.67 2.58 2.49
20 8.10 5.85 4.94 4.43 4.10 3.87 3.70 3.56 3.46 3.37 3.23 3.09 2.94 2.86 2.78 2.69 2.61 2.52 2.42
21 8.02 5.78 4.87 4.37 4.04 3.81 3.64 3.51 3.40 3.31 3.17 3.03 2.88 2.80 2.72 2.64 2.55 2.46 2.36
22 7.95 5.72 4.82 4.31 3.99 3.76 3.59 3.45 3.35 3.26 3.12 2.98 2.83 2.75 2.67 2.58 2.50 2.40 2.31
23 7.88 5.66 4.76 4.26 3.94 3.71 3.54 3.41 3.30 3.21 3.07 2.93 2.78 2.70 2.62 2.54 2.45 2.35 2.26
24 7.82 5.61 4.72 4.22 3.90 3.67 3.50 3.36 3.26 3.17 3.03 2.89 2.74 2.66 2.58 2.49 2.40 2.31 2.21
25 7.77 5.57 4.68 4.18 3.85 3.63 3.46 3.32 3.22 3.13 2.99 2.85 2.70 2.62 2.54 2.45 2.36 2.27 2.17
26 7.72 5.53 4.64 4.14 3.82 3.59 3.42 3.29 3.18 3.09 2.96 2.81 2.66 2.58 2.50 2.42 2.33 2.23 2.13
27 7.68 5.49 4.60 4.11 3.78 3.56 3.39 3.26 3.15 3.06 2.93 2.78 2.63 2.55 2.47 2.38 2.29 2.20 2.10
28 7.64 5.45 4.57 4.07 3.75 3.53 3.36 3.23 3.12 3.03 2.90 2.75 2.60 2.52 2.44 2.35 2.26 2.17 2.06
29 7.60 5.42 4.54 4.04 3.73 3.50 3.33 3.20 3.09 3.00 2.87 2.73 2.57 2.49 2.41 2.33 2.23 2.14 2.03
30 7.56 5.39 4.51 4.02 3.70 3.47 3.30 3.17 3.07 2.98 2.84 2.70 2.55 2.47 2.39 2.30 2.21 2.11 2.01
40 7.31 5.18 4.31 3.83 3.51 3.29 3.12 2.99 2.89 2.80 2.66 2.52 2.37 2.29 2.20 2.11 2.02 1.92 1.80
60 7.08 4.98 4.13 3.65 3.34 3.12 2.95 2.82 2.72 2.63 2.50 2.35 2.20 2.12 2.03 1.94 1.84 1.73 1.60
120 6.85 4.79 3.95 3.48 3.17 2.96 2.79 2.66 2.56 2.47 2.34 2.19 2.03 1.95 1.86 1.76 1.66 1.53 1.38
6.63 4.61 3.78 3.32 3.02 2.80 2.64 2.51 2.41 2.32 2.18 2.04 1.88 1.79 1.70 1.59 1.47 1.32 1.00
242
Table 7. Cumulative distribution function of the runs test statistic
n R
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
6 .100 .300 .700 .900 1.00
8 .029 .114 .371 .629 .886 .971 1.00
10 .008 .040 .167 .357 .643 .833 .960 .992 1.00
12 .002 .013 .067 .175 .392 .608 .825 .933 .987 .998 1.00
14 .001 .004 .025 .078 .209 .383 .617 .791 .922 .975 .996 .999 1.00
16 .000 .001 .009 .032 .100 .214 .405 .595 .786 .900 .968 .991 .999 1.00 1.00
18 .000 .000 .003 .012 .044 .109 .238 .399 .601 .762 .891 .956 .988 .997 1.00 1.00 1.00
20 .000 .000 .001 .004 .019 .051 .128 .242 .414 .586 .758 .872 .949 .981 .996 .999 1.00 1.00 1.00
243
References
244