2

CHAPTER 4
CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

The only relevant test of the validity of a
hypothesis is comparison of its predictions with
experience.
Milton Friedman
A condence interval is an interval that provides an estimated range of values which is likely to include, with a
certain level of condence, an unknown population parameter of interest. This estimated range of values is calculated
from a given set of sample data. Condence intervals are one way to represent how good an estimate is; the larger a
condence interval for a particular estimate, the more caution is required when using the estimate. Hypothesis testing,
on the hand, is a method of drawing inferences about a population based on statistical evidence from a sample. A
statistical hypothesis is an assumption about a population parameter. This assumption may or may not be true. The
purpose of hypothesis testing is to assess the validity of a statistical hypothesis made about a population.
4.1 POINT AND INTERVAL ESTIMATION
Point estimation involves the use of sample data to calculate a single value (statistic) which serves as an estimate of an
unknown population parameter, while interval estimation involves the use of sample data to calculate an interval of
possible (or probable) values of an unknown population parameter.
Let X
1
, . . . , X
n
be a random sample from a population with an unknown parameter . A point estimator of is the
statistic

= (X
1
, . . . , X
n
), where is a given function. A numerical value

of the statistic

, calculated from the
observed sample data X
1
= x
1
, . . . , X
n
= x
n
, is called the point estimate of the parameter . For example, the sample
mean X is an estimator of a normal population mean , while the numerical value = x calculated from a observed
sample data is the point estimate of . A good estimator must satisfy three conditions:
Unbiased: The expected value of the estimator must be equal to the parameter
Consistent: The value of the estimator approaches the value of the parameter as the sample size increases
Relatively Efcient: The estimator has the smallest variance of all estimators which could be used.
Denition 4.1 Assume that X
1
, . . . , X
n
is a random sample from a population with an unknown parameter . Then,
The point estimator

is an unbiased estimator for the parameter if E(

) = .
The minimum variance unbiased estimator (MVUE) is an unbiased estimator that has lower variance than any other unbi-
ased estimator for all possible values of the parameter .
The standard error of a point estimator

is given by s.e.(

) =
_
var(

).
A biased point estimator is an estimator such that E(

) = + bias. For example, both the sample mean X and
sample variance S
2
are unbiased estimators for the population mean and the population variance
2
, respectively, i.e.
57
E(X) = and E(S
2
) =
2
. However, the sample standard deviation S is a biased estimator of the population standard
deviation , i.e. E(S) ,= .
The MVUE is the most efcient estimator. An efcient estimator

will produce an estimate closer to the true
parameter , and it can be shown that the sample mean X is MVUE for the population mean . On the other hand, an
interval estimate refers to a range of values used to estimate the population parameter. Such an interval estimate is
obtained by making use of the probability distribution of the point estimator.
Example 4.1 In a sample of ve measurements, the diameter of a sphere was recorded by a scientist as 6.33, 6.37, 6.36, 6.32, and
6.37 centimeters (cm).
(i) Determine an unbiased and efcient estimate of the population mean.
(ii) Determine an unbiased and efcient estimate of the population variance.
(iii) Give an unbiased and inefcient estimate of the population mean.
Solution: From the given information, the sample size is n = 5.
(i) The unbiased and efcient estimate of the the population mean is
X =
1
n
n
i=1
X
i
=
1
5
(6.33 +6.37 +6.36 +6.32 +6.37) = 6.35
(ii) The unbiased and efcient estimate of the population variance is
S
2
=
1
n 1
n
i=1
(X
i
X)
2
= 0.00055.
(iii) The median Q
2
is one example of an unbiased and inefcient estimate of the population mean. By ordering the
data, we obtain Q
2
= 6.36.
MATLAB code
>> X = [6.33 6.37 6.36 6.32 6.37];
>> mean(X), var(X), median(X)
4.2 CONFIDENCE INTERVALS
A condence interval is an interval estimate with a specic condence level, (1 )%, where (0, 1). The condence
coefcient, denoted 1 , is the probability that the interval estimate will contain the population parameter . More
specically, we want to construct a 100(1 )% condence interval u such that
P( u) = 1 (4.2.1)
where and u are called lower and upper condence limits, respectively. These condence limits are calculated from the
observed sample data.
When a one-sided specication is employed, then only a one-sided condence limit is needed. In this case, a lower-
condence interval on is such that P( ) = 1 and an upper-condence interval on is u such that
P( u) = 1 . A typical value of the the condence level (1 )% is 95%, which means that if all samples of the
same size were selected, 95% of them include the population parameter somewhere within the condence interval,
and 5% would not.
58
4.2.1 CONFIDENCE INTERVAL ON THE POPULATION MEAN WHEN THE VARIANCE IS KNOWN
Let X
1
, . . . , X
n
be a randomsample of size n ( 30) froma normal population with an unknown parameter mean and
known parameter variance
2
. Since the sample size is large, it follows from the central limit theorem that the statistic
Z =
X
/
n
N(0, 1), and as illustrated in Figure 4.1 we can write
P
_
z
/2

X
/
n
z
/2
_
= 1 ,
where z
/2
is the upper 100/2 percentage point of the standard normal distribution.
Rearranging the terms inside the parentheses yields
P
_
X z
/2
n
X + z
/2
n
_
= 1
z
/2 z
/2
/2 /2
Z =
X
/
n
N(0, 1)
0
1
FIGURE 4.1: Illustration of condence interval on the normal mean when the variance
is known.
Denition 4.2 Let X
1
, . . . , X
n
be a random sample from an N(,
2
) distribution with an unknown parameter mean and
known parameter variance
2
.
A 100(1 )% condence interval on is given by
x z
/2
n
x + z
/2
n
(4.2.2)
A 100(1 )% upper-condence interval on is given by
x + z
n
(4.2.3)
A 100(1 )% lower-condence interval on is given by
x z
n
(4.2.4)
where x is the observed sample mean.
59
Example 4.2 A survey was conducted of companies that use solar panels as a primary source of electricity. The question that was
asked was this: How much of the electricity used in your company comes from the solar panels? A random sample of 55 responses
produced a mean of 45 megawatts. Suppose the population standard deviation for this question is 15.5 megawatts.
(i) Find the 95% condence interval for the mean.
(ii) Find the 95% upper-condence interval for the mean.
(iiii) Find the 95% lower-condence interval for the mean.
Solution: From the given information, we have x = 45, = 15.5, and n = 55.
For 95% condence level, we have = 1 0.95 = 0.05, z
/2
= z
0.025
= 1.96, and z
= z
0.05
= 1.64.
(i) The 95% condence interval is given by
x z
/2
n
x + z
/2
n
45 1.96
15.5
55
45 +1.96
15.5
55
40.9 49.1
Thus, we can be 95% sure that the mean will be between 40.9 and 49.1 megawatts. In other words, the probability
for the mean to be between 40.9 and 49.1 will be 0.95.
P(40.9 49.1) = 0.95.
MATLAB code
>> xbar = 45; sigma = 15.5; n = 55; alpha = 1-0.95;
>> zalpha2 = icdf(norm,1-alpha/2,0,1);
>> LC = xbar - zalpha2*sigma/sqrt(n);
>> UC = xbar + zalpha2*sigma/sqrt(n);
(ii) The 95% upper-condence interval is given by
x + z
n
= 45 +1.64
15.5
55
48.44
(iii) The 95% lower-condence interval is given by
x z
n
= 45 1.64
15.5
55
41.56 .
4.2.2 CONFIDENCE INTERVAL ON THE POPULATION MEAN WHEN THE VARIANCE IS UNKNOWN
When the sample size is small, we cannot apply the central limit theorem. Thus, we either assume that the sampled
population is normally distributed, or we need to verify that the sample data is approximately normally distributed
using for example the normal probability plot or the box plot. Therefore, when the population is normal and sample
size is small, the statistic T =
X
S/
n
t(n 1) follows a t-distribution with n 1 degrees of freedom. Hence, as
illustrated in Figure 4.2 we can write
P
_
t
/2,n1
X
S/
n
t
/2,n1
_
= 1 ,
where t
/2,n1
is the upper 100/2 percentage point of the t-distribution with n 1 degrees of freedom. It follows that
P
_
X t
/2,n1
S
n
X + t
/2,n1
S
n
_
= 1
60
t
/2,n1
t
/2,n1
/2 /2
T =
X
S/
n
t(n 1)
1
FIGURE 4.2: Illustration of condence interval on the normal mean when the variance
is unknown.
Denition 4.3 Let X
1
, . . . , X
n
be a random sample from a normal population with an unknown parameter variance
2
.
A 100(1 )% condence interval on is given by
x t
/2,n1
s
n
x + t
/2,n1
s
n
(4.2.5)
A 100(1 )% upper-condence interval on is given by
x + t
,n1
s
n
(4.2.6)
A 100(1 )% lower-condence interval on is given by
x t
,n1
s
n
(4.2.7)
where x and s are the observed sample mean and sample standard deviation, respectively.
Example 4.3 A random sample of size 25 of a certain kind of lightbulb yielded an average lifetime of 1875 hours and a standard
deviation of 100 hours. From past experience it is known that the lifetime of this kind of bulb is normally distributed. Find the a
99% condence interval for the population mean.
Solution: From the information given, we have n = 25, x = 1875, s = 100. For 99% condence level, we have
= 1 0.99 = 0.01. Also, the population is assumed to be normally distributed. The 99% condence interval is given
by
x t
/2,n1
s
n
x + t
/2,n1
s
n
1875 2.7969
100
25
1875 +2.7969
100
25
1819.1 1930.9
We can be 99% sure that the mean will be between 1819.1 and 1930.9. In other words, the probability for the mean to
be between 1819.1 and 1930.9 will be 0.99. That is,
P(1819.1 1930.9) = 0.99.
61
MATLAB code
>> n=25; xbar = 1875; s = 100; alpha = 1-0.99;
>> talpha2 = icdf(t,1-alpha/2,n-1)
>> LC = xbar - talpha2*s/sqrt(n)
>> UC = xbar + talpha2*s/sqrt(n)
Example 4.4 A manager of a car rental company wants to estimate the average number of times luxury cars would be rented a
month. She takes a random sample of 19 cars that produces the following number of times the cars are rented in a month:
3 7 12 5 9 13 2 8 6 14 6 1 2 3 2 5 11 13 5
(i) Check the assumption of normality for the number of times the cars are rented in a month.
(ii) Find the 95% condence interval to estimate the average.
(iii) Find the 95% upper-condence interval to estimate the average.
(iv) Find the 95% lower-condence interval to estimate the average.
Solution: The sample size is n = 19. Thus, the sample mean and sample standard deviation of the data are: x =
127/19 = 6.68 and s = 4.23. For 95% condence level, we have = 1 0.95 = 0.05, t
/2,n1
= t
0.025,18
= 2.101, and
t
,n1
= t
0.05,18
= 1.734.
(i) According to the normal probability plot shown in Figure 4.3, there does not seem to be a severe deviation from
normality for this data. This is evident by the fact that the data appears to fall along a straight line.
MATLAB code
>> X = [3 7 12 5 9 13 2 8 6 14 6 1 2 3 2 5 11 13 5];
>> normplot(X);
2 4 6 8 10 12 14
0.02
0.05
0.10
0.25
0.50
0.75
0.90
0.95
0.98
Data
P
r
o
b
a
b
i
l
i
t
y
FIGURE 4.3: Normal probability plot for the number of times the cars are rented in a
month.
(ii) The 95% condence interval is given by
x t
/2,n1
s
n
x + t
/2,n1
s
n
62
6.68 2.101
4.23
19
6.68 +2.101
4.23
19
4.65 8.72
We can be 95 percent sure that the mean will be between 4.65 and 8.72. In other words, the probability for the
mean to be between 4.65 and 8.72 will be 0.95.
P(4.65 8.72) = 0.95.
MATLAB code
>> X = [3 7 12 5 9 13 2 8 6 14 6 1 2 3 2 5 11 13 5];
>> xbar = mean(X); s = std(X); n = 19; alpha = 1-0.95;
>> talpha2 = icdf(t,1-alpha/2,n-1)
>> LC = xbar - talpha2*s/sqrt(n)
>> UC = xbar + talpha2*s/sqrt(n)
(iii) The 95% upper-condence interval is given by
x + t
,n
s
n
= 6.68 +1.734
4.23
19
8.37
(iv) The 95% lower-condence interval is given by
x t
,n1
s
n
= 6.68 1.734
4.23
19
5 .
4.2.3 CONFIDENCE INTERVAL ON THE POPULATION VARIANCE
In quality control, in most cases the objective of the auditor is not to nd the mean of a population but rather to deter-
mine the level of variation of the output. For instance, they would want to know how much variation the production
process exhibits about the target to see what adjustments are needed to reach a defect-free process.
When the population is normally distributed, the statistic (n 1)
S
2
2

2
(n 1) follows a
2
(n 1) distribution
with n 1 degrees of freedom, and as illustrated in Figure 4.4 we can construct the the 100(1 )% condence interval
for
2
as follows
P
_
2
1/2,n1
(n 1)S
2
2

2
/2,n1
_
= 1 ,
where
2
/2,n1
is the upper 100/2 percentage point of the 2-distribution with n 1 degrees of freedom. It follows
that
P
_
(n 1)S
2
2
/2,n1

2
(n 1)S
2
2
1/2,n1
_
= 1
Denition 4.4 Let X
1
, . . . , X
n
N(,
2
) be a random sample from a normal population with mean and variance
2
.
A 100(1 )% condence interval on
2
is given by
(n 1)s
2
2
/2,n1

2
(n 1)s
2
2
1/2,n1
(4.2.8)
A 100(1 )% upper-condence interval on
2
is given by
(n 1)s
2
2
1,n1
(4.2.9)
A 100(1 )% lower-condence interval on
2
is given by
(n 1)s
2
2
,n1

2
(4.2.10)
where s
2
is the observed sample variance.
63
1
/2
/2
(n 1)
S
2
2

2
(n 1)
2
/2,n1
2
1/2,n1
FIGURE 4.4: Illustration of condence interval on the normal variance.
Example 4.5 A sample of 9 screws was taken out of a production line and the sizes of the diameters in millimeters are as follows:
13 13 12 12.55 12.99 12.89 12.88 12.97 12.99
(i) Find the 95% condence interval to estimate the population variance.
(ii) Find the 95% upper-condence interval to estimate the population variance.
(iii) Find the 95% lower-condence interval to estimate the population variance.
Solution: The sample size is n = 9. Thus, the sample variance of the data is: s
2
= 0.11. For 95% condence level, we
have = 1 0.95 = 0.05,
2
/2,n1
=
2
0.025,8
= 17.53,
2
1/2,n1
= 2.18,
2
,n1
= 15.51, and
2
1,n1
= 2.73.
(n 1)s
2
2
/2,n1

2
(n 1)s
2
2
1/2,n1
0.05
2
0.41
Thus, we can be 95% sure that the population variance will be between 0.05 and 0.41. In other words, the proba-
bility for the population variance to be between 0.05 and 0.41 will be 0.95.
P(0.05
2
0.41) = 0.95.
MATLAB code
>> X = [13 13 12 12.55 12.99 12.89 12.88 12.97 12.99];
>> n = length(X); s2 = var(X); alpha = 1-0.95;
>> LC = (n-1)*s2/icdf(chi2,1-alpha/2,n-1)
>> UC = (n-1)*s2/icdf(chi2,alpha/2,n-1)
(n 1)s
2
2
1,n1

2
0.33
(n 1)s
2
2
,n1

2
0.06
2
64
Example 4.6 The time taken by a worker in a car manufacturing company to nish a paint job on a car is normally distributed
with mean and variance
2
. A sample of 15 paint jobs is randomly selected and assigned to that worker, and the time taken by
the worker to nish the job is jotted down. These data yield a sample standard deviation of 2.5 hours.
(i) Find the 95% condence interval to estimate the population standard deviation.
(ii) Find the 95% upper-condence interval to estimate the population standard deviation.
(iii) Find the 95% lower-condence interval to estimate the population standard deviation
Solution: From the given information, we have n = 15 and s = 2.5. For 95% condence level, we have = 1 0.95 =
0.05,
2
/2,n1
=
2
0.025,14
= 26.1189,
2
1/2,n1
= 5.6287,
2
,n1
= 23.6848, and
2
1,n1
= 6.5706.
(n 1)s
2
2
/2,n1

2
(n 1)s
2
2
1/2,n1
3.3501
2
15.5453
We can be 95% sure that the population variance will be between 3.3501 and 15.5453. Therefore, by taking the
square root of the lower and upper condence limits for
2
, the 95% condence interval of the population stan-
dard deviation is (1.8303, 3.9427).
(n 1)s
2
2
1,n1

2
13.3168
Thus, the 95% upper-condence interval of the population standard deviation is (0, 3.6492).
(n 1)s
2
2
,n1

2
3.6944
2
Thus, the 95% lower-condence interval of the population standard deviation is (1.9221, +).
4.2.4 CONFIDENCE INTERVAL ON THE POPULATION PROPORTION
Consider a population of items, each of which independently meets certain standards with some unknown probability
p, and suppose that a sample of size n was taken from this population. If X denote the number of the n items that meet
the standards, then X bino(n, p). Thus, for n large and both np 5 and n(1 p) 5, it follows that
X np
_
np(1 p)
N(0, 1)
where

means is approximately distributed as. Since p = X/n is a point estimate of p, it follows that
_
n p(1 p)
_
np(1 p) and
n p np
_
n p(1 p)
N(0, 1)
Thus,
P
_
z
/2

n p np
_
n p(1 p)
z
/2
_
1
or, equivalently,
P
_
p z
/2
_
p(1 p)/n p p + z
/2
_
p(1 p)/n
_
1
65
Denition 4.5 If p is the proportion of observations in a random sample of size n, then
A 100(1 )% condence interval on p is given by
p z
/2
_
p(1 p)/n p p + z
/2
_
p(1 p)/n (4.2.11)
A 100(1 )% upper-condence interval on p is given by
p p + z
_
p(1 p)/n (4.2.12)
A 100(1 )% lower-condence interval on p is given by
p z
_
p(1 p)/n p (4.2.13)
Example 4.7 The fraction of defective integrated circuits produced in a photolithography process is being studied. A random
sample of 300 circuits is tested, revealing 13 defectives. Find a 95% two-sided condence interval on the fraction of defective
circuits produced by this particular tool.
Solution: From the given information, we have n = 300 and x = 13. Thus, the point estimate of the proportion is
p = x/n = 13/300 = 0.0433. Since both n p = 13 5 and n(1 p) = 287 5 are satised, p is approximately normal.
For 95% condence level, we have = 1 0.95 = 0.05, and z
/2
= z
0.025
= 1.96. Thus, the 95% condence interval is
given by
p z
/2
_
p(1 p)/n p p + z
/2
_
p(1 p)/n 0.02 p 0.07
We can be 95% sure that the fraction of defective circuits will be between 0.02 and 0.07. In other words, the probability
for the proportion to be between 0.02 and 0.07 will be 0.95.
MATLAB code
>> n = 300; x = 13; phat = x/n; alpha = 1-0.95;
>> zalpha2 = icdf(norm,1-alpha/2,0,1);
>> LC = phat - zalpha2*sqrt(phat*(1-phat)/n)
>> UC = phat + zalpha2*sqrt(phat*(1-phat)/n)
Example 4.8 A random sample of 400 computer chips is taken from a large lot of chips and 50 of them are found defective. Find a
90% two-sided condence interval for the proportion of defective chips contained in the lot.
Solution: From the given information, we have n = 400 and x = 50. Thus, the point estimate of the proportion of
defective chips contained in the lot is p = x/n = 50/400 = 0.125. Since both n p = 50 5 and n(1 p) = 350 5 are
satised, p is approximately normal. For 90% condence level, we have = 1 0.9 = 0.1, and z
/2
= z
0.025
= 1.6449.
The 90% condence interval is given by
p z
/2
_
p(1 p)/n p p + z
/2
_
p(1 p)/n 0.0978 p 0.1522
We can be 90% sure that the proportion of defective chips contained in the lot will be between 0.0978 and 0.1522.
4.2.5 CONFIDENCE INTERVAL ON THE DIFFERENCE IN POPULATIONS MEANS
Just as in the analysis of a single population, to estimate the difference between two populations the researcher would
draw samples from each population.
Assume that X
1
, . . . , X
n
1
N(
1
,
2
1
) and Y
1
, . . . , Y
n
2
N(
2
,
2
2
) are two independent samples from two indepen-
dent normal populations, as depicted in Figure 4.5. Let X, Y, S
2
1
and S
2
2
be the sample means and sample variances,
respectively. Then,
X N
_
1
,
2
1
n
1
_
and X N
_
2
,
2
2
n
2
_
= X Y N
_
2
,
2
1
n
1
+

2
2
n
2
_
66
To construct the condence interval on the difference in means
1
2
, we consider two particular cases:
FIGURE 4.5: Illustration of two independent distributions N(
1
,
2
1
) and N(
2
,
2
2
).
Case 1: If
1
and
2
are known, then the statistic
X Y (
1
2
)
_
2
1
n
1
+

2
2
n
2
N(0, 1)
Thus,
P
_
_
_
_
z
/2

X Y (
1
2
)
_
2
1
n
1
+

2
2
n
2
z
/2
_
_
_
_
= 1
or, equivalently,
P
_
_
X Y z
/2
2
1
n
1
+

2
2
n
2

1
2
X Y + z
/2
2
1
n
1
+

2
2
n
2
_
_
= 1
Denition 4.6 Let x and y be the observed sample means of independent random samples of sizes n
1
and n
2
from two independent
normal populations with known variances
2
1
and
2
1
, respectively. Then,
1
2
is given by
x y z
/2
2
1
n
1
+

2
2
n
2

1
2
x y + z
/2
2
1
n
1
+

2
2
n
2
(4.2.14)
2
is given by
2
x y + z
2
1
n
1
+

2
2
n
2
(4.2.15)
2
is given by
x y z
2
1
n
1
+

2
2
n
2

1
2
(4.2.16)
where s
2
is the observed sample variance.
Example 4.9 The variances of two populations 1 and 2 are 16 and 9, respectively. A sample of 25 items was taken from Population
1 with a mean of 50, and a sample of 22 items was taken fromPopulation 2 with a mean of 45. Construct a 95%two-sided condence
interval for the difference in population means.
67
Solution: We have n
1
= 25, n
2
= 22,
1
= 4, and
2
= 3. For 95% condence level, we have = 1 0.95 = 0.05, and
z
/2
= z
0.025
= 1.96. Thus, the 95% two-sided condence for the difference in population means is given by
x y z
/2
2
1
n
1
+

2
2
n
2

1
2
x y + z
/2
2
1
n
1
+

2
2
n
2
2.99
1
2
7.01.
Hence, the probability for the difference in population means to be between 2.99 and 7.01 will be 0.95.
Case 2: If
1
and
2
are unknown and equal to
2
, then
X N
_
1
,

2
n
1
_
and X N
_
2
,

2
n
2
_
= X Y N
_
2
,

2
n
1
+

2
n
2
_
and
(n
1
1)
S
2
1
2

2
(n
1
1) and (n
2
1)
S
2
2
2

2
(n
2
1)
=(n
1
1)
S
2
1
2
+ (n
2
1)
S
2
2
2

2
(n
1
+ n
2
2)
or, equivalently,
(n
1
+ n
2
2)
S
2
p
2

2
(n
1
+ n
2
2)
where S
2
p
=
(n
1
1)S
2
1
+ (n
2
1)S
2
2
n
1
+ n
2
2
is called the pooled variance.
Since Z =
X Y (
1
2
)
_
2
n
1
+

2
n
2
N(0, 1) and V = (n
1
+ n
2
2)
S
2
p
2

2
(n
1
+ n
2
2), it follows that
Z
_
V/(n
1
+ n
2
2)
=
X Y (
1
2
)
S
p
1/n
1
+1/n
2
t(n
1
+ n
2
2)
Therefore,
P
_
t
/2,n
1
+n
2
2

X Y (
1
2
)
S
p
1/n
1
+1/n
2
t
/2,n
1
+n
2
2
_
= 1
Denition 4.7 Let x and y be the observed sample means of independent random samples of sizes n
1
and n
2
from two independent
normal populations with unknown variances
2
1
=
2
2
=
2
1
2
is given by
x y t
/2,n
1
+n
2
2
s
p
1
n
1
+
1
n
2

1
2
x y + t
/2,n
1
+n
2
2
s
p
1
n
1
+
1
n
2
(4.2.17)
2
is given by
2
x y + t
,n
1
+n
2
2
s
p
1
n
1
+
1
n
2
(4.2.18)
2
is given by
x y t
,n
1
+n
2
2
s
p
1
n
1
+
1
n
2

1
2
(4.2.19)
where s
p
is the observed pooled standard deviation.
68
Example 4.10 The variances of two populations are assumed to be equal. A sample of 15 items was taken from Population I with a
mean of 50 and a standard deviation of 3, and a sample of 19 items was taken from Population II with a mean of 47 and a standard
deviation of 2.
(i) Calculate the pooled sample variance.
(ii) Construct a 95% two-sided condence for the difference between the two population means.
Solution: From the given information, we have n
1
= 15, n
2
= 19, x = 50, y = 47, S
1
= 3, and S
2
= 2.
(i) The value of the pooled sample variance is given by
s
2
p
=
(n
1
1)S
2
1
+ (n
2
1)S
2
2
n
1
+ n
2
2
=
(15 1)(3
2
) + (19 1)(2
2
)
19 +15 2
= 6.19.
(ii) For 95% condence level, we have = 1 0.95 = 0.05, and t
/2,n
1
+n
2
2
= t
0.025,32
= 2.04. Thus, the 95%
two-sided condence for the difference between the two population means is given by
x y t
/2,n
1
+n
2
2
s
p
1
n
1
+
1
n
2

1
2
x y + t
/2,n
1
+n
2
2
s
p
1
n
1
+
1
n
2
1.25
1
2
4.75.
Hence, a 95% two-sided condence for
1
2
is (1.25, 4.75).
MATLAB code
>> n1 = 15; n2 = 19; alpha = 0.05;
>> talpha2 = icdf(t,1-alpha/2,n1+n2-2)
Example 4.11 A pharmaceutical company sets two machines to ll 15oz bottles with cough syrup. Two random samples of
n
1
= 16 bottles from machine 1 and n
2
= 12 bottles from machine 2 are selected. The two samples yield the following sample
statistics:
X = 15.24 S
2
1
= 0.64
Y = 14.96 S
2
2
= 0.36
(i) Calculate the pooled sample variance.
(ii) Construct a 95% two-sided condence for the mean difference of the amount of cough syrup lled in bottles by the two
machines.
1
= 16, n
2
= 12, x = 15.24, y = 14.96, S
2
1
= 0.64, and S
2
2
= 0.36.
(i) The value of the pooled sample variance is given by
s
2
p
=
(n
1
1)S
2
1
+ (n
2
1)S
2
2
n
1
+ n
2
2
=
(16 1)(0.64) + (12 1)(0.36)
16 +12 2
= 0.5215
(ii) For 95% condence level, we have = 1 0.95 = 0.05, and t
/2,n
1
+n
2
2
= t
0.025,26
= 2.0555. Thus, the 95%
two-sided condence for the difference between the two population means is given by
x y t
/2,n
1
+n
2
2
s
p
1
n
1
+
1
n
2

1
2
x y + t
/2,n
1
+n
2
2
s
p
1
n
1
+
1
n
2
0.2869
1
2
0.8469.
Hence, a 95% two-sided condence for
1
2
is (0.2869, 0.8469).
69
4.2.6 CONFIDENCE INTERVAL ON THE RATIO OF VARIANCES
Assume that X
1
, . . . , X
n
1
N(
1
,
2
1
) and Y
1
, . . . , Y
n
2
N(
2
,
2
2
) are two independent samples from two independent
normal populations. Let S
2
1
and S
2
2
be the sample variances, respectively. Then the statistic
S
2
1
/
2
1
S
2
2
/
2
2
F(n
1
1, n
2
1)
Denote by f
/2,n
1
1,n
2
1
and f
1/2,n
1
1,n
2
1
the upper and lower /2 percentage points of the F(n
1
1, n
2
1) distri-
bution, as shown in Figure 4.6.
/2
/2
1
S
2
1
/
2
1
S
2
2
/
2
2
F(n
1
1, n
2
1)
f
/2,n11,n21
f
1/2,n11,n21
FIGURE 4.6: Illustration of condence interval on the ratio of variances.
Thus,
P
_
f
1/2,n
1
1,n
2
1

S
2
1
/
2
1
S
2
2
/
2
2
f
/2,n
1
1,n
2
1
_
= 1
or, equivalently,
P
_
f
1/2,n
1
1,n
2
1
S
2
1
S
2
2

2
1
2
2
f
/2,n
1
1,n
2
1
S
2
1
S
2
2
_
= 1
Denition 4.8 Let s
2
1
and s
2
2
be the observed sample variances of two independent random samples of sizes n
1
and n
2
from two
independent normal populations with unknown variances
2
1
and
2
2
2
1
/
2
2
is given by
f
1/2,n
1
1,n
2
1
s
2
1
s
2
2

2
1
2
2
f
/2,n
1
1,n
2
1
s
2
1
s
2
2
(4.2.20)
2
1
/
2
2
is given by
2
1
2
2
f
,n
1
1,n
2
1
s
2
1
s
2
2
(4.2.21)
2
1
/
2
2
is given by
f
1,n
1
1,n
2
1
s
2
1
s
2
2

2
1
2
2
(4.2.22)
70
Example 4.12 The variances of two populations are assumed to be equal. A sample of 15 items was taken from Population I with
a standard deviation of 3, and a sample of 19 items was taken from Population II with a standard deviation of 2. Construct a 95%
two-sided condence for the ratio of variances.
Solution: We have n
1
= 15, n
2
= 19, S
1
= 3, and S
2
= 2. For 95% condence level, we have = 1 0.95 = 0.05,
f
/2,n
1
1,n
2
1
= f
0.025,14,18
= 2.70, and f
1/2,n
1
1,n
2
1
= 0.35. Thus, the 95% two-sided condence for the ratio of
variances is given by
f
1/2,n
1
1,n
2
1
s
2
1
s
2
2

2
1
2
2
f
/2,n
1
1,n
2
1
s
2
1
s
2
2
0.78

2
1
2
2
6.07.
MATLAB code
>> n1 = 15; n2 = 19; alpha = 0.05;
>> falpha1 = icdf(f,1-alpha/2,n1-1,n2-1) %f_{alpha/2,n1-1,n2-1}
>> falpha2 = icdf(f,alpha/2,n1-1,n2-1) %f_{1-alpha/2,n1-1,n2-1}
4.3 HYPOTHESIS TESTING
A statistical hypothesis is a statement or claim about a set of parameters of one or more populations. It is called a
hypothesis because it is not known whether or not it is true. A Hypothesis test is the decision-making procedure about
the hypothesis. In statistics terms, the hypothesis that we try to establish is called an alternative hypothesis H
1
, while
its contradiction is called a null hypothesis H
0
.
Consider a population with unknown parameter . Basically, there are three ways to set up the null and alternatives
hypothesis:
1. Two-tailed hypothesis test
H
0
: =
0
H
1
: ,=
0
(4.3.1)
2. Upper-tailed hypothesis test
H
0
: =
0
H
1
: >
0
(4.3.2)
3. Lower-tailed hypothesis test
H
0
: =
0
H
1
: <
0
(4.3.3)
The two-tailed test is also called two-sided test, whereas the upper-and lower-tests are also referred to as one-sided
tests.
Example 4.13 A manufacturer of a certain brand of rice cereal claims that the average saturated fat content does not exceed 1.5
grams per serving. State the null and alternative hypotheses to be used in testing this claim.
Solution: The manufacturers claimshould be rejected only if is greater than 1.5 milligrams and should not be rejected
if is less than or equal to 1.5 milligrams. Thus, we test
H
0
: = 1.5
H
1
: > 1.5.
Example 4.14 A real estate agent claims that 60% of all private residences being built today are 3-bedroom homes. To test this
claim, a large sample of new residences is inspected; the proportion of these homes with 3 bedrooms is recorded and used as the test
statistic. State the null and alternative hypotheses to be used in this test.
71
Solution: If the test statistic were substantially higher or lower than p = 0.6, we would reject the agents claim. Hence,
we should make the hypotheses:
H
0
: p = 0.6
H
1
: p ,= 0.6
The alternative hypothesis implies a two-tailed test with the critical region divided equally in both tails of the distribu-
tion of the test statistic.
The goal of any hypothesis test is to make a decision; in particular, we will decide whether to reject the null hypoth-
esis in favor of the alternative hypothesis H
1
. Although we would like to be able to always make a correct decision,
we must remember that the decision will be based on the sample data. When a test is done, there are four possible
outcomes, as summarized in Table 4.1.
H
0
is true H
0
is false
Reject H
0
Type I Error Correct Decision
Do not reject H
0
Correct Decision Type II Error
TABLE 4.1: Possible outcomes for a hypothesis test.
From Table 4.1, we can observe that there are two ways of making a mistake when doing a hypothesis test. Thus, we
may make one of the following two types of errors:
Type I error: is the error of rejecting H
0
when it is true. The probability of making a Type I error, denote by , is given
by
= P(Type I error) = P(Reject H
0
when H
0
is true)
Type II error: is the error of accepting H
0
when it is false. The probability of making a Type II error, denote by , is
given by
= P(Type II error) = P(Accept H
0
when H
0
is false)
The Type I error and Type II error are related. A decrease in the probability of one generally results in an increase
in the probability of the other. The probability, , is also called the signicance level for the hypothesis test. If the
signicance level is xed, then the rejection of H
0
is done with a xed degree of condence in the decision. Because we
specify the level of signicance before performing the hypothesis test, we basically control the risk of making a Type I
error. Typical values for are 0.1, 0.05, and 0.01. For example, if = 0.1 for a test, and the null hypothesis is rejected,
then one will be 90% certain that this is the correct decision.
After the hypotheses are stated, the next step is to design the study. An appropriate statistical test will be selected,
the level of signicance will be chosen, and a plan to conduct the study will be formulated. To make an inference for
the study, the statistical test and level of signicance are used. Once the level of signicance is selected, a critical value
for the appropriate test is selected from a table in the Appendix.
A hypothesis testing procedure consists of four main steps:
Step 1: Specify the null and alternative hypotheses, H
0
and H
1
, and the signicance level,
Step 2: Determine an appropriate test statistic and compute its value using the sample data
Step 3: Specify the rejection region
Step 4: Make the appropriate conclusion by deciding whether H
0
should be rejected.
4.3.1 TESTS ON THE MEAN OF A NORMAL POPULATION WITH KNOWN VARIANCE
Let X
1
, . . . , X
n
be a sample of size n from a N(,
2
) population with unknown mean and known variance
2
. In
testing the population mean , there are three ways to structure the hypothesis test:
72
Two-tailed Upper-Tailed Lower-Tailed
H
0
: =
0
H
0
: =
0
H
0
: =
0
H
1
: ,=
0
H
1
: >
0
H
1
: <
0
Recall that the sample mean X N(,
2
/n). Under the assumption that the null hypothesis is true (i.e. H
0
: =
0
), it follows that the test statistic
Z
0
=
X
0
/
n
N(0, 1)
has a standard normal distribution. Thus, this hypothesis test is called z-test.
A critical region or rejection region is the set of all values such that the null hypothesis is rejected. We can then
determine a critical region based on the computed test statistic.
Let z
0
be the numerical value, calculated from the sample, of the test statistic Z
0
. Then, for a selected signicance
level , the critical regions are (see Figure 4.7).
If z
0
< z
/2
or z
0
> z
/2
If z
0
> z
If z
0
< z
reject H
0
reject H
0
reject H
0
z
/2 z
/2
/2 /2
0
1
z
0
1
z
0
1
(a) (b) (c)
FIGURE 4.7: Critical region for the z-test alternative hypothesis: (a) ,=
0
; (b) >
0
;
(c) <
0
.
Usually, is specied in advance before any samples are drawn so that results will not inuence the choice for the
level of signicance. To conclude a statistical test, we compare our a value with the p-value, which is the probability
of observing the given sample result under the assumption that the null hypothesis is true. The p-value is computed
using sample data and the sampling distribution. If the p-value is less than the signicance level , then we reject the
null hypothesis. For example, if = 0.05 and the p-value is 0.03, then we reject the null hypothesis. The converse is
not true. If the p-value is greater than , then we have insufcient evidence to reject the null hypothesis.
If p-value , we reject the null hypothesis and say the data are statistically signicant at the level .
If p-value > , we do not reject the null hypothesis.
Denote by () the cdf of the N(0, 1) distribution. Then, the z-test may be summarized as follows:
73
Hypotheses:
H
0
: =
0
H
0
: =
0
H
0
: =
0
H
1
: ,=
0
H
1
: >
0
H
1
: <
0
Test Statistic (z-test): Z
0
=
X
0
/
n
Critical Regions:
[z
0
[ > z
/2
z
0
> z
z
0
< z
p-values:
2[1 ([z
0
[)] 1 (z
0
) (z
0
)
Example 4.15 The CEO of a large nancial corporation claims that the average distance that commuting employees travel to work
is 32 km. The commuting employees feel otherwise. A sample of 64 employees was randomly selected and yielded a mean of 35 km.
Assuming a population standard deviation of 5 km,
(i) Test the CEOs claim at the 5% level of signicance.
(ii) Calculate the p-value for this test.
(iii) Test the CEOs claim using a condence interval with a 95% condence coefcient.
Solution: From the given information, we have n = 64, x = 35,
0
= 32, = 5, and = 0.05.
(i) Step 1: This is a two-tailed test, since the employees feel that the CEOs claim is not correct, but whether they
feel that the average distance is less than 32 km or more than 32 km is not specied. Thus,
H
0
: = 32
H
1
: ,= 32
Step 2: Since is known, the appropriate test statistic is the z-test and its value is given by
z
0
=
x
0
/
n
=
35 32
5/
64
= 4.8
Step 3: The rejection region is [z
0
[ > z
/2
, where z
/2
= z
0.025
= 1.96.
Step 4: Since [z
0
[ = 4.8 > z
/2
= 1.96, we reject the null hypothesis H
0
. There is sufcient sample evidence
to refute the CEOs claim. The sample evidence supports the employees claim that the average distance
commuting employees travel to work is not equal to 32 km at the 5 percent level of signicance. That is,
there is a signicant difference between the sample mean and the postulated value of the population mean
of 32 km.
(ii) The p-value is equal to 2[1 ([z
0
[)] = 2[1 (4.8)] = 0, which is less than the signicance level = 0.05;
therefore we reject the null hypothesis.
(iii) A condence interval with a 95% condence coefcient implies = 0.05. The 95% condence interval is given by
x z
/2
n
x + z
/2
n
35 1.96
5
64
35 +1.96
5
64
33.775 36.225
This interval clearly does not contain 32, the value of under the null hypothesis. Thus, we reject the null
hypothesis.
74
Example 4.16 A random sample of 36 pieces of copper wire produced in a plant of a wire manufacturing company yields the mean
tensile strength of 950 psi. Suppose that population of tensile strengths of all copper wires produced in that plant are distributed
with mean and standard deviation = 120 psi. Test the statistical hypothesis:
H
0
: = 980 versus H
1
: < 980
at the 99% level of signicance. Then, calculate the p-value.
Solution: From the given information, we have n = 36, x = 950,
0
= 980, = 120, and = 0.01.
Step 1: This is a lower-tailed test. Thus,
H
0
: = 980
H
1
: < 980
Step 2: Since is known, the appropriate test statistic is the z-test and its value is given by
z
0
=
x
0
/
n
=
950 980
120/
36
= 1.5
Step 3: The rejection region is z
0
< z
, where z
= z
0.01
= 2.3263.
Step 4: Since z
0
= 1.5 z
= 2.3263, we do not reject the null hypothesis H

0
.
The p-value is equal to (z
0
) = (1.5) = 1 (1.5) = 0.0668, which is greater than the signicance level = 0.01;
therefore, there is insufcient evidence to reject the null hypothesis.
4.3.2 TESTS ON THE MEAN OF A NORMAL POPULATION WITH UNKNOWN VARIANCE
Let X
1
, . . . , X
n
2
) population with unknown mean and unknown variance
2
. In
testing the population mean , there are three ways to structure the hypothesis test:
H
0
: =
0
H
0
: =
0
H
0
: =
0
H
1
: ,=
0
H
1
: >
0
H
1
: <
0
Under the assumption that the null hypothesis is true (i.e. H
0
: =
0
T
0
=
X
0
S/
n
t(n 1)
has a t-distribution with n 1 degrees of freedom. Thus, this hypothesis test is called t-test. When the population
standard deviation is not known, we typically use the t-test either (i) when the sample size is large (i.e. n 30) or
(ii) when the sample size is small (i.e. n < 30) and the population from which the sample is selected is approximately
normal.
A critical region or rejection region is the set of all values such that the null hypothesis is rejected. Let t
0
be the
numerical value, calculated from the sample, of the test statistic T
0
. Then, for a selected signicance level , the critical
regions are (see Figure 4.8):
If t
0
< t
/2,n1
or t
0
> t
/2,n1
If t
0
> t
,n1
If t
0
< t
,n1
reject H
0
reject H
0
reject H
0
Denote by F() the cdf of the t(n 1) distribution. Then, the t-test may be summarized as follows:
75
t
/2,n1
t
/2,n1
/2 /2
1
t,n1
1
t,n1
1
(a) (b) (c)
FIGURE 4.8: Critical region for the t-test alternative hypothesis: (a) ,=
0
; (b) >
0
;
(c) <
0
.
Hypotheses:
H
0
: =
0
H
0
: =
0
H
0
: =
0
H
1
: ,=
0
H
1
: >
0
H
1
: <
0
Test Statistic (t-test): T
0
=
X
0
S/
n
Critical Regions:
[t
0
[ > t
/2,n1
t
0
> t
,n1
t
0
< t
,n1
p-values:
2[1 F([t
0
[)] 1 F(t
0
) F(t
0
)
Example 4.17 The Atlas Electric Institute has published gures on the number of kilowatt hours used annually by various home
appliances. It is claimed that a vacuum cleaner uses an average of 46 kilowatt hours per year. If a random sample of 12 homes
included in a planned study indicates that vacuum cleaners use an average of 42 kilowatt hours per year with a standard deviation
of 11.9 kilowatt hours, does this suggest at the 0.05 level of signicance that vacuum cleaners use, on average, less than 46 kilowatt
hours annually? Then, calculate the p-value. Assume the population of kilowatt hours to be normal.
Solution: From the information given, we have n = 12, x = 42,
0
= 46, s = 11.9, and = 0.05.
Step 1: The hypotheses are:
H
0
: = 46
H
1
: < 46
Step 2: Since is unknown and the population of kilowatt hours is assumed to be normally distributed, the appro-
priate test statistic is the t-test and its value is given by
t
0
=
x
0
s/
n
=
42 46
11.9/
12
= 1.1644
Step 3: The rejection region is t
0
< t
,n1
, where t
,n1
= t
0.05,11
= 1.7959.
Step 4: Since t
0
= 1.16 t
,n1
0
. We conclude that the average
number of kilowatt hours used annually by home vacuum cleaners is not signicantly less than 46.
76
The p-value is equal to F(t
0
) = F(1.1644) = 0.1344, which is greater than the signicance level = 0.05; therefore
we have insufcient evidence to reject the null hypothesis.
Example 4.18 Grand Auto Corporation produces auto batteries. The company claims that its top-of-the-line Never Die batteries
are good, on average, for at least 65 months. A consumer protection agency tested 45 such batteries to check this claim. It found
that the mean life of these 45 batteries is 63.4 months and the standard deviation is 3 months. Using 1% signicance level, can you
conclude that the companys claim is true?. Then, calculate the p-value. Assume the population of life of batteries to be normal.
Solution: From the given information, we have n = 45, x = 63.4,
0
= 65, s = 3, and = 0.01.
Step 1: This is a lower-tailed test:
H
0
: 65 (The mean life of batteries is at least 65 months)
H
1
: < 65 (The mean life of batteries is less than 65 months)
Step 2: Since is unknown and n 30, the appropriate test statistic is the t-test and its value is given by
t
0
=
x
0
s/
n
=
63.4 65
3/
45
= 3.5777
0
< t
,n1
, where t
,n1
= t
0.01,44
= 2.4141.
Step 4: Since t
0
= 3.5777 < t
,n1
0
. We conclude that the mean life of
such batteries is less than 65 months.
0
) = F(3.5777) = 0, which is less than the signicance level = 0.01; therefore we reject
the null hypothesis.
Example 4.19 A tool assembling company believes that a worker should take no more than 30 minutes to assemble a particular
tool. A sample of 16 workers who assembled that tool showed that the average time was 33 minutes with a sample standard
deviation of 6 minutes. Test at the 5% level od signicance if the data provide sufcient evidence to indicate the validity of the
companys belief. Then, calculate the p-value. Assume that the assembly times are normally distributed.
Solution: From the information given, we have n = 16, x = 33,
0
= 30, s = 6, and = 0.05.
Step 1: This is a upper-tailed test:
H
0
: 30 (The mean assembly time is no more than 30 minutes)
H
1
: > 30 (The mean assembly time is more than 30 minutes)
Step 2: Since is unknown and the assembly times are assumed to be normally distributed, the appropriate test
statistic is the t-test and its value is given by
t
0
=
x
0
s/
n
=
33 30
6/
16
= 2
0
> t
,n1
, where t
,n1
= t
0.05,15
= 1.7531.
Step 4: Since t
0
= 2 > t
,n1
0
. We conclude that the mean assembly time is
more than 30 minutes.
0
) = 1 F(2) = 0.032, which is less than the signicance level = 0.05; therefore we reject
the null hypothesis.
Example 4.20 A city health department wishes to determine if the mean bacteria count per unit volume of water at a lake beach
is within the safety level of 200. A researcher collected 10 water samples of unit volume and found the bacteria counts to be
77
175 190 205 193 184 207 204 193 196 180
(i) Check the assumption of normality for the bacteria counts.
(ii) Do the data strongly indicate that there is no cause for concern at 5% level of signicance?
Solution: From the given information, we have n = 10, x = 192.7, s = 10.812,
0
= 200, and = 0.05.
(i) Because the sample size is small, we must be willing to assume that the population distribution of bacteria counts
is normally distributed. As shown in Figure 4.9, the normal probability plot and boxplot indicate that the mea-
surements constitute a sample from a normal population. The normal probability plot appears to be reasonably
straight. Although the boxplot is not perfectly symmetric, it is not too skewed and there are no outliers.
MATLAB code
>> X = [175 190 205 193 184 207 204 193 196 180];
>> subplot(1,2,1); normplot(X); subplot(1,2,2); boxplot(X);
175 180 185 190 195 200 205
0.05
0.10
0.25
0.50
0.75
0.90
0.95
Data
P
r
o
b
a
b
i
l
i
t
y
175
180
185
190
195
200
205
1
B
a
c
t
e
r
i
a

c
o
u
n
t
FIGURE 4.9: Normal probability and box plots for the bacteria counts.
(ii) Step 1: Let denote the current (population) mean bacteria count per unit volume of water. Then, the statement
no cause for concern translates to < 200, and the researcher is seeking strong evidence in support of this
hypothesis. So the formulation of the null and alternative hypotheses should be
H
0
: = 200
H
1
: < 200
Step 2: Since is unknown, the appropriate test statistic is the t-test and its value is given by
t
0
=
x
0
s/
n
=
192.7 200
10.812/
10
= 2.1351
0
< t
,n1
, where t
,n1
= t
0.05,9
= 1.8331.
Step 4: Since t
0
= 2.1351 < t
,n1
0
. On the basis of the data
obtained from these 10 measurements, there does seem to be strong evidence that the true mean is within
the safety level.
0
) = F(2.1351) = 0.0308, which is less than the signicance level = 0.05; therefore
we reject the null hypothesis. There is strong evidence that the mean bacteria count is within the safety level.
MATLAB code
>> X = [175 190 205 193 184 207 204 193 196 180];
>> mu0 = 200; alpha = 0.05;
>> [h,p,ci,stats]=ttest(X,mu0,alpha,left)
78
4.3.3 TESTS ON THE VARIANCE OF A NORMAL POPULATION
Let X
1
, . . . , X
n
2
) population with unknown variance
2
. In testing the population
variance
2
, there are three ways to structure the hypothesis test:
H
0
:
2
=
2
0
H
0
:
2
=
2
0
H
0
:
2
=
2
0
H
1
:
2
,=
2
0
H
1
:
2
>
2
0
H
1
:
2
<
2
0
0
:
2
=
2
0
X
2
0
=
(n 1)S
2
2
0

2
(n 1)
has a
2
-distribution with n 1 degrees of freedom. Thus, this hypothesis test is called
2
-test.
Let
2
0
be the numerical value, calculated from the sample, of the test statistic X
2
0
level , the critical regions are:
If
2
0
<
2
1/2,n1
or
2
0
>
2
/2,n1
If
2
0
>
2
,n1
If
2
0
<
2
1,n1
reject H
0
reject H
0
reject H
0
Then, the
2
-test may be summarized as follows:
Hypotheses:
H
0
:
2
=
2
0
H
0
:
2
=
2
0
H
0
:
2
=
2
0
H
1
:
2
,=
2
0
H
1
:
2
>
2
0
H
1
:
2
<
2
0
Test Statistic:
2
0
=
(n 1)S
2
2
0
Critical Regions:
2
0
<
2
1/2,n1
or
2
0
>
2
/2,n1

2
0
>
2
,n1

2
0
<
2
1,n1
p-values:
2 min(P(
2
,n1
<
2
0
), 1 P(
2
,n1
<
2
0
)) P(
2
0
>
2
,n1
) P(
2
0
<
2
1,n1
)
Example 4.21 A manufacturer of car batteries claims that the life of the companys batteries is approximately normally distributed
with a standard deviation equal to 0.9 year. If a random sample of 10 of these batteries has a standard deviation of 1.2 years, do you
think that > 0.9 year? Then, calculate the p-value. Use a 0.05 level of signicance.
Solution: From the given information and data, we have n = 10, s
2
= (1.2)
2
= 1.44,
0
= 0.9, and = 0.05.
Step 1: This is an upper-tailed test, that is
H
0
:
2
= 0.81
H
1
:
2
> 0.81
Step 2: The appropriate test statistic is the
2
-test and its value is given by
2
0
=
(n 1)s
2
2
0
=
(10 1)(1.2)
2
(0.9)
2
= 16
79
Step 3: The rejection region is
2
0
>
2
,n1
, where
2
,n1
=
2
0.05,9
= 16.919.
Step 4: Since
2
0
= 16 >
2
,n1
0
.
The p-value is equal to 1 F(
2
0
) = 1 F(16) = 0.0669, where F is the cdf of the
2
-distribution with 9 degrees of
freedom. Since the p-value is greater than the signicance level = 0.05, we have insufcient evidence to reject the
null hypothesis.
Example 4.22 The production manager of a light bulb manufacturer believes that the lifespan of the 14W bulb with light output
of 800 lumens is 6000 hours. A random sample of 25 bulbs produced the sample mean of 6180 hours and sample standard deviation
of 178 hours. Test at the 5% level of signicance that the population standard deviation is less than 200 hours. Then, calculate the
p-value for the test. Assume that the lifespan of these bulbs is normally distributed.
Solution: From the given information and data, we have n = 25, s = 178,
0
= 200, and = 0.05. The lifespan of the
the light bulbs is assumed to be normally distributed.
Step 1: This is a lower-tailed test, that is
H
0
:
2
= (200)
2
H
1
:
2
< (200)
2
or equivalently
H
0
: = 200
H
1
: < 200
Step 2: The appropriate test statistic is the
2
-test and its value is given by
2
0
=
(n 1)s
2
2
0
=
(25 1)(178)
2
(200)
2
= 19.0104
Step 3: The rejection region is
2
0
<
2
1,n1
, where
2
1,n1
=
2
10.05,24
= 13.8484.
Step 4: Since
2
0
= 19.0104
2
1,n1
0
. The supervisor can
conclude that the standard deviation of the lifespan of a light bulb is 200.
The p-value is equal to F(
2
0
) = F(19.0104) = 0.2486, which is less than the signicance level = 0.05; therefore, there
is insufcient evidence to reject the null hypothesis.
4.3.4 TESTS ON A PROPORTION
Consider a population of items, each of which independently meets certain standards with some unknown probability
p, and suppose that a sample of size n was taken from this population. In testing the proportion p, there are three ways
to structure the hypothesis test:
H
0
: p = p
0
H
0
: p = p
0
H
0
: p = p
0
H
1
: p ,= p
0
H
1
: p > p
0
H
1
: p < p
0
If X denote the number of the n items that meet the standards, then X bino(n, p). Thus, for n large and under the
assumption that the null hypothesis is true (i.e. H
0
: p = p
0
Z
0
=
X np
0
_
np
0
(1 p
0
)
N(0, 1)
has a approximate standard normal distribution.
80
Let z
0
0
level , the critical regions are (see Figure)
If z
0
< z
/2
or z
0
> z
/2
If z
0
> z
If z
0
< z
reject H
0
reject H
0
reject H
0
Denote by () the cdf of the N(0, 1) distribution. Then, the proportion test may be summarized as follows:
Hypotheses:
H
0
: p = p
0
H
0
: p = p
0
H
0
: p = p
0
H
1
: p ,= p
0
H
1
: p > p
0
H
1
: p < p
0
Test Statistic: Z
0
=
X np
0
_
np
0
(1 p
0
)
Critical Regions:
[z
0
[ > z
/2
z
0
> z
z
0
< z
p-values:
2[1 ([z
0
[)] 1 (z
0
) (z
0
)
Example 4.23 A builder claims that heat pumps are installed in 70% of all homes being constructed today in the city of Granada,
Spain. Would you agree with this claim if a random survey of new homes in this city showed that 8 out of 15 had heat pumps
installed? Then, calculate the p-value. Use a 0.10 level of signicance.
Solution: From the given information, we have n = 15, p
0
= 0.7, x = 8, and = 0.10.
Step 1: This is a two-tailed test on proportion:
H
0
: p = 0.7
H
1
: p ,= 0.7
Step 2: The appropriate test statistic is the z-test and its value is given by
z
0
=
x np
0
_
np
0
(1 p
0
)
=
8 (15)(0.7)
_
(15)(0.7)(1 0.7)
= 1.4086
0
[ > z
/2
, where z
/2
= z
0.05
= 1.6449.
Step 4: Since [z
0
[ = 1.4086 z
/2
0
. We conclude that there is
insufcient reason to doubt the builders claim.
The p-value is equal to 2[1 ([z
0
[)] = 2[1 (1.4086)] = 0.1590, which is greater than the signicance level = 0.05;
therefore we have insufcient evidence to reject the null hypothesis.
Example 4.24 A commonly prescribed drug for relieving nervous tension is believed to be only 60% effective. Experimental
results with a new drug administered to a random sample of 100 adults who were suffering from nervous tension show that 70
received relief. Is this sufcient evidence to conclude that the new drug is superior to the one commonly prescribed? Then, calculate
the p-value. Use a 0.05 level of signicance.
0
= 0.6, x = 70, and = 0.05.
81
Step 1: This is an upper-tailed test on proportion:
H
0
: p = 0.6
H
1
: p > 0.6
z
0
=
x np
0
_
np
0
(1 p
0
)
=
70 (100)(0.6)
_
(100)(0.6)(1 0.6)
= 2.0412
0
> z
, where z
= z
0.05
= 1.6449.
Step 4: Since z
0
= 2.0412 > z

0
. We conclude that the new drug is superior.
The p-value is equal to 1 (z
0
) = 1 (2.0412) = 0.0206, which is smaller than the signicance level = 0.05;
Example 4.25 Direct Mailing Company sells computers and computer parts by mail. The company claims that at least 90% of
all orders are mailed within 72 hours after they are received. The quality control department at the company often takes samples
to check if this claim is valid. A recently taken sample of 150 orders showed that 129 of them were mailed within 72 hours. Using
2.5% signicance level, do you think the companys claim is true?
0
= 0.9, x = 129, and = 0.025.
Step 1: This is an upper-tailed test on proportion:
H
0
: p 0.9 (The companys claim is true)
H
1
: p < 0.9 (The companys claim is false)
z
0
=
x np
0
_
np
0
(1 p
0
)
=
129 (150)(0.9)
_
(150)(0.9)(1 0.9)
= 1.6330
0
< z
, where z
= z
0.025
= 1.96.
Step 4: Since z
0
= 1.6330 z

0
. We conclude that the companys
claim is true.
The p-value is equal to (z
0
) = (1.6330) = 1 (1.6330) = 0.0512, which is greater than the signicance level
= 0.025; therefore we have insufcient evidence to reject the null hypothesis.
4.3.5 TESTS ON THE DIFFERENCE IN MEANS
Assume that X
1
, . . . , X
n
1
N(
1
,
2
1
) and Y
1
, . . . , Y
n
2
N(
2
,
2
2
) are two independent samples from two independent
normal populations. Let X, Y, S
2
1
and S
2
2
be the sample means and sample variances, respectively. In testing the mean
difference
1
2
=
0
, there are three ways to structure the hypothesis test:
H
0
:
1
2
=
0
H
0
:
1
2
=
0
H
0
:
1
2
=
0
H
1
:
1
2
,=
0
H
1
:
1
2
>
0
H
1
:
1
2
<
0
82
Case I) When
1
and
2
are known:
0
:
1
2
=
0
), the test statistic
Z
0
=
X Y (
1
2
)
2
1
n
1
+

2
2
n
2
=
X Y
0
2
1
n
1
+

2
2
n
2
N(0, 1)
has a standard normal distribution.
Let z
0
0
level , the critical regions are
If z
0
< z
/2
or z
0
> z
/2
If z
0
> z
If z
0
< z
reject H
0
reject H
0
reject H
0
and the z-test for the difference in means may be summarized as follows:
Hypotheses:
H
0
:
1
2
=
0
H
0
:
1
2
=
0
H
0
:
1
2
=
0
H
1
:
1
2
,=
0
H
1
:
1
2
>
0
H
1
:
1
2
<
0
Test Statistic (z-test): Z
0
=
X Y
0
2
1
n
1
+

2
2
n
2
Critical Regions:
[z
0
[ > z
/2
z
0
> z
z
0
< z
p-values:
2[1 ([z
0
[)] 1 (z
0
) (z
0
)
Example 4.26 A random sample of size n
l
= 36 selected from a normal distribution with standard deviation
1
= 4 has a mean
x = 75. A second random sample of size n
2
= 25 selected from a different normal distribution with a standard deviation
2
= 6
has a mean y = 85. Is there a signicant difference between the population means at the 5 percent level of signicance?. Then,
calculate the p-value.
1
= 36, n
2
= 25, x = 75, y = 85,
0
= 0,
1
= 4,
2
= 6, and = 0.05.
Step 1: Since we want to determine whether there is a difference between the population means, this will be a two-
tailed test. Hence,
H
0
:
1
=
2
H
1
:
1
,=
2
Step 2: Since
1
and
2
are known, the appropriate test statistic is the z-test and its value is given by
z
0
=
x y
0
2
1
n
1
+

2
2
n
2
=
75 85
_
16
36
+
36
25
= 7.2846
83
0
[ > z
/2
, where z
/2
= z
0.025
= 1.96.
Step 4: Since [z
0
[ = 7.2844 > z
/2
0
. We can conclude that the means are
signicantly different from each other.
The p-value is equal to 2[1 ([z
0
[)] = 2[1 (7.2846)] = 0, which is less than the signicance level = 0.05;
Case II) When
1
and
2
are unknown and equal to
2
:
0
:
1
2
=
0
T
0
=
X Y (
1
2
)
S
p
1/n
1
+1/n
2
=
X Y
0
S
p
1/n
1
+1/n
2
t(n
1
+ n
2
2)
has a t-distribution with n
1
+ n
2
2 degrees of freedom, and S
2
p
=
(n
1
1)S
2
1
+ (n
2
1)S
2
2
n
1
+ n
2
2
is the pooled variance.
Let t
0
be the numerical value, calculated from the sample, of the test statistic T
0
If t
0
< t
/2,,n
1
+n
2
2
or t
0
> t
/2,,n
1
+n
2
2
If t
0
> t
,,n
1
+n
2
2
If t
0
< t
,,n
1
+n
2
2
reject H
0
reject H
0
reject H
0
Denote by F() the cdf of the t(n
1
+ n
2
2) distribution. Then, the t-test for the difference in means may be summa-
rized as follows:
Hypotheses:
H
0
:
1
2
=
0
H
0
:
1
2
=
0
H
0
:
1
2
=
0
H
1
:
1
2
,=
0
H
1
:
1
2
>
0
H
1
:
1
2
<
0
Test Statistic (t-test): T
0
=
X Y
0
S
p
1/n
1
+1/n
2
Critical Regions:
[t
0
[ > t
/2,n
1
+n
2
2
t
0
> t
,n
1
+n
2
2
t
0
< t
,n
1
+n
2
2
p-values:
2[1 F([t
0
[)] 1 F(t
0
) F(t
0
)
Example 4.27 An experiment was performed to compare the abrasive wear of two different laminated materials X and Y. Twelve
pieces of material X were tested by exposing each piece to a machine measuring wear. Ten pieces of material Y were similarly tested.
In each case, the depth of wear was observed. The samples of material X gave an average (coded) wear of 85 units with a sample
standard deviation of 4, while the samples of material Y gave an average of 81 with a sample standard deviation of 5. Can we
conclude at the 0.05 level of signicance that the abrasive wear of material X exceeds that of material Y by more than 2 units?
Assume the populations to be approximately normal with equal variances.
1
= 12, n
2
= 10, x = 85, y = 81,
0
= 2, s
1
= 4, s
2
= 5, and = 0.05.
84
Step 1: Let
1
and
2
represent the population means of the abrasive wear for material X and material Y, respectively.
This is an upper-tailed test. Thus,
H
0
:
1
2
= 2
H
1
:
1
2
> 2
Step 2: Since the standard deviations of the populations are unknown, the appropriate test statistic is the t-test and its
value is given by
t
0
=
x y
0
s
p
1/n
1
+1/n
2
=
85 81 2
4.4777
1/12 +1/10
= 1.0432
where the pooled variance is
s
2
p
=
(n
1
1)s
2
1
+ (n
2
1)s
2
2
n
1
+ n
2
2
=
(12 1)(4)
2
+ (10 1)(5)
2
12 +10 2
= 20.05
0
> t
,n
1
+n
2
2
, where t
,n
1
+n
2
2
= t
0.05,20
= 1.7247.
Step 4: Since t
0
= 1.0432 t
,n
1
+n
2
2
0
. We are unable to conclude
that the abrasive wear of material X exceeds that of material Y by more than 2 units.
The p-value is equal to 1 F(t
0
) = 1 F(1.0432) = 0.1547, which is greater than the signicance level = 0.05;
therefore we have insufcient evidence to reject the null hypothesis.
Example 4.28 One process of making green gasoline, not just a gasoline additive, takes biomass in the formof sucrose and converts
it into gasoline using catalytic reactions. This research is still at the pilot plant stage. At one step in a pilot plant process, the
product volume (liters) consists of carbon chains of length 3. Nine runs were made with each of two catalysts and the product
volumes measured:
catalyst 1: 1.86 2.05 2.06 1.88 1.75 1.64 1.86 1.75 2.13
catalyst 2: 0.32 1.32 0.93 0.84 0.55 0.84 0.37 0.52 0.34
Is the mean yield with catalyst 1 more than 0.80 liters higher than the yield with catalyst 2? Test with = 0.05.
Solution: From the given information and data, we have n
1
= 9, n
2
= 9, x = 1.8867, y = 0.67,
0
= 0.80, s
1
= 0.1642,
s
2
= 0.3366, and = 0.05.
As shown in Figure 4.10, the normal probability plots and box plot indicate that the measurements constitute a sample
from a normal population.
0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0.05
0.10
0.25
0.50
0.75
0.90
0.95
Data
P
r
o
b
a
b
i
l
i
t
y

Catalyst 1
Catalyst 2
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
Catalyst 1 Catalyst 2
P
r
o
d
u
c
t

v
o
l
u
m
e
FIGURE 4.10: Normal probability and box plots for the product volumes.
85
Step 1: Let
1
and
2
represent the population means for catalyst 1 and catalyst 2, respectively. This is an upper-tailed
test. Thus,
H
0
:
1
2
= 0.80
H
1
:
1
2
> 0.80
Step 2: Since the standard deviations of the populations are unknown, the appropriate test statistic is the t-test and its
value is given by
t
0
=
x y
0
s
p
1/n
1
+1/n
2
=
1.8867 0.67 0.80
0.2648
1/9 +1/9
= 3.3381
where the pooled variance is
s
2
p
=
(n
1
1)s
2
1
+ (n
2
1)s
2
2
n
1
+ n
2
2
=
(9 1)(0.1642)
2
+ (9 1)(0.3366)
2
9 +9 2
= 0.0701
0
> t
,n
1
+n
2
2
, where t
,n
1
+n
2
2
= t
0.05,16
= 1.7459.
Step 4: Since t
0
= 3.3381 > t
,n
1
+n
2
2
0
. We conclude that the mean product
volume from catalyst 1 is more than 0.80 liters higher than catalyst 2.
The p-value is equal to 1 F(t
0
) = 1 F(3.3381) = 0.0021, which is smaller than the signicance level = 0.05;
4.3.6 TESTS ON THE EQUALITY OF VARIANCES
The quality of any process depends on the amount of variability present in the process, which we measure in terms of
the variance of the quality characteristic. For example, if we have to choose between two similar processes, we would
prefer the one with smaller variance. Any process with smaller variance is more dependable and more predictable. In
fact, one of the most important criteria used to improve the quality of a process or to achieve 6 quality is to reduce the
variance of the quality characteristic in the process. In practice, comparing the variances of two processes is common.
Assume that X
1
, . . . , X
n
1
N(
1
,
2
1
) and Y
1
, . . . , Y
n
2
N(
2
,
2
2
) are two independent samples from two indepen-
dent normal populations. Let S
2
1
and S
2
2
be the sample variances. In testing the equality of variances
2
1
=
2
2
, there are
three ways to structure the hypothesis test:
H
0
:
2
1
=
2
2
H
0
:
2
1
=
2
2
H
0
:
2
1
=
2
2
H
1
:
2
1
,=
2
2
H
1
:
2
1
>
2
2
H
1
:
2
1
<
2
2
0
:
2
1
=
2
2
F
0
=
S
2
1
/
2
1
S
2
2
/
2
2
=
S
2
1
S
2
2
F(n
1
1, n
2
1)
Let f
0
be the numerical value, calculated from the sample, of the test statistic F
0
If f
0
< f
1/2,n
1
1,n
2
1
or f
0
> f
/2,n
1
1,n
2
1
If f
0
> f
,n
1
1,n
2
1
If f
0
< f
1,n
1
1,n
2
1
reject H
0
reject H
0
reject H
0
and the f -test for the equality of variances may be summarized as follows:
86
Hypotheses:
H
0
:
2
1
=
2
2
H
0
:
2
1
=
2
2
H
0
:
2
1
=
2
2
H
1
:
2
1
,=
2
2
H
1
:
2
1
>
2
2
H
1
:
2
1
<
2
2
Test Statistic ( f -test): F
0
=
S
2
1
S
2
2
Critical Regions:
f
0
> f
/2,n
1
1,n
2
1
or f
0
< f
1/2,n
1
1,n
2
1
f
0
> f
,n
1
1,n
2
1
f
0
< f
1,n
1
1,n
2
1
Example 4.29 In testing for the difference in the abrasive wear of the two materials in the previous example, we assumed that the
two unknown population variances were equal. Were we justied in making this assumption? Use a 0.10 level of signicance.
1
= 12, n
2
= 10, s
1
= 4, s
2
= 5, and = 0.10.
Step 1: Let
2
1
and
2
2
represent the population variances of the abrasive wear for material X and material Y, respec-
tively. This is a two-tailed test. Thus,
H
0
:
2
1
=
2
2
H
1
:
2
1
,=
2
2
Step 2: Since the standard deviations of the populations are unknown, the appropriate test statistic is the f -test and
its value is given by
f
0
=
s
2
1
s
2
2
= 0.64
Step 3: The rejection region is f
0
> f
/2,n
1
1,n
2
1
or f
0
< f
1/2,n
1
1,n
2
1
, where f
/2,n
1
1,n
2
1
= 3.1025 and
f
1/2,n
1
1,n
2
1
= 0.3453.
Step 4: Since f
0
= 0.64 f
/2,n
1
1,n
2
1
0
. We conclude that there is
insufcient evidence that the variances differ.
Example 4.30 Suppose the following is the sample summary of samples from two independent processes:
n
1
= 21 S
2
1
= 24.6
n
2
= 16 S
2
2
= 16.4
We assume that the quality characteristics of the two processes are normally distributed N(
1
,
2
1
) and N(
2
,
2
2
), respectively.
Test at the 5% level of signicance the hypothesis H
0
:
2
1
=
2
2
versus H
1
:
2
1
,=
2
2
, and nd the p-value for the test.
1
= 21, n
2
= 16, s
2
1
= 24.6, s
2
2
= 16.4, and = 0.05.
Step 1: This is a two-tailed test. Thus,
H
0
:
2
1
=
2
2
H
1
:
2
1
,=
2
2
Step 2: Since the standard deviations of the populations are unknown, the appropriate test statistic is the f -test and
its value is given by
f
0
=
s
2
1
s
2
2
= 1.50
Step 3: The rejection region is f
0
> f
/2,n
1
1,n
2
1
or f
0
< f
1/2,n
1
1,n
2
1
, where f
/2,n
1
1,n
2
1
= 2.7559 and
f
1/2,n
1
1,n
2
1
= 0.3886.
87
Step 4: Since f
0
= 1.50 f
/2,n
1
1,n
2
1
= 2.7559 and f
0
= 1.50 f
1/2,n
1
1,n
2
1
= 0.3886, we do not reject the null
hypothesis H
0
. We conclude that there is insufcient evidence that the variances differ.
The p-value is 2F( f
0
) = 2F(1.5) = 1.5726, where F is the cdf of the F(n
1
1, n
2
1) distribution with degrees of
freedom 20 and 15. Since the p-value is greater than , there is insufcient evidence to reject the null hypothesis.
4.4 PROBLEMS
The total weight of a lled tire can dramatically affect the performance and safety of an automobile. Some trans-
portation ofcials argue that mechanics should check the tire weights of every vehicle as part of an annual inspec-
tion. Suppose the weight of a 185/60/14 lled tire is normally distributed with standard deviation 1.25 pounds.
In a random sample of 15 lled tires, the sample mean weight was 18.75 pounds. Find a 95% condence interval
for the true mean weight of 185/60/14 tires.
An electro Pneumatic hammer has an advertised impact force of 2.2 joules. In a random sample of 23 hammers,
the impact force for each tool was carefully measured (in joules), and the resulting data are as follows:
2.16 1.69 2.30 2.08 1.72 2.17 2.25 2.06 2.00 2.29 2.15 2.49
2.12 2.17 1.93 2.39 2.22 2.26 2.14 1.92 2.06 2.09 2.08
a) Check the assumption of normality for the impact force data.
b) Find the 95% two-sided condence interval for the true mean impact force for this type of pneumatic ham-
mer.
c) Using the condence interval constructed in part b), is there any evidence to suggest that the true mean
impact force is different from 2.2 joules as advertised? Justify your answer.
Adobeware dishes are made from clay and are red, or exposed to heat, in a large kiln. Large uctuations in the
kiln temperature can cause cracks, bumps, or other aws (and increase cost). With the kiln set at 800C, a random
sample of 19 temperature measurements (in C) was obtained. The sample variance was 17.55.
a) Find the 95%two-sided condence interval for the true population variance in temperature of the kiln when
it is set to 800C. Assume that the underlying distribution is normal.
b) Quality control engineers have determined that the maximum variance in temperature during ring should
be 16C. Using the condence interval constructed in part a), is there any evidence to suggest that the true
temperature variance is greater than 16C? Justify your answer.
A successful company usually has high brand name and logo recognition among consumers. For example, Coco-
Cola products are available to 98%of all people in the world, and therefore may have the highest logo recognition
on any company. A software rm developing a product would like to estimate the proportion of people who
recognize the Linux penguin logo. Of the 952 randomly selected consumers surveyed, 132 could identify the
product associated with the penguin.
a) Is the distribution of the sample proportion, p, approximately normal? Justify your answer.
b) Find the 95% two-sided condence interval for the true proportion of consumers who recognize the Linux
penguin.
c) The company will market a Linux version of their new software if the true proportion of people who rec-
ognize the logo is greater that 0.10. Is there any evidence to suggest that the true proportion of people who
recognize the logo is greater than 0.10? Justify your answer.
An engineer wants to measure the bias in a pH meter. She uses the meter to measure the pH in 15 neutral
substances (pH = 7.0) and obtains the following data:
7.04 7.0 7.03 7.01 6.97 7.00 6.95 7.00 6.99 7.04 6.97 7.07 7.04 6.97 7.08
a) Check the assumption of normality for the pH meter data.
88
b) Is there sufcient evidence to support the claim that the pH meter is not correctly calibrated at the 5% level
of signicance
c) Find the 95% two-sided condence interval to estimate the mean. Comment on your result.
A quality control supervisor in a cannery knows that the exact amount each can contains will vary, since there are
certain uncontrollable factors that affect the amount of ll. Suppose regulatory agencies specify that the standard
deviation of the amount of ll should be less that 0.1 ounce. The quality control supervisor sampled 10 cans and
measured the amount of ll in each. The resulting data measurements are:
7.96, 7.90, 7.98, 8.01, 7.97, 7.96, 8.03, 8.02, 8.04, 8.02
Does this information, at the 0.05 level of signicance, provide sufcient evidence to indicate that the standard
deviation of the ll measurements is less than 0.1 ounce? Then, calculate the p-value.
The management of a luxurious hotel is concerned with increasing the return rate for hotel guests. One aspect of
rst impressions by guests relates to the time it takes to deliver the guests luggage to the room after check-in
to the hotel. A random sample of 20 deliveries on a particular day were selected in Wing A of the hotel and a
random sample of 20 deliveries were selected in Wing B.
Wing A: 10.70, 9.89, 11.83, 9.04, 9.37, 11.68, 8.36, 9.76, 13.67, 8.96, 9.51, 10.85, 10.57, 11.06, 8.91, 11.79, 10.59, 9.13, 12.37, 9.91
Wing B: 7.20, 6.68, 9.29, 8.95, 6.61, 8.53, 8.92, 7.95, 7.57, 6.38, 8.89, 10.03, 9.30, 5.28, 9.23, 9.25, 8.44, 6.57, 10.61, 6.77
a) Is the normality assumption of the data satised? Justify your answer.
b) Was there a difference in the mean delivery time in the two wings on the hotel?. Test with = 0.05.
c) Determine whether the variance in luggage delivery time is the same for Wing A and Wing B of the hotel at
the = 0.05 level of signicance.
d) Assume that delivery times to Wing A and Wing B are two independent normal populations with unknown
variances
2
1
and
2
2
, respectively. Construct a 90% two-sided condence interval on the ratio of the two
standard deviations
1
/
2
. Comment on your result.
4.5 REFERENCES
[1] D.C. Montgomery, Introduction to Statistical Quality Control, Wiley, 6th Edition, 2009.
[2] I. Bass and B. Lawton, Lean Six Sigma using SigmaXL and Minitab, McGraw-Hill Professional, 1st Edition, 2009.
[3] B.C. Gupta and H.F. Walker, Applied statistics for the Six Sigma Green Belt, ASQ Quality Press, 2004.
89
CHAPTER 5
STATISTICAL PROCESS AND QUALITY CONTROL
Almost all quality improvement comes via
simplication of design, manufacturing,, layout,
processes, and procedures.
Tom Peters
Statistical quality control (SQC) is a term used to describe the activities associated with ensuring that goods and
services satisfy customer needs. SQC uses statistical analysis based on measurements taken from a process or from a
sample of products or services, to make decisions regarding the quality of goods and services. The statistical methods
of SQC may be divided into two main categories: Statistical process control (SPC) and acceptance sampling. SPC refers
to the use of statistical methods to measure and control the performance of a process to ensure that the output meets
customer needs. Acceptance sampling is a methodology of taking samples from lots of materials or products and
inspecting the items to determine if the items meet customer requirements. A process may include customer services,
productions systems, and administration activities. SPC may be used to help control almost all processes that can be
measured or monitored to ensure that the process performs within limits.
5.1 STATISTICAL PROCESS CONTROL
Statistical process control allows engineers to understand and monitor process variation through control charts. The
causes of process variation in a product quality characteristic may be broadly classied into two main categories:
common causes of variation (variation due to the system itself) and assignable causes of variation (variation due to
factors external to the system).
The concept of control charts was rst introduced by Walter A. Shewhart of Bell Telephone Laboratories during
the 1920s. For this reason, statistical control charts are also known as Shewhart control charts. A control chart is
a graphical method to quickly spot assignable cause of variation of a process. Variation is present in any process;
deciding when the variation is natural and when it needs correction is the key to quality control. A control chart
displays a quality characteristic that has been measured or computed from a sample versus the sample number or
time. The sample values to be used in a quality control effort are divided into subgroups with a sample representing
a subgroup. A control chart contains a center line (CL) that represents the average value of the quality characteristic
when the process is in control. Two other horizontal lines, called the upper control limit (UCL) and the lower control
limit (LCL), are also shown on the chart. These control limits are chosen so that if the process is in control, nearly all
of the sample points will fall between them. In general, as long as the points plot within the control limits, the process
is assumed to be in-control, and no action is necessary. However, a point that plots outside of the control limits is
interpreted as evidence that the process is out-of-control, and investigation and corrective action are required to nd
and eliminate the assignable cause or causes responsible for this behavior. The sample points on the control chart are
usually connected with straight-line segments so that it is easier to visualize how the sequence of points has evolved
over time. Figure 5.1 illustrates the concept of a control chart, where the process is found to be out of control due to the
sample number 15 which falls outside the control limits.
90
2 4 6 8 10 12 14 16 18 20
0.85
0.9
0.95
1
1.05
1.1
S
a
m
p
l
e
s
t
a
t
i
s
t
i
c
Sample number

15
UCL
LCL
CL
FIGURE 5.1: Illustration of a control chart.
5.1.1 HYPOTHESIS TESTING AND CONTROL CHARTS
There is a close connection between control charts and hypothesis testing. A control chart may be formulated as
hypothesis test:
H
0
: process is in-control
H
1
: process is out-of-control
(5.1.1)
Control limits are established to control the probability of making the error of concluding that the process is out of
control when in fact it is not. This corresponds to the probability of making a Type I error if we were testing the null
hypothesis that the process is in control. On the other hand, we must be attentive to the error of not nding the process
out of control when in fact it is (Type II error). Thus, the choice of control limits is similar to the choice of a critical
region. When a point plots within the control limits, the null hypothesis is not rejected; and when a point plots outside
the control limits, the null hypothesis is rejected.
Denition 5.1 Let be a sample statistic that measures some quality characteristic of interest, with mean
and standard
deviation
. The upper control limit, center line, and lower control limit are given by
UCL =
+3
CL =
LCL =
(5.1.2)
The 3
limits imply that there is a probability of only 0.0026 of a sample statistic to fall outside the control limits if
the process is in-control.
Control chart are broadly classied into control charts for variables and control charts for attributes. Variable control
charts are used for quality characteristics that are measured on a continuous scale such as length, temperature, weight,
and time. Attribute control charts are used for quality characteristics in discrete (count) data, such as number of
defects. Attribute control charts are further divided into two main classes: attributes control charts for defective units,
and attribute control charts for defects per unit.
91
5.1.2 RULES FOR DETERMINING OUT-OF-CONTROL POINTS
The control chart is an important tool for distinguishing between the common causes of variation that are due to the
process and special causes of variation that are not due to the process. Only management can change the process. One
of the main goals of using a control chart is to determine when the process is out-of-control so that necessary actions
may be taken. The simplest rule for detecting the presence of an assignable (or special) cause of variation is one or
more plotted points falling outside the control limits UCL and LCL. Assignable causes are special causes of variation
that are ordinarily not part of the process, and should be corrected as warranted. Common causes, on the other hand,
are inherent in the design of the system and reect the typical variation to be expected. An unstable (or out-of-control)
process exhibits variation due to both assignable and common causes. Improvement can be achieved by identifying
and removing the assignable cause(s). A stable process is one that exhibits only common-cause variation, and can be
improved only by changing the design of the process. Attempts to make adjustments to a stable process, which is
called tampering, results in more variation in the quality of the output. Control charts are used to detect the occurrence
of assignable causes affecting the quality of process output. Figure depicts a control chart in which the area between
UCL and LCL is subdivided into bands, each of which is 1
wide.
S
a
m
p
l
e

S
t
a
t
i
s
t
i
c

(
Q
u
a
l
i
t
y

C
h
a
r
a
c
t
e
r
i
s
t
i
c
)
Sample Number

UCL
LCL
CL
+ 2
FIGURE 5.2: Illustration of control chart bands, each of which is 1
wide.
The rules for determining out-of-control points in a control chart may be summarized in ve main rules (refereed to as
Western Electric rules). That is, a process is considered out-of-control (unstable) if:
Rule 1: A point falls outside the upper and lower control limits, i.e. above UCL or below LCL
Rule 2: Two out of three consecutive points fall above
+2
or below
Rule 3: Four out of ve consecutive points fall above
+1
or below
Rule 4: Eight or more consecutive points fall above
or below
Rule 5: Eight or more consecutive points move upward (increasing) or downward (decreasing) in value.
5.2 CONTROL CHARTS FOR VARIABLES
Control charts for variables are used to study a process when a characteristic is a measurement; for example, temper-
ature, cost, revenue, processing time, area, and waiting time. Variable charts are typically used in pairs. One chart
92
studies the variation in a process, and the other chart studies the variation in the process mean. A chart that studies
the process variability must be examined before the chart that studies the process mean. This is due to the fact that the
chart that studies the process mean assumes that the process variability is stable over time. One of the most commonly
used pairs of charts is the X-chart and the R-chart. Another pair is the X-chart and the s-chart. In this section, we
discuss in detail these two pairs of charts.
5.2.1 CONTROL CHARTS FOR THE MEAN AND RANGE
Control Chart for the Mean (X-chart):
An X-chart is a control chart plotting the sample means vs the sample number. Denote by and the process mean
and standard deviation, respectively.
Considering the sample mean X as the sample statistic yields
X
= and
X
= /
n. When the parameters

and are unknown, we usually estimate them on the basis of preliminary samples (subgroups), taken when the
process is thought to be in control. Suppose m preliminary samples are available, each of size n. Denote by X
i
and R
i
the sample mean and range of the i-th sample, respectively. An unbiased estimator of is obtained by averaging m
sample (subgroup) means when the process is in control
CL = X =
1
m
m
i=1
X
i
(5.2.1)
where X (pronounced X-double-bar) is the average of the X
i
s.
On the other hand, it can be shown that the mean and standard deviation of relative range W = R/d
2
are
W
= d
2
and
W
= d
3
, where d
2
and d
3
are constants that depend on the sample size n. That is, the relative range is an unbiased
estimator of . Since R = W, it follows that
R
= d
2
and
R
= d
3
. An unbiased estimator of
R
is the average of
the sample ranges given by
R =
1
m
m
i=1
R
i
(5.2.2)
where R
i
= X
(n),i
X
(1),i
is the range of the ith-sample (subgroup). Here X
(n),i
and X
(1),i
denote the largest and
smallest observations, respectively, in the sample.
Thus, an unbiased estimator of is
=
R
d
2
(5.2.3)
The upper and control limits are located at a distance of 3
X
= 3/
n above and below the center line. An estimator

of this distance is given by
3 /
n =
3(R/d
2
)
n
=
3
d
2
n
R = A
2
R (5.2.4)
where A
2
is a constant that depends on the sample size n (see Appendix Table V).
Control chart for the mean (X-chart):
The upper control limit, center line, and lower control limit of the X-chart are given by
UCL =

x + A
2
r
CL =

x
LCL =

x A
2
r
(5.2.5)
Example 5.1 A control chart for X is to be set up for an important quality characteristic. The sample size is n = 4, and x and r
are computed for each of 25 preliminary samples. The summary data are
25
i=1
x
i
= 7657
25
i=1
r
i
= 1180
(i) Find the control limits for X-chart
93
(ii) Assuming the process is in control, estimate the process mean and standard deviation.
Solution: From the given information, the number of samples is m = 25 and the sample size is n = 4.
(i) The grand mean and the average range are given by
x =
1
m
m
i=1
x
i
=
7657
25
= 306.28 r =
1
m
m
i=1
r
i
=
1180
25
= 47.20
The value of A
2
for samples of size 4 is A
2
= 0.729. Therefore, the control limits of the X-chart are
UCL =

x + A
2
r = 306.28 + (0.729)(47.20) = 340.69
CL =

x = 306.28
LCL =

x A
2
r = 306.28 + (0.729)(47.20) = 271.87
(ii) The estimates of the process mean and standard deviation are
=

x = 306.28 =
r
d
2
=
47.20
2.059
= 22.92
Control Chart for the Range (R-chart):
In quality control, we want to control not only the mean value of some quality characteristic but also its variability. A
range chart (or simply R-chart) is a control chart plotting the sample ranges vs the sample number, and it monitors the
variation of a quality characteristic. Considering the sample range R as the statistic yields
R
= d
2
and
R
= d
3
,
where d
2
and d
3
are constants (see Appendix Table V) that depend on the sample size n. An unbiased estimator of
R
is the mean R of the ranges of m samples
CL = R =
1
m
m
i=1
R
i
The upper and control limits are located at a distance of 3
R
= 3d
3
above and below the center line. Since R/d
2
as
an unbiased estimator of , it follows that the upper and control limits may be expressed as
UCL = R +3
d
3
d
2
R =
_
1 +3
d
3
d
2
_
R = D
4
R (5.2.6)
and
LCL = R 3
d
3
d
2
R =
_
1 3
d
3
d
2
_
R = D
3
R (5.2.7)
where D
3
and D
4
are constants that depend on the sample size n (see Appendix Table V).
Control chart for the range (R-chart):
The upper control limit, center line, and lower control limit for the R-chart are given by
UCL = D
4
r
CL = r
LCL = D
3
r
(5.2.8)
The R-chart highlights the changes in the process variability and shows better results when analyzed in conjunction
with the X-chart. It is likely that a sample with the same mean may not reveal a shift in the process at all. Thus, it is
necessary to analyze both the X-and R-chart together to decide whether the process is in-control or out-of-control.
Example 5.2 Samples of size 5 are collected from a process every hour. After 30 samples have been collected, we calculate the
value of the average range r = 2.5. Find the control limits for the R-chart.
94
Solution: From the given information, the number of samples is m = 30 and the sample size is n = 5. From Appendix
Table V, the values of D
3
and D
4
are D
3
= 0 and D
4
= 2.115. Thus, the upper control limit, center line, and lower
control limit for the R-chart are given by
UCL = D
4
r = (2.114)(2.5) = 5.29
CL = r = 2.5
LCL = D
3
r = 0
Example 5.3 The data provided in Table 5.1 have been obtained by measuring four consecutive units on a assembly line every 30
minutes until the 20 subgroups (samples) are obtained. Each subgroup has 5 observations. Construct the X- and R-charts. Is the
process under statistical control? Explain.
TABLE 5.1: Assembly Data in Subgroups (Samples) Obtained at Regular Intervals.
Sample Data
Number X
1
X
2
X
3
X
4
X
5
1 29 51 75 62 42
2 97 73 75 99 56
3 46 60 68 76 57
4 40 61 66 41 59
5 66 76 70 76 52
6 58 61 45 41 78
7 53 38 71 57 42
8 48 71 37 62 60
9 86 65 68 72 74
10 77 55 43 50 63
11 54 50 41 66 46
12 58 48 49 87 59
13 83 57 35 25 52
14 87 43 57 39 47
15 76 48 76 77 80
16 51 32 53 52 54
17 47 48 65 56 47
18 70 45 60 73 44
19 42 67 78 95 59
20 39 82 54 35 32
Solution: The upper control limit, center line, and lower control limit for the R-chart are
UCL = D
4
r = 74.219
CL = r = 35.10
LCL = D
3
r = 0
where for a sample of size n = 5, Appendix Table V gives D
3
= 0 and D
4
= 2.114.
1 l oad assembl y.mat %l oad assembly data
2 [ st at s , pl ot dat a ] = cont r ol char t ( X, ' char t ' ,{ ' r ' , ' xbar ' } , ' sigma ' , ' range ' ) ; %R and Xbarchar t s
First we examine the R-chart for signs of special variation. The R-chart is shown in Figure 5.3(a), where all samples
appear to be in-control. None of the points on the R-chart is outside the control limits, and there are no other signals
indicating a lack of control. Thus, there are no indications of special sources of variation on the R-chart. In other words,
only common-cause variation appears to exist. In this case, we can proceed further to calculate the upper control limit,
center line, and lower control limit for the X-chart
UCL =

x + A
2
r = 78.926
CL =

x = 58.680
LCL =

x A
2
r = 38.434
95
where the value of A
2
for a sample of size n = 5 is A
2
= 0.577.
The X-chart is shown in Figure 5.3(b), where the sample number 2 appears to be out-of-control. Further investigation
is warranted to determine the source(s) of this special variation. Thus, the X-chart can be interpreted without concern
that the observed variability in the sample means could be associated with a lack of control of process variability.
Therefore, the sample number 2 is discarded when remedial actions have been taken to remove special causes, and
then new limits are calculated using the remaining 24 samples (i.e. samples 1 and 3 to 25). These limits are referred to
as revised control limits.
2 4 6 8 10 12 14 16 18 20
0
10
20
30
40
50
60
70
80
S
a
m
p
l
e

R
a
n
g
e
Sample Number

UCL
LCL
CL
2 4 6 8 10 12 14 16 18 20
35
40
45
50
55
60
65
70
75
80
S
a
m
p
l
e

M
e
a
n
Sample Number

2
UCL
LCL
CL
(a) R-chart (b) X-chart
FIGURE 5.3: R- and X-charts for assembly data.
With sample 2 deleted, the revised mean range and grand mean are:
x = 57.56 r = 34.69
The revised control limits for the new R-chart are
UCL = D
4
r = 73.340
CL = r = 34.684
LCL = D
3
r = 0
and the revised control limits for the new X-chart are
UCL =

x + A
2
r = 77.564
CL =

x = 57.558
LCL =

x A
2
r = 37.551
The new X- and R-charts are shown in Figure 5.4. Notice now that all the points fall within the limits, indicating that
the process may be stable.
5.2.2 CONTROL CHARTS FOR THE MEAN AND STANDARD DEVIATION
It is customary to examine the control chart for standard deviation rst and verify that the all the plotted points fall
within the control limits, and then proceed to constructing the X-chart. In fact, the concept of bringing the process
variability under control rst and then proceeding to control the mean does make a lot of sense. This is due to the fact
that without controlling the process variability, it is almost impossible to bring the process mean under control.
Control Chart for the Standard Deviation (S-chart):
As the sample size n increases, the range becomes increasingly less efcient as a measure of variability. This is the
case because the range ignores all information between the two most extreme values (minimum and maximum sample
96
2 4 6 8 10 12 14 16 18
0
10
20
30
40
50
60
70
80
S
a
m
p
l
e

R
a
n
g
e
Sample Number

UCL
LCL
CL
2 4 6 8 10 12 14 16 18
35
40
45
50
55
60
65
70
75
80
S
a
m
p
l
e

M
e
a
n
Sample Number

UCL
LCL
CL
FIGURE 5.4: Revised R- and X-charts for assembly data.
values). A standard deviation control chart (or simply S-chart) is sensitive to changes in variation in the measurement
process, and it preferable for larger sample sizes (n 10). The S-chart plots the sample standard deviations vs the
sample number. Suppose m preliminary samples are available, each of size n. Denote by S
i
the sample standard
deviation of the i-th sample. An unbiased estimator of
S
is the mean S of the sample standard deviations of m samples
CL = S =
1
m
m
i=1
S
i
(5.2.9)
The upper and lower control limits are located at a distance of 3
S
above and belowthe center line. It can be shown that
E(S) =
S
= c
4
and
S
=
_
1 c
2
4
, where c
4
Thus, S/c
4
is an unbiased estimator of , which in turn implies that (S/c
4
)
_
1 c
2
4
is an estimator of
S
. Therefore, the
upper and lower control limits may be expressed as
UCL = S +3
S
c
4
_
1 c
2
4
=
_
1 +
3
c
4
_
1 c
2
4
_
S = B
4
S (5.2.10)
and
LCL = S 3
S
c
4
_
1 c
2
4
=
_
1
3
c
4
_
1 c
2
4
_
S = B
3
S (5.2.11)
where B
3
and B
4
are constants that depend on the sample size n (see Appendix Table V).
Control chart for standard deviation (S-chart):
The upper control limit, center line, and lower control limit of the S-chart are given by
UCL = B
4
s
CL = s
LCL = B
3
s
(5.2.12)
Example 5.4 Containers are produced by a process where the volume of the containers is subject to quality control. Twenty-ve
samples of size 5 each were used to establish the quality control parameters, and the sum of the sample standard deviations is
25
i=1
s
i
= 0.903
97
(i) Find the control limits for the S-chart
(ii) Assuming the process is in control, estimate the process standard deviation.
Solution: From the given information, the number of samples is m = 25 and the sample size is n = 5. Thus, Appendix
Table V gives B
3
= 0, B
4
= 2.089, and c
4
= 0.940.
(i) The average of the sample standard deviations is
s =
1
m
m
i=1
s
i
=
0.903
25
= 0.0361
Therefore, the control limits for the S-chart are
UCL = B
4
s = (2.089)(0.0361) = 0.0754
CL = s = 306.28
LCL = B
3
s = 0
(ii) The process standard deviation can be estimated as follows
=
s
c
4
=
0.0361
0.940
= 0.0384.
Control Chart for the Mean (from s):
We can now write the parameters of the corresponding X-chart involving the use of the sample standard deviation.
Let us assume that the sample standard deviation S and the sample mean X are available from the base preliminary
sample. Since S/c
4
as an unbiased estimator of , the upper and control limits of the X-chart may also be written as
UCL = X +3
S
c
4
n
= X + A
3
S (5.2.13)
and
LCL = X 3
S
c
4
n
= X A
3
S (5.2.14)
where A
3
Control chart for mean (X-chart from s):
The upper control limit, center line, and lower control limit of the X-chart (from s) are given by
UCL =

x + A
3
s
CL =

x
LCL =

x A
3
s
(5.2.15)
Example 5.5 Using the data of Table 5.1, construct the S- and the X-charts.
Solution: The upper control limit, center line, and lower control limit for the S-chart are given by
UCL = B
4
s = 29.502
CL = s = 14.122
LCL = B
3
s = 0
where B
3
= 0 and B
4
= 2.089 for samples of size n = 5.
Similar to the analysis that we performed in the case of R-chart, we rst examine the S-chart for signs of special
variation. The S-chart is given in Figure 5.5(a), which shows that all plotted points fall within the control limits and
there is no evidence of any special pattern. Thus, we may conclude that the only variation present in the process is
98
due to common causes. In this case, we can proceed further to calculate the upper control limit, center line, and lower
control limit for the X-chart:
UCL =

x + A
3
s = 78.837
CL =

x = 58.680
LCL =

x A
3
s = 38.523
where A
3
1 l oad assembl y.mat %l oad assembly data
2 [ st at s , pl ot dat a ] = cont r ol char t ( X, ' char t ' ,{ ' s ' , ' xbar ' } , ' sigma ' , ' st d ' ) ; %S and Xbarchar t s
The X-chart is shown in Figure 5.5(b), where the sample number 2 is out-of-control. This indicate that the process
may not be under control and there are some special causes present that are affecting the process mean. Thus, a thor-
ough investigation should be launched to nd the special causes, and appropriate action should be taken to eliminate
these special causes before we proceed to recalculate the control limits for the ongoing process.
2 4 6 8 10 12 14 16 18 20
0
5
10
15
20
25
30
S
a
m
p
l
e

S
t
a
n
d
a
r
d

D
e
v
i
a
t
i
o
n
Sample Number

UCL
LCL
CL
2 4 6 8 10 12 14 16 18 20
35
40
45
50
55
60
65
70
75
80
S
a
m
p
l
e

M
e
a
n
Sample Number

2
UCL
LCL
CL
(a) s-chart (b) X-chart
FIGURE 5.5: S- and X-charts for assembly data.
With sample 2 deleted, the new upper control limit, center line, and lower control limit for the S-chart are given by
UCL = B
4
s = 29.072
CL = s = 13.917
LCL = B
3
s = 0
The new upper control limit, center line, and lower control limit for the X-chart are given by
UCL =

x + A
3
s = 77.421
CL =

x = 57.558
LCL =

x A
3
s = 37.694
The new X- and S-charts are shown in Figure 5.6. Notice now that all the points fall within the limits, indicating that
the process may be stable.
Example 5.6 A component part for a jet aircraft engine is manufactured by an investment casting process. The vane opening on
this casting is an important functional parameter of the part. Table 5.2 presents 20 samples of ve parts each. The values given in
the table have been coded by using the last three digits of the dimension; that is, 31.6 should be 0.50316 inch.
(i) Estimate the process mean and standard deviation.
99
2 4 6 8 10 12 14 16 18
0
5
10
15
20
25
30
S
a
m
p
l
e

S
t
a
n
d
a
r
d

D
e
v
i
a
t
i
o
n
Sample Number

UCL
LCL
CL
2 4 6 8 10 12 14 16 18
0
5
10
15
20
25
30
S
a
m
p
l
e

S
t
a
n
d
a
r
d

D
e
v
i
a
t
i
o
n
Sample Number

UCL
LCL
CL
(a) S-chart (b) X-chart
FIGURE 5.6: Revised S- and X-charts for assembly data.
TABLE 5.2: Vane-Opening Data.
Sample Data
Number X
1
X
2
X
3
X
4
X
5
1 33 29 31 32 33
2 33 31 35 37 31
3 35 37 33 34 36
4 30 31 33 34 33
5 33 34 35 33 34
6 38 37 39 40 38
7 30 31 32 34 31
8 29 39 38 39 39
9 28 33 35 36 43
10 38 33 32 35 32
11 28 30 28 32 31
12 31 35 35 35 34
13 27 32 34 35 37
14 33 33 35 37 36
15 35 37 32 35 39
16 33 33 27 31 30
17 35 34 34 30 32
18 32 33 30 30 33
19 25 27 34 27 28
20 35 35 36 33 30
(ii) Construct the R- and the X-charts.
(iii) Construct the S- and the X-charts.
Solution: Table 5.3 shows the vane-opening data with extra columns displaying the sample means, sample ranges,
and sample standard deviations. The grand mean, average range, and average standard deviation are also listed at the
bottom of the table. For a sample size of n = 5, we have D
3
= 0, D
4
= 2.114, A
2
= 0.577, B
3
= 0, B4 = 2.089, and
A
3
= 1.427.
(i) Using Table 5.3, the process mean and standard deviation can be estimated as follows
X = 33.32 and =
R
d
2
=
5.8
2.326
= 2.4936
where d
2
100
TABLE 5.3: Vane-Opening Data with Sample Means, Ranges, and Standard Devia-
tions.
Sample Data
Number X
1
X
2
X
3
X
4
X
5
X
i
R
i
S
i
1 33 29 31 32 33 31.60 4 1.67
2 33 31 35 37 31 33.40 6 2.61
3 35 37 33 34 36 35.00 4 1.58
4 30 31 33 34 33 32.20 4 1.64
5 33 34 35 33 34 33.80 2 0.84
6 30 31 32 34 31 31.60 4 1.52
7 38 33 32 35 32 34.00 6 2.55
8 31 35 35 35 34 34.00 4 1.73
9 27 32 34 35 37 33.00 10 3.81
10 33 33 35 37 36 34.80 4 1.79
11 35 37 32 35 39 35.60 7 2.61
12 33 33 27 31 30 30.80 6 2.49
13 35 34 34 30 32 33.00 5 2.00
14 32 33 30 30 33 31.60 3 1.52
15 35 35 36 33 30 33.80 6 2.39
16 33 33 27 31 30 30.80 6 2.49
17 35 34 34 30 32 33.00 5 2.00
18 32 33 30 30 33 31.60 3 1.52
19 25 27 34 27 28 28.20 9 3.42
20 35 35 36 33 30 33.80 6 2.39
X = 33.32 R = 5.8 S = 2.345
(ii) The upper control limit, center line, and lower control limit of the R-chart are
UCL = D
4
r = 12.27
CL = r = 5.8
LCL = D
3
r = 0
The R-chart is analyzed rst to determine if it is stable. Figure 5.7(a) shows that there is an out-of-control point on
the R-chart at sample (subgroup) 9. Assuming that the out-of-control point at sample 9 has an assignable cause,
it can be discarded from the data.
The X-chart can now be analyzed. The control limits of the X-chart are
UCL =

x + A
2
r = 36.67
CL =

x = 33.32
LCL =

x A
2
r = 29.97
Figure 5.3(b) shows that there are out-of-control points at samples 6, 8, 11, and 19. Assuming assignable cause,
we can discard these samples from the data.
MATLAB code
>> load vaneopening.mat; %load vane-opening data
>> [stats,plotdata]=controlchart(X,chart,{r,xbar},sigma,range); %r & Xbar charts
Thus, if the out-of-control points at samples 6, 8, 9, 11, and 19 are discarded, then the new control limits are
calculated using the remaining 15 samples. The revised mean range and grand mean are the given by
x = 33.21 r = 5
The revised control limits for the new R-chart are
UCL = D
4
r = 10.57
CL = r = 5
LCL = D
3
r = 0
101
2 4 6 8 10 12 14 16 18 20
0
5
10
15
S
a
m
p
l
e

R
a
n
g
e
Sample Number

9
UCL
LCL
CL
2 4 6 8 10 12 14 16 18 20
28
30
32
34
36
38
40
S
a
m
p
l
e

M
e
a
n
Sample Number

6
8
11
19
UCL
LCL
CL
FIGURE 5.7: R- and X-charts for the vane-opening data.
and the revised control limits for the new X-chart are
UCL =

x + A
2
r = 36.10
CL =

x = 33.21
LCL =

x A
2
r = 30.33
The new X- and R-charts are shown in Figure 5.8. Notice now that all the points fall within the limits, indicating
that the process may be stable.
2 4 6 8 10 12 14
0
2
4
6
8
10
12
S
a
m
p
l
e

R
a
n
g
e
Sample Number

UCL
LCL
CL
2 4 6 8 10 12 14
30
31
32
33
34
35
36
37
S
a
m
p
l
e

M
e
a
n
Sample Number

UCL
LCL
CL
FIGURE 5.8: Revised R- and X-charts for the vane-opening data.
(iii) The upper control limit, center line, and lower control limit of the S-chart are given by
UCL = B
4
s = 4.899
CL = s = 2.345
LCL = B
3
s = 0
102
The upper control limit, center line, and lower control limit of the X-chart are given by
UCL =

x + A
3
s = 36.67
CL =

x = 33.32
LCL =

x A
3
s = 29.97
We rst examine the S-chart for signs of special variation. The S-chart are shown in Figure 5.9(a), where samples
6, 8, 11, and 19 are out-of-control. The X-chart is shown in Figure 5.9(b), where the sample number 9 is out-of-
control.
2 4 6 8 10 12 14 16 18 20
0
1
2
3
4
5
6
S
a
m
p
l
e

S
t
a
n
d
a
r
d

D
e
v
i
a
t
i
o
n
Sample Number

9
UCL
LCL
CL
2 4 6 8 10 12 14 16 18 20
28
30
32
34
36
38
40
S
a
m
p
l
e

M
e
a
n
Sample Number

6
8
11
19
UCL
LCL
CL
(a) S-chart (b) X-chart
FIGURE 5.9: S- and X-charts for vane-opening data.
MATLAB code
>> load vaneopening.mat; %load vane-opening data
>> [stats,plotdata]=controlchart(X,chart,{s,xbar},sigma,std); %s & Xbar charts
With samples 6, 8, 9, 11, and 19 deleted, the new upper control limit, center line, and lower control limit of the
S-chart are given by
UCL = B
4
s = 4.281
CL = s = 2.049
LCL = B
3
s = 0
and the new upper control limit, center line, and lower control limit of the X-chart are given by
UCL =

x + A
3
s = 36.14
CL =

x = 33.21
LCL =

x A
3
s = 30.29
As shown in Figure 5.10, the remaining plotted points on the S- and X-charts indicate a stable process.
5.3 PROCESS CAPABILITY ANALYSIS
Process capability is the long-term performance level of the process after it has been brought under statistical control.
In other words, process capability is the range over which the natural variation of the process occurs as determined by
the system of common causes. A process may be in statistical control, but due to a high level of variation may not be
103
2 4 6 8 10 12 14
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
S
a
m
p
l
e

S
t
a
n
d
a
r
d

D
e
v
i
a
t
i
o
n
Sample Number

UCL
LCL
CL
2 4 6 8 10 12 14
30
31
32
33
34
35
36
37
S
a
m
p
l
e

M
e
a
n
Sample Number

UCL
LCL
CL
(a) s-chart (b) X-chart
FIGURE 5.10: Revised S- and X-charts for vane-opening data.
capable of producing output that is acceptable to customers. A process capability study assumes the system is stable
and that the data are normally distributed. A process is stable when only normal random variation is present.
A process capability analysis is simply the comparison of the distribution of a process output with the product
tolerances. Control charts limits can be compared by specication limits to determine the process capability. These
specication limits are often set by customers, management, and/or product designers. Moreover, specication limits
are usually two-sided, with upper specication limit (USL) and lower specication limit (LSL); or can be one-sided,
with either USL or LSL. Knowing the capability of your processes, we can specify better the quality performance
requirements for new machines, parts and processes. The capability of a process centered on the desired mean can
be measured using the process capability potential C
p
, which is dened as the ratio between the specication spread
(USL LSL) and the process spread (6):
C
p
=
Specication spread
Process spread
=
USL LSL
6
(5.3.1)
where is the process standard deviation. The idea is illustrated graphically in Figure 5.11. Note that the specication
spread is the performance spread acceptable to customers, management, and/or product designers.
Let X be the process quality characteristic that we want to monitor. The performance of the process with respect to the
specication limits USL and LSL are dened as follows:
P(X > USL) = Percentage of nonconforming produced by the process at the upper end.
P(X < LSL) = Percentage of nonconforming produced by the process at the lower end.
Thus, the total percentage of nonconforming produced by the process is dened as
P(X < USL or X > USL) = 1 P(LSL < X < USL) (5.3.2)
Other capability indices that are frequently used in process capability analysis include:
Upper capability index: C
pU
= (USL )/(3)
Lower capability index: C
pL
= ( LSL)/(3)
Process capability index: C
pk
= min(C
pU
, C
pL
)
These four measures of process capability quantify the degree to which your process produces output that meets the
customers specication, and can be used effectively to summarize process capability information in a convenient
unitless system. Calculating the process capability measures requires knowledge of the process mean and standard
deviation, and , which are usually estimated from data collected from the process. Assume m preliminary samples
104
Process spread: 6
Specification spread: USLLSL
LSL USL
+ 3 3
FIGURE 5.11: Specication spread vs. process spread.
TABLE 5.4: Process Capability Indices with = R/d
2
.
Index Estimated Equation
C
p
(USL LSL)/(6 )
C
pU
(USL X)/(3 )
C
pL
(X LSL)/(3 )
C
pk
min(C
pU
, C
pL
)
(with equal sample size n) are available, then = X is the grand mean and = R/d
2
, where R is the mean of sample
ranges. Table 5.4 shows a summary of the process capability measures. Note that C
pk
C
p
, and are equal when X is at
target.
The lower and upper capability indices C
pL
and C
pU
are used when only one direction from the mean is important.
The process capability index, C
pk
, measures the distance of the process average X from the closest specication. In
other words, unlike C
p
, the index C
pk
takes process location into account. Moreover, C
pk
can be calculated in situations
where there is only one specication limit. Thus, C
pk
can be used in place of the other three capability measures. Three
possible cases can be considered:
Case 1: If C
pk
< 1, the process in not capable of consistently producing product within the specications. The process
produces more than 2.7 non-conforming units per million. It is impossible for the current process to meet speci-
cations even when it is in statistical control. If the specications are realistic, an effort must be immediately made
to improve the process (i.e. reduce variation) to the point where it is capable of producing consistently within
specications.
Case 2: If C
pk
1.33, the process in highly capable and produces about 63 non-conforming units per million. C
pk
values of 1.33 or greater are considered to be industry benchmarks.
Case 3: If 1 C
pk
< 1.33, the process in barely capable and produces more than 63 but less than 2.7 non-conforming
units per million. This process has a spread just about equal to specication width. It should be noted that if the
process mean moves to the left or the right, a signicant portion of product will start falling outside one of the
specication limits. This process must be closely monitored.
Example 5.7 A pharmaceutical company carried out a process capability study on the weight of tablets produced and showed that
the process was in-control with a process mean X = 2504 mg and a mean range R = 91 mg from samples of size n = 4. Compute
the process capability indices for the specications limits 2800 mg and 2200 mg, and interpret your result.
105
Solution: From the given information, we have X = 2504, R = 91, LSL = 2200, USL = 2800, and n = 4. For a sample
of size n = 4, Appendix Table V gives d
2
= 2.059. Thus,
=
R
d
2
=
91
2.059
= 44.1962
C
p
=
USL LSL
6
= 2.263
C
pU
=
USL X
3
= 2.232
C
pL
=
X LSL
3
= 2.293
C
pk
= min(C
pU
, C
pL
) = 2.232
Since C
pk
= 2.232 1.33, the process is highly capable. Moreover, C
pU
is smaller that C
pL
, which indicates that the
process is skewed more to the high side. Thus, the process is highly capable of meetings the requirements but not
centered. Some corrective action may have to be taken to centralize the process.
MATLAB code
n = 4; d2 = 2.059; %sample size equal 4
LSL = 2200; USL = 2800;
Xbarbar = 2504; Rbar = 91;
sigma = Rbar/d2;
Cp = (USL - LSL)/(6*sigma);
Cpl = (Xbarbar-LSL)/(3*sigma);
Cpu = (USL - Xbarbar)/(3*sigma);
Cpk = min(Cpl,Cpu);
fprintf(Process Capability Indices: Cp=%.3f, Cpl=%.3f, Cpu=%.3f, ...
Cpk=%.3f\n,Cp,Cpl,Cpu,Cpk);
Example 5.8 Assume that the vane-opening data (Example 5.6) are normally distributed and that the specications are 34 6.5.
(i) Determine the process capability index.
(ii) What proportion of the product will not meet specications?
Solution: From the given information, we have LSL = 27.5 and USL = 40.5. From the revised R- and X-charts in
Example 5.6, we have X = 33.21 and R = 5. Thus, the estimated process standard deviation is = R/d
2
= 5/2.326 =
2.1496.
(i) The process capability indices are given by
C
p
=
USL LSL
6
= 1.0079
C
pU
=
USL X
3
= 1.1299
C
pL
=
X LSL
3
= 0.8859
C
pk
= min(C
pU
, C
pL
) = 0.8859
The C
pk
value is quite low (C
pk
< 1), which indicates that the process in not capable of consistently producing
product within the specication. Since C
pL
is smaller than C
pU
, the process is skewed more to the low side and
reects a relatively poor capability in meeting the low side of the design specication.
106
(ii) The total percentage of nonconforming produced by the process is
Percentage of Nonconforming = 1 P(LSL < X < USL)
= 1 P
_
LSL X

<
X X

<
USL X

_
= 1 P(2.6578 < Z < 3.3898)
= 1 ((3.3898) (2.6578)) = 0.0043
where () is the cdf of the standard normal distribution N(0, 1). Thus, the proportion of product not meeting
specications is 0.43%, which is quite low. Although we found that the process is in control, it is however not ca-
pable of meeting the stated specications. In this case, common causes must be found for process improvement.
Example 5.9 A dimension has specications of 2.125 0.05. Data from the process indicate that the distribution is normally
distributed, and the X- and R-charts indicate that the process is stable. The control chart used a sample of size ve and it is found
that X = 2.1261 and R = 0.0055. Determine the fraction of the manufactured product that will have this particular dimension
outside the specication limits.
Solution: From the given information, we have LSL = 2.120 and USL = 2.130, X = 2.1261, and R = 0.0055. For
a sample of size n = 5, Appendix Table V gives d
2
= 2.326. Thus, the estimated process standard deviation is =
R/d
2
= 0.0055/2.326 = 0.00236.
The total percentage of nonconforming produced by the process is
Percentage of Nonconforming = 1 P(LSL < X < USL)
= 1 P
_
LSL X

<
X X

<
USL X

_
= 1 P(2.58 < Z < 1.65)
= 1 ((1.65) (2.58)) = 0.0544
where () is the cdf of the standard normal distribution N(0, 1). Thus, approximately 5.44% of the products will fall
outside specication for this quality characteristic.
5.4 INDIVIDUAL CONTROL CHARTS
Some situations exist in which the sample consists of a single observation, that is, the sample size is equal to 1. A
sample of size one can occur when production is very slow or costly, and it is impractical to allow the sample size to
be greater than one. The individual control charts, I- and MR-charts, for variable data are appropriate for this type of
situation. The I-chart (also called X-chart) serves the same function as the X-chart except that now X is the value of the
individual measurement. Assuming that X N(
x
,
2
x
), the control limits of the I-chart are then given by
UCL =
x
+3
x
CL =
x
LCL =
x
3
x
(5.4.1)
Since
x
and
x
are unknown, they need to be estimated. Suppose m preliminary observations (samples) are avail-
able, each of size one. Then the process mean
x
can be estimated as the average of the individual measurements

x
= x = (1/m)
m
i=1
x
i
.
Since only individual measurements are available, the moving ranges MR
i
= [X
i
X
i1
[, i = 2, . . . , m between
two successive samples X
i1
and X
i
need to be calculated to estimate the process variability (standard deviation) as
follows:
=
MR
d
2
(5.4.2)
107
where
MR =
1
m1
m
i=2
MR
i
(5.4.3)
Note that since the data are taken as pairs X
i1
, X
i
to calculate the moving ranges MR
i
= [X
i
X
i1
[, the value of d
2
is then equal to 1.128. Notice that division is done by m1 since only m1 moving range values are calculated (there
is no moving range for subgroup 1), where m is the number of observations.
Control chart for individuals (I-chart):
The upper control limit, center line, and lower control limit for the I-chart are given by
UCL = x +3
mr
d
2
= x +3
mr
1.128
CL = x
LCL = x 3
mr
d
2
= x 3
mr
1.128
(5.4.4)
Example 5.10 A company is manufacturing high precision tubes. Quality Control department is interested to determine if the
production process in under control. For simplicity, we assume that from each batch is measured the outer diameter of a random
selected tube. Measured data are given in Table 5.5. Construct the I-chart.
TABLE 5.5: Outer diameter of tubes.
Sample 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Diameter 99.82 99.63 99.89 99.45 100.03 99.76 100.23 99.81 99.91 100.12 100.05 99.78 100.01 100.04 99.95
Solution: The upper control limit, center line, and lower control limit for the I-chart are given by
UCL = x +3
mr
d
2
= 99.8987 +3
0.2593
1.128
= 100.588
CL = x = 99.8987
LCL = x 3
mr
d
2
= 99.8987 3
0.2593
1.128
= 99.2093
1 X=[99 . 82 99. 63 99. 89 99. 45 100. 03 99. 76 100. 23 99. 81 . . .
2 99. 91 100. 12 100. 05 99. 78 100. 01 100. 04 99. 95 ] ; %out er di amet er data
3 [ st at s , pl ot dat a ] = cont r ol char t ( X, ' char t ' , ' i ' , ' wi dt h ' , 2) ; %pl ot I char t
4 f p r i n t f ( ' Cont r ol l i mi t s f or i char t : UCL=%g , CL=%g , LCL=%g\n ' , . . .
5 pl ot dat a. uc l ( 1) , p l ot dat a. c l ( 1) , p l o t d a t a . l c l ( 1) ) ;
The I-chart is shown in Figure 5.12, where all samples appear to be in-control. Thus, the process of producing the tubes
is considered to be in statistical control.
On the other hand, the MR-chart is used to monitor the process variability. it can be shown that
MR
and
MR
can
be estimated as

MR
= MR and
MR
=
d
3
d
2
MR (5.4.5)
Control chart for moving ranges (MR-chart):
The upper control limit, center line, and lower control limit for the MR-chart are given by
UCL = D
4
mr = 3.267 mr
CL = mr
LCL = D
3
mr = 0
(5.4.6)
108
2 4 6 8 10 12 14
99.2
99.4
99.6
99.8
100
100.2
100.4
100.6
100.8
I
n
d
i
v
i
d
u
a
l

V
a
l
u
e
Sample Number

UCL
LCL
CL
FIGURE 5.12: I-chart for outer diameter data.
It is important to note that the moving range control chart can not be interpreted in the same way as the R chart
presented earlier, with respect to patterns or trends. Patterns or trends identied on the moving range chart do not
necessarily indicate that the process is out of control. The moving ranges MR
i
= [X
i
X
i1
[ are correlated. There is a
natural dependency between successive MR
i
values.
Example 5.11 Using the data of Table 5.5, construct the MR-chart.
Solution: The upper control limit, center line, and lower control limit for the MR-chart are given by
UCL = D
4
mr = (3.267)(0.2593) = 0.8471
CL = mr = 0.2593
LCL = D
3
mr = 0
1 [ st at s , pl ot dat a ] = cont r ol char t ( X, ' char t ' , ' mr ' , ' wi dt h ' , 2) ; %pl ot MRchar t
2 f p r i n t f ( ' Cont r ol l i mi t s f or mrchar t : UCL=%g , CL=%g , LCL=%g\n ' , . . .
The mr-chart is shown in Figure 5.13, where all samples appear to be in-control.
Example 5.12 Packages of a particular instant dry food are lled by a machine and weighed. The weights (in ounces) for 15
successive packages have been collected and are displayed in Table 5.6. Construct the I- and MR-charts.
TABLE 5.6: Weights for dry food packages.
Bottle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Weight 19.85 19.92 19.93 19.26 20.36 19.96 19.87 19.80 20.40 19.98 20.17 19.81 20.21 19.64 20.15
Solution: The moving ranges are calculated using MR
i
= [X
i
X
i1
[. To illustrate, consider the rst moving range at
subgroup 2:
MR
2
= [X
2
X
1
[ = 19.92 19.85 = 0.07
The remaining moving ranges are calculated accordingly and are given in Table 5.7.
109
2 4 6 8 10 12 14
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
M
o
v
i
n
g

R
a
n
g
e
Sample Number

UCL
LCL
CL
FIGURE 5.13: MR-chart for outer diameter data.
TABLE 5.7: Weights for dry food packages with moving ranges.
Bottle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Weight 19.85 19.92 19.93 19.26 20.36 19.96 19.87 19.80 20.40 19.98 20.17 19.81 20.21 19.64 20.15
Moving Range - 0.07 0.01 0.67 1.10 0.40 0.09 0.07 0.60 0.42 0.19 0.36 0.40 0.57 0.51
The upper control limit, center line, and lower control limit for the I-chart are given by
UCL = x +3
mr
d
2
= 19.954 +3
0.39
1.128
= 20.9909
CL = x = 19.954
LCL = x 3
mr
d
2
= 19.954 3
0.39
1.128
= 18.9171
The upper control limit, center line, and lower control limit for the MR-chart are given by
UCL = D
4
mr = (3.267)(0.39) = 1.27395
CL = mr = 0.39
LCL = D
3
mr = 0
The I- and MR-charts are displayed in Figure 5.14, which shows that all samples fall within the control limits. Thus,
the process appears to be in statistical control.
5.5 CUMULATIVE SUM CONTROL CHART (CUSUM-CHART)
Although Shewhart charts with 3 limits can quickly detect large process changes, they are ineffective for small, sus-
tained process changes (for example, changes smaller than 1.5). An alterative control chart has been developed to
detect small shifts in the process mean is the so-called cumulative sum (CUSUM) control chart (CUSUM-chart), which
is more sensitive to small shifts in the process because it is based on not only the current observation, but also the most
recent past observations. Moreover, a CUSUM-chart is especially effective with samples of size n = 1.
Suppose m preliminary samples of size n 1 are available. Then, the CUSUM-chart plots the cumulative sums C
i
of
deviations of the observations from some target mean
0
C
i
=
i
j=1
( x
j
0
) =
_
i1
j=1
( x
j
0
)
_
+ ( x
i
0
) = C
i1
+ ( x
i
0
), i = 1, . . . , m (5.5.1)
110
2 4 6 8 10 12 14
18.5
19
19.5
20
20.5
21
I
n
d
i
v
i
d
u
a
l

V
a
l
u
e
Sample Number

UCL
LCL
CL
2 4 6 8 10 12 14
0
0.2
0.4
0.6
0.8
1
1.2
1.4
M
o
v
i
n
g

R
a
n
g
e
Sample Number

UCL
LCL
CL
FIGURE 5.14: I- and MR-charts for package weights.
against the subgroup (sample i), where x
j
is the mean of the j-th sample (j = 1, . . . , i).
As long as the process remains in control at the target mean
0
, the cumulative sums, C
i
, will be approximately
zero. Otherwise, if the process shifts away from the target mean
0
, then C
i
become increasingly large in absolute
value. The tabular CUSUM for monitoring the process mean involves two statistics, C
+
i
and C
i
, dened as
C
+
i
= max[0, x
i
(
0
+ K) + C
+
i1
]
C
i
= max[0, (
0
K) x
i
+ C
i1
]
(5.5.2)
where
C
+
i
is the accumulation of deviations above the target mean, with initial value C
+
0
= 0
C
i
is the accumulation of deviations below the target mean, with initial value C
0
= 0
K is called the reference value given by K = [
1
0
[/2, where
1
is the out-of-control mean that we are interested
in detecting.
A deviation from the target that is larger than K increases either the one-sided upper CUSUM C
+
i
or the one-sided
lower CUSUM C
i
. If the out-of-control mean
1
is unknown, there are methods for determining the value K. In this
situation we can let K = k
x
, where k is some constant chosen so that a particular shift is detected, and
x
= /
n with
denoting the process standard deviation. For example, say a shift from target of 1 standard deviation is important
to detect (i.e., detect whether the target has shifted to
0
+ 1
x
or
0
1
x
, then k = 1, and K = 1
x
). If the process
standard deviation is not known, it must be estimated from the data provided. The two-sided CUSUM-chart plots the
values of C
+
i
and C
i
i
for each sample i. A control limit violation (out-of-control) occurs when either C
+
i
or C
i
i
exceeds a
specied control limit (or threshold) H = h
x
, where h is typically equal to 5.
For simplicity of CUSUM limits calculation, it is preferable to use the standardized value y
i
= ( x
i

0
)/
x
=
( x
i

0
)/(/
n) of the variable x
i
. Since the value of is unknown, it is usually estimated as = MR/d
2
for
individual observations, and as = R/d
2
for samples of size n > 1.
111
Control chart for standardized cumulative sums (CUSUM-chart):
The one-sided upper and lower CUSUMs of the standardized CUSUM-chart are given by
C
+
i
= max[0, y
i
k + C
+
i1
]
C
i
= max[0, k y
i
+ C
i1
]
(5.5.3)
The upper control limit, center line, and lower control limit of the CUSUM-chart are given by
UCL = h
CL = 0
LCL = h
(5.5.4)
Example 5.13 The data given in Table 5.8 are average readings from a process, taken every hour. (Read the observations down,
from left). The target value for the mean is
0
= 160.
TABLE 5.8: Process readings.
159.0480 160.2766 160.3368 162.0092 162.2135
156.8969 159.2432 157.7590 159.2716 164.2690
159.1034 160.2861 160.8015 160.2469 159.7530
156.8976 163.2101 162.4781 162.5760 158.2998
161.8597 162.6982 162.8001 161.5964 158.6755
161.8039 159.1031 157.9376 160.6725 159.1114
1. Estimate the process standard deviation
2. Set up and apply a tabular CUSUM for this process, using standardized values h = 5 and k = 0.5. Interpret this chart.
Solution:
1. Since the data are individual observations, an estimate the process standard deviation is then given by
=
MR
d
2
=
2.0112
1.128
= 1.7830
1 X=[159 .0480 156.8969 159.1034 156.8976 161.8597 161.8039 160.2766 159.2432 . . .
2 160.2861 163.2101 162.6982 159.1031 160.3368 157.7590 160.8015 162.4781 . . .
3 162.8001 157.9376 162.0092 159.2716 160.2469 162.5760 161.5964 160.6725 . . .
4 162.2135 164.2690 159.7530 158.2998 158.6755 159.1114 ] ;
5 mu0 = 160; %t ar get val ue f or t he mean
6 MR = sl i def un ( @range, 2 , X) ; %moving range
7 MR2bar = mean(MR( 2: end ) ) ;
8 d2 = 1.128 ; %from Tabl e i n t he Appendi x
9 si gmahat = MR2bar / d2 ; %est i mat e of t he process st andar d devi at i on
10 h = 5;
11 k = 0. 5 ;
12 [ Cplus , Cminus ] = cusumchart ( X, mu0, si gmahat , h , k ) ; %pl ot CUSUMchar t
2. The CUSUM-chart is shown in Figure 5.15, where the process appears to be in-control. So there does not appear
to have been a shift of 0.5 from the target value of 160.
5.6 EXPONENTIALLY WEIGHTED MOVING AVERAGE CONTROL CHART (EWMA-CHART)
Another alterative control chart that is generally used for detecting small shifts in the process mean is the so-called ex-
ponentially weighted moving average (EWMA) chart, and it plots weighted moving average values. Like the CUSUM-
chart, the EWMA-chart is also preferred when the samples are of size n = 1 (i.e. individual measurements). Suppose
112
0 5 10 15 20 25 30
5
4
3
2
1
0
1
2
3
4
5 UCL
LCL
CL
Standardized Cumulative Sum Control Chart
Sample Number
S
t
a
n
d
a
r
d
i
z
e
d

C
u
m
u
l
a
t
i
v
e

S
u
m
FIGURE 5.15: CUSUM-chart for process readings.
m preliminary samples of size n 1 are available. The EWMA statistic is dened by
z
i
= x
i
+ (1 )z
i1
, i = 1, . . . , m (5.6.1)
where
the weighting factor, 0 < 1, determines the depth of memory for the EWMA
x
i
is the most current observation (i.e. average of the sample at time i)
z
i1
is the previous EWMA statistic in which the initial value z
0
is equal to the process target mean
0
. If
0
is
unknown, then z
0
= x.
The EWMA-chart plots z
i
against the sample i. Let be the process standard deviation (i.e. =
x
). It can be shown
that for large i, the standard deviation of z
i
is given by
z
i
=
x
_

2
[1 (1 )
2i
]

n
_

2
, as i (5.6.2)
Note that large i means that EWMA control chart has been running for several time periods.
Control chart for the exponentially weighted moving average (EWMA-chart):
The upper control limit, center line, and lower control limit of the EWMA-chart are given by
UCL =
0
+ L

n
_

2
CL =
0
LCL =
0
L

n
_

2
(5.6.3)
where L is the width of the control limits. For individual observations, we use = MR/d
2
; and for samples of size
n > 1, we use = R/d
2
.
The values of the parameters L and can have a considerable impact on the performance of the chart. The parameter
determines the rate at which past (historical) data enter into the calculation of the EWMA statistic. A value of = 1
113
implies that only the most recent observation inuences the EWMA. Thus, a large value of gives more weight to
recent data and less weight to historical data, while a small value of gives more weight to historical data. Although
the choice of and L is arbitrary. In practice, the values 0.05 0.25 and 2.6 L 3 work well. Note that L = 3
matches other control charts, but it may be necessary to reduce L slightly for small values of .
Example 5.14 Using the data of Table 5.8:
1. Apply an EWMA control chart to these data using = 0.15 and L = 2.7
Solution: The EWMA-chart is shown in Figure 5.16, where the process is out-of-control at sample number 26.
1 lambda = 0. 15 ;
2 L = 2. 7 ;
3 ewmachart ( X, lambda , L , mu0, si gmahat ) ; %pl ot EWMAchar t
5 10 15 20 25 30
159
159.5
160
160.5
161
161.5
26
UCL
LCL
CL
Sample Number
E
M
W
A
FIGURE 5.16: EWMA-chart for process readings.
Example 5.15 The concentration of a chemical product is measured by taking four samples from each batch of material. The aver-
age concentration of these measurements is shown for the last 20 batches in in Table 5.9. Assume the target value of concentration
for this process is 100.
TABLE 5.9: Concentration Measurements.
Batch Concentration Batch Concentration
1 104.50 11 95.40
2 99.90 12 94.50
3 106.70 13 104.50
4 105.20 14 99.70
5 94.80 15 97.70
6 94.60 16 97.00
7 104.40 17 95.80
8 99.40 18 97.40
9 100.30 19 99.00
10 100.30 20 102.60
(i) Estimate the process standard deviation
114
(ii) Set up and apply a tabular CUSUM for this process, using standardized values h = 5 and k = 0.5. Does the process appear
to be in control at the target?
(iii) Apply an EWMA control chart to these data using = 0.1 and L = 2.7. Interpret this chart.
Solution:
(i) Since the data are individual observations, an estimate the process standard deviation is then given by
=
MR
d
2
=
3.71
1.128
= 2.29
(ii) The CUSUM-chart is given in Figure 5.17(a), which shows that the process is in-control.
(iii) The EWMA-chart is given in Figure 5.17(b), which shows that the process is in-control.
0 5 10 15 20
5
4
3
2
1
0
1
2
3
4
5 UCL
LCL
CL
Sample Number
S
t
a
n
d
a
r
d
i
z
e
d

C
u
m
u
l
a
t
i
v
e

S
u
m
2 4 6 8 10 12 14 16 18 20
98
98.5
99
99.5
100
100.5
101
101.5
102 UCL
LCL
CL
Sample Number
E
M
W
A
(a) (b)
FIGURE 5.17: CUSUM- and EWMA-charts for concentration measurements.
Example 5.16 Packages of a particular instant dry food are lled by a machine and weighed. The weights (in ounces) for 24
successive packages have been collected and are displayed in Table 5.10. Assume the target mean weight for this process is 20
ounces.
TABLE 5.10: Weights for dry food packages.
Package Weight Package Weight
1 20.26 13 20.30
2 19.97 14 19.77
3 19.76 15 20.40
4 19.72 16 19.98
5 19.69 17 19.91
6 19.85 18 20.18
7 19.96 19 20.08
8 20.03 20 20.05
9 20.06 21 20.20
10 19.71 22 19.90
11 19.68 23 19.95
12 19.94 24 20.12
115
(ii) Set up and apply a tabular CUSUM for this process, using standardized values h = 5 and k = 0.5. Does the process appear
to be in control at the target?
(iii) Apply an EWMA control chart to these data using = 0.1 and L = 2.7. Interpret this chart.
Solution:
(i) Since the data are individual observations, an estimate the process standard deviation is then given by
=
MR
d
2
=
0.20
1.128
= 0.18
(ii) The CUSUM-chart is given in Figure 5.18(a), which shows that the process is in-control. Thus, there does not
appear to have been a shift of 0.5 from the target value of 20 ounces.
(iii) The EWMA-chart is given in Figure 5.18(b), which shows that the process is in-control since all EWMA statistics
fall within the control limits.
0 5 10 15 20 25
5
4
3
2
1
0
1
2
3
4
5 UCL
LCL
CL
Sample Number
S
t
a
n
d
a
r
d
i
z
e
d

C
u
m
u
l
a
t
i
v
e

S
u
m
5 10 15 20
19.9
19.92
19.94
19.96
19.98
20
20.02
20.04
20.06
20.08
20.1
UCL
LCL
CL
Sample Number
E
M
W
A
FIGURE 5.18: CUSUM- and EWMA-charts for dry food packages.
5.7 CONTROL CHARTS FOR ATTRIBUTES
In quality control, a defective quality characteristic is called a defect or non-conformance, whereas a unit that has at
least one defect is called a defective or nonconforming unit. In other words, a defective unit may have more than one
defect.
5.7.1 CONTROL CHART FOR PROPORTION DEFECTIVE: p-CHART
The proportion or fraction defective in a population is dened as the ratio of the number of defective items in the
population to the total number of items in that population. The goal of the fraction nonconforming control chart is
to monitor the proportion of defective units (fraction defective) for a process of interest using the data collected over
m samples (subgroups) each of size n. The p-chart plots the sample proportions (fraction defectives) vs the sample
number. If D is the number of units that are defective in a random sample of size n, then D bino(n, p). The sample
proportion defective

P = D/n is the ratio of the number of defective units in the sample, D, to the sample size n. The
mean and standard deviation of

P are
P
= p and
P
=
_
p(1 p)/n, respectively.
Since p is unknown, it must be estimated from the available data. Suppose m preliminary samples are available,
each of size n. If D
i
is the number of defectives in the i-th sample, then the fraction defective of the i-th sample is
P
i
= D
i
/n, i = 1, . . . , m. An unbiased estimator of
P
is the mean P of the fraction defectives of m samples,
CL = P =
1
m
m
i=1
P
i
=
1
mn
m
i=1
D
i
116
Since
P
=
_
p(1 p)/n, the upper and control limits may be expressed as
UCL = P +3
P(1 P)
n
and
LCL = P +3
P(1 P)
n
Control chart for proportion defective (p-chart):
The upper control limit, center line, and lower control limit of the p-chart are given by
UCL = p +3
_
p(1 p)
n
CL = p
LCL = p 3
_
p(1 p)
n
(5.7.1)
Example 5.17 A quality control inspector wishes to construct a fraction-defective control chart for a light bulb production line.
Packages containing 1000 light bulbs are randomly selected, and all 1000 bulbs are light-tested. The results of the tests are given
in Table 5.11. Construct and plot a p-chart.
TABLE 5.11: Number of defectives observed in samples of 1000 light bulbs.
Sample 1 2 3 4 5 6 7 8 9 10 11 12 13
Number of Defectives 9 12 13 12 11 9 7 0 12 8 9 7 11
Solution: The center line CL is given by
CL = p =
(Number of Defectives)
(13)(1000)
=
120
13000
= 0.0092
The upper and lower control limits are given by
UCL = p +3
_
p(1 p)
n
= 0.0183
LCL = p 3
_
p(1 p)
n
= 0.0001
1 X = [ 9 12 13 12 11 9 7 0 12 8 9 7 11] ; % Number of def ect i ves
2 n = 1000; %t o t a l number of i nspect ed i t ems
3 [ st at s , pl ot dat a ] = cont r ol char t ( X, ' char t ' , ' p ' , ' uni t ' , n) ; %pl ot pchar t
4 f p r i n t f ( ' Cont r ol l i mi t s f or pchar t : UCL=%g , CL=%g , LCL=%g\n ' , . . .
The p-chart, shown in Figure 5.19, indicates that the sample number 8 is outside the control limits.
Example 5.18 When a coupon redemption process is in control, then a maximum of 3% of the rebates are done incorrectly, for a
maximum acceptable proportion of errors of 0.03. For 20 sequential samples of 100 coupon redemptions each, an audit reveals that
the number of errors found in the rational subgroup samples are given in Table 5.12. Construct and plot a p-chart.
117
2 4 6 8 10 12
0
2
4
6
8
10
12
14
16
18
20
x 10
3
P
r
o
p
o
r
t
i
o
n

D
e
f
e
c
t
i
v
e
Sample Number

8
UCL
LCL
CL
FIGURE 5.19: p-chart for lightbulb data.
TABLE 5.12: Number of errors observed in samples of 100 coupon redemptions.
Sample 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of Errors 2 2 3 6 1 3 6 4 7 2 5 6 3 2 4 5 3 8 1 4
CL = p =
(Number of Errors)
(20)(100)
=
77
2000
= 0.0385
UCL = p +3
_
p(1 p)
n
= 0.09622
LCL = p 3
_
p(1 p)
n
= 0
The p-chart is shown in Figure 5.20. As can be observed in the gure, only common-cause variation is included in the
chart, and the process appears to be stable.
2 4 6 8 10 12 14 16 18 20
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
P
r
o
p
o
r
t
i
o
n

D
e
f
e
c
t
i
v
e
Sample Number

UCL
LCL
CL
FIGURE 5.20: p-chart for the coupon data.
118
5.7.2 CONTROL CHART FOR NUMBER OF DEFECTIVES: np-CHART
The np control chart is a slight variation of the p-chart, except now the actual number of defective items D
i
are plotted
on the control chart against the sample number, where i = 1, . . . , m. The control limits are based on the number of
defective units instead of the fraction defective. The np-chart and the p-chart will give the same resulting information.
That is, if the p-chart indicates an out-of-control situation for a process, then the np-chart for the same data will also
signal out-of-control. For the np-chart the average fraction defective is estimated as
P =
1
mn
m
i=1
D
i
Control chart for number of defectives (np-chart):
The upper control limit, center line, and lower control limit of the np-chart are given by
UCL = n p +3
_
n p(1 p)
CL = n p
LCL = n p 3
_
n p(1 p)
(5.7.2)
Example 5.19 Using the data in Table 5.11, construct and plot an np-chart.
CL = n p = 9.23077
UCL = n p +3
_
n p(1 p) = 18.3033
LCL = p 3
_
n p(1 p) = 0.15828
1 X = [ 9 12 13 12 11 9 7 0 12 8 9 7 11] ; % Number of def ect i ves
3 [ st at s , pl ot dat a ] = cont r ol char t ( X, ' char t ' , ' np ' , ' uni t ' , n) ; %pl ot npchar t
4 f p r i n t f ( ' Cont r ol l i mi t s f or npchar t : UCL=%g , CL=%g , LCL=%g\n ' , . . .
The np-chart, shown in Figure 5.21, indicates that the sample number 8 is outside the control limits.
Example 5.20 For each of the 15 days, a number of magnets used in electric relays are inspected and the number of defectives
is recorded. The total number of magnets tested is 15,000. The results are given in Table 5.20. Construct and plot the p-and
np-charts.
TABLE 5.13: Number of defectives observed in samples of 1000 magnets.
Day Number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of Defectives 201 152 190 214 186 193 183 225 168 182 174 182 182 169 187
Solution: Since the total number of defectives in 15 days is 15000, the average sample size is n = 15000/15 = 1000.
Control limits for the p-chart: UCL = 0.22277, CL = 0.185867, LCL = 0.148963
Control limits for the np-chart: UCL = 222.77, CL = 185.867, LCL = 148.963
119
2 4 6 8 10 12
0
2
4
6
8
10
12
14
16
18
20
N
u
m
b
e
r

o
f

D
e
f
e
c
t
i
v
e
s
Sample Number

8
UCL
LCL
CL
FIGURE 5.21: p-chart for lightbulb data.
A close examination of the control charts, shown in Figure 5.22, reveals that for the 8th week, the sample is above the
upper control limit. This indicates a signicantly high percentage defective, which implies that there is an assignable
cause on the manufacturing process. Such cases may also result due to a lapse from the inspection department or the
sample size being quite different from the average used to calculate the control limits.
1 X = [201 152 190 214 186 193 183 225 168 182 174 182 182 169 187] ; % Number of def ect i ves
3 [ st at s , pl ot dat a ] = cont r ol char t ( X, ' char t ' , ' p ' , ' uni t ' , n) ; %pl ot pchar t
4 f p r i n t f ( ' Cont r ol l i mi t s f or pchar t : UCL=%g , CL=%g , LCL=%g\n ' , . . .
6 f i gur e ;
7 [ st at s , pl ot dat a ] = cont r ol char t ( X, ' char t ' , ' np ' , ' uni t ' , n) ; %pl ot npchar t
8 f p r i n t f ( ' Cont r ol l i mi t s f or npchar t : UCL=%g , CL=%g , LCL=%g\n ' , . . .
2 4 6 8 10 12 14
0.15
0.16
0.17
0.18
0.19
0.2
0.21
0.22
0.23
P
r
o
p
o
r
t
i
o
n

D
e
f
e
c
t
i
v
e
Sample Number

8
UCL
LCL
CL
2 4 6 8 10 12 14
140
150
160
170
180
190
200
210
220
230
N
u
m
b
e
r

o
f

D
e
f
e
c
t
i
v
e
s
Sample Number

8
UCL
LCL
CL
(a) (b)
FIGURE 5.22: (a) p-chart, and (b) np-chart for magnet data.
120
5.7.3 CONTROL CHART FOR COUNT OF DEFECTS: c-CHART
The c-chart plots the numbers of defects per item. Assume that the number of defects X in a given inspection unit
follows a Poisson distribution poiss(c), with parameter c. Then, the mean and standard deviation of X are
X
= c
and
X
=
c, respectively. Since c is unknown, it must be estimated from the available data. Suppose m preliminary
samples are available, each of size n. If X
i
is the number of defects per sample, then an estimator of the number of
defects over the entire data set is given by
C =

m
i=1
X
i
m
Control chart for count of defects (c-chart):
The upper control limit, center line, and lower control limit of the c-chart are given by
UCL = c +3
c
CL = c
LCL = c 3
c
(5.7.3)
Example 5.21 The number of noticeable defects found by quality control inspectors in a randomly selected 1-square-meter spec-
imen of woolen fabric from a certain loom is recorded each hour for a period of 20 hours. The results are shown in Table 5.14.
Construct and plot a c-chart to monitor the textile production process.
TABLE 5.14: Number of defects observed in specimens of woolen fabric over 20 con-
secutive hours.
Hour 1 2 3 4 5 6 7 8 9 10
Number of Defects 11 14 10 8 3 9 10 2 5 6
Hour 11 12 13 14 15 16 17 18 19 20
Number of Defects 12 3 4 5 6 8 11 8 7 9
Solution: The center line CL is the mean number of defects per square meter of woolen fabric given by
CL = c =
(Number of Defects)
20
=
151
20
= 7.55
UCL = c +3
c = 15.7932
LCL = c 3
c = 0.69
1 X = [ 11 14 10 8 3 9 10 2 5 6 12 3 4 5 6 8 11 8 7 9] ; % Number of def ect s
2 n = 20; %si ze of i nspect ed uni t s ( every 20 hours ) : sample si ze
3 [ st at s , pl ot dat a ] = cont r ol char t ( X, ' char t ' , ' c ' , ' uni t ' , n) ; %pl ot cchar t
4 f p r i n t f ( ' Cont r ol l i mi t s f or cchar t : UCL=%g , CL=%g , LCL=%g\n ' , . . .
Since a negative number of defects cannot be observed, the LCL value is set to 0. The c-chart, shown in Figure 5.23,
indicate that the process is in control.
Example 5.22 Samples of fabric from a textile mill, each 100 m
2
, are selected, and the number of occurrences of foreign matter are
recorded. Data for 25 samples are shown in Table 5.15. Construct and plot a c-chart for the number of nonconformities.
Solution: The center line CL is the mean number of nonconformities
CL = c =
(Number of Nonconformities)
100
=
151
25
= 7.56
121
2 4 6 8 10 12 14 16 18 20
0
2
4
6
8
10
12
14
16
C
o
u
n
t

o
f

D
e
f
e
c
t
s
Sample Number

UCL
LCL
CL
FIGURE 5.23: c-chart for woolen fabric data.
TABLE 5.15: Foreign Matter Data.
Sample Number 1 2 3 4 5 6 7 8 9 10 11 12 13
Nonconformities 5 4 7 6 8 5 6 5 16 10 9 7 8
Sample Number 14 15 16 17 18 19 20 21 22 23 24 25
Nonconformities 11 9 5 7 6 10 8 9 9 7 5 7
The upper and lower control limits for the c-chart are given by
UCL = c +3
c = 15.8086
LCL = c 3
c = 0
The c-chart is displayed in Figure 5.24, which indicates that the sample number 9 is out-of-control.
5 10 15 20 25
0
2
4
6
8
10
12
14
16
C
o
u
n
t

o
f

N
o
n
c
o
n
f
o
r
m
i
t
i
e
s
Sample Number

9
UCL
LCL
CL
FIGURE 5.24: c-chart for foreign matter data.
122
Assuming special causes for the out-of-control point (sample 9 is deleted), the revised centerline and control limits for
the c-chart are
CL = c = 7.20833
UCL = c +3
c = 15.2628
LCL = c 3
c = 0
The revised c-chart is given in Figure 5.25, which shows that all the remaining points fall within the control limits.
5 10 15 20
0
2
4
6
8
10
12
14
16
C
o
u
n
t

o
f

N
o
n
c
o
n
f
o
r
m
i
t
i
e
s
Sample Number

UCL
LCL
CL
FIGURE 5.25: Revised c-chart for foreign matter data.
5.7.4 CONTROL CHART FOR DEFECTS PER UNIT (u-CHART)
The u-chart monitors the average number of defects. Like the c-chart, the control limits in the u-chart are computed
based on the Poisson distribution. Suppose m preliminary samples are available, each of size n. In this chart we plot
the rate of defects U
i
= X
i
/n, which is the number of defects X
i
per sample divided by the number n (sample size) of
units inspected. Thus, an estimator of the average number of defects U
i
= X
i
/n over m samples is given by
U =

m
i=1
U
i
m
Unlike the c-chart, the u-chart does not require a constant number of units, and it can be used, for example, when the
samples are of different sizes.
Control chart for defects per unit (u-chart):
The upper control limit, center line, and lower control limit of the u-chart are given by
UCL = u +3
_
u
n
CL = u
LCL = u 3
_
u
n
(5.7.4)
Example 5.23 Using the data of Table 5.14, construct and plot the u-chart.
123
Solution: The upper control limit, center line, and lower control limit of the u-chart are given by
UCL = u +3
_
u
n
= 0.3775 +3
_
0.3775
20
= 0.7897
CL = u = 0.3775
LCL = u 3
_
u
n
= 0.3775 3
_
0.3775
20
= 0.0347
Since a negative number of defects per unit cannot be observed, the LCL value is set to 0. The u-chart, shown in
Figure 5.26, indicate that the process is in control.
1 X = [ 11 14 10 8 3 9 10 2 5 6 12 3 4 5 6 8 11 8 7 9] ; % Number of def ect s
2 n = 20; %si ze of i nspect ed uni t s ( every 20 hours ) : sample si ze
3 [ st at s , pl ot dat a ] = cont r ol char t ( X, ' char t ' , ' u ' , ' uni t ' , n) ; %pl ot uchar t
4 f p r i n t f ( ' Cont r ol l i mi t s f or uchar t : UCL=%g , CL=%g , LCL=%g\n ' , . . .
2 4 6 8 10 12 14 16 18 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
D
e
f
e
c
t
s

p
e
r

U
n
i
t
Sample Number

UCL
LCL
CL
FIGURE 5.26: u-chart for woolen fabric data.
5.8 ACCEPTANCE SAMPLING
Acceptance sampling is a methodology commonly used in quality control and improvement to determine whether to
accept or reject a particular lot or batch of products before shipped to customers. This is done by divising a sampling
plan that sets the product acceptability criteria. A 100%inspection does not guarantee 100%compliance and is too time
consuming and costly. Rather than evaluating all items, a specied sample is taken, inspected or tested, and a decision
is made about accepting or rejecting the entire production lot. There are two major classications of acceptance plans:
by attributes and by variables. When the decision to accept or reject a lot based on classication of the items as either
defective (conforming) or nondefective (nonconforming), the sampling plan is called inspection by attributes. The lot
is accepted if no more than an allowable number of defective items are found. The attribute case is the most common
for acceptance sampling, and will be assumed for the rest of this section. A sampling plan based on one sample is
known as a single sampling plan, while sampling plans based on two or more successively drawn samples are known
as double or multiple sampling plans. Selection of an acceptance sampling plan will depend on the nature of the
inspection test and the data produced.
An acceptance sampling plan works in the following way. A xed number n of items is sampled fromeach lot of size
N, carefully inspected, and each item is judged to be either defective or nondefective. If the number d of defectives in
the sample is less than or equal to prespecied acceptance number c, the lot is accepted. Otherwise, the lot is rejected.
124
Thus, to design a sampling plan we need to know how many items n to sample and how many defectives items c in
that sample are enough to convince us that the lot is unacceptable.
In quality control, there is some relationship between the lot size N and the sample size n because the probability
distribution for the number d of defectives in a sample of n items from a lot will depend on the lot size N. For example,
a good sampling plan will provide for effective decisions with a sample of 10% or less of the lot size. If N is large and
n is small relative to N, then the probability distribution the number d of defectives follows a binomial distribution
P(d = k) =
_
n
k
_
p
k
(1 p)
nk
, k = 0, 1, . . . , n
where p denotes the lot fraction defective (true proportion nonconforming).
Thus, for a sampling plan with sample size n and acceptance number c, the probability of accepting a lot with lot
fraction defective p is given by
P
a
(p) = P(Accept lot) = P(d c) =
c
k=0
P(d = k) =
c
k=0
_
n
k
_
p
k
(1 p)
nk
(5.8.1)
5.8.1 SAMPLING PLAN CRITERIA
In order to design a sampling plan, four values must be known and are typically determined from past experience,
engineering estimate, and/or management decisions:
1. Acceptable Quality Level (AQL). This is the maximum fraction defective, p
1
, considered acceptable off the pro-
ducers line, and it is generally in the order of 1-2%. That is, P
a
(p
1
) = P
a
(AQL) should be large, typically near
0.95.
2. Producers Risk, . This is the probability of rejecting a lot that is within acceptable quality level (AQL). That is,
the producers risk is the probability of rejecting H
0
: p = p
1
when H
0
is true. Thus, = 1 P
a
(AQL) is the
probability of a Type I error. That is, the producers risk is the probability that a lot containing an acceptable
quality level is rejected.
3. Lot Tolerance Percent Defective (LTPD. This is the largest lot fraction defective, p
2
, that can be tolerated by a con-
sumer. The LTPD has a low probability of acceptance.
4. Consumers Risk, . This is the probability that a bad lot containing a greater number of defects than the LTPD
limit will be accepted. Thus, = P
a
(LTPD) is the probability of making a Type II error.
Example 5.24 A manufacturer of USB drives ships a particular USB in lots of 500 each. The acceptance sampling plan used prior
to shipment is based on a sample size n = 10 and acceptance number c = 1.
1. Find the producers risk if AQL = 0.05
2. Find the consumers risk if LTPD = 0.2
Solution: The lot size N = 500 is much larger than the sample size n = 10. Thus, the probability of lot acceptance is
given by
P
a
(p) =
1
k=0
_
10
k
_
p
k
(1 p)
10k
=
_
10
0
_
(1 p)
10
+
_
10
1
_
p(1 p)
9
= (1 p)
10
+10p(1 p)
9
1. For p = 0.05, we have P
a
(0.05) = 0.914. Thus, the producers risk is = 1 P
a
(AQL) = 1 P
a
(0.05) =
1 0.914 = 0.086. That is, the producer will reject 8.6% of the lots, even if the lot fraction defective is as small as
0.05.
2. The consumers risk is = P
a
(LTPD) = P
a
(0.2) = 0.376.
125
1 c=1; % acceptance number
2 n=10; % sample si ze
3 AQL = 0. 05 ; % accept abl e qual i t y l evel
4 al pha = 1cdf ( ' bi no ' , c , n , AQL) ; % producer ' s r i s k
5 f p r i n t f ( ' Producer Ri sk = %. 3 f \n ' , al pha )
6 LTPD = 0. 2 ; % l o t t ol er ance per cent def ect i ve
7 beta = cdf ( ' bi no ' , c , n , LTPD) ; % consumer ' s r i s k
8 f p r i n t f ( ' Consumer Ri sk = %. 3 f \n ' , beta )
5.8.2 OPERATING CHARACTERISTIC (OC) CURVES
The operating characteristic (OC) curve is an excellent graphical tool for evaluating quality sampling plans. The OC
curve depicts the plot of the probability of lot acceptance P
a
versus lot fraction defective p. The OC curve for Ex-
ample 5.24 is shown in Figure 5.27. Note that as the lot fraction defective increases, the probability of lot acceptance
decreases until it reaches 0.
3 p=0:0 . 01 : 1; % l o t f r a c t i on def ect i ve
4 Pa=cdf ( ' bi no ' , c , n , p) ; % pr o babi l i t y of l o t acceptance
5 pl ot ( p , Pa) ; % pl ot OC curve
0 0.1 0.2 0.3 0.4 0.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p
P
a
AQL LTPD
Consumer Risk
Producer Risk
FIGURE 5.27: OC curve for Example 5.24.
In general, the steeper the OC curve the better the protection for both consumer and producer. In fact, the ideal OC
curve would be a parallel line to the y-axis at the AQL value as shown in Figure 5.28.
When sample sizes are increased, the OC curve becomes steeper as shown in Figure 5.29(b). In the same vein, when
the acceptance number is decreased, the curve gets steeper. Moreover, changing the acceptance number does not
signicantly change the OC curve as shown in Figure 5.29(c).
5.8.3 AVERAGE OUTGOING QUALITY
The average outgoing quality (AOQ) is the expected average quality of outgoing products for a given value of incoming
product quality. The AOQ curve can be used to evaluate a sampling plan by showing the average quality accepted by
the consumer for a given fraction defective.
AOQ =
_
1
n
N
_
p P
a
(5.8.2)
126
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p
P
a
FIGURE 5.28: Ideal OC curve (100% inspection).
0 0.05 0.1 0.15 0.2 0.25
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p
P
a

n = 25, c = 1
n = 50, c = 2
n = 100, c = 4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p
P
a

n = 10, c = 0
n = 10, c = 1
n = 10, c = 2
(a) (b)
FIGURE 5.29: (a) OC curves for different sample sizes; (b) OC curves for the same
sample sizes.
where N is the lot size and n is the sample size. Note that when N n, then AOQ p P
a
. The AOQ curve for
Example 5.24 is shown in Figure 5.30.
3 N=500; % l o t si ze
6 AOQ = (1n / N)
*
p.
*
Pa; % average out goi ng qual i t y
7 pl ot ( p ,AOQ) ; % pl ot AOQ curve
The AOQ curve initially increases as more defectives are produced, more are released. As more and more lots
are rejected, 100% inspections become more common and the AOQ curve starts to decrease as a result. The average
outgoing quality limit (AOQL) is simply the maximum value on the AOQcurve, and represents the maximumpossible
fraction defective for the sampling plan.
127
0 0.2 0.4 0.6 0.8 1
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
p
A
O
Q
Average Outgoing Quality (AOQ)
FIGURE 5.30: AOQ curve for Example 5.24.
5.8.4 AVERAGE TOTAL INSPECTION
In practice, a rejected lot usually means that the lot is to follow a particular routine, to be screened, repaired, corrected,
or even rejected, or perhaps accepted after argument or waiver of specications. In screening, the rest of the lot is 100%
inspection. Defective items are supposedly removed, and what is left of the lot is supposedly perfect. But the question
that rises is: What is the total amount of inspection when rejected lots are screened? If all lots contain zero defectives,
no lot will be rejected. If all items are defective, all lots will be inspected, and the amount to be inspected is N. Finally,
if the lot quality is 0 < p < 1, the average amount of inspection per lot will vary between the sample size n, and the lot
size N.
The average total inspection (ATI) is the average total number of items inspected per lot of size N. Let the quality of
the lot be p and the probability of lot acceptance be P
a
, then the ATI per lot is
ATI = n + (1 P
a
)(N n) (5.8.3)
The ATI curve for Example 5.24 is shown in Figure 5.31.
3 N=500; % l o t si ze
6 ATI = n+(Nn)
*
(1Pa) ; % average t o t a l i nspect i on
7 pl ot ( p , ATI ) ; % pl ot ATI curve
5.9 MULTIVARIATE CONTROL CHARTS
Many manufacturing and service businesses use univariate statistical control charts to monitor the performance of
their processes [1]. However, in most processes, there are more than one measurement process to monitor [2], and it is
increasingly difcult to determine the root cause of defects if multiple process variables exhibit faults or process devia-
tions at the same moment in time. Moreover, most processes are highly correlated, particularly for assembly operations
and chemical processes [2, 3]. Univariate control charts not only lead to frequent adjustments of the process but also
do not account for the correlation information between the measurement processes [3]. Multivariate quality control
methods overcome these limitations by monitoring the interactions of several process variables simultaneously and
also by determining hidden factors using dimensionality-reduction techniques [2]. The use of multivariate statistical
process control is also facilitated by the proliferation of sensor data that is typically complex, high-dimensional and
128
0 0.2 0.4 0.6 0.8 1
0
50
100
150
200
250
300
350
400
450
500
p
A
T
I
FIGURE 5.31: AOQ curve for Example 5.24.
generally correlated. Multivariate charts are used to detect shifts in the mean or the relationship (covariance) between
several related parameters.
In recent years, several multivariate statistical process control techniques have been proposed to analyze and monitor
multivariate data [5, 4, 6]. With multivariate quality control charts, it is possible to have well-dened control limits,
while taking in consideration the cross-correlation between the variables. In addition, these multivariate charts may be
used to analyze the stability of the processes without the complication of simultaneously monitoring several univariate
control charts [2].
Mapping a multivariate situation as a univariate may lead to results where processes might seem to be in control
when in fact they are not and vice-versa, as illustrated in Fig. 5.32 which depicts the result of modelling two highly-
correlated variables as independent. The ellipse denes a region where the process is operating under normal operating
conditions. Any observation falling outside the ellipse is identied as a fault. If the variables were, however, modeled
as independent, then the control region would be dened between the rectangle. As can be seen in Fig. 5.32, some out-
of-control observations would be misidentied, indicating that the correlation structure between the variables should
be taken into account in order to accurately characterize the behavior of multivariate industrial environments [2].
x
1
x
2
outofcontrol observations
incontrol observations
N
o
r
m
a
l

r
a
n
g
e

f
o
r

x
2
Normal range for x
1
FIGURE 5.32: Motivation behind using multivariate control charts.
129
5.9.1
2
AND HOTELLING T
2
CONTROL CHART
In many industrial applications, the output of a process is characterized by p variables that are measured simulta-
neously. Independent variables can be charted individually, but if the variables are correlated, a multivariate chart
is needed to determine whether the process is in control. Generally, the univariate process variables make a ran-
dom vector, x = (X
1
, X
2
, . . . , X
p
)
/
. That is, the process has p quality characteristics X
1
, X
2
, . . . , X
p
that we are inter-
ested to monitor. Suppose the random vector x follows a multivariate normal distribution N(, ) with mean vector
= (
1
,
2
, . . . ,
p
)
/
and a covariance matrix . The statistic given by
2
= n( x )
/
1
( x ) (5.9.1)
follows a
2
(p) distribution with p degrees of freedom, where x is the sample mean for each of the p quality character-
istics from a sample of size n, is the vector of in-control means for each quality characteristic, and
1
is the inverse
of the covariance matrix. The upper control limit of the
2
control chart is given by UCL =
2
,p
, where is a given
signicance level.
Since and are unknown in practice, we usually estimate them on the basis of preliminary samples (subgroups),
taken when the process is thought to be in control. Suppose m preliminary samples are available, each of size n. The
Hotellings T
2
control chart is the most common monitoring technique for multivariate data, and it can be thought of
as the multivariate counterpart of the Shewhart x-chart. The T
2
statistic is given by
T
2
= n( x )
/
1
( x
x) (5.9.2)
where x is the vector of sample means, is the estimated vector of in-control means, and
is the estimated covariance

matrix for the quality characteristics when the process is in control.
To nd and
, consider the vector of sample means given by

x
k
=
_
_
_
_
_
x
1k
x
2k
.
.
.
x
pk
_
_
_
_
_
k = 1, 2, . . . , m (5.9.3)
where
x
jk
=
1
n
n
i=1
x
ijk
_
j = 1, 2, . . . , p
k = 1, 2, . . . , m
(5.9.4)
is the sample mean of the jth quality characteristic for the kth sample, and x
ijk
is the ith observation on the jth quality
characteristic in the kth sample.
The sample variances for the jth quality characteristic in the kth sample are given by
s
2
jk
=
1
n 1
n
i=1
_
x
ijk
x
jk
_
2
_
j = 1, 2, . . . , p
k = 1, 2, . . . , m
(5.9.5)
The sample covariance between the jth and hth quality characteristics in the kth sample is given by
s
2
jhk
=
1
n 1
n
i=1
_
x
ijk
x
jk
_
(x
ihk
x
hk
) ,
_
k = 1, 2, . . . , m
j ,= h
(5.9.6)
The target mean of each quality characteristic for m samples is given by
x
j
=
1
m
m
k=1
x
jk
, j = 1, 2, . . . , p (5.9.7)
and the averaged sample variance and covariance are given by
s
2
j
=
1
m
m
k=1
s
2
jk
j = 1, 2, . . . , p (5.9.8)
130
and
s
jh
=
1
m
m
k=1
s
jhk
j ,= h (5.9.9)
Thus, and
can be estimated by the mean vector

x and the average covariance matrix C, respectively, as follows:
x =
_
_
_
_
_
x
1
x
2
.
.
.
x
p
_
_
_
_
_
and C =
_
_
_
_
_
s
2
1
s
12
s
1p
s
12
s
2
2
s
2p
.
.
.
.
.
.
.
.
.
.
.
.
s
1p
s
2p
s
2
p
_
_
_
_
_
(5.9.10)
The Hotellings T
2
statistic for the ith subgroup is given by
T
2
k
= n( x
k
x)
/
C
1
( x
k
x), k = 1, 2, . . . , m (5.9.11)
For phase I, the upper control limit is
UCL =
p(m1)(n 1)
mn m p +1
F
,p,mnmp+1
(5.9.12)
For phase II, the upper control limit is
UCL =
p(m+1)(n 1)
mn m p +1
F
,p,mnmp+1
(5.9.13)
Example 5.25 Table 5.16 shows bivariate data for two quality characteristics X
1
and X
2
for 25 samples, each of size 5. Plot the
Hotellings T
2
control chart. Assume = 0.05.
Solution: We have m = 25, n = 5, and p = 2. The summary statistics are shown in Table 5.17.
UCL =
p(m1)(n 1)
mn m p +1
F
,p,mnmp+1
= 1.9394 F
0.05,2,99
= (1.9394)(3.0882) = 5.9893
where F
0.05,2,99
can be computed using MATLAB as follows: icdf(F,1-0.05,2,99). The Hotellings T
2
control chart is
shown in Figure 5.33, where the samples 5 and 18 appears to be out-of-control. However, all the samples in the X-charts
for each quality characteristic appear to be in-control as shown in Figure 5.34. Thus, the process is out-of-control.
5.10 PRINCIPAL COMPONENTS ANALYSIS
Principal components analysis (PCA) is an explanatory technique to learn about data sets. The objective of PCA to
reduce the dimensionality of the data set while retaining as much as possible the variation in the data set. Principal
components (PCs) are linear transformations of the original set of variables, and are uncorrelated and ordered so
that the rst few components carry most of the variation in the original data set. The rst PC has the geometric
interpretation that it is a new coordinate axis that maximizes the variation of the projections of the data points on the
new coordinate axis. The general idea of PCA is as follows: if we have a set of moderately or strongly correlated
variables (i.e. the variables share much common information), it may then be possible to construct new variables that
are combinations of these variables that account for much of the original information contained in the data. The output
of PCA consists of coefcients that dene the linear combinations used to obtain the new variables (PC loadings) and
the new variables (PCs) themselves. Examining the PC loadings and plotting the PCs can aid in data interpretation,
particularly with higher dimensional data.
131
Sample Quality Characteristic
Number k X
1
X
2
1 65 69 79 66 62 33 41 36 36 42
2 64 71 72 73 72 35 37 37 37 39
3 78 75 73 59 60 40 34 39 32 38
4 63 66 74 69 76 34 34 39 34 38
5 87 81 71 62 67 37 33 39 36 38
6 76 72 80 67 75 34 38 33 36 38
7 64 63 60 75 61 34 38 32 34 37
8 66 65 68 85 75 34 35 41 33 42
9 73 81 78 65 67 39 39 39 39 36
10 75 64 69 73 68 37 34 38 37 34
11 78 61 74 65 67 38 41 42 41 36
12 72 78 56 70 74 35 31 43 39 41
13 61 72 72 91 83 37 36 35 34 39
14 70 87 78 70 76 40 39 37 30 41
15 57 67 51 64 69 33 37 38 37 36
16 74 74 67 77 63 35 37 37 36 37
17 74 69 59 76 70 40 35 38 39 32
18 70 72 61 57 61 42 43 38 40 40
19 62 56 74 64 68 34 36 40 36 37
20 79 71 78 65 77 38 39 29 37 37
21 77 75 67 67 68 33 41 37 35 40
22 63 69 65 70 79 39 36 36 41 36
23 58 66 83 65 65 37 41 36 37 38
24 77 69 68 72 73 34 38 37 35 35
25 74 52 72 73 72 39 33 39 36 35
TABLE 5.16: Bivariate data for two quality characteristics.
0 5 10 15 20 25
0
2
4
6
8
10
12
15
18
UCL
Sample Number
H
o
t
e
l
l
i
n
g

T
2
FIGURE 5.33: T
2
control chart for bivariate data.
Let x
1
, x
2
, . . . , x
n
be n observation vectors on a random vector x = (X
1
, X
2
, . . . , X
p
)
/
with p quality characteristics,
where
x
i
=
_
_
_
_
_
x
i1
x
i2
.
.
.
x
i p
_
_
_
_
_
is a p-dimensional column vector, which represents the vector of quality characteristic means. All n observation vectors
132
Sample Summary Statistics
Number k x
1k
x
2k
s
2
1k
s
2
2k
s
12k
T
2
k
1 68.20 37.60 42.70 14.30 -5.90 0.55
2 70.40 37.00 13.30 2.00 4.00 0.02
3 69.00 36.60 78.50 11.80 14.50 0.17
4 69.60 35.80 29.30 6.20 11.90 0.88
5 73.60 36.60 104.80 5.30 -7.45 1.43
6 74.00 35.80 23.50 5.20 -6.00 2.54
7 64.60 35.00 36.30 6.00 -2.00 5.24
8 71.80 37.00 69.70 17.50 -5.25 0.35
9 72.80 38.40 47.20 1.80 4.35 2.12
10 69.80 36.00 18.70 3.50 5.50 0.60
11 69.00 39.60 47.50 6.30 -3.00 4.52
12 70.00 37.80 70.00 23.20 -30.00 0.44
13 75.80 36.20 132.70 3.70 -4.95 3.81
14 76.20 37.40 49.20 19.30 11.40 4.02
15 61.60 36.20 55.80 3.70 0.10 7.19
16 71.00 36.40 33.50 0.80 -3.00 0.32
17 69.60 36.80 43.30 10.70 3.65 0.03
18 64.20 40.60 41.70 3.80 10.35 11.71
19 64.80 36.60 45.20 4.80 11.40 2.66
20 74.00 36.00 35.00 16.00 -8.25 2.26
21 70.80 37.20 23.20 11.20 -2.20 0.11
22 69.20 37.60 38.20 5.30 -3.65 0.31
23 67.40 37.80 86.30 3.70 -5.90 1.07
24 71.80 35.80 12.70 2.70 -5.30 1.23
25 68.60 36.40 86.80 6.80 18.20 0.37
TABLE 5.17: Summary Statistics for the bivariate data.
x
1
, x
2
, . . . , x
n
on p process variables can be transposed to row vectors and placed in a matrix X, of dimension n p,
called data matrix or data set, as follows:
X =
_
_
_
_
_
_
_
_
_
_
x
/
1
x
/
2
.
.
.
x
/
i
.
.
.
x
/
n
_
_
_
_
_
_
_
_
_
_
=
_
_
_
_
_
_
_
_
_
_
x
11
x
12
x
1j
x
1p
x
21
x
22
x
2j
x
2p
.
.
.
.
.
.
.
.
.
.
.
.
x
i1
x
i2
x
ij
x
i p
.
.
.
.
.
.
.
.
.
.
.
.
x
n1
x
n2
x
nj
x
np
_
_
_
_
_
_
_
_
_
_
(5.10.1)
An individual column of X corresponds to that data collected on a particular variable, while an individual row (obser-
vation) refers to the data collected on a particular individual or object for all variables. The value of the jth variable for
the ith observation x
/
i
= (x
i1
, . . . , x
ij
, . . . , x
i p
) is x
ij
, which is the element of the ith row and jth column of X.
The sample mean vector x is a p-dimensional vector given by
x =
1
n
n
i=1
x
i
=
1
n
X
/
1 =
_
_
_
_
_
_
_
_
_
_
x
1
x
2
.
.
.
x
j
.
.
.
x
p
_
_
_
_
_
_
_
_
_
_
(5.10.2)
where 1 = (1, 1, . . . , 1)
/
is a n-dimensional column vector of all 1s, X
/
denotes the transpose of X, and x
j
is the
mean of the jth variable. That is, x
j
is the mean of the jth column of the data matrix X.
133
5 10 15 20 25
60
62
64
66
68
70
72
74
76
78
80
S
a
m
p
l
e

M
e
a
n
Sample Number

UCL
LCL
CL
5 10 15 20 25
33
34
35
36
37
38
39
40
41
S
a
m
p
l
e

M
e
a
n
Sample Number

UCL
LCL
CL
(a) (b)
FIGURE 5.34: X-chart for each quality characteristic of the bivariate data: (a) X-chart
for X
1
. (b) X-chart for X
2
.
The sample covariance matrix S = (s
ij
) is a p p symmetric matrix given by
S =
_
_
_
_
_
_
_
_
_
_
s
11
s
12
s
1i
s
1p
s
12
s
22
s
2i
s
2p
.
.
.
.
.
.
.
.
.
.
.
.
s
1i
s
2i
s
ii
s
i p
.
.
.
.
.
.
.
.
.
.
.
.
s
1p
s
2p
s
i p
s
pp
_
_
_
_
_
_
_
_
_
_
(5.10.3)
where s
ii
= s
2
i
is the sample variance of the ith variable
s
ii
= s
2
i
=
1
n 1
n
k=1
(x
ki
x
i
)
2
(5.10.4)
and s
ij
is the sample covariance of the ith and jth variables
s
ij
=
1
n 1
n
k=1
(x
ki
x
i
)(x
kj
x
j
). (5.10.5)
The sample covariance matrix S = (s
ij
) can be written as
S =
1
n 1
X
/
HX (5.10.6)
where H = I J/n is a called centering matrix, and J = 11
/
is an n n matrix of all 1s. Note that a centered data
matrix Y is obtained by multiplying the centering matrix H with the data matrix X, that is Y = HX.
The sample correlation matrix R = (r
ij
) is a symmetric p p matrix given by
R =
_
_
_
_
_
1 r
12
r
1p
r
12
1 r
2p
.
.
.
.
.
.
.
.
.
.
.
.
r
1p
r
2p
1
_
_
_
_
_
(5.10.7)
where r
ij
is the sample correlation between ith and jth variables
r
ij
=
s
ij
s
ii
s
jj
=
s
ij
s
i
s
j
. (5.10.8)
134
Note that if D = diag(s
1
, s
2
, . . . , s
p
) is a p p diagonal matrix of sample standard deviations, then the covariance
and correlation matrices can be written as S = DRD and R = D
1
SD
1
, where D
1
= diag(1/s
1
, 1/s
2
, . . . , 1/s
p
)
is the inverse matrix of D.
5.10.1 PCA ALGORITHM
Given a data matrix X, the PCA algorithm consists of four main steps:
Step 1: Compute the centered data matrix Y = HX by subtracting off-column means.
Step 2: Compute the p p covariance matrix S of the centered data matrix as follows
S =
1
n 1
Y
/
Y (5.10.9)
Step 3: Compute the eigenvectors and eigenvalues of S using eigen-decomposition
S = AA
/
=
p
j=1
j
a
/
j
a
j
(5.10.10)
where
A = (a
1
, a
2
, . . . , a
p
) is a p p orthogonal matrix (A
/
A = I) whose columns a
j
= (a
j1
, a
j2
, . . . , a
j p
)
/
are the
eigenvectors of S such that a
/
j
a
j
= 1, j = 1, . . . , p. The eigenvectors tell us a direction or linear combination
of existing data. The rst eigenvector a
1
denes a variable with the most variation in the data matrix. The
eigenvectors are the principal component (PC) coefcients, also known as loadings. That is, each eigenvector
a
j
contains coefcients for the jth PC. These eigenvectors are in order of decreasing component variance.
= diag(
1
,
2
, . . . ,
p
) is a p p diagonal matrix whose elements are the eigenvalues of S arranged in
decreasing order, i.e.
1

2

p
. The eigenvalues tell us about the amount of variation in the data
matrix, that is the variance in a particular direction. The eigenvalues measure the importance of the PCs.
Step 4: Compute the transformed data matrix Z = YA of size n p
Z = (z
/
1
, z
/
2
, . . . , z
/
i
, . . . , z
/
p
) =
_
_
_
_
_
_
_
_
_
_
z
11
z
12
z
1j
z
1p
z
21
z
22
z
2j
z
2p
.
.
.
.
.
.
.
.
.
.
.
.
z
i1
z
i2
z
ij
z
i p
.
.
.
.
.
.
.
.
.
.
.
.
z
n1
z
n2
z
nj
z
np
_
_
_
_
_
_
_
_
_
_
(5.10.11)
which contains the coordinates of the original data in the new coordinate system dened by the PCs. The rows of
Z correspond to observations z
i
= A
/
(x
i
x), while its columns correspond to PC scores. The eigenvalues are the
variances of the columns of the PCs. Thus, the rst PC score accounts for as much of the variability in the data as
possible, and each succeeding component score accounts for as much of the remaining variability as possible.
To apply PCA to the data matrix X, we use the MATLAB function princomp as follows:
>> [A,Z,lambda,Tsquare] = princomp(X)
where A is the eigenvector matrix, Z is the transformed data matrix, lambda is a p-dimensional vector of eigenvalues
i.e. = diag(lambda) = diag(
1
,
2
, . . . ,
p
), and Tsquare is Hotellings T
2
, a statistical measure of the multivariate
distance of each observation from the center of the data set. Tsquare = (T
2
1
, T
2
2
, . . . , T
2
p
) is p-dimensional vector whose
values are given by
T
2
i
= (x
i
x)
/
S
1
(x
i
x) (5.10.12)
Using the fact that S
1
= A
1
A
/
and z
i
= A
/
(x
i
x), it follows that
T
2
i
= (x
i
x)
/
A
1
A
/
(x
i
x) = z
/
i
1
z
i
(5.10.13)
Standardizing the data is often preferable when the variables are in different units or when the variance of the differ-
ent columns is substantial. In this case, we use princomp(zscore(X)) instead of princomp(X). That is, PCA is performed
on the correlation matrix instead of the covariance.
135
Hotelling Control Chart:
The Hotelling control chart is a multivariate extension of the X-chart that does take the correlation into account. The
plotted points on this chart are given by the Hotellings statistic T
2
i
for individual observations:
T
2
i
= (x
i
x)
/
S
1
(x
i
x)
UCL =
(n 1)
2
n
B
,p/2,(np1)/2
(5.10.14)
where B() the inverse cdf of the Beta distribution and is the signicance level (typically set to 0.05 or 0.01).
For phase II, the upper control limit is
UCL =
p(n +1)(n 1)
n(n p)
F
,p,np
(5.10.15)
Example 5.26 The phosphorus content data set contains 18 observations, where each observation has 3 variables:
X1: Inorganic phosphorus
X2: Organic phosphorus
X3: Plant phosphorus
This data set studies the effect of organic and inorganic phosphorus in the soil in comparison with the phosphorus content of the
corn grown. Start by loading the data in phosphorus.mat:
>> load phosphorus.mat;
>> whos
Name Size Bytes Class Attributes
Description 7x72 1008 char
X 18x3 432 double
observations 18x2 72 char
variables 3x2 12 char
1. Display the box plot for the phosphorus data matrix.
2. Compute the mean vector x, covariance S, and correlation R.
3. Plot the second PC score vs. the rst PC score.
4. Plot the Hotellings T
2
control chart.
Solution:
1 l oad phosphorus.mat ;
2 [ n , p] = si ze ( X) ; %si ze of data mat r i x
3 xbar = mean( X) ; %mean vect or
4 S = cov (X) ; %covar i ance mat r i x
5 R = cor r (X) ; %c or r el at i on mat r i x
6 boxpl ot (X) ;
7 [ A, Z, lambda , Tsquare ] = pri ncomp ( X) ; %perform PCA on data mat r i x usi ng covar i ance
8 % PC2 score vs. PC1 score
9 scat t er ( Z( : , 1) , Z( : , 2) , 3 , ' o ' , ' MarkerFaceCol or ' , [ . 49 1 . 63 ] , ' Li neWi dth ' , 1) ;
136
0
20
40
60
80
100
120
140
160
X
1
X
2
X
3
FIGURE 5.35: Boxplot for the phosphorus content data.
1. The side-by-side box plots for the data is shown in Figure 5.35. We can see that the third column of the data (i.e.
plant phosphorus) contains an outlier.
2. The mean vector x, covariance S, and correlation R are
x =
_
_
11.94
42.11
81.28
_
_
S =
_
_
103.12 63.86 190.09
63.86 185.63 130.38
190.09 130.38 728.80
_
_
R =
_
_
1.00 0.46 0.69
0.46 1.00 0.35
0.69 0.35 1.00
_
_
Note that R can also be obtained from S using the MATLAB command corrcov as follows:
>> [R,sigma]=corrcov(S);
where sigma is the vector of sample standard deviations.
3. The plot the second PC score vs. the rst PC score is shown in Figure 5.36. The labels displayed in Figure 5.36(b)
represent the observation numbers. Notice that the sample number 17 is an outlier.
40 20 0 20 40 60 80 100
25
20
15
10
5
0
5
10
15
20
25
PC1 score
P
C
2

s
c
o
r
e
40 20 0 20 40 60 80 100
25
20
15
10
5
0
5
10
15
20
25
PC1 score
P
C
2

s
c
o
r
e
17
10
6
3
1
13
15
18
16
7 12
14
11
4
5
2 8
9
(a) unlabeled plot (b) labeled plot
FIGURE 5.36: PC2 vs. PC1 score for the phosphorus content data.
4. The plot of the Hotellings T
2
control chart is displayed in Figure 5.37, which shows that the sample numbers 6
and 17 appear to be out-of-control.
137
1 %Pl ot Hot el l i ng ' s T 2 cont r ol char t
2 al pha=0. 05 ;
3 UCL = ( ( n1) 2/ n)
*
i c df ( ' beta ' ,1al pha , p / 2 , ( np1) / 2) ;
4 pl ot ( Tsquare , ' bo' , ' MarkerFaceCol or ' , [ . 49 1 . 63 ] , ' MarkerSi ze ' , 2) ;
0 2 4 6 8 10 12 14 16 18
0
2
4
6
8
10
12
6
17
UCL
Sample Number
H
o
t
e
l
l
i
n
g

T
2
FIGURE 5.37: Hotellings T
2
control chart for the phosphorus content data.
5.10.2 PCA THEORY
Let x = (X
1
, X
2
, . . . , X
p
)
/
denote the original variables measured on the objects or individuals. In principal component
analysis, the p original variables are transformed into linear combinations of uncorrelated variables (PCs) Z
1
, Z
2
, . . . , Z
p
such that the jth PC Z
j
is given by Z
j
= a
/
j
x, where a
j
= (a
1j
, . . . , a
pj
)
/
is the jth eigenvector of S. That is,
Z
j
= a
1j
X
1
+ a
2j
X
2
+ + a
pj
X
p
(5.10.16)
Since Sa
j
=
j
a
j
, it follows that var(Z
j
) =
j
. Thus, the rst PC Z
1
has the largest variance, while the last PC Z
p
has the
smallest. Moreover, the PCs are uncorrelated, that is corr(Z
j
, Z
k
) = 0 for j ,= k, which results from the orthogonality of
the eigenvectors a
/
j
a
k
= 0.
If x is standardized to have zero mean and unit variance for each X
i
, then the PCs Z
j
are correlated with the original
variables X
i
such that
corr(Z
j
, X
i
) = a
ij
_
j
(5.10.17)
Denote by z = (Z
1
, Z
2
, . . . , Z
p
)
/
the p-dimensional vector of PCs. Then, the correlation matrix between the PCs and the
the original variables is given by
C = corr(z, x) = A
1/2
(5.10.18)
where A and
1/2
= diag(
1
,
2
, . . . ,
_
p
) are obtained by performing PCA on the correlation matrix, that is by
applying PCA on the standardized data matrix using princomp(zscore(X)).
Example 5.27 Using the phosphorus content data
1. Determine the rst and second PCs
2. Display the scatter plot of PC2 coefcients vs. PC1 coefcients, and label the points
3. Compute the correlation matrix between the PCs and the original variables
Solution
138
1. Using the columns of the eigenvector matrix
A =
_
_
0.27 0.16 0.95
0.22 0.95 0.22
0.94 0.27 0.22
_
_
,
the rst PC is then given by
Z
1
= 0.27 X
1
+0.22 X
2
+0.94 X
3
and the second PC is given by
Z
2
= 0.16 X
1
+0.95 X
2
0.27 X
3
2. The scatter plot of PC2 coefcients vs. PC1 coefcients is shown in Figure 5.38. This plot helps understand which
variables have a similar involvement within PCs. As can be seen in Figure 5.38, the variables X
1
and X
2
are
located on the left of the plot, while the variable X
3
is located on the bottom right. This is consistent with the
values of the coefcients of PC1 and PC2.
1 scat t er ( A( : , 1 ) ,A( : , 2 ) , 3 , ' o ' , ' MarkerFaceCol or ' , [ . 49 1 . 63 ] , ' Li neWi dth ' , 1) ;
2 gname( var i abl es ) ; %press t he Ent er or Escape key t o st op l abel i ng.
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.4
0.2
0
0.2
0.4
0.6
0.8
1
PC1 coefficient
P
C
2

c
o
e
f
f
i
c
i
e
n
t
X2
X1
X3
FIGURE 5.38: Coefcients of PC2 vs. PC1 for the phosphorus content data.
3. The correlation matrix between the PCs and the original variables is given by
1 [ Ac , Zc , lambdac , Tsquarec] = pri ncomp ( zscore ( X) ) ; %PCA on st andar di zed data
2 C = Ac
*
sqr t ( di ag ( lambdac ) ) ; %component c or r el at i on mat r i x
C =
_
_
0.90 0.19 0.40
0.70 0.71 0.09
0.85 0.39 0.35
_
_
Note that all the three variables are highly correlated with the rst PC. We can also see that the organic phospho-
rus (X
2
) is highly correlated with PC2.
139
5.10.3 SCREE PLOT
The percentage of variance accounted for by the jth PC is called explained variance, and it is given by
j
=
p
j=1
j
100%, j = 1, . . . , p. (5.10.19)
After computing the eigenvectors, we sort them by their corresponding eigenvalues and then we pick the d principal
components with the largest eigenvalues. The other components will be discarded. The natural question that arises is:
How do we select the value of d? Well, the proportion of variance retained by mapping down from p to d dimensions
can be found as the normalized sum of the d largest eigenvalues
=
d
j=1
j
=
d
j=1
p
j=1
j
100%. (5.10.20)
In many applications, d is chosen such that a relatively high percentage, say 70 95%of explained variance is retained.
The remaining variance is assumed to be due to noise. The number of principal components d (d << p) can also be
determined using the scree graph, which is the plot of the explained variance against the component number k. The
optimal number k is usually selected as the one where the kink (elbow) in the curve appears.
1. Compute the explained variance
2. Plot the explained variance vs. the number of PCs.
3. What would be the lowest-dimensional space to represent the phosphorus content data?
Solution
1 expvar =100
*
lambda / sum( lambda ) ;%per cent of t he t o t a l v a r i a b i l i t y expl ai ned by each PC.
2 pl ot ( expvar , ' ko' , ' MarkerFaceCol or ' , [ . 49 1 . 63 ] , ' Li neWi dth ' , 1) ; %Scree pl ot
3 par et o ( expvar ) ; %Pareto pl ot
1. The explained variance by the three PCs are:
1
= 80.04%,
2
= 15.65%, and
3
= 4.31%. Notice that PC1 and
PC2 combined account for 95.69% of the variance in the data.
2. The scree and pareto plots of the explained variance vs. the number of PCs are shown in Figure 5.39.
1 1.5 2 2.5 3
0
10
20
30
40
50
60
70
80
90
Number of Principal Components
E
x
p
l
a
i
n
e
d

V
a
r
i
a
n
c
e

%
1 2
0
10
20
30
40
50
60
70
80
90
E
x
p
l
a
i
n
e
d

V
a
r
i
a
n
c
e

%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
(a) Scree plot (b) Pareto plot
FIGURE 5.39: Scree and Pareto plots for the phosphorus content data.
140
3. Based on the explained variance by both PC1 and PC2 and also fromthe scree and Pareto plots, it can be deduced
that the lowest-dimensional space to represent the phosphorus content data corresponds to d = 2.
5.10.4 BIPLOT
The plot of the principal component coefcients is a graphical display of the variables, while the plot of the principal
component scores is a graphical display of the observations. The biplot was originally proposed by Gabriel (1971) as a
graphical tool that allows information on both observations and variables of a data matrix to be displayed graphically,
and hence the bi in the name. Observations are displayed as points while variables are displayed as vectors. The
biplot helps visualize both the principal component coefcients for each variable and the principal component scores
for each observation in a single plot. Each of the p variables is represented in the biplot by a vector, and the direction
and length of the vector indicates how each variable contributes to the two principal components in the biplot, as
shown in Figure 5.40(a). The axes in the biplot represent the principal components (columns of eigenvector matrix
A), and the observed variables (rows of A) are represented as vectors. A biplot allows us to visualize the magnitude
and sign of each variables contribution to the rst two or three principal components, and how each observation is
represented in terms of those components. Each of the n observations is represented in the biplot by a point, and their
locations indicate the score of each observation for the two principal components in the plot. For example, points near
the left edge of this plot have the lowest scores for the rst principal component.
A 2D biplot displays the rst two PCs, i.e. PC2 vs. PC1, while a 3D biplot displays the rst 3 PCs, i.e., PC1, PC2, and
PC3 as shown in Figure 5.40(b). It is usually difcult to visualize a 3D biplot on a 2D plane, but rotating 3D biplot can
be very useful when the rst two PCs do not explain most of the variance in the data. The axes in the biplot represent
the PCs, and the observed variables are represented as vectors.
1. Display the 2D biplot of PC2 vs. PC1
2. Display the 3D biplot of PC1, PC2, and PC3
Solution
1 b i pl ot ( A( : , 1 : 2 ) , ' Scores ' , Z( : , 1: 2 ) , ' VarLabel s ' , var i abl es ) ; %2D b i pl o t
2 b i pl ot ( A( : , 1 : 3 ) , ' Scores ' , Z( : , 1: 3 ) , ' VarLabel s ' , var i abl es ) %3D b i pl ot
1. The 2D biplot of PC2 vs. PC1 is shown in Figure 5.40(a). The rst principal component, represented in this biplot
by the horizontal axis, has positive coefcients for all 3 variables. That corresponds to the 3 vectors directed
into the right half of the plot. The second principal component, represented by the vertical axis, has 2 positive
coefcients for the variables X
1
and X
2
, and 1 negative coefcient for the variable X
3
. That corresponds to vectors
directed into the top and bottom halves of the plot, respectively. This indicates that this component distinguishes
between observations that have high values for the rst set of variables and low for the second, and observations
that have the opposite. Each of the 18 observations (rows of scores) is represented in this plot by a point, and their
locations indicate the score of each observation for the two principal components in the plot. For example, points
near the left edge of this plot have the lowest scores for the rst principal component. The angles of between
the vectors representing the variables and the PCs indicate the contribution of the variable to the PCs. A narrow
angle indicates that the variable plays a major role in the PC. For example, plant phosphorus (X
3
) is important in
the rst PC, while organic phosphorus (X
2
) is important in the second PC.
2. The 3D biplot of PC1, PC2, and PC3 is shown in Figure 5.40(b).
5.10.5 PCA CONTROL CHART
Recall that the mean and variance of the jth principal component Z
j
are
Z
j
= 0 and var(Z
j
) =
j
. That is,
Z
j
=
_
j
.
141
1 0.5 0 0.5 1
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
X1
X2
X3
PC1
P
C
2
1
0.5
0
0.5
1
1
0.5
0
0.5
1
1
0.5
0
0.5
1
X3
PC1
X1
X2
PC2
P
C
3
(a) 2D biplot (b) 3D biplot
FIGURE 5.40: 2D and 3D biplots for the phosphorus content data.
PCA Control chart for jth PC:
The upper control limit, center line, and lower control limit of the jth PC are given by
UCL = 3
_
j
CL = 0
LCL = 3
_
j
(5.10.21)
The control chart for the rst PC of the phosphorus content data is depicted in Figure 5.41, which shows that the
sample number 17 is out-of-control.
0 2 4 6 8 10 12 14 16 18
100
80
60
40
20
0
20
40
60
80
100
17
UCL
LCL
CL
Sample Number
Z
1
FIGURE 5.41: PC1 control chart for the phosphorus content data.
142
Example 5.30 The European Jobs data are the percentage employed in different industries in European countries during 1979.
The job categories are agriculture, mining, manufacturing, power supplies, construction, service industries, nance, social and
personal services, and transportation and communications. It is important to note that these data were collected during the Cold
War. The European Jobs data set contains 26 observations (countries), where each observation has 9 variables:
X
1
: agriculture (Agr)
X
2
: mining (Min)
X
3
: manufacturing (Man)
X
4
: power supply industries (PS)
X
5
: construction (Con)
X
6
: service industries (SI)
X
7
: nance (Fin)
X
8
: social and personal services (SPS)
X
9
: transport and communications (TC)
To load the data set into the MATLAB workspace, type:
>> load europeanjobs.mat
>> whos
X 26x9 1872 double
countries 26x14 728 char
description 15x96 2880 char
1. Display the correlation matrix of the data.
2. Display the scatterplot matrix of the data.
3. Plot the second PC score vs. the rst PC score.
4. Determine the rst and second PCs
5. Display the scatter plot of PC2 coefcients vs. PC1 coefcients, and label the points
6. Compute the explained variance, and plot it against the number of PCs. What would be the lowest-dimensional space to
represent the phosphorus content data?
7. Display the 2D biplot of PC2 vs. PC1. Then, display the 3D biplot of PC1, PC2, and PC3
8. Plot the Hotelling and rst PC control charts. Identify the out-of-control points
Solution:
MATLAB code
>> load europeanjobs.mat
>> whos
X 26x9 1872 double
countries 26x1 1970 cell
variables 1x9 588 cell
>> [n,p]=size(X); %size of data matrix
>> R = corrcoef(X); %correlation matrix
>> plotmatrix(X);
>> boxplot(X);
>> [A,Z,lambda,Tsquare]=princomp(X); %perform PCA on data matrix using covariance
>> scatter(Z(:,1),Z(:,2),3,o,MarkerFaceColor,[.49 1 .63],LineWidth,1);
143
1. The correlation matrix of the data is shown in Figure 5.42. From this correlation matrix plot we can see that the
percentage of people employed in agriculture is negatively correlated with virtually of the other employment
areas indicating a contrast between industrial and agricultural economies. We also see that the percent of people
employed in manufacturing is positively correlated with employment areas which are required support manu-
facturing such as power supply, mining, construction and transportation. Other interesting relationships between
these variables are also evident.
FIGURE 5.42: Correlation matrix of the European jobs data.
MATLAB code
>> R = corrcoef(X); %correlation matrix
>> imagesc(R);
>> set(gca,XTick,1:p); set(gca,YTick,1:p);
>> set(gca,XTickLabel,variables); set(gca,YTickLabel,variables);
>> axis([0 p+1 0 p+1]); grid; colorbar;
2. When interpreting correlations it is important to visualize the bivariate relationships between all pairs of vari-
ables. This can be achieved by looking at a scatterplot matrix, which is shown in Figure 5.43
3. The plot the second PC score vs. the rst PC score is shown in Figure 5.44. The labels displayed in Figure 5.44(b)
represent the names of the countries.
MATLAB code
>> scatter(Z(:,1),Z(:,2),15,ko,MarkerFaceColor,[.49 1 .63],LineWidth,1);
>> xlabel(PC1 score,fontsize,14,fontname,times);
>> ylabel(PC2 score,fontsize,14,fontname,times);
>> gname(countries); %press Enter or Escape key to stop labeling.
4. Using the columns of the eigenvector matrix
144
0 5 10 0 2040 0 1020 0 1020 0 1020 0 1 2 0 2040 0 2 4 0 50100
0
5
10
0
20
40
0
10
20
0
10
20
0
10
20
0
1
2
0
20
40
0
2
4
0
50
100
Agr
Min
Man
PS
Con
SI
Fin
SPS
TC
FIGURE 5.43: Scatterplot matrix of the European jobs data.
20 10 0 10 20 30 40 50 60
15
10
5
0
5
10
15
PC1 score
P
C
2

s
c
o
r
e
20 10 0 10 20 30 40 50 60
15
10
5
0
5
10
15
PC1 score
P
C
2

s
c
o
r
e
Turkey
Yugoslavia
Greece
Rumania
Poland
Portugal
USSR
Ireland
Italy
Spain
Bulgaria
Hungary
Austria
Czechoslovakia
Switzerland
E. Germany
W. Germany
Luxembourg
France
Finland
Norway
Denmark
Netherlands
Sweden
Belgium
United Kingdom
(a) unlabeled plot (b) labeled plot
FIGURE 5.44: PC2 vs. PC1 score for the European jobs data.
A =
_
_
_
_
_
_
_
_
_
_
_
_
0.8918 0.0068 0.1185 0.0968 0.1800 0.1526 0.0916 0.0687 0.3354
0.0019 0.0923 0.0794 0.0102 0.0011 0.4564 0.7665 0.2905 0.3240
0.2713 0.7703 0.1847 0.0104 0.3360 0.2009 0.1620 0.0741 0.3375
0.0084 0.0120 0.0068 0.0181 0.0025 0.2309 0.0629 0.9092 0.3399
0.0496 0.0690 0.0773 0.0829 0.7243 0.5584 0.1943 0.0045 0.3253
0.1918 0.2344 0.5796 0.6076 0.2659 0.0216 0.0879 0.1044 0.3367
0.0311 0.1301 0.4700 0.7812 0.1211 0.0553 0.0800 0.1228 0.3344
0.2980 0.5668 0.5977 0.0483 0.2359 0.2479 0.0045 0.0521 0.3324
0.0454 0.0099 0.1594 0.0378 0.4349 0.5459 0.5675 0.2238 0.3342
_
_
_
_
_
_
_
_
_
_
_
_
the rst PC is then given by
Z
1
= 0.89 X
1
0.27 X
3
0.192 X
6
0.298 X
8
= 0.89 Agr 0.27 Man 0.192 SI 0.298 SPS
145
We can see that the rst PC is essentially a contrast between agriculture and industrial/urban employment ar-
eas. This is evidenced by the positive coefcient for agriculture and the negative coefcients for manufacturing,
service industries, and social and personal services.
The second PC is given by
Z
2
= 0.77 X
3
0.234 X
6
0.13 X
7
0.567 X
8
= 0.77 Man 0.234 SI 0.13 Fin 0.567 SPS
We can see that the second PC appears to be a contrast between manufacturing and non-industrial areas such as
service industries and nance. This is evidenced by the positive coefcient for manufacturing and the negative
coefcients for service industries, nance, and social and personal services.
5. The scatter plot of PC2 coefcients vs. PC1 coefcients is shown in Figure 5.45. This plot helps understand which
variables have a similar involvement within PCs. As can be seen in Figure 5.45, the variables Agr is located on
the right of the plot, while the other variables are located on the left of the plot. This is consistent with the values
of the coefcients of PC1 and PC2.
MATLAB code
>> scatter(A(:,1),A(:,2),3,o,MarkerFaceColor,[.49 1 .63],LineWidth,1);
>> gname(variables); %press Enter or Escape key to stop labeling.
0.4 0.2 0 0.2 0.4 0.6 0.8 1
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
PC1 coefficient
P
C
2

c
o
e
f
f
i
c
i
e
n
t
Agr
Fin
Con
PS
TC
Min
Man
SI
SPS
FIGURE 5.45: Coefcients of PC2 vs. PC1 for the European jobs data.
6. The explained variance by the three PCs are:
1
= 81.58%,
2
= 11.75%, and
3
= 4.09%. Notice that PC1 and PC2
combined account for 93.33% of the variance in the data. The scree and Pareto plots of the explained variance
vs. the number of PCs are shown in Figure 5.46. Based on the explained variance by both PC1 and PC2 and also
from the scree and Pareto plots, it can be deduced that the lowest-dimensional space to represent the European
jobs data corresponds to d = 2.
7. The 2Dbiplot of PC2 vs. PC1 is shown in Figure 5.47(a). The axes in the biplot represent the principal components
(columns of A), and the observed variables (rows of A) are represented as vectors. Each observation (row of Z)
is represented as a point in the biplot. From Figure 5.47(a), we can see that the rst principal component has 1
positive coefcient for the rst variable Agr and 3 negative coefcients for the variables Man, SI, and SPS. That
corresponds to 1 vector directed into the right half of the plot, and 3 vectors directed into the left half of the
plot, respectively. The second principal component, represented by the vertical axis, has 1 positive coefcient
for the variable Man, and 3 negative coefcients for the variables SI, FIN, and SPS. That corresponds to vectors
directed into the top and bottom halves of the plot, respectively. This indicates that this component distinguishes
between observations that have high values for the rst set of variables and low for the second, and observations
146
1 2 3 4 5 6 7 8 9
0
10
20
30
40
50
60
70
80
90
E
x
p
l
a
i
n
e
d

V
a
r
i
a
n
c
e

%
1 2 3
0
20
40
60
80
100
E
x
p
l
a
i
n
e
d

V
a
r
i
a
n
c
e

%
0%
20%
40%
60%
80%
100%
(a) Scree plot (b) Pareto plot
FIGURE 5.46: Scree and Pareto plots for the European jobs data.
MATLAB code
>> expvar=100*variance/sum(variance);%percent of the total variability explained by each PC.
>> plot(expvar,ko-,MarkerFaceColor,[.49 1 .63],LineWidth,1);
>> figure;
>> pareto(expvar);
that have the opposite. Each of the 26 countries is represented in this plot by a red point, and their locations
indicate the score of each observation for the two principal components in the plot. For example, points near the
left edge of this plot have the lowest scores for the rst principal component. The variables are represented by
rays extending out from the plot origin. Rays that tend to point in the same direction represent variables that
are positively correlated. For example we can see that the rays for manufacturing, mining, and power supply
all point in the same direction, indicating the positive correlation of these employment areas with one another.
The ray for agriculture is fairly isolated indicating its weak positive correlation and more often times negative
correlation with the other employment areas. The cases are plotted in accordance with their scores on the rst
three PCs. We can see that Turkey and Yugoslavia both extend far to the right in the direction of ray for the
agriculture variable. This indicates that these countries have a larger percentage of the workforce employed in
agriculture in comparison to the other countries in this data set. We can also see that Norway has a relatively
large percentage of its workforce employed in the social and service areas. The 3D biplot of PC1, PC2, and PC3 is
shown in Figure 5.47(b)
MATLAB code
biplot(A(:,1:2),Scores,Z(:,1:2),VarLabels,variables)
figure; biplot(A(:,1:3),Scores,Z(:,1:3),VarLabels,variables)
8. The Hotelling and rst PC charts are displayed in Figure 5.48. The Hotelling chart indicates that the samples 7
(Luxembourg), 18 (Yugoslavia), and 26 (Turkey) are out-of-control. All the plotted points on the rst PC chart are
within the control limits.
MATLAB code
>> alpha = 0.05;
>> [outliers, h] = tsquarechart(X,alpha); %T^2 chart
>> figure;
>> k=1;
>> [outliers, h] = pcachart(X,k); %1st PC control chart
147
0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
Agr
Min
Man
PS
Con
SI
Fin
SPS
TC
Component 1
C
o
m
p
o
n
e
n
t

2
0.5
0
0.5
0.5
0
0.5
0.5
0
0.5
Agr
Component 1
SPS
Fin
SI
PS
TC Min
Con
Man
Component 2
C
o
m
p
o
n
e
n
t

3
(a) 2D biplot (b) 3D biplot
FIGURE 5.47: 2D and 3D biplots for the European jobs data.
0 5 10 15 20 25 30
2
4
6
8
10
12
14
16
18
20
7
18
26
UCL
Sample Number
H
o
t
e
l
l
i
n
g

T
2
0 5 10 15 20 25 30
60
40
20
0
20
40
60
UCL
LCL
CL
Sample Number
Z
1
FIGURE 5.48: Hotelling and PCA charts for the European jobs data.
5.11 PROBLEMS
The management of a bank has embarked on a program of statistical process control and has decided to use
variable control charts to study the waiting time of customers during the peak noon to 1 p.m. lunch hour to
detect special causes of variation. Four customers are selected during the one-hour period; the rst customer to
enter the bank every 15 minutes. Each set of four measurements makes up a subgroup (sample). Table 5.18 lists
the waiting time (operationally dened as the time from when the customer enters the line until he or she begins
to be served by the teller) for 20 days.
a) Construct a table that shows the waiting time data with extra columns displaying the sample means and
sample ranges.
b) Estimate the process mean and standard deviation.
c) Construct the R- and the X-charts. Identify the out-of-control points using all Western Electric rules. If
necessary, revise your control limits, assuming that any samples that violate Western Electric rules can de
discarded.
A sample data set called parts.mat in the MATLAB Statistics Toolbox contains measurements on newly ma-
chined parts, taken at one hour intervals for 36 hours. Each row of the runout matrix contains the measurements
148
TABLE 5.18: Waiting Time for Customers at a Bank.
Sample Data
Number X
1
X
2
X
3
X
4
1 7.2 8.4 7.9 4.9
2 5.6 8.7 3.3 4.2
3 5.5 7.3 3.2 6.0
4 4.4 8.0 5.4 7.4
5 9.7 4.6 4.8 5.8
6 8.3 8.9 9.1 6.2
7 4.7 6.6 5.3 5.8
8 8.8 5.5 8.4 6.9
9 5.7 4.7 4.1 4.6
10 3.7 4.0 3.0 5.2
11 2.6 3.9 5.2 4.8
12 4.6 2.7 6.3 3.4
13 4.9 6.2 7.8 8.7
14 7.1 6.3 8.2 5.5
15 7.1 5.8 6.9 7.0
16 6.7 6.9 7.0 9.4
17 5.5 6.3 3.2 4.9
18 4.9 5.1 3.2 7.6
19 7.2 8.0 4.1 5.9
20 6.1 3.4 7.2 5.9
for 4 parts chosen at random. The values indicate, in thousandths of an inch, the amount the part radius differs
from the target radius. To load the data set into the MATLAB workspace, type:
>> load parts
>> whos
runout 36x4 1152 double
(i) Construct the R- and the X-charts. Identify the out-of-control points. If necessary, revise your control limits,
assuming that any samples that plot outside the control limits can de discarded.
(ii) Assuming the process is in control, estimate the process mean and standard deviation.
(iii) Construct the s- and the X-charts. Identify the out-of-control points. If necessary, revise your control limits,
assuming that any samples that plot outside the control limits can de discarded.
Table 5.19 presents the weights, in ounces, for a sequence of 15 rational subgroup samples of potato chips, with
n = 4 for each sample. Assume that the specications are 14 1.37.
(i) Construct the R- and the X-charts. Is the process under statistical control? Explain.
(ii) Construct the s- and the X-charts. Is the process under statistical control? Explain.
(iii) Assuming that the package weights are normally distributed, calculate the process capability index and the
proportion of the product that will not meet specications.
(iv) Comment on the ability of the process to produce items that meet specications?
The diameter of holes is measured in consecutive order by an automatic sensor. The results of measuring 25 holes
are given in in Table 5.20. Assume the target diameter is 10 millimeters.
(ii) Set up and apply a tabular cusum for this process, using standardized values h = 5 and k = 0.5. Does the
process appear to be operating in a state of statistical control at the desired target level?
(iii) Apply an EWMA control chart to these data using = 0.4 and L = 3. Interpret this chart.
149
TABLE 5.19: Potato chip Data.
Sample Package Weights (oz)
Number X
1
X
2
X
3
X
4
1 15.01 14.98 15.16 14.80
2 15.09 15.14 15.08 15.03
3 15.04 15.10 14.93 15.13
4 14.90 15.03 14.94 14.92
5 15.04 15.05 15.08 14.98
6 14.96 14.81 14.96 14.91
7 15.01 15.10 14.90 15.03
8 14.71 14.92 14.77 14.95
9 14.81 14.80 14.64 14.95
10 15.03 14.89 14.99 15.03
11 15.16 14.91 14.95 14.83
12 14.92 15.05 15.01 15.02
13 15.06 15.03 14.95 15.02
14 14.99 15.14 15.04 15.11
15 14.94 15.08 14.90 15.17
TABLE 5.20: Diameter measurements.
Sample Diameter Sample Diameter
1 9.94 14 9.99
2 9.93 15 10.12
3 10.09 16 9.81
4 9.98 17 9.73
5 10.11 18 10.14
6 9.99 19 9.96
7 10.11 20 10.06
8 9.84 21 10.11
9 9.82 22 9.95
10 10.38 23 9.92
11 9.99 24 10.09
12 10.41 25 9.85
13 10.36
The wafer dataset (wafer.mat) is based on one found in Statistical case studies for industrial process improve-
ment by Czitrom and Spagon. It consists of oxide layer thickness measured in 9 locations on each of 116 semi-
conductor wafers. The measurements were taken by position on the wafer as shown in Figure 5.49. Note that the
rst is in the center, the next 4 halfway out, and the last 4 on the edge. To load the data set into the MATLAB
1
2
3
4
5
6
7 8
9
FIGURE 5.49: Layout of wafer data.
workspace, type:
>> load wafer.mat
>> whos
150
X 116x9 8352 double
a) Display the correlation matrix of the data.
b) Display the side-by-side boxplots of the data. Comment on the plots.
c) Plot the second PC score vs. the rst PC score. Comment on the plot.
d) Determine the rst and second PCs.
e) Display the scatter plot of PC2 coefcients vs. PC1 coefcients, and label the points. Comment on the plot.
f) Compute the explained variance, and plot it against the number of PCs. What would be the lowest-
dimensional space to represent the phosphorus content data?
g) Display the 2D biplot of PC2 vs. PC1. Then, display the 3D biplot of PC1, PC2, and PC3. Comment on the
plots.
h) Plot the Hotelling and rst PC control charts. Identify the out-of-control points.
5.12 REFERENCES
[1] D. C. Montgomery, Introduction to Statistical Quality Control, John Wiley & Sons, 6th edition, 2009.
[2] K. Yang and J. Trewn, Multivariate Statistical Process Control in Quality Management, Mc-Graw Hill Professional,
2004.
[3] K.H. Chen, D.S. Boning, and R.E. Welch, Multivariate statistical process control and signature analysis using
eigenfactor detection methods, Proc. Symposium on the Interface of Computer Science and Statistics, Costa Mesa, CA,
June 2001.
[4] N.D. Tracy, J.C. Young, and R.L. Mason, Multivariate quality control charts for individual observations, Journal
of Quality Technology, vol. 24, no. 22, pp. 88-95, 1992.
[5] J.A. Vargas, Robust estimation in multivariate control charts for individual observations, Journal of Quality Tech-
nology, vol. 35, no. 4, pp. 367-376, 2003.
[6] J.H. Sullivan and W.H. Woodall, Acomparison of multivariate control charts for individual observations, Journal
of Quality Technology, vol. 28, no. 24, pp. 398-408, 1996.
[7] I.T. Jolliffe, Principal Component Analysis, New York: Springer, 1986.
[8] T.F. Cox, An Introduction to Multivariate Data Analysis, Hodder Arnold, 2005.
151

2

Uploaded by

Copyright:

Available Formats

2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2

Uploaded by

Copyright:

Available Formats

CHAPTER 4

CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

= 2.3263, we do not reject the null hypothesis H

= 1.6449, we reject the null hypothesis H

= 1.96, we do not reject the null hypothesis H

FIGURE 5.2: Illustration of control chart bands, each of which is 1

Rule 3: Four out of ve consecutive points fall above

Rule 4: Eight or more consecutive points fall above

n. When the parameters

n above and below the center line. An estimator

is the estimated covariance

, consider the vector of sample means given by

can be estimated by the mean vector

You might also like