STAT 231 Course Notes W16 Print
STAT 231 Course Notes W16 Print
STAT 231 Course Notes W16 Print
2016 Edition
ii 1
Contents
4. ESTIMATION 111
4.1 Statistical Models and Estimation . . . . . . . . . . . . . . . . . . . . . . . 111
4.2 Estimators and Sampling Distributions . . . . . . . . . . . . . . . . . . . . . 112
4.3 Interval Estimation Using the Likelihood Function . . . . . . . . . . . . . . 116
4.4 Con…dence Intervals and Pivotal Quantities . . . . . . . . . . . . . . . . . . 119
4.5 The Chi-squared and t Distributions . . . . . . . . . . . . . . . . . . . . . . 125
4.6 Likelihood-Based Con…dence Intervals . . . . . . . . . . . . . . . . . . . . . 129
iii
iv CONTENTS
Preface
These notes are a work-in-progress with contributions from those students taking the
courses and the instructors teaching them. An original version of these notes was prepared
by Jerry Lawless. Additions and revisions were made by Don McLeish, Cyntha Struthers,
Jock MacKay, and others. Richard Cook supplied the example in Chapter 8. In order to
provide improved versions of the notes for students in subsequent terms, please email lists of
errors, or sections that are confusing, or additional remarks/suggestions to your instructor
or [email protected].
Speci…c topics in these notes also have associated video …les or powerpoint shows that
can be accessed at www.watstat.ca. Where possible we reference these videos in the text.
1. INTRODUCTION TO
STATISTICAL SCIENCES
Statistical Sciences are concerned with all aspects of empirical studies including problem
formulation, planning of an experiment, data collection, analysis of the data, and the con-
clusions that can be made. An empirical study is one in which we learn by observation or
experiment. A key feature of such studies is that there is usually uncertainty in the conclu-
sions. An important task in empirical studies is to quantify this uncertainty. In disciplines
such as insurance or …nance, decisions must be made about what premium to charge for
an insurance policy or whether to buy or sell a stock, on the basis of available data. The
uncertainty as to whether a policy holder will have a claim over the next year, or whether
the price of a stock will rise or fall, is the basis of …nancial risk for the insurer and the
investor. In medical research, decisions must be made about the safety and e¢ cacy of new
treatments for diseases such as cancer and HIV.
Empirical studies deal with populations and processes; both of which are collections
of individual units. In order to increase our knowledge about a process, we examine a
sample of units generated by the process. To study a population of units we examine
a sample of units carefully selected from that population. Two challenges arise since we
only see a sample from the process or population and not all of the units are the same.
For example, scientists at a pharmaceutical company may conduct a study to assess the
e¤ect of a new drug for controlling hypertension (high blood pressure) because they do not
know how the drug will perform on di¤erent types of people, what its side e¤ects will be,
and so on. For cost and ethical reasons, they can involve only a relatively small sample
of subjects in the study. Variability in human populations is ever-present; people have
varying degrees of hypertension, they react di¤erently to the drug, they have di¤erent side
e¤ects. One might similarly want to study variability in currency or stock values, variability
in sales for a company over time, or variability in the number of hits and response times
for a commercial web site. Statistical Sciences deal both with the study of variability
in processes and populations, and with good (that is, informative, cost-e¤ective) ways to
collect and analyze data about such processes.
1
2 1. INTRODUCTION TO STATISTICAL SCIENCES
We can have various objectives when we collect and analyze data on a population or
process. In addition to furthering knowledge, these objectives may include decision-making
and the improvement of processes or systems. Many problems involve a combination of
objectives. For example, government scientists collect data on …sh stocks in order to further
scienti…c knowledge and also to provide information to policy makers who must set quotas
or limits on commercial …shing.
Statistical data analysis occurs in a huge number of areas. For example, statistical
algorithms are the basis for software involved in the automated recognition of handwritten
or spoken text; statistical methods are commonly used in law cases, for example in DNA
pro…ling; statistical process control is used to increase the quality and productivity of
manufacturing and service processes; individuals are selected for direct mail marketing
campaigns through a statistical analysis of their characteristics. With modern information
technology, massive amounts of data are routinely collected and stored. But data do not
equal information, and it is the purpose of the Statistical Sciences to provide and analyze
data so that the maximum amount of information or knowledge may be obtained3 . Poor
or improperly analyzed data may be useless or misleading. The same could be said about
poorly collected data.
We use probability models to represent many phenomena, populations, or processes
and to deal with problems that involve variability. You studied these models in your …rst
probability course and you have seen how they describe variability. This course will focus
on the collection, analysis and interpretation of data and the probability models studied
earlier will be used extensively. The most important material from your probability course
is the material dealing with random variables, including distributions such as the Binomial,
Poisson, Multinomial, Normal or Gaussian, Uniform and Exponential. You should review
this material.
Statistical Sciences is a large discipline and this course is only an introduction. Our
broad objective is to discuss all aspects of: problem formulation, planning of an empirical
study, formal and informal analysis of data, and the conclusions and limitations of the
analysis. We must remember that data are collected and models are constructed for a
speci…c reason. In any given application we should keep the big picture in mind (e.g. Why
are we studying this? What else do we know about it?) even when considering one speci…c
aspect of a problem. We …nish this introduction with a recent quote4 from Hal Varien,
Google’s chief economist.
“The ability to take data - to be able to understand it, to process it, to extract value
from it, to visualize it, to communicate it’s going to be a hugely important skill in the next
decades, not only at the professional level but even at the educational level for elementary
3
A brilliant example of how to create information through data visualization is found in the video by
Hans Rosling at: http://www.youtube.com/watch?v=jbkSRLYSojo
4
For the complete article see “How the web challenges managers” Hal Varian, The McKinsey Quarterly,
January 2009
1.2. COLLECTING DATA 3
school kids, for high school kids, for college kids. Because now we really do have essen-
tially free and ubiquitous data. So the complemintary (sic) scarce factor is the ability to
understand that data and extract value from it.
I think statisticians are part of it, but it’s just a part. You also want to be able to
visualize the data, communicate the data, and utilize it e¤ectively. But I do think those
skills - of being able to access, understand, and communicate the insights you get from data
analysis - are going to be extremely important. Managers need to be able to access and
understand the data themselves.”
Variates can also be complex such as an image or an open ended response to a survey
question.
We are interested in functions of the variates over the whole population; for example
the average drop in blood pressure due to a treatment for individuals with hypertension
or the proportion of a population having a certain characteristic. We call these functions
attributes of the population or process.
We represent variates by letters such as x; y; z. For example, we might de…ne a variate
y as the size of the insurance claim or the time between claims. The values of y typically
vary across the units in a population or process. This variability generates uncertainty and
makes it necessary to study populations and processes by collecting data about them. By
data, we mean the values of the variates for a sample of units in the population or a sample
of units taken from the process.
In planning to collect data about some process or population, we must carefully specify
what the objectives are. Then, we must consider feasible methods for collecting data as
well as the extent it will be possible to answer questions of interest. This sounds simple
but is usually di¢ cult to do well, especially since resources are always limited.
There are several ways in which we can obtain data. One way is purely according to
what is available: that is, data are provided by some existing source. Huge amounts of
data collected by many technological systems are of this type, for example, data on credit
card usage or on purchases made by customers in a supermarket. Sometimes it is not
clear what available data represent and they may be unsuitable for serious analysis. For
example, people who voluntarily provide data in a web survey may not be representative of
the population at large. Alternatively, we may plan and execute a sampling plan to collect
new data. Statistical Sciences stress the importance of obtaining data that will be objective
and provide maximal information at a reasonable cost. There are three broad approaches:
(i) Sample Surveys. The object of many studies is to learn about a …nite population
(e.g. all persons over 19 in Ontario as of September 12 in a given year). In this case
information about the population may be obtained by selecting a “representative”
sample of units from the population and determining the variates of interest for each
unit in the sample. Obtaining such a sample can be challenging and expensive. In
a survey sample the variates of interest are most often collected using a question-
naire. Sample surveys are widely used in government statistical studies, economics,
marketing, public opinion polls, sociology, quality assurance and other areas.
(ii) Observational Studies. An observational study is one in which data are collected
about a process or population without any attempt to change the value of one or
more variates for the sampled units. For example, in studying risk factors associated
with a disease such as lung cancer, we might investigate all cases of the disease at a
particular hospital (or perhaps a sample of them) that occur over a given time period.
We would also examine a sample of individuals who did not have the disease. A dis-
tinction between a sample survey and an observational study is that for observational
1.2. COLLECTING DATA 5
The three types of studies described above are not mutually exclusive, and many studies
involve aspects of all of them. Here are some slightly more detailed examples.
Consider, for example, soft drinks sold in nominal 355 ml cans. Because of inherent
variation in the …lling process, the amount of liquid y that goes into a can varies over a
small range. Note that the manufacturer would like the variability in y to be as small as
possible, and for cans to contain at least 355 ml. Suppose that the manufacturer has just
added a new …lling machine to increase the plant’s capacity. The process engineer wants
to compare the new machine with an old one. Here the population of interest is the cans
…lled in the future by both machines. She decides to do this by sampling some …lled cans
from each machine and accurately measuring the amount of liquid y in each can. This is
an observational study.
How exactly should the sample be chosen? The machines may drift over time (that is,
the average of the y values or the variability in the y values may vary systematically up or
down over time) so we should select cans over time from each machine. We have to decide
how many, over what time period, and when to collect the cans from each machine.
campaigns in which large numbers of individuals are contacted by mail and invited to
acquire a product or service. Such individuals are usually picked from a much larger number
of persons on whom the company has information. For example, in a credit card marketing
campaign a company might have data on several million persons, pertaining to demographic
(e.g. sex, age, place of residence), …nancial (e.g. salary, other credit cards held, spending
patterns) and other variates. Based on the data, the company wishes to select persons whom
it considers have a good chance of responding positively to the mail-out. The challenge is
to use data from previous mail campaigns, along with the current data, to achieve as high
a response rate as possible.
Numerical Summaries
We now describe some numerical summaries which are useful for describing features of
a single variate in a data set when the variate is either continuous or discrete. These
summaries fall generally into three categories: measures of location (mean, median, and
mode), measures of variability or dispersion (variance, range, and interquartile range), and
measures of shape (skewness and kurtosis).
1. Measures of location:
1 P
n
The (sample) mean also called the sample average: y = n yi .
i=1
The (sample) median m ^ or the middle value when n is odd and the sample is ordered
from smallest to largest, and the average of the two middle values when n is even.
Since the median is less a¤ected by a few extreme observations (see Problem 1) it is
a more robust measure of location.
The (sample) mode, or the value of y which appears in the sample with the highest
frequency (not necessarily unique).
8 1. INTRODUCTION TO STATISTICAL SCIENCES
1 P
n 1 P
n
s2 = (yi y)2 = yi2 n (y)2
n 1 i=1 n 1 i=1
p
and the (sample) standard deviation: s = s2 .
The range = y(n) y(1) where y(n) = max (y1 ; y2 ; : : : ; yn ) and y(1) = min (y1 ; y2 ; : : : ; yn ).
3. Measures of shape:
Measures of shape generally indicate how the data, in terms of a relative frequency
histogram, di¤er from the Normal bell-shaped curve, for example whether one tail of the
relative frequency histogram is substantially larger than the other so the histogram is asym-
metric, or whether both tails of the relative frequency histogram are large so the data are
more prone to extreme values than data from a Normal distribution.
0.5
0.45
skewness = 1.15
0.4
0.35
0.3
Relative
F requency
0.25
0.2
0.15
0.1
0.05
0
0 1 2 3 4 5 6 7 8 9 10 11 12
Figure 1.1: Relative frequency histogram for data with positive skewness
1.3. DATA SUMMARIES 9
When the relative frequency histogram of the data is approximately symmetric then
there is an approximately equal balance between the positive and negative values in the
Pn
sum (yi y)3 and this results in a value for the skewness that is approximately zero. If
i=1
the relative frequency histogram of the data has a long right tail (see Figure 1.1), then the
positive values of (yi y)3 dominate the negative values in the sum and the value of the
skewness will be positive.
Similarly if the relative frequency histogram of the data has a long left tail (see Figure
1.2) then the value of the skewness will be negative.
0.7
0.6
skewness = -1.35
0.5
Relative
F requency
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6 7 8 9 10 11 12
Figure 1.2: Relative frequency histogram for data with negative skewness
measures the heaviness of the tails and the peakedness of the data relative to data
that are Normally distributed. For the Normal distribution the kurtosis is equal to 3.
Since the term (yi y)4 is always positive, the kurtosis is always positive and values
greater than three indicate heaver tails (and a more peaked center) than data that are
Normally distributed. See Figures 1.3 and 1.4. Typical …nancial data such as the S&P500
index have kurtosis greater than three, because the extreme returns (both large and small)
are more frequent than one would expect for Normally distributed data.
10 1. INTRODUCTION TO STATISTICAL SCIENCES
0.35
0.25
Relative
Frequency
0.2 G (0.15,1.52) p.d.f.
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6 7
Figure 1.3: Relative frequency histogram for data with kurtosis > 3
1.4
skewness = 0.08
kurtosis = 1.73
1.2
1
Relative
Frequency
0.8
0.6
0.4
G (0.49,0.29) p.d.f.
0.2
0
-0.5 0 0.5 1 1.5
Figure 1.4: Relative frequency histogram for data with kurtosis < 3
De…nition 1 (sample percentiles and sample quantiles): The pth (sample) quantile (also
called the 100pth (sample) percentile) is a value, call it q(p), determined as follows:
If m 2
= f1; 2; : : : ; ng but 1 < m < n then determine the closest integer j such that
j < m < j + 1 and take q(p) = 21 y(j) + y(j+1) .
Depending on the size of the data set, quantiles are not uniquely de…ned for all values
of p. For example, what is the median of the values f1; 2; 3; 4; 5; 6g? What is the lower
quartile? There are di¤erent conventions for de…ning quantiles in these cases; if the sample
size is large, the di¤erences in the quantiles based on the various de…nitions are small.
De…nition 2 The values q(0:5), q(0:25) and q(0:75) are called the median, the lower or
…rst quartile, and the upper or third quartile respectively.
We can easily understand what the sample mean, quantiles and percentiles tell us about
the variate values in a data set. The sample variance and sample standard deviation measure
the variability or spread of the variate values in a data set. We prefer the standard deviation
because it has the same scale as the original variate. Another way to measure variability
is to use the interquartile range, the di¤erence between the lower or …rst quartile and the
higher or third quartile.
Since the interquartile range is less a¤ected by a few extreme observations (see Problem
2) it is a more robust measure of variability.
De…nition 4 The …ve number summary of a data set consists of the smallest observation,
the lower quartile, the median, the upper quartile and the largest value, that is, the …ve
values y(1) ; q (0:25) ; q (0:5) ; q (0:75) ; y(n) .
The …ve number summary provides a concise numerical summary of a data set which
provides information about the location (through the median), the spread (through the
lower and upper quartiles) and the range (through the minimum and maximum values).
Their initial Body Mass Index (BMI) was also calculated. BMI is used to measure obesity
or severely low weight. It is de…ned as:
weight(kg)
BM I = :
height(m)2
There is some variation in what di¤erent guidelines refer to as “overweight”, “under-
weight”, etc. We present one such classi…cation in Table 1.1. The BMI obesity classi…cation
is an example of a ordinal variate.
The data are available in the …le ch1example131.txt available on the course web page
and are listed in Appendix C. For statistical analysis of the data, it is convenient to record
the data in row-column format (see Table 1.2). The …rst row of the …le gives the variate
names, in this case subject number, sex (M=male or F=female), height, weight and BMI.
Each subsequent row gives the variate values for a particular subject.
The …ve number summaries for the BMI data for each sex are given in Table 1.3 along
with the sample mean and standard deviation.
From the table, we see that there are only small di¤erences in the median and the mean.
For the standard deviation, IQR and the range we notice that the values are all larger for
the females. In other words, there is more variability in the BMI measurements for females
than for males in this sample.
We can also construct a relative frequency table that gives the proportion of subjects
that fall within each obesity class by sex.
1.3. DATA SUMMARIES 13
From Table 1.4, we see that the reason for the larger variability for females is that there
is a greater proportion of females in the extreme classes.
Sample Correlation
So far we have looked only at graphical summaries of a data set fy1 ; y2 ; : : : ; yn g. Often
we have bivariate data of the form f(x1 ; y1 ) ; (x2 ; y2 ) ; : : : ; (xn ; yn )g. A numerical summary
of such data is the sample correlation.
De…nition 5 The sample correlation, denoted by r, for data f(x1 ; y1 ) ; (x2 ; y2 ) ; : : : ; (xn ; yn )g
is
Sxy
r=p
Sxx Syy
where
P
n P
n P
n P
n
Sxx = (xi x)2 = x2i n (x)2 ; Sxy = (xi x) (yi y) = x i yi nxy
i=1 i=1 i=1 i=1
Pn Pn
and Syy = (yi y)2 = yi2 n (y)2 :
i=1 i=1
The sample correlation, which takes on values between 1 and 1; is a measure of the
linear relationship between the two variates x and y. If the value of r is close to 1 then
we say that there is a strong positive linear relationship between the two variates while if
the value of r is close to 1 then we say that there is a strong negative linear relationship
between the two variates. If the value of r is close to 0 then we say that there is no linear
relationship between the two variates.
Relative Risk
Recall that categorical variates consist of group or category names that do not neces-
sarily have any ordering. If two variates of interest in a study are categorical variates then
14 1. INTRODUCTION TO STATISTICAL SCIENCES
it does not make sense to use sample correlation as a measure of the relationship between
the two variates.
What measure can be used to summarize the relationship between taking daily aspirin and
the occurrence of CHD?
One measure which is used to summarize the relationship between two categorical vari-
ates is relative risk. To de…ne relative risk consider a generalized version of Table 1.5 given
by
P (AjB)
= 1:
P AjB
and otherwise the ratio is not equal to one. In the PHS if we let A = takes daily aspirin
and B = CHD then we can estimate this ratio using the ratio of the sample proportions.
De…nition 6 For categorical data in the form of Table 1.6 the relative risk of event A in
group B as compared to group B is
y11 = (y11 + y12 )
relative risk = :
y21 = (y21 + y22 )
1.3. DATA SUMMARIES 15
Graphical Summaries
We consider several types of plots for a data set fy1 ; y2 ; : : : ; yn g and one type of plot for a
data set f(x1 ; y1 ); (x2 ; y2 ); : : : (xn ; yn )g.
Frequency histograms
Consider measurements fy1 ; y2 ; : : : ; yn g on a variate y. Partition the range of y into k
non-overlapping intervals Ij = [aj 1 ; aj ); j = 1; 2; : : : ; k and then calculate for j = 1; : : : ; k
(a) a “standard” frequency histogram where the intervals Ij are of equal length. The
height of the rectangle for Ij is the frequency fj or relative frequency fj =n. This type
of histogram is similar to a bar chart.
(b) a “relative” frequency histogram, where the intervals Ij = [aj 1 ; aj ) may or may not
be of equal length. The height of the rectangle for Ij is chosen so that its area equals
fj =n, that is, the height of the rectangle for Ij is equal to
fj =n
.
(aj aj 1)
Note that in this case the sum of the areas of the rectangles in the histogram is equal
to one.
16 1. INTRODUCTION TO STATISTICAL SCIENCES
We can make the two types of frequency histograms visually comparable by using inter-
vals of equal length for both types. If we wish to compare two groups which have di¤erent
sample sizes then a relative frequency histogram must be used. If we wish to superimpose
a probability density function on a frequency histogram to see how well the data …t the
model then a relative frequency histogram must always be used.
To construct a frequency histogram, the number and location of the intervals must be
chosen. The intervals are typically selected so that there are ten to …fteen intervals and each
interval contains at least one y-value from the sample (that is, each fj 1). If a software
package is used to produce the frequency histogram (see Section 1.7) then the intervals are
usually chosen automatically. An option for user speci…ed intervals is also usually provided.
0.14
kurtosis = 3.03
0.1
Relative
Frequency
0.08
0.06
0.04
0.02
0
16 18 20 22 24 26 28 30 32 34 36 38 40
BMI
0.09
0.05
0.04
0.03
0.02
0.01
0
16 18 20 22 24 26 28 30 32 34 36 38 40
BMI
The data are listed in Appendix C and are available in the …le Brake Pad Lifetime
Data.txt which is posted on the course web page. Notice that the distribution has a very
di¤erent shape compared to the BMI histograms. The brake pad lifetimes have a long right
tail which is consistent with a skewness value which is positive and not close to zero. The
high degree of variability in lifetimes is due to the wide variety of driving conditions which
di¤erent cars are exposed to, as well as to variability in how soon car owners decide to
replace their brake pads.
0.018
0.016
skewness = 1.28
0.014
0.012
Relative
Frequency
0.01
0.008
0.006
0.004
0.002
0
0 15 30 45 60 75 90 105 120 135 150 165 180
Lifetime
0.9
Males
0.8
0.7
cumlative
relative F emales
0.6
frequency
0.5
0.4
0.3
0.2
0.1
0
1.4 1.5 1.6 1.7 1.8 1.9 2
Height
Figure 1.8: Empirical cumulative distribution function of heights for males and
for females
1.3. DATA SUMMARIES 19
Boxplots
In many situations, we want to compare the values of a variate for two or more groups,
as in Example 1.3.1 where we compared BMI values and heights for males versus females.
Especially when the number of groups is large (or the sample sizes within groups are small),
side-by-side boxplots are a convenient way to display the data. Boxplots are also called box
and whisker plots.
The boxplot is usually displayed vertically. The center line in each box corresponds
to the median and the lower and upper sides of the box correspond to the lower quartile
q(0:25) and the upper quartile q(0:75) respectively. The so-called whiskers extend down
and up from the box to a horizontal line. The lower line is placed at the smallest observed
data value that is larger than the value q(0:25) 1:5 IQR where IQR = q(0:75) q(0:25)
is the interquartile range. Similarly the upper line is placed at the largest observed data
value that is smaller than the value q(0:75) + 1:5 IQR. Any values beyond the whiskers
(often called outliers) are plotted with special symbols.
120
110
100
90
80
70
60
50
40
Males Females
Figure 1.9 displays side-by-side boxplots of male and female weights from Example 1.3.1.
We can see for this sample that males are generally heavier than females but that the spread
of the two distributions is about the same. For the males and the females, the center line
in the box, which corresponds to the median, divides the box and whiskers approximately
in half which indicates that both distributions are roughly symmetric about the median.
For the females there are two very large weights.
Boxplots are particularly usually for comparing several groups. Figure 1.10 shows a
comparison of the miles per gallon (MPG) for 100 cars by country of origin. The boxplot
makes it easy to see the di¤erences and similarities between the cars from di¤erent countries.
The graphical summaries we have just discussed are most useful for summarizing variates
which are either continuous or discrete with many possible values. For a categorical variates
20 1. INTRODUCTION TO STATISTICAL SCIENCES
45
40
35
30
MPG
25
20
15
10
Figure 1.10: Boxplots for miles per gallon for 100 cars from six di¤erent countries
the data can be best summarized using bar graphs and pie charts. Such graphs can be used
incorrectly. See Problems 15-18 at the end of this chapter.
The graphical summaries discussed to this point deal with a single variate. If we have
data on two variates x and y for each unit in the sample then then data set is represented
as f(xi ; yi ); i = 1; : : : ; ng. We are often interested in examining the relationships between
the two variates.
Scatterplots
A scatterplot, which is a plot of the points (xi ; yi ); i = 1; : : : ; n, can be used to see
whether the two variates are related in some way.
120
110
100
Weight
90
80
70
r=0.55
60
50
1.55 1.6 1.65 1.7 1.75 1.8 1.85 1.9 1.95 2
Height
Figures 1.11 and 1.12 give the scatterplots of x = weight versus y = height for males
and females respectively for the data in Example 1.3.1. As expected, there is a tendency
1.4. PROBABILITY DISTRIBUTIONS AND STATISTICAL MODELS 21
120
110
100
Weight
90
80
70
60
50 r=0.31
40
1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 1.8 1.85
Height
for weight to increase as height increases for both sexes. What might be surprising is the
variability in weights for a given height.
the variate values vary so random variables can describe this variation
empirical studies usually lead to inferences that involve some degree of uncertainty,
and probability is used to quantify this uncertainty
models allow us to characterize processes and to simulate them via computer experi-
ments
of the units in the population. Getting such a list would be expensive and time consuming
so the actual selection procedure is likely to be very di¤erent. We select a sample of 500
units from the list at random and count the number of smokers in the sample. We model
this selection process using a Binomial random variable Y with probability function (p.f.)
500 y
P (Y = y; ) = (1 )500 y
for y = 0; 1; : : : ; 500
y
Here the parameter represents the unknown proportion of smokers in the population,
one attribute of interest in the study.
Here the parameter > 0 represents the mean lifetime of the brake pads in the population
since, in the model, the expected value of Y is E (Y ) = .
To model the sampling procedure, we assume that the data fy1 ; : : : ; y200 g represent 200
independent realizations of the random variable Y . That is, we let Yi = the lifetime for
the ith brake pad in the sample, i = 1; 2; : : : ; 200, and we assume that Y1 ; Y2 ; : : : ; Y200 are
independent Exponential random variables each having the same mean .
We can use the model and the data to estimate and other attributes of interest such
as the proportion of brake pads that fail in the …rst 100; 000 km of use. In terms of the
model, we can represent this proportion by
Z100
100=
P (Y 100; ) = f (y; )dy = 1 e
0
co¤ee consumption (a somewhat unlikely proposition to be sure) then CHD would be the
explanatory variate and co¤ee habits would be the response variate.
In some cases it is not clear which is the explanatory variate and which is the response
variate. For example, the response variable Y might be the weight (in kg) of a randomly
selected female in the age range 16-25, in some population. A person’s weight is related to
their height. We might want to study this relationship by considering females with a given
height x (say in meters), and proposing that the distribution of Y , given x is Gaussian,
G( + x; ). That is, we propose that the average (expected) weight of a female depends
linearly on her height x and we write this as
E(Y jx) = + x:
However it would be possible to reverse the roles of the two variates here and consider
the weight to be an explanatory variate and height the response variate, if for example we
wished to predict height using data on individuals’weights.
Models for describing the relationships among two or more variates are considered in
more detail in Chapters 6 and 7.
Suppose we are interested in the question “Is the smoking rate among teenage girls higher
than the rate among teenage boys?” From the data, we see that the sample proportion of
girls who smoke is 82=250 = 0:328 or 32:8% and the sample proportion of males who smoke
is 71=250 = 0:284 or 28:4%. In the sample, the smoking rate for females is higher. But
what can we say about the whole population? To proceed, we formulate the hypothesis
that there is no di¤erence in the population rates. Then assuming the hypothesis is true,
we construct two Binomial models as in Example 1.4.1 each with a common parameter .
We can estimate using the combined data so that ^ = 153=500 = 0:306 or 30:6%. Then
using the model and the estimate, we can calculate the probability of such a large di¤erence
in the observed rates. Such a large di¤erence occurs about 20% of the time (if we selected
samples over and over and the hypothesis of no di¤erence is true) so such a large di¤erence
in observed rates happens fairly often and therefore, based on the observed data, there is no
evidence of a di¤erence in the population smoking rates. In Chapter 7 we discuss a formal
26 1. INTRODUCTION TO STATISTICAL SCIENCES
method for testing the hypothesis of no di¤erence in rates between teenage girls and boys.
357.8
357.6
357.4
357.2
Volume
357
356.8
356.6
356.4
356.2
356
355.8
0 5 10 15 20 25 30 35 40
Hour
Figure 1.13: Run chart of the volume for the new machine over time
First we examine if the behaviour of the two machines is stable over time. In Figures
1.13 and 1.14, we show a run chart of the volumes over time for each machine. There is no
indication of a systematic pattern for either machine so we have some con…dence that the
data can be used to predict the performance of the machines in the near future.
The sample mean and standard deviation for the new machine are 356:8 and 0:54 ml
respectively and, for the old machine, are 357:5 and 0:80. Figures 1.15 and 1.16 show the
relative frequency histograms of the volumes for the new machine and the old machine re-
spectively. To see how well a Gaussian model might …t these data we superimpose Gaussian
probability density functions with the mean equal to the sample mean and the standard
deviation equal to the sample standard deviation on each histogram. The agreement is
1.5. DATA ANALYSIS AND STATISTICAL INFERENCE 27
360
359.5
359
358.5
Volume
358
357.5
357
356.5
356
0 5 10 15 20 25 30 35 40
Hour
Figure 1.14: Run chart of the volume for old machine over time
0.9
0.6 G(356.76,0.54)
Relative
Frequenc y
0.5
0.4
0.3
0.2
0.1
0
355 356 357 358 359 360 361
Volume
Figure 1.15: Relative frequency histogram of volumes for the new machine
reasonable given that the sample size for both data sets is only forty. Note that it only
makes sense to compare density functions and relative frequency histograms (not standard)
since the areas both equal one.
None of the 80 cans had volume less than the required 355ml. However, we examined
only 40 cans per machine. We can use the Gaussian models to estimate the long term
proportion of cans that fall below the required volume. For the new machine, we …nd
that if V G(356:8; 0:54) then P (V 355) = 0:0005 so about 5 in 10; 000 cans will be
under…lled. The corresponding rate for the old machine is about 8 in 10; 000 cans. These
estimates are subject to a high degree of uncertainty because they are based on a small
sample and we have no way to test that the models are appropriate so far into the tails of
the distribution.
We can also see that the new machine is superior because of its smaller sample mean
28 1. INTRODUCTION TO STATISTICAL SCIENCES
0.7
skewness = 0.54
0.6
kurtos is = 2.84
0.5
Relative
Frequenc y
0.4
G(357.5,0.80)
0.3
0.2
0.1
0
355 356 357 358 359 360 361
Volume
Figure 1.16: Relative frequency histogram of volumes for the old machine
which translates into less over…ll (and hence less cost to the manufacturer). It is possible
to adjust the mean of the new machine to a lower value because of its smaller standard
deviation.
Using R
Lots of online help is available in R. You can use a search engine to …nd the answer to most
questions. For example, if you search for “R tutorial”, you will …nd a number of excellent
introductions to R that explain how to carry out most tasks. Within R, you can …nd help
for a speci…c function using the command help(function name) but it is often easier to look
externally using a search engine.
Here we show how to use R on a Windows machine. You should have R open as you
read this material so you can play along.
1.6. STATISTICAL SOFTWARE AND R 29
Some R Basics
R is command-line driven. For example, if you want to de…ne a quantity x, use the assign-
ment function <- (that is, < followed by -).
x< 15
If you want to change x, you can up-arrow to return to the assignment and make the
change you want, followed by a carriage return.
If you are doing something more complicated, you can type the code in Notepad or
some other text editor (Word is not advised!) and cut and paste the code into R.
You can save your session and, if you choose, it will be restored the next time you
open R.
You can add comments by entering # with the comment following on the same line.
Vectors
Vectors can consist of numbers or other symbols; we will consider only numbers here.
Vectors are de…ned using the function c( ). For example,
x< c(1; 3; 5; 7; 9)
de…nes a vector of length 5 with the elements given. You can display the vector by typing
x and carriage return. Vectors and other objects possess certain attributes. For example,
typing
length(x)
will give the length of the vector x.
You can cut and paste comma- delimited strings of data into the function c(). This is
one way to enter data into R. See below to learn how you can read a …le into R.
Arithmetic
R can be used as a calculator. Enter the calculation after the prompt > and hit return as
shown below.
> 7+3
30 1. INTRODUCTION TO STATISTICAL SCIENCES
[1] 10
> 7*3
[1] 21
> 7/3
[1] 2.333333
> 2^3
[1] 8
You can save the result of the calculation by assigning it to a variable such as y<-7+3
Some Functions
n decimal places
[1] 2.72 3.49 4.48 5.75 7.39
> x+2*y
[1] 3.0 5.5 8.0 10.5 13.0
We often want to compare summary statistics of variate values by group (such as sex). We
can use the by()function. For example,
Graphs
Note that in R, a graphics window opens automatically when a graphical function is used. A
useful way to create several plots in the same window is the function par() so, for example,
following the command
par(mfrow=c(2,2))
the next 4 plots will be placed in a 2 2 array within the same window.
There are various plotting and graphical functions. Three useful ones are
You can control the axes of plots (especially useful when you are making comparisons) by
including xlim = c(a; b) and ylim = c(d; e) as arguments separated by commas within
the plotting function. Also you can label the axes by including xlab = \yourchoice" and
ylab = \yourchoice". A title can be added using main = \yourchoice". There are many
other options. Check out the Html help “An Introduction to R” for more information on
plotting.
32 1. INTRODUCTION TO STATISTICAL SCIENCES
To save a graph, you can copy and paste into a Word document for example or alternately
use the “Save as” menu to create a …le in one of several formats.
Probability Distributions
There are functions which compute values of probability functions or probability density
functions, cumulative distribution functions, and quantiles for various distributions. It is
also possible to generate random samples from these distributions. Some examples follow
for the Gaussian distribution. For other distributions, type help(distributionname) or
check the “Introduction to R” in the Html help menu.
R stores and retrieves data from the current working directory. You can use the command
getwd()
to determine the current working directory. To change the working directory, look in
the File menu for \changedir" and browse until you reach your choice.
There are many ways to read data into R. The …les we used in Chapter 1 are in .txt
format with the variate labels in the …rst row separated by spaces and the corresponding
variate values in subsequent rows. We created the …les from EXCEL and then saved the
…les as text …les.
To read such …les, …rst be sure the …le is in your working directory. Then use the
commands
a<-read.table(’filename.txt’,header=T) #filename in single quotes
attach(a)
The “header=T”tells R that the variate names are in the …rst row of the data …le. The
object a is called a data frame in R and the variate names are of the form \a : v1" where
v1 is the name of the …rst column in the …le. The R function attach(a) allows you to drop
the a : from the variate names.
1.6. STATISTICAL SOFTWARE AND R 33
You can cut and paste output generated by R in the sessions window although the format
is usually messed up. This approach works best for Figures. You can write an R vector or
other object to a text …le through
write(y,file="filename")
To see more about the write function use help(write).
In the …le ch1example152.txt, there are three columns labelled hour, machine and volume.
The data are
dd1<-dnorm(w1,356.8,0.53)
points(w1,dd1,type=’l’) # Superimpose Gaussian p.d.f.
hist(v2,br,freq=F,xlab=’volume’,ylab=’density’,main=’Old Machine’)
w2<-357.5+0.799*seq(-3,3,0.01) # Values where Gaussian p.d.f. is located
dd2<-dnorm(w2,357.5,0.8)
points(w2,dd2,type=’l’) # Superimpose Gaussian p.d.f.
1.7. CHAPTER 1 PROBLEMS 35
2. The sample standard deviation and the interquartile range are two di¤erent measures
of the variability of a data set (y1 ; y2 ; : : : ; yn ). Let s be the sample standard deviation
and let IQR be the interquartile range of the data set.
(a) Suppose we transform the data so that ui = a+byi , i = 1; : : : ; n where a and b are
constants and b 6= 0. How are the sample standard deviation and interquartile
range of u1 ; : : : ; un related to s and IQR?
Pn P
n
(b) Show that (yi y)2 = yi2 n (y)2 .
i=1 i=1
(c) Suppose we include an extra observation y0 to the data set. Use the result in
(b) to write the sample standard deviation of the augmented data set in terms
of y0 and the original sample standard deviation. What happens when y0 gets
large (or small)?
36 1. INTRODUCTION TO STATISTICAL SCIENCES
3. The sample skewness and kurtosis are two di¤erent measures of the shape of a data
set (y1 ; y2 ; : : : ; yn ). Let g1 be the sample skewness and let g2 be the sample kurtosis
of the data set. Suppose we transform the data so that ui = a + byi , i = 1; :::; n where
a and b are constants and b 6= 0. How are the sample skewness and sample kurtosis
of u1 ; : : : ; un related to g1 and g2 ?
4. Suppose the data c1 ; c2 ; : : : ; c24 represents the costs of production for a …rm every
month from January 2013 to December 2014. For this data set the sample mean was
$2500, the sample deviation was $5500; the sample median was $2600, the sample
skewness was 1:2, the sample kurtosis was 3:9, and the range was $7500. The rela-
tionship between cost and revenue is given by ri = 7ci + 1000; i = 1; 2; : : : ; 24. Find
the sample mean, standard deviation, median, skewness, kurtosis and range of the
revenues.
(a) Plot a relative frequency histogram of the data. Is the process producing pistons
within the speci…cations.
(b) Calculate the sample mean y and the sample median of the diameters.
(c) Calculate the sample standard deviation s and the IQR.
(d) Such data are often summarized using a single performance index called P pk
de…ned as
U y y L
P pk = min ;
3s 3s
where (L; U ) = ( 10; 10) are the lower and upper speci…cation limits. Calculate
P pk for these data.
1.7. CHAPTER 1 PROBLEMS 37
(e) Explain why larger values of P pk (i.e. greater than 1) are desirable.
(f) Suppose we …t a Gaussian model to the data with mean and standard deviation
equal to the corresponding sample quantities, that is, with = y and = s. Use
the …tted model to estimate the proportion of diameters (in the process) that
are out of speci…cation.
6. In the above problem, we saw how to estimate the performance measure P pk based on
a sample of 50 pistons, a very small proportion of one day’s production. To get an idea
of how reliable this estimate is, we can model the process output by a Gaussian random
variable Y with mean and standard deviation equal to the corresponding sample
quantities. Then we can use R to generate another 50 observations and recalculate
P pk. We do this many times. Here is some R code. Make sure you replace x with
the appropriate values.
(a) Compare the P pk from the original data with the average P pk value from the
1000 iterations. Mark the original P pk value on the histogram of generated P pk
values. What do you notice? What would you conclude about how good the
original estimate of P pk was?
(b) Repeat the above exercise but this time use a sample of 300 pistons rather than
50 pistons. What conclusion would you make about using a sample of 300 versus
50 pistons?
7. Construct the empirical cumulative distribution function for the following data:
0:76 0:43 0:52 0:45 0:01 0:85 0:63 0:39 0:72 0:88
1 P
n
2 1 P
n
2
S2 = Yi Y = Yi2 n Y :
n 1 i=1 n 1 i=1
9. The data below show the lengths (in cm) of 43 male coyotes and 40 female coyotes
captured in Nova Scotia. (Based on Table 2.3.2 in Wild and Seber 1999.) The data
are available in the …le ch1exercise5.txt.
Females x
71:0 73:7 80:0 81:3 83:5 84:0 84:0 84:5 85:0 85:0 86:0 86:4
86:5 86:5 88:0 87:0 88:0 88:0 88:5 89:5 90:0 90:0 90:2 91:0
91:4 91:5 91:7 92:0 93:0 93:0 93:5 93:5 93:5 96:0 97:0 97:0
97:8 98:0 101:6 102:5
P
40 P
40
xi = 3569:6 x2i = 320223:38
i=1 i=1
Males y
78:0 80:0 80:0 81:3 83:8 84:5 85:0 86:0 86:4 86:5 87:0 88:0
88:0 88:9 88:9 90:0 90:5 91:0 91:0 91:0 91:4 92:0 92:5 93:0
93:5 95:0 95:0 95:0 94:0 95:5 96:0 96:0 96:0 96:0 97:0 98:5
100:0 100:5 101:0 101:6 103:0 104:1 105:0
P
43 P
43
yi = 3958:4 yi2 = 366276:84
i=1 i=1
(a) Plot relative frequency histograms of the lengths for females and males sepa-
rately. Be sure to use the same bins.
(b) Determine the …ve number summary for each data set.
(c) Compute the sample mean y and sample standard deviation s for the lengths
of the female and male coyotes separately. Assuming = y and = s, overlay
the corresponding G ( ; ) probability density function on the histograms for the
females and males separately. Comment on how well the Normal model …ts each
data set.
(d) Plot the empirical distribution function of the lengths for females and males
separately. Assuming = y and = s, overlay the corresponding G ( ; )
cumulative distribution functions. Comment on how well the Normal model …ts
each data set.
1.7. CHAPTER 1 PROBLEMS 39
10. Does the value of an actor in‡uence the amount grossed by a movie? The “value
of an actor” will be measured by the average amount the actors’movies have made.
The “amount grossed by a movie” is measured by taking the highest grossing movie,
in which that actor played a major part. For example, Tom Hanks, whose value is
103.2 had his best results with Toy Story 3 (gross 415.0). All numbers are corrected
to 2012 dollar amounts and have units “millions of U.S. dollars”. Twenty actors
were selected by taking the …rst twenty alphabetically listed by name on the website
(http://boxo¢ cemojo.com/people/), and the corresponding measurements (above),
were obtained for each actor. The data for 20 actors, their value (x) and the gross
(y) of their best movie are given below:
Actor 1 2 3 4 5 6 7 8 9 10
Value (x) 67 49:6 37:7 47:3 47:3 32:9 36:5 92:8 17:6 14:4
Gross (y) 177:2 201:6 183:4 55:1 154:7 182:8 277:5 415 90:8 83:9
Actor 11 12 13 14 15 16 17 18 19 20
Value (x) 51:1 54 30:5 42:1 23:6 62:4 32:9 26:9 43:7 50:3
Gross (y) 158:7 242:8 37:1 220 146:3 168:4 173:8 58:4 199 533
P
20 P
20 P
20
xi = 860:6 x2i = 43315:04 xi yi = 184540:93
i=1 i=1 i=1
P
20 P
20
yi = 3759:5 yi2 = 971560:19
i=1 i=1
(a) What are the two variates in this data set? Choose one variate to be an explana-
tory variate and the other to be a response variate. Justify your choice.
(b) Plot a scatterplot of the data.
(c) Calculate the sample correlation for the data (xi ; yi ) ; i = 1; 2; : : : ; 20. Is there a
strong positive or negative relationship between the two variates?
(d) Is it reasonable to conclude that the explanatory variate in this problem causes
the response variate? Explain.
11. In a very large population a proportion of people have blood type A. Suppose n
people are selected at random. De…ne the random variable Y = number of people
with blood type A in sample of size n.
(a) What is the probability function for Y ? What are E(Y ) and V ar(Y )? What
assumptions have you made?
(b) Suppose n = 50. What is the probability of observing 20 people with blood type
A as a function of ?
40 1. INTRODUCTION TO STATISTICAL SCIENCES
12. The IQ’s of students of UWaterloo Math students are Normally distributed with
mean and standard standard deviation . De…ne the random variable Y = IQ of
UWaterloo Math student.
(a) What is the probability density function of Y ? What are E(Y ) and V ar(Y )?
(b) Suppose that the IQ’s for 16 students were:
127 108 127 136 125 130 127 117 123 112 129 109 109 112 91 134
P
16 P
16
yi = 1916; yi2 = 231618
i=1 i=1
What is a reasonable estimate of based on these data? What is a reasonable
estimate of 2 based on these data? Estimate the probability that a random
chosen UWaterloo Math student will have an IQ greater than 120.
(c) Suppose Yi v G( ; ), i = 1; 2; : : : ; n independently.
(i) What is the distribution of
1 Pn
Y = Yi ?
n i=1
Find E(Y ), and V ar(Y ). What happens to V ar(Y ) as n ! 1? What does
this imply about how far Y is from for large n?
1.7. CHAPTER 1 PROBLEMS 41
p p
(ii) Find P Y 1:96 = n Y + 1:96 = n .
(iii) If = 12, …nd the smallest value of n such that P Y 1:0 0:95.
13. The lifetimes of a certain type of battery are Exponentially distributed with parameter
. De…ne the random variable Y = lifetime of a battery.
(a) What is the probability density function of Y ? What are E(Y ) and V ar(Y )?
(b) Suppose the lifetimes (in hours) for 20 batteries were:
20:5 9:9 206:4 9:1 45:8 232:7 127:8 60:4 4:3 3:6 P
20
yi = 1325:1
184:8 3:0 4:4 72:3 22:3 195:3 86:3 8:8 23:3 4:1 i=1
1 Pn
Y = Yi
n i=1
(i) Find E(Y ) and V ar(Y ). What happens to V ar(Y ) as n ! 1? What does
this imply about how far Y is from for large n?
p p
(ii) Approximate P Y 1:6449 = n Y + 1:6449 = n .
(a) What is the probability function for Y ? What are E(Y ) and V ar(Y )?
(b) Suppose on 6 consecutive Wednesday’s the number of accidents observed was
0, 2, 0, 1, 3, 1. What is the probability of observing these data as a function
of ? (Remember the Poisson process assumption that the number of events in
non-overlapping time intervals are independent.) What is a reasonable estimate
of based on these data? Estimate the probability that there is at least one
accident at this intersection next Wednesday.
(c) Suppose Yi v P oisson ( ), i = 1; 2; : : : ; n independently. Let
1 Pn
Y = Yi
n i=1
(i) Find E(Y ) and V ar(Y ). What happens to V ar(Y ) as n ! 1? What does
this imply about how far Y is from for large n?
p p
(ii) Approximate P Y 1:96 =n Y + 1:96 =n . You may ignore
the continuity correction.
42 1. INTRODUCTION TO STATISTICAL SCIENCES
Figure 1.17: Pie chart for support for Republican Presidental candidates
15. The pie chart in Figure 1.17, from Fox News, shows the support for various Republican
Presidential candidates. What do you notice about this pie chart? Comment on how
e¤ective pie charts are in general at conveying information.
16. For the graph in Figure 1.18 indicate whether you believe the graph is e¤ective in
conveying information by giving at least one feature of the graph which is either good
or bad.
boys
girls
Candy
Chips
Chocolate bars
Cookies
Crackers
Fruit
Ice cream
P opcorn
P retzels
V egetables
17. The graphs in Figures 1.19 and 1.20 are two more classic Fox News graphs. What do
you notice? What political message do you think they were trying to convey to their
audience?
18. Information about the mortality from malignant neoplasms (cancer) for females living
in Ontario is given in …gures 1.21 and 1.22 for the years 1970 and 2000 respectively.
The same information displayed in these two pie charts is also displayed in the bar
graph in Figure 1.23. Which display seems to carry the most information?
Lung
O ther
Breast
Stomach
Colorectal
Figure 1.21: Mortality from malignant neoplasms for females in Ontario 1970
Lung
O ther
Stomach
Breast
Colorectal
Figure 1.22: Mortality from malignant neoplasms for females in Ontario in 2000
1.7. CHAPTER 1 PROBLEMS 45
40
1970
2000
35
30
25
20
15
10
0
Lung Leuk. & Lymph. Breast Colorectal Stomach Other
Figure 1.23: Mortality from malignant neoplasms for females living in Ontario,
1970 and 2000
46 1. INTRODUCTION TO STATISTICAL SCIENCES
2. STATISTICAL MODELS AND
MAXIMUM LIKELIHOOD
ESTIMATION
2. Past experience with data sets from the population or process, which has shown that
certain distributions are suitable.
6
The material in this section is largely a review of material you have seen in a previous probability course.
This material is available in the STAT 230 Notes which are posted on the course website.
7
The University of Wisconsin-Madison statistician George E.P. Box (18 October 1919 –28 March 2013)
says of statistical models that "All models are wrong but some are useful" which is to say that although
rarely do they …t very large amounts of data perfectly, they do assist in describing and drawing inferences
from real data.
47
48 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION
In probability theory, there is a large emphasis on factor 1 above, and there are many
“families”of probability distributions that describe certain types of situations. For example,
the Binomial distribution was derived as a model for outcomes in repeated independent
trials with two possible outcomes on each trial while the Poisson distribution was derived
as a model for the random occurrence of events in time or space. The Gaussian or Normal
distribution, on the other hand, is often used to represent the distributions of continuous
measurements such as the heights or weights of individuals. This choice is based largely on
past experience that such models are suitable and on mathematical convenience.
In choosing a model we usually consider families of probability distributions. To be
speci…c, we suppose that for a random variable Y we have a family of probability func-
tions/probability density functions, f (y; ) indexed by the parameter (which may be a
vector of values). In order to apply the model to a speci…c problem we need a value for .
The process of selecting a value for based on the observed data is referred to as “estimat-
ing” the value of or“…tting” the model. The next section describes the most widely used
method for estimating .
Most applications require a sequence of steps in the formulation (the word “speci…ca-
tion” is also used) of a model. In particular, we often start with some family of models in
mind, but …nd after examining the data set and …tting the model that it is unsuitable in cer-
tain respects. (Methods for checking the suitability of a model will be discussed in Section
2.4.) We then try other models, and perhaps look at more data, in order to work towards
a satisfactory model. This is usually an iterative process, which is sometimes represented
by diagrams such as:
Statistics devotes considerable e¤ort to the steps of this process. However, in this
course we will focus on settings in which the models are not too complicated, so that
model formulation problems are minimized. There are several distributions that you should
review before continuing since they will appear frequently in these notes. See the STAT
220/230/240 Course Notes available on the course webpage. You should also consult the
Table of Distributions in these course notes for a condensed table of properties of these
distributions including their moment generating functions and their moments.
2.1. CHOOSING A STATISTICAL MODEL 49
P Rx
F (x) = P (X x) = P (X = t) F (x) = P (X x) = f (t) dt
c.d.f. t x 1
F is a right continuous step F is a continuous
function for all x 2 < function for all x 2 <
d
p.f./p.d.f. f (x) = P (X = x) f (x) = dx F (x) 6= P (X = x) = 0
P
P (X 2 A) = P (X = x) P (a < X b) = F (b) F (a)
Probability x2A
P Rb
of an event = f (x) = f (x) dx
x2A a
P P R1
Total Probability P (X = x) = f (x) = 1 f (x) dx = 1
all x all x 1
P R1
Expectation E [g (X)] = g (x) f (x) E [g (X)] = g (x) f (x) dx
all x 1
Binomial Distribution
The discrete random variable (r.v.) Y has a Binomial distribution if its probability
function is of the form
n y
P (Y = y; ) = f (y; ) = (1 )n y
for y = 0; 1; : : : ; n
y
Poisson Distribution
The discrete random variable Y has a Poisson distribution if its probability function is
of the form
y
e
f (y; ) = for y = 0; 1; 2; : : :
y!
where is a parameter with > 0. We write Y Poisson( ). Recall that E(Y ) = and
V ar(Y ) = .
50 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION
Exponential Distribution
The continuous random variable Y has an Exponential distribution if its probability
density function is of the form
1 y=
f (y; ) = e for y > 0
where is parameter with > 0. We write Y Exponential( ). Recall that E(Y ) = and
V ar(Y ) = 2 .
where and are parameters, with 2 < and > 0. Recall that E(Y ) = ; V ar(Y ) =
2 ; and the standard deviation of Y is sd(Y ) = . We write either Y G( ; ) or
Y N ( ; 2 ). Note that in the former case, G( ; ), the second parameter is the stan-
dard deviation whereas in the latter, N ( ; 2 ), the second parameter is the variance 2 .
Most software syntax including R requires that you input the standard deviation for the
parameter. As seen in examples in Chapter 1, the Gaussian distribution provides a suitable
model for the distribution of measurements on characteristics like the height or weight of
individuals in certain populations, but is also used in many other settings. It is particularly
useful in …nance where it is the most commonly used model for asset prices, exchange rates,
interest rates, etc.
Multinomial Distribution
The Multinomial distribution is a multivariate distribution in which the discrete random
variable’s Y1 ; : : : ; Yk (k 2) have the joint probability function
P (Y1 = y1 ; : : : ; Yk = yk ; ) = f (y1 ; : : : ; yk ; )
n! y1 y2 ::: yk
= (2.1)
y1 !y2 ! : : : yk ! 1 2 k
where each yi , for i = 1; : : : ; k, is an integer between 0 and n, and satisfying the condition
Pk
yi = n. The elements of the parameter vector = ( 1 ; : : : ; k ) satisfy 0 < i < 1 for i =
i=1
P
k
1; : : : ; k and i = 1. This distribution is a generalization of the Binomial distribution. It
i=1
arises when there are repeated independent trials, where each trial has k possible outcomes
(call them outcomes 1; : : : ; k), and the probability outcome i occurs is i . If Yi , i = 1; : : : ; k
is the number of times that outcome i occurs in a sequence of n independent trials, then
2.2. ESTIMATION OF PARAMETERS AND THE METHOD OF MAXIMUM LIKELIHOOD51
(Y1 ; : : : ; Yk ) have the joint probability function given in (2.1). We write (Y1 ; : : : ; Yk )
Multinomial(n; ):
Pk
Since Yi = n we can rewrite f (y1 ; : : : ; yk ; ) using only k 1 variables, say y1 ; : : : ; yk 1
i=1
by replacing yk with n y1 : : : yk 1 . We see that the Multinomial distribution with
k = 2 is just the Binomial distribution, where the two possible outcomes are S (Success)
and F (Failure).
We now turn to the problem of …tting a model. This requires estimating or assigning
numerical values to the parameters in the model (for example, in an Exponential model
or and in the Gaussian model).
parameters is called a statistic. The numerical summaries discussed in Chapter 1 are all
examples of statistics. A point estimate is also a statistic.
Instead of ad hoc approaches to estimation as in (2.2), it is desirable to have a general
method for estimating parameters. The method of maximum likelihood is a very general
method, which we now describe.
Let the discrete (vector) random variable Y represent potential data that will be used
to estimate , and let y represent the actual observed data that are obtained in a speci…c
application. Note that to apply the method of maximum likelihood, we must know (or
make assumptions about) how the data y were collected. It is usually assumed here that
the data set consists of measurements on a random sample of population units.
L ( ) = L ( ; y) = P (Y = y; ) for 2
Note that the likelihood function is a function of the parameter and the given data y.
For convenience we usually write just L ( ). Also, the likelihood function is the probability
that we observe at random the observation y, considered as a function of the parameter
. Obviously values of the parameter that make our observation y more probable would
seem more credible or likely than those that make it less probable. Therefore values of
for which L( ) is large are more consistent with the observed data y. This seems like a
“sensible” approach, and it turns out to have very good properties.
De…nition 9 The value of which maximizes L( ) for given data y is called the maximum
likelihood estimate 8 (m.l. estimate) of . It is the value of which maximizes the probability
of observing the data y. The value is denoted by ^.
We are surrounded by polls. They guide the policies of political leaders, the products
that are developed by manufacturers, and increasingly the content of the media. The fol-
lowing is an example of a public opinion poll.
The poll described in the article was conducted in November 2010. Harris/Decima uses
a telephone poll of 2000 “representative” adults. Figure 2.1 shows the results for the polls
conducted in fall 2009 and 2010. In 2009 and 2010, 26% of respondents agreed and 48% dis-
agreed with the statement: “University and college teachers earn too much”.Harris/Decima
Figure 2.1: Harris/Decima poll. The two bars are from polls conducted in Nov.
9, 2009 (left bar) and Nov 10, 2010 (right bar)
declared their result to be accurate within 2:2%, 19 times out of 20 (the margin of error
for regional, demographic or other subgroups is larger). What does this mean and how
were these estimates and intervals obtained?
Suppose that the random variable Y represents the number of individuals who, in a
randomly selected group of n persons, agreed with the statement. Suppose we assume that
Y is closely modelled by a Binomial distribution with probability function
n y
P (Y = y; ) = f (y; ) = (1 )n y
for y = 0; 1; : : : ; n
y
where represents the proportion of the Canadian adult population that agree with the
statement. If n people are selected and y people agree with the statement then the likelihood
function is given by
It is easy to see that (2.3) is maximized by the value = ^ = y=n. (You should show
this.) The estimate ^ = y=n is called the sample proportion. For the Harris/Decima poll
conducted in 2010, y = 520 people out of n = 2000 people agreed with the statement so
the likelihood function is
2000 520
L( ) = (1 )1480 for 0 < <1 (2.4)
520
and the maximum likelihood estimate is 520=2000 = 0:26 or 26%. This is also easily seen
from a graph of the likelihood function (2.4) given in Figure 2.2.
2.2. ESTIMATION OF PARAMETERS AND THE METHOD OF MAXIMUM LIKELIHOOD55
0.025
0.02
L(θ)
0.015
0.01
0.005
0 θ
0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.3
Figure 2.2: Likelihood function for the Harris/Decima poll and corresponding
interval estimate for
The interval suggested by the pollsters was 26 2:2% or [23:8; 28:2]. Looking at Figure
2.2 we see that the interval [0:238; 0:282] is a reasonable interval for the parameter since
it seems to contain most of the values of with large values of the likelihood L( ). We will
return to the construction of such interval estimates in Chapter 4.
Note that the likelihood function’s shape and where its maximum occurs are not a¤ected
if L( ) is multiplied by a constant. Indeed it is not the absolute value of the likelihood
function that is important but the relative values at two di¤erent values of the parameter,
e.g. L( 1 )=L( 2 ): You might think of this ratio as how much more or less consistent the
data are with the parameter 1 versus 2 . The ratio L( 1 )=L( 2 ) is also una¤ected if L( ) is
multiplied by a constant. In view of this the likelihood may be de…ned as P (Y = y; ) or as
any constant multiple of it, so, for example, we could drop the term ny in (2.3) and de…ne
L( ) = y (1 )n y . This function and (2.3) are maximized by the same value ^ = y=n and
have the same shape. Indeed we might rescale the likelihood function by dividing through
by its maximum value L(^) so that the new function has a maximum value equal to one.
Sometimes it is easier to work with the log (log = ln in these course notes) of the
likelihood function.
56 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION
1.5
L(θ)
0.5
-0.5
l(θ)
-1
-1.5
-2
-2.5
-3
θ
0.23 0.24 0.25 0.26 0.27 0.28 0.29
Figure 2.3: The functions L ( ) (upper graph) and l ( ) (lower graph) are both
maximized at the same value = ^
l( ) = log L( ) for 2 :
Note that ^ also maximizes l( ). In fact in Figure 2.3 we see that l( ), the lower of the
two curves, is a monotone function of L( ) so they increase together and decrease together.
This implies that both functions have a maximum at the same value = ^.
Because functions are often (but not always!) maximized by setting their derivatives
equal to zero11 , we can usually obtain ^ by solving the equation
dl
= 0:
d
y
For example, from L( ) = (1 )n y we get l( ) = y log( ) + (n y) log(1 ) and
dl y n y
= :
d 1
Solving dl=d = 0 gives = y=n. The First Derivative Test can be used to verify that this
corresponds to a maximum value so the maximum likelihood estimate of is ^ = y=n.
11
Can you think of an example of a continuous function f (x) de…ned on the interval [0; 1] for which the
maximum max0 x 1 f (x) is NOT found by setting f 0 (x) = 0?
2.2. ESTIMATION OF PARAMETERS AND THE METHOD OF MAXIMUM LIKELIHOOD57
(You should recall from probability that if Y1 ; : : : ; Yn are independent random variables
then their joint probability function is the product of their individual probability functions.)
or more simply
ny n
L( ) = e for > 0:
The log likelihood is
l ( ) = n (y log ) for >0
with derivative
d y n
l( ) = n 1 = (y ):
d
A …rst derivative test easily veri…es that the value = y maximizes l( ) and so ^ = y is the
maximum likelihood estimate of .
L( ) = L1 ( ) L2 ( ) for 2
L( ) = 1060
(1 )2940 for 0 < < 1:
Sometimes the likelihood function for a given set of data can be constructed in more
than one way as the following example illustrates.
Example 2.2.3
Suppose that the random variable Y represents the number of persons infected with
the human immunode…ciency virus (HIV) in a randomly selected group of n persons. We
assume the data are reasonably modeled by Y v Binomial(n; ) with probability function
n y
P (Y = y; ) = f (y; ) = (1 )n y
for y = 0; 1; : : : ; n
y
where represents the proportion of the population that are infected. In this case, if we
select a random sample of n persons and test them for HIV, we have Y = Y , and y = y as
the observed number infected. Thus
n y
L( ) = (1 )n y
for 0 < <1
y
or more simply
y
L( ) = (1 )n y
for 0 < <1 (2.5)
and again L( ) is maximized by the value ^ = y=n.
For this random sample of n persons who are tested for HIV, we could also de…ne the
indicator random variable
-17
-18
-19
l(θ)
-20
-21
-22
-23
-24
-25
θ
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
vi (ml) 8 4 2 1
no. of samples 10 10 10 10
no. with zi = 1 10 8 7 3
This gives
8 4 2
l( ) = 10 log(1 e ) + 8 log(1 e ) + 7 log(1 e )
+ 3 log(1 e ) 21 for > 0:
in R the function nlm() is powerful and easy to use. In addition, statistical software packages
contain special functions for …tting and analyzing a large number of statistical models. The
R package MASS (which can be accessed by the command library(MASS)) has a function
fitdistr that will …t many common models.
with derivative
d 1 y n
l( ) = n 2 = 2 (y ):
d
A …rst derivative test easily veri…es that the value = y maximizes l( ) and so ^ = y is the
maximum likelihood estimate of .
62 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION
1 1
f (y; ; ) = p exp 2
(y )2 for y 2 <:
2 2
Q
n
L( ) = L( ; ) = f (yi ; ; )
i=1
Q
n 1 1
= p exp 2
(yi )2
i=1 2 2
1 P
n
= (2 ) n=2 n
exp 2
(yi )2 for 2 < and >0
2 i=1
or more simply
1 P
n
L( ) = L( ; ) = n
exp 2
(yi )2 for 2 < and > 0:
2 i=1
1 P
n
l( ) = l( ; ) = n log 2
(yi )2 for 2 < and >0
2 i=1
To maximize l( ; ) with respect to both parameters and we solve 12 the two equations13
@l 1 Pn n
= 2 (yi )= 2
(y )=0
@ i=1
@l n 1 P
n
= + 3
(yi )2 = 0;
@ i=1
1=2
1 Pn 1 Pn
^= yi = y and ^= (yi y)2 :
n i=1 n i=1
12
To maximize a function of two variables, set the derivative with respect to each variable equal to zero.
Of course …nding values at which the derivatives are zero does not prove this is a maximum. Showing it is
a maximum is another exercise in calculus.
13
In case you have not met partial derivatives, the notation @@ means we are taking the derivative with
respect to while holding the other parameter constant. Similarly @@ is the derivative with respect to
while holding constant.
2.4. LIKELIHOOD FUNCTIONS FOR MULTINOMIAL MODELS 63
n! y1 y2 y3
P (Y1 = y1 ; Y2 = y2 ; Y3 = y3 ) = 1 2 3
y1 !y2 !y3 !
2
1 = ; 2 = 2 (1 ); 3 = (1 )2
n! 2 y1
P (Y1 = y1 ; Y2 = y2 ; Y3 = y3 ) = [ ] [2 (1 )]y2 [(1 )2 ]y3 :
y1 !y2 !y3 !
n!
L( ) = [ 2 ]y1 [2 (1 )]y2 [(1 )2 ]y3
y1 !y2 !y3 !
n!
= 2y2 2y1 +y2 (1 )y2 +2y3 for 0 < <1
y1 !y2 !y3 !
or more simply
2y1 +y2
L( ) = (1 )y2 +2y3 for 0 < < 1:
and
dl 2y1 + y2 y2 + 2y3
=
d 1
and
dl 2y1 + y2 2y1 + y2
= 0 if = =
d 2y1 + 2y2 + 2y3 2n
so
2y1 + y2
^=
2n
is the maximum likelihood estimate of .
2.5. INVARIANCE PROPERTY OF MAXIMUM LIKELIHOOD ESTIMATES 65
Example 2.5.1
Suppose we want to estimate attributes associated with BMI for some population of
individuals (for example, Canadian males age 21-35). If the distribution of BMI values in
the population is well described by a Gaussian model, Y G( ; ), then by estimating
and we can estimate any attribute associated with the BMI distribution. For example:
(i) The mean BMI in the population corresponds to = E(Y ) for the Gaussian distri-
bution.
(ii) The median BMI in the population corresponds to the median of the Gaussian
distribution which equals since the Gaussian distribution is symmetric about its mean.
(iii) For the BMI population, the 0:1 (population) quantile, Q (0:1) = 1:28 . (To
see this, note that P (Y 1:28 ) = P (Z 1:28) = 0:1, where Z = (Y )= has a
G(0; 1) distribution.)
(iv) The fraction of the population with BMI over 35:0 given by
35:0
p=1
Example 2.6.1 Rutherford and Geiger study of alpha-particles and the Poisson
model
In 1910 the physicists Ernest Rutherford and Hans Geiger conducted an experiment
in which they recorded the number of alpha particles omitted from a polonium source (as
detected by a Geiger counter) during 2608 time intervals each of length 1=8 minute. The
number of particles j detected in the time interval and the frequency fj of that number of
particles is given in Table 2.1.
We can see whether a Poisson model …t these data by comparing the observed frequencies
with the expected frequencies calculated assuming a Poisson model. To calculate these
expected frequencies we need to specify the mean of the Poisson model. We estimate
using the sample mean for the data which is
^ = 1 P 14
jfj
2608 j=0
1
= (10097)
2608
= 3:8715:
(3:8715)j e 3:8715
ej = (2608) ; j = 0; 1; : : :
j!
The expected frequencies are also given in Table 2.1.
Since the observed and expected frequencies are reasonably close, the Poisson model
seems to …t these data well. Of course, we have not speci…ed how close the expected and
observed frequencies need to be in order to conclude that the model is reasonable. We will
look at a formal method for doing this in Chapter 7.
2.6. CHECKING THE MODEL 67
This comparison of observed and expected frequencies to check the …t of a model can
also be used for data that have arisen from a continuous model. The following is an example.
Zaj
1 y=49:0275
ej = 200 e dy
49:0275
aj 1
aj 1 =49:0275 aj =49:0275
= 200 e e :
The expected frequencies are also given in Table 2.2. We notice that the observed and
68 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION
expected frequencies are not close in this case and therefore the Exponential model does
not seem to be a good model for these data.
The drawback of this method for continuous data is that the intervals must be selected
and this adds a degree of arbitrariness to the method. The following graphical methods
provide better techniques for checking the …t of the model for continuous data.
15
Graphical Checks of Models
We may also use graphical techniques for checking the …t of a model. These methods are
particularly useful for continuous data.
The …rst graphical method is to superimpose the probability density function on the
relative frequency histogram of the data as we did in Figures 1.15 and 1.16 for the data
from the can …ller study.
A second graphical procedure is to plot the empirical cumulative distribution function F^ (y)
and then to superimpose on this a plot of the model-based cumulative distribution function,
P (Y y; ) = F (y; ). We saw an example of such a plot in Chapter 1 but we provide
more detail here. The objective is to compare two cumulative distribution functions, one
that we hypothesized is the cumulative distribution function for the population, and the
other obtained from the sample. If they di¤er a great deal, this would suggest that the
hypothesized distribution is a poor …t.
15
See the video at www.watstat.ca called "The empirical c.d.f. and the qqplot" on the material in this
section.
2.6. CHECKING THE MODEL 69
0:76 0:43 0:52 0:45 0:01 0:85 0:63 0:39 0:72 0:88:
The …rst step in constructing the empirical cumulative distribution function is to order the
observations from smallest to largest16 obtaining
0:01 0:39 0:43 0:45 0:52 0:63 0:72 0:76 0:85 0:88
If you were then asked, purely on the basis of this data, what you thought the probability
is that a random value in the population falls below a given value y, you would probably
respond with the proportion in the sample that falls below y. For example, since four of the
values 0:01 0:39 0:43 0:45 are less than 0:5, we would estimate the cumulative distribution
function at 0:5 using 4=10. Thus, we de…ne the empirical cumulative distribution function
for all real numbers y by the proportion of the sample less than or equal to y or:
More generally for a sample of size n we …rst order the yi ’s, i = 1; : : : ; n to obtain the
ordered values y(1) y(2) : : : y(n) . F^ (y) is a step function with a jump at each of the
ordered observed values y(i) . If y(1) ; y(2) ; : : : ; y(n) are all di¤erent values, then F^ (y(j) ) = j=n
and the jumps are all of size 1=n. In general the size of a jump at a particular point y is
the number of values in the sample that are equal to y, divided by n:
16
We usually denote the ordered values y(1) y(2) ::: y(n) where y(1) is the smallest and y(n) is the
largest. In this case y(n) = 0:88:
70 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION
0 .9
0 .8
0 .7
theoretical quantiles
0 .6
0 .5
0 .4
0 .3
0 .2
0 .1
0
0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1
s a m p l e q u a n ti l e s
Figure 2.5: The empirical cumulative distribution function for n = 10 data values
and a superimposed Uniform(0; 1) cumulative distribution function.
As an example we consider data (see Appendix C) for the time between 300 eruptions,
between the …rst and the …fteenth of August 1985, of the geyser Old Faithful in Yellowstone
National Park. One might hypothesize that the random distribution of times between
consecutive eruptions follows a Normal distribution. We plot the empirical cumulative
distribution function in Figure 2.6 together with the cumulative distribution function of a
Gaussian distribution. Of course we don’t know the parameters of the appropriate Gaussian
distribution so we use the sample mean 72:3 and sample standard deviation 13:9 in order
to approximate these parameters. Are the di¤erences between the two curves in Figure 2.6
su¢ cient that we would have to conclude a distribution other than the Gaussian? There are
two ways of trying to get another view of the magnitude of these di¤erences. The …rst way
is to plot the relative frequency histogram of the data and then superimpose the Gaussian
curve. The second way is to use a qqplot which will be discussed in the next section.
2.6. CHECKING THE MODEL 71
0.9
0.8
0.7
0.6
e.c.d.f.
0.5
0.4
0.3
0.2 G (72.3,13.9)
0.1
0
30 40 50 60 70 80 90 100 110 120
T ime between Eruptions
Figure 2.6: Empirical c.d.f. of times between eruptions of Old Faithful and
superimposed G (72:3; 13:9) c.d.f.
Figure 2.7 seems to indicate that the distribution of the times between eruptions is not
very Normal because it appears to have two modes. The plot of the empirical cumulative
distribution function did not show the shape of the distribution as clearly as the histogram.
The empirical cumulative distribution function does allow us to determine the pth quantile
or 100pth percentile (the left-most value on the horizontal axis yp where F^ (yp ) = p). For
example, from the empirical cumulative distribution function of the Old Faithful data, we
see that the median time (F^ (m)
^ = 0:5) between eruptions is around m ^ = 78.
0.035
0.03
0.025
Relative
F requency
0.02
0.015
G (72.3,13.9)
0.01
0.005
0
43 49 55 61 67 73 79 85 91 97 103 109
T ime between Eruptions
Figure 2.7: Relative frequency histogram for times between eruptions of Old
Faithful and superimposed G (72:3; 13:9) p.d.f.
72 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION
0.9
0.8
0.7
0.6
e.c.d.f.
0.5 G(1.62,0.064)
0.4
0.3
0.2
0.1
0
1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 1.8 1.85
Height
Figure 2.8: Empirical c.d.f. of female heights and G (1:62; 0:064) c.d.f.
5
Relative
F requency
4 G (1.62,0.064)
0
1.4 1.5 1.6 1.7 1.8
Height
Figure 2.9: Relative frequency histogram of female heights and G (1:62; 0:064) p.d.f.
Figure 2.9 shows a relative frequency histogram for these data with the G(1:62; 0:0637)
probability density function superimposed. The two types of plots give complementary but
consistent pictures. An advantage of the distribution function comparison is that the exact
heights in the sample are used, whereas in the histogram plot the data are grouped into
intervals to form the histogram. However, the histogram and probability density function
2.6. CHECKING THE MODEL 73
show the distribution of heights more clearly. Both graphs indicate that a Normal model
seems reasonable for these data.
Qqplots
An alternative view, which is really just another method of graphing the empirical cumu-
lative distribution function, tailored to the Normal distribution, is a graph called a qqplot.
Suppose the data Yi , i = 1; : : : ; n were in fact drawn from the G( ; ) distribution so that
the standardized variables, after we order them from smallest Y(1) to largest Y(n) , are
Y(i)
Z(i) = :
These behave like the ordered values from a sample of the same size taken from the G(0; 1)
distribution. Approximately what value do we expect Z(i) to take? If denotes the
standard Normal cumulative distribution function then for 0 < u < 1
1 1
P ( (Z) u) = P (Z (u)) = ( (u)) = u
so that (Z) has a Uniform distribution. It is easy to check that the expected value of
the i’th largest value in a random sample of size n from a Uniform(0; 1) distribution is
i 17 i
equal to n+1 so we expect that the i=n0 th quantile (Z(i) ) to be close to n+1 . In other
1 i
words we expect Z(i) = Y(i) = to be approximately n+1 or Y(i) to be roughly
1 i
a linear function of n+1 . This is the basic argument underlying the qqplot. If
1 i
the distribution is actually Normal, then a plot Y(i) ; n+1 , i = 1; : : : ; n should be
approximately linear (subject to the usual randomness).
Since reading qqplots is an art acquired from experience, it is a good idea to generate
similar plots where we know the answer. This can be done by generating data from a known
distribution and then plotting a qqplot. See the R code below and Chapter 2, Problem
14. A qqplot of 100 observations randomly generated from a G ( 2; 3) distribution is
given in Figure 2.10. The theoretical quantiles are plotted on the horizontal axis and the
empirical quantiles are plotted on the vertical axis. Since the quantiles of the Normal
distribution change more rapidly in the tails of the distribution, we expect the
points at both ends of the line to lie further from the line.
17
This is intuitively obvious since n values Y(i) breaks the interval into n + 1 spacings,
and it makes sense each should have the same expected length. For empirical evidence see
http://www.math.uah.edu/stat/applets/OrderStatisticExperiment.html. More formally we must …rst show
the p.d.f. of Y(i) is (i 1)!(nn!
i)!
ui 1 (1 u)n i for 0 < u < 1: Then …nd the integral E(Y(i) ) =
R1 n!
0 (i 1)!(n i)!
ui (1 u)n i du = n+1
i
:
74 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION
10
Sample Quantiles
0
-5
-10
-15
-3 -2 -1 0 1 2 3
Standard Normal Quantiles
A qqplot of the female heights is given in Figure 2.11. Overall the points lie reasonably
along a straight line. The qqplot has a staircase look because the heights are rounded to the
closest centimeter. As was the case for the relative frequency histogram and the empirical
cumulative distribution function, the qqplot indicates that the Normal model is reasonable
for these data.
1.85
1.8
1.75
1.7
Sample Quantiles
1.65
1.6
1.55
1.5
1.45
1.4
-3 -2 -1 0 1 2 3
Standard Normal Quantiles
A qqplot of the times between eruptions of Old Faithful is given in Figure 2.12. The
points do not lie along a straight line which indicates as we saw before that the Normal is
not a reasonable model for these data. The two places at which the shape of the points
changes direction correspond to the two modes of these data that we observed previously.
140
120
100
Sample Quantiles
80
60
40
20
0
-3 -2 -1 0 1 2 3
Standard Normal Q uant iles
A qqplot of the lifetimes of brake pads (Example 1.3.3) is given in Figure 2.13. The
points form a U-shaped curve. This pattern is consistent with the long right tail and
positive skewness that we observed before. The Normal is not a reasonable model for these
data.
200
150
Sample Quantiles
100
50
-50
-3 -2 -1 0 1 2 3
Standard Normal Q uant iles
A qqplot of the data in Figure 1.4 is given in Figure 2.14. These points form an S-shaped
curve which is consistent with the fact that the data are reasonably symmetric but the data
do not have tails like the Normal distribution.
1.5
1
Sample Quantiles
0.5
-0.5
-3 -2 -1 0 1 2 3
Standard Normal Quantiles
R Code for Checking Models Using Histograms, Empirical c.d.f.’s and Qqplots
# Normal Data Example
y<-rnorm(100,5,2) # generate 100 observations from a G(5,2) distribution
mn<-mean(y) # find the sample mean
s<-sd(y) # find the sample standard deviation
summary(y) # five number summary
skewness(y,type=’1’) # find the sample skewness as given in the Course Notes
(a) G ( ) = a
(1 )b ; 0 < <1
a b=
(b) G ( ) = e ; >0
a b
(c) G ( ) = e ; >0
2
(d) G ( ) = e a( b) ; 2 <:
2. Consider the following two experiments whose purpose was to estimate , the fraction
of a large population with blood type B.
Experiment 1: Individuals were selected at random until 10 with blood type B were
found. The total number of people examined was 100.
Experiment 2: One hundred individuals were selected at random and it was found
that 10 of them have blood type B.
(a) Find the probability of the observed results (as a function of ) for the two
experiments. Thus obtain the likelihood function for for each experiment and
show that they are proportional. Show the maximum likelihood estimate ^ is
the same in each case. What is the maximum likelihood estimate of ?
(b) Suppose n people came to a blood donor clinic. Assuming = 0:10, use the Nor-
mal approximation to the Binomial distribution (remember to use a continuity
correction) to determine how large should n be to ensure that the probability
of getting 10 or more donors with blood type B is at least 0:90? Use The R
functions gbinom() or pbinom() to determine the exact value of n.
3. Specimens of a new high-impact plastic are tested by repeatedly striking them with
a hammer until they fracture. Let Y = the number of blows required to fracture a
specimen. If the specimen has a constant probability of surviving a blow, indepen-
dently of the number of previous blows received, then the probability function for Y
is
f (y; ) = P (Y = y; ) = y 1 (1 ) for y = 1; 2; : : : ; 0 < < 1:
(a) Find the likelihood function L( ) and the maximum likelihood estimate ^.
P
200
(b) Find the relative likelihood function R ( ). If n = 200 and yi = 400 then plot
i=1
R ( ).
2.7. CHAPTER 2 PROBLEMS 79
(c) Estimate the probability that a specimen fractures on the …rst blow.
(a) The numbers of transactions received in 10 separate one minute intervals were
8, 3, 2, 4, 5, 3, 6, 5, 4, 1. Write down the likelihood function for and …nd the
maximum likelihood estimate ^.
(b) Estimate the probability that during a two-minute interval, no transactions ar-
rive.
(c) Use the R function rpois() with the value = 4:1 to simulate the number of
transactions received in 100 one minute intervals. Calculate the sample mean
and variance; are they approximately the same? (Note that E(Y ) = V ar(Y ) =
for the Poisson model.)
(a) Find the likelihood function L( ) and the maximum likelihood estimate ^.
P
20
(b) Find the relative likelihood function R ( ). If n = 20 and yi2 = 72 then plot
i=1
R ( ).
(a) Find the likelihood function L( ) and the maximum likelihood estimate ^.
(b) Find the log relative likelihood function r ( ). If n = 15 and
P
15
log yi = 34:5 then plot r ( ).
i=1
7. Suppose that in a population of twins, males (M ) and females (F ) are equally likely
to occur and that the probability that a pair of twins is identical is . If twins are
not identical, their sexes are independent.
80 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION
8. The following model has been proposed for the distribution of the Y = the number
of children in a family, for a large population of families:
1 2
P (Y = 0; ) =
1
and
y 1
P (Y = y; ) = for y = 1; 2; : : : and 0 < :
2
(a) What does the parameter represent?
(b) Suppose that n families are selected at random and the observed data were
y 0 1 ymax > ymax Total
fy f0 f1 fmax 0 n
where fy = the observed number of families with y children and ymax = maximum
number of children observed in a family. Find the probability of observing these
data and thus determine the maximum likelihood estimate of .
(c) Consider a di¤erent type of sampling in which a single child is selected at random
and then the number of o¤spring in that child’s family is determined. Let X
represent the number of children in the family of a randomly chosen child. Show
that
1
P (X = x; ) = cx x for x = 1; 2; : : : and 0 <
2
and determine c.
(d) Suppose that the type of sampling in part (c) was used and that with n = 33
the following data were obtained:
x 1 2 3 4 >4 Total
fx 22 7 3 1 0 33
Find the probability of observing these data and thus determine the maximum
likelihood estimate of . Estimate the probability a couple has no children using
these data.
(e) Suppose the sample in (d) was incorrectly assumed to have arisen from the
sampling plan in (b). What would ^ be found to be? This problem shows that
the way the data have been collected can a¤ect the model.
2.7. CHAPTER 2 PROBLEMS 81
9. When Wayne Gretzky played for the Edmonton Oilers he scored an incredible 1669
points in 696 games. The data are given in the frequency table below:
The P oisson ( ) model has been proposed for the random variable Y = number of
points Wayne scores in a game.
(a) Show that the likelihood function for based on the Poisson model and the data
in the frequency table simpli…es to
1669 696
L( ) = e ; > 0:
10. Radioactive particles are emitted randomly over time from a source at an average rate
of per second. In n time periods of varying lengths t1 ; t2 ; : : : ; tn (seconds), the num-
bers of particles emitted (as determined by an automatic counter) were y1 ; y2 ; : : : ; yn
respectively.
(a) Determine an estimate of from these data. What assumptions have you made
to do this?
(b) Suppose that the intervals are all of equal length (t1 = t2 = = tn = t) and that
instead of knowing the yi ’s, we know only whether or not there were one or more
particles emitted in each time interval of length t. Find the likelihood function
for based on these data, and determine the maximum likelihood estimate of .
82 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION
11. The marks for 100 students on a tutorial test in STAT 231 were:
30
25
20
15
10
Marks
Qqplot of Marks
35
30
25
Sample Quantiles
20
15
10
0
-3 -2 -1 0 1 2 3
N(0,1) Quantiles
(a) Determine the sample mean y and the sample standard deviation s for these
data.
(b) Determine the proportion of observations in the interval [y s; y + s] and
[y 2s; y + 2s]. Compare these proportions with P (Y 2 [ ; + ]) and
P (Y 2 [ ; + ]) where Y v G ( ; ).
(c) Find the sample skewness and sample kurtosis for these data. Are these values
close to what you would expect for Normally distributed data?
(d) Find the …ve-number summary for these data.
2.7. CHAPTER 2 PROBLEMS 85
(e) Find the IQR for these data. Does the IQR agree with what you expect for
Normally distributed data?
(f) Construct a relative frequency histogram and superimpose a Gaussian probabil-
ity density function with = y and = s.
(g) Construct an empirical distribution function for these data and superimpose a
Gaussian cumulative distribution function with = y and = s.
(h) Draw a boxplot for these data.
(i) Plot a qqplot for these data. Do you observe anything unusual about the qplot?
Why might cause this?
(j) Based on the above information indicate whether it is reasonable to assume a
Gaussian distribution for these data.
13. Consider the data on heights of adult males and females from Chapter 1. (The data
are posted on the course webpage.)
(a) Assuming that for each sex the heights Y in the population from which the sam-
ples were drawn is adequately represented by Y G( ; ), obtain the maximum
likelihood estimates ^ and ^ in each case.
(b) Give the maximum likelihood estimates for q (0:1) and q (0:9), the 10th and 90th
percentiles of the height distribution for males and for females.
(c) Give the maximum likelihood estimate for the probability P (Y > 1:83) for males
and females (i.e. the fraction of the population over 1:83 m, or 6 ft).
(d) A simpler estimate of P (Y > 1:83) that doesn’t use the Gaussian model is
14. The qqplot of the brake pad data in Figure 2.13 indicates that the Normal distribution
is not a reasonable model for these data. Sometimes transforming the data gives a
data set for which the Normal model is more reasonable. A log transformation is often
used. Plot a qqplot of the log lifetimes and indicate whether the Normal distribution
is a reasonable model for these data. The data are posted on the course webpage.
15. In a large population of males ages 40 50, the proportion who are regular smokers is
where 0 < < 1 and the proportion who have hypertension (high blood pressure)
86 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION
is where 0 < < 1. If the events S (a person is a smoker) and H (a person has
hypertension) are independent, then for a man picked at random from the population
the probabilities he falls into the four categories SH; S H; SH; S H are respectively,
; (1 ); (1 ) ; (1 )(1 ). Explain why this is true.
(a) Suppose that 100 men are selected and the numbers in each of the four categories
are as follows:
Category SH S H SH S H
Frequency 20 15 22 43
Assuming that S and H are independent events, determine the likelihood func-
tion for and based on the Multinomial distribution, and …nd the maximum
likelihood estimates of and .
(b) Compute the expected frequencies for each of the four categories using the max-
imum likelihood estimates. Do you think the model used is appropriate? Why
might it be inappropriate?
(a) y<-rnorm(100)
(b) y<-runif(100)
(c) y<-rexp(100)
(d) y<-rgamma(100,4,1)
(e) y<-rt(100,3)
(f) y<-rcauchy(100)
17. A qqplot was generated for 100 values of a variate. See Figure 2.17. Based on this
qqplot, answer the following questions:
(a) What is the approximate value of the sample median of these data?
(b) What is the approximate value of the IQR of these data?
(c) Would the frequency histogram of these data be reasonably symmetric about
the sample mean?
(d) The frequency histogram for these data would most resemble a Normal probabil-
ity density function, an Exponential probability density function or a Uniform
probability density function?
2.7. CHAPTER 2 PROBLEMS 87
2.5
1.5
Sample Quantiles
0.5
-0.5
-1
-1.5
-3 -2 -1 0 1 2 3
N(0,1) Quantiles
(b) For observed k, n and y …nd the value N ^ that maximizes the probability in
part (a). Does this ever di¤er much from the intuitive estimate N ~ = kn=y?
(Hint: The likelihood L(N ) depends on the discrete parameter N , and a good
way to …nd where L(N ) is maximized over f1; 2; 3; : : :g is to examine the ratios
L(N + 1)=L(N ):)
(c) When might the model in part (a) be unsatisfactory?
20. Censored lifetime data: Consider the Exponential distribution as a model for the
lifetimes of equipment. In experiments, it is often not feasible to run the study long
enough that all the pieces of equipment fail. For example, suppose that n pieces of
equipment are each tested for a maximum of C hours (C is called a “censoring time”).
The observed data are: k (where 0 k n) pieces fail, at times y1 ; : : : ; yk and n k
pieces are still working after time C.
^ = 1 P yi + (n
k
k)C :
k i=1
(c) What does part (b) give when k = 0? Explain this intuitively.
(d) A standard test for the reliability of electronic components is to subject them
to large ‡uctuations in temperature inside specially designed ovens. For one
particular type of component, 50 units were tested and k = 5 failed before 400
P5
hours, when the test was terminated, with yi = 450 hours. Find the maximum
i=1
likelihood estimate of .
21. Poisson model with a covariate: Let Y represent the number of claims in a given
year for a single general insurance policy holder. Each policy holder has a numerical
“risk score”x assigned by the company, based on available information. The risk score
may be used as a covariate (explanatory variable) when modeling the distribution of
Y , and it has been found that models of the form
[ (x)]y (x)
P (Y = yjx) = e for y = 0; 1; : : :
y!
(a) Suppose that n randomly chosen policy holders with risk scores x1 ; x2 ; : : : ; xn
had y1 ; y2 ; : : : ; yn claims, respectively, in a given year. Determine the likelihood
function for and based on these data.
(b) Can ^ and ^ be found explicitly?
3. PLANNING AND
CONDUCTING EMPIRICAL
STUDIES
Problem: a clear statement of the study’s objectives, usually involving one or more
questions
Plan: the procedures used to carry out the study including how we will collect the
data.
Analysis: the analysis of the data collected in light of the Problem and the Plan.
Conclusion: The conclusions that are drawn about the Problem and their limitations.
PPDAC has been designed to emphasize the statistical aspects of empirical studies. We
develop each of the …ve steps in more detail below. Several examples of the use of PPDAC
in an empirical study will be given. We identify the steps in the following example.
89
90 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES
Example 3.1
The following news item was published by the University of Sussex, UK on February 16,
2015. It describes an empirical investigation in the …eld of psychology.
Campaigns to get young people to drink less should focus on the bene…ts of not
drinking and how it can be achieved:
Pointing out the advantages and achievability of staying sober is more e¤ective than traditional
approaches that warn of the risks of heavy drinking, according to the research carried out at the
University of Sussex by researcher Dr Dominic Conroy. The study, published this week in the British
Journal of Health Psychology, found that university students were more likely to reduce their overall
drinking levels if they focused on the bene…ts of abstaining, such as more money and better health.
They were also less likely to binge drink if they had imagined strategies for how non-drinking might
be achieved – for example, being direct but polite when declining a drink, or choosing to spend
time with supportive friends. Typical promotions around healthy drinking focus on the risks of high
alcohol consumption and encourage people to monitor their drinking behaviour (e.g. by keeping a
drinks diary). However, the current study found that completing a drinks diary was less e¤ective in
encouraging safer drinking behaviour than completing an exercise relating to non-drinking.
Dr Conroy says: “We focused on students because, in the UK, they remain a group who drink
heavily relative to their non-student peers of the same age. Similarly, attitudes about the acceptabil-
ity of heavy drinking are relatively lenient among students. “Recent campaigns, such as the NHS
Change4Life initiative, give good online guidance as to how many units you should be drinking
and how many units are in speci…c drinks. “Our research contributes to existing health promotion
advice, which seeks to encourage young people to consider taking ’dry days’ yet does not always
indicate the range of bene…ts nor suggest how non-drinking can be more successfully ‘managed’in
social situations.”
Dr Conroy studied 211 English university students aged 18-25 over the course of a month. Par-
ticipants in the study completed one of four exercises involving either: imagining positive outcomes
of non-drinking during a social occasion; imagining strategies required to successfully not drink
during a social occasion; imagining both positive outcomes and required strategies; or completing a
drinks diary task.
At the start of the study, participants in the outcome group were asked to list positive outcomes
of not drinking and those in the process group listed what strategies they might use to reduce their
drinking. Those in the combined group did both. They were reminded of their answers via email
during the one month course of the study and asked to continue practising this mental simulation.
All groups completed an online survey at various points, indicating how much they had drunk
the previous week. Over the course of one month, Dr Conroy found that students who imagined
positive outcomes of non-drinking reduced their weekly alcohol consumption from 20 units to 14
units on average. Similarly, students who imagined required strategies for non-drinking reduced the
frequency of binge drinking episodes –classi…ed as six or more units in one session for women, and
eight or more units for men –from 1.05 episodes a week to 0.73 episodes a week on average.
Interestingly, the research indicates that perceptions of non-drinkers were also more favourable
3.1. EMPIRICAL STUDIES 91
after taking part in the study. Dr Conroy says this could not be directly linked to the intervention
but was an interesting additional feature of the study. He says: “Studies have suggested that holding
negative views of non-drinkers may be closely linked to personal drinking behaviour and we were
interested to see in the current study that these views may have improved as a result of taking
part in a non-drinking exercise. “I think this shows that health campaigns need to be targeted
and easy to …t into daily life but also help support people to accomplish changes in behaviour that
might sometimes involve ‘going against the grain’, such as periodically not drinking even when in
the company of other people who are drinking.”
Plan: Recruit university 211 students aged 18-25 in the United Kingdom and assign
the students to one of the four mental exercises. (The article in the British Journal of
Health Psychology indicated that academic departments across English universities
were asked to forward a pre-prepared recruitment message to their students containing
a URL to an online survey.) Collect information from the students via online surveys
at various points including how much alcohol they had drunk the previous week.
Data: The data collected included which mental exercise group the student was in
and information about their alcoholic consumption in the week before they completed
the various online surveys.
Conclusion: The study found that completing mental exercises relating to non-
drinking was more e¤ective in encouraging safer drinking behaviour than completing
a drinks diary alone.
Note that in the Problem step, we describe what we are trying to learn or what
questions we want to answer. The Plan step describes how the data are to be measured
and collected. In the Data step, the Plan is executed. The Analysis step corresponds
to what many people think Statistics is all about. We carry out both simple and complex
calculations to process the data into information. Finally, in the Conclusion step, we answer
the questions formulated at the Problem step.
PPDAC can be used in two ways - …rst to actively formulate, plan and carry out investi-
gations and second as a framework to critically scrutinize reported empirical investigations.
These reports include articles in the popular press (as in the above example), scienti…c
papers, government policy statements and various business reports. If you see the phrase
“evidence based decision” or “evidence based management”, look for an empirical study.
To discuss the steps of PPDAC in more detail we need to introduce a number of technical
terms. Every subject has its own jargon, i.e. words with special meaning, and you need to
learn the terms describing the details of PPDAC to be successful in this course.
92 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES
1. Problem
The elements of the Problem address questions starting with “What”
Types of Problems
Three common types of statistical problems that are encountered are described below.
“Does taking a low dose of aspirin reduce the risk of heart disease among men over the
age of 50?”
“Does changing from assignments to multiple term tests improve student learning in
STAT 231?”
“Does second-hand smoke from parents cause asthma in their children.
“Does compulsory driver training reduce the incidence of accidents among new drivers?”
Predictive: The problem is to predict the response of a variate for a given unit. This
is often the case in …nance or in economics. For example, …nancial institutions need
to predict the price of a stock or interest rates in a week or a month because this
e¤ects the value of their investments.
In the second type of problem, the experimenter is interested in whether one variate
x tends to cause an increase or a decrease in another variate Y . Where possible this is
conducted in a controlled experiment in which x is increased or decreased while holding
everything else in the experiment constant and we observe the changes in Y: As indicated
in Chapter 1, an experiment in which the experimenter manipulates the values of the ex-
planatory variates is referred to as an experimental study. On the other hand in the study
of whether second-hand smoke causes asthma, it is unlikely that the experimenter would
3.2. THE STEPS OF PPDAC 93
be able to manipulate the explanatory variate and so the experimenter needs to rely on
a potentially less informative observational study, one that depends on data that is col-
lected without the ability to control explanatory variates. We will see in Chapter 8 how
an empirical study must be carefully designed in order to answer such causative questions.
Important considerations in an observational study are the design of the survey and ques-
tionnaire, who to ask, what to ask, how many to ask, where to sample etc.
De…nition 14 The target population or process is the collection of units to which the ex-
perimenters conducting the empirical study wish the conclusions to apply.
For each teenager (unit) in the target population, the variate of primary interest is
whether or not the teenager smokes. Other variates of interest de…ned for each unit might
be age and sex. In the can-…lling example, the volume of liquid in each can is a variate.
The machine that …lled the can is another variate. A key point to notice is that the values
of the variates change from unit to unit in the population. There are usually many variates
associated with each unit. At this stage, we will be interested in only those that help specify
the questions of interest.
We specify the questions of interest in the Problem in terms of attributes of the target
population. In the smoking example, one important attribute is the proportion of teenagers
in the target population. In the can-…lling example, the attributes of interest were the
average volume and the variability of the volumes for all cans …lled by each machine under
current conditions. Possible questions of interest (among others) are:
“What proportion of teenagers in Ontario smoke?”
94 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES
“Is the standard deviation of volumes of cans …lled by the new machine less than that
of the old machine?”
We can also ask questions about graphical attributes of the target population such as
the population histogram or a scatterplot of one variate versus another over the whole
population.
It is very important that the Problem step contain clear questions about one or more
attributes of the target population.
2. Plan
In most cases, we cannot calculate the attributes of interest for the target population directly
because we can only examine a subset of the units in the target population. This may be
due to lack of resources and time, as in the smoking survey or a physical impossibility as
in the can-…lling study where we can only look at cans available now and not in the future.
Or, in an even more di¢ cult situation, we may be forced to carry out a clinical trial using
mice because it is unethical to use humans and so we do not examine any units in the target
population. Obviously there will be uncertainty in our answers. The purpose of the Plan
step is to decide what units we will examine (the sample), what data we will collect and
how we will do so. The Plan depends on the questions posed in the Problem step.
De…nition 17 The study population or study process is the collection of units available to
be included in the study.
Often the study population is a subset of the target population (as in the teenage
smoking survey). However, in many medical applications, the study population consists of
laboratory animals whereas the target population consists of humans. In this case the units
in the study population are laboratory animals and the units in the target population are
humans. In the development of new products, we may want to draw conclusions about a
production process in the future but we can only look at units produced in a laboratory
in a pilot process. In this case, the study units are not part of the target population. In
many surveys, the study population is a list of people de…ned by their telephone number.
The sample is selected by calling a subset of the telephone numbers. Therefore the study
population excludes those people without telephones or with unlisted numbers.
The study population is often not identical to the target population.
De…nition 18 If the attributes in the study population di¤ er from the attributes in the
target population then the di¤ erence is called study error.
We cannot quantify study error but must rely on context experts to know, for example,
that conclusions from an investigation using mice will be relevant to the human target
population. We can however warn the context experts of the possibility of such error,
especially when the study population is very di¤erent from the target population.
3.2. THE STEPS OF PPDAC 95
De…nition 19 The sampling protocol is the procedure used to select a sample of units from
the study population. The number of units sampled is called the sample size.
In Chapter 2, we discussed modeling the data and often claimed that we had a “random
sample”so that our model was simple. In practice, it is exceedingly di¢ cult and expensive
to select a random sample of units from the study population and so other less rigorous
methods are used. Often we “take what we can get”. Sample size is usually driven by
economics or availability. We will show in later chapters how we can use the model to help
with sample size determination.
De…nition 20 If the attributes in the sample di¤ er from the attributes in the study popu-
lation the di¤ erence is called sample error or sampling error.
Even with random sampling, we are looking at only a subset of the units in the study
population. Di¤ering sampling protocols are likely to produce di¤erent sample errors. Also,
since we do not know the values of the study population attributes, we cannot know the
sampling error. However, we can use the model to get an idea of how large this error might
be. These ideas are discussed in Chapter 4.
We must decide which variates we are going to measure or determine for the units in
the sample. For any attributes of interest, as de…ned in the Problem step, we will certainly
measure the corresponding variates for the units in the sample. As we shall see, we may
also decide to measure other variates that can aid the analysis. In the smoking survey,
we will try to determine whether each teenager in the sample smokes or not (this requires
a careful de…nition) and also many demographic variates such as age and sex so that we
can compare the smoking rate across age groups, sex etc. In experimental studies, the
experimenters assign the value of a variate to each unit in the sample. For example, in a
clinical trial, sampled units can be assigned to the treatment group or the placebo group
by the experimenters. When the value of a variate is determined for a given unit, errors
are often introduced by the measurement system which determines the value.
De…nition 21 If the measured value and the true value of a variate are not identical the
di¤ erence is called measurement error.
Measurement errors are usually unknown. In practice, we need to ensure that the
measurement systems used do not contribute substantial error to the conclusions. We may
have to study the measurement systems which are used in separate studies to ensure that
this is so.
96 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES
The …gure below shows the steps in the Plan and the sources of error:
Target Population
l Study error
Study Population
# Sample error
Sample
# Measurement error
Measured variate values
A person using PPDAC for an empirical study should, by the end of the Plan step, have
a good understanding of the study population, the sampling protocol, the variates which
are to be measured, and the quality of the measurement systems that are intended for use.
In this course you will most often use PPDAC to critically examine the Conclusions from
a study done by someone else. You should examine each step in the Plan (you may have
to ask to see the Plan since many reports omit it) for strengths and weaknesses. You must
also pay attention to the various types of error that may occur and how they might impact
the conclusions.
3. Data
The object of the Data step is to collect the data according to the Plan. Any deviations
from the Plan should be noted. The data must be stored in a way that facilitates the
Analysis.
The previous sections noted the need to de…ne variates clearly and to have satisfactory
methods of measuring them. It is di¢ cult to discuss the Data step except in the context
of speci…c examples, but we mention a few relevant points.
Mistakes can occur in recording or entering data into a data base. For complex
investigations, it is useful to put checks in place to avoid these mistakes. For example,
if a …eld is missed, the data base should prompt the data entry person to complete
the record if possible.
In many studies the units must be tracked and measured over a long period of time
(e.g. consider a study examining the ability of aspirin to reduce strokes in which
persons are followed for 3 to 5 years). This requires careful management.
When data are recorded over time or in di¤erent locations, the time and place for
each measurement should be recorded.
3.2. THE STEPS OF PPDAC 97
There may be departures from the study Plan that arise over time (e.g. persons may
drop out of a long term medical study because of adverse reactions to a treatment; it
may take longer than anticipated to collect the data so the number of units sampled
must be reduced). Departures from the Plan should be recorded since they may have
an important impact on the Analysis and Conclusion.
In some studies the amount of data may be extremely large, so data base design and
management is important.
4. Analysis
In Chapter 1 we discussed di¤erent methods of summarizing the data using numerical and
graphical summaries. A key step in formal analyses is the selection of an appropriate model
that can describe the data and how it was collected. In Chapter 2 we discussed methods
for checking the …t of the model. We also need to describe the Problem in terms of the
model parameters and properties. You will see many more formal analyses in subsequent
chapters.
5. Conclusions
The purpose of the Conclusion step is to answer the questions posed in the Problem. In
other words, the Conclusion is directed by the Problem. An attempt should be made
18
http://www.youtube.com/watch?v=0A7ojjsmSsY
19
http://www.cbc.ca/news/technology/story/2010/10/20/long-form-census-world-statistics-day.html
98 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES
to quantify (or at least discuss) potential errors as described in the Plan step and any
limitations to the conclusions.
Background
An automatic in-line gauge measures the diameter of a crankshaft journal on 100% of
the 500 parts produced per shift. The measurement system does not involve an operator
directly except for calibration and maintenance. Figure 3.1 shows the diameter in question.
The journal is a “cylindrical”part of the crankshaft. The diameter of the journal must
be de…ned since the cross-section of the journal is not perfectly round and there may be
taper along the axis of the cylinder. The gauge measures the maximum diameter as the
crankshaft is rotated at a …xed distance from the end of the cylinder.
The speci…cation for the diameter is 10 to +10 units with a target of 0. The mea-
surements are re-scaled automatically by the gauge to make it easier to see deviations from
the target. If the measured diameter is less than 10, the crankshaft is scrapped and a
cost is incurred. If the diameter exceeds +10, the crankshaft can be reworked, again at
considerable cost. Otherwise, the crankshaft is judged acceptable.
3.3. CASE STUDY 99
Overall Project
A project is planned to reduce scrap/rework by reducing part-to-part variation in the
diameter. A …rst step involves an investigation of the measurement system itself. There
is some speculation that the measurement system contributes substantially to the overall
process variation and that bias in the measurement system is resulting in the scrapping
and reworking of good parts. To decide if the measurement system is making a substantial
contribution to the overall process variability, we also need a measure of this attribute for
the current and future population of crankshafts. Since there are three di¤erent attributes
of interest, it is convenient to split the project into three separate applications of PPDAC.
Study 1
In this application of PPDAC, we estimate the properties of the errors produced by the
measurement system. In terms of the model, we will estimate the bias and variability due
to the measurement system. We hope that these estimates can be used to predict the future
performance of the system.
Problem
The target process is all future measurements made by the gauge on crankshafts to be
produced. The response variate is the measured diameter associated with each unit. The
attributes of interest are the average measurement error and the population standard de-
viation of these errors. We can quantify these concepts using a model (see below). A
detailed …shbone diagram for the measurement system is also shown in Figure 3.2. In such
a diagram, we list explanatory variates organized by the major “bones” that might be re-
sponsible for variation in the response variate, here the measured journal diameter. We can
use the diagram in formulating the Plan.
Note that the measurement system includes the gauge itself, the way the part is loaded
into the gauge, who loads the part, the calibration procedure (every two hours, a master
part is put through the gauge and adjustments are made based on the measured diameter
of the master part; that is “the gauge is zeroed”), and so on.
Plan
To determine the properties of the measurement errors we must measure crankshafts with
known diameters. “Known” implies that the diameters were measured by an o¤-line mea-
surement system that is very reliable. For any measurement system study in which bias is
an issue, there must be a reference measurement system which is known to have negligible
bias and variability which is much smaller than the system under study.
There are many issues in establishing a study process or a study population. For con-
venience, we want to conduct the study quickly using only a few parts. However, this
restriction may lead to study error if the bias and variability of the measurement system
100 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES
Gauge J o u rn a l
M e a s u re m e n ts te m p e r a tu r e
m a in te n a n c e
a c tu a l s iz e
p o s itio n o f p a r t
c o n d itio n d ir t
w ear on head
o u t- o f- r o u n d
M e a s u re d J o u rn a l D i a m e te r
tr a in in g
fr e q u e n c y
E n vi ro n m e n t
C a l i b ra ti o n O p e ra t o r
change as other explanatory variates change over time or parts. We guard against this
latter possibility by using three crankshafts with known diameters as part of the de…nition
of the study process. Since the units are the taking of measurements, we de…ne the study
population as all measurements that can be taken in one day on the three selected crank-
shafts. These crankshafts were selected so that the known diameters were spread out over
the range of diameters Normally seen. This will allow us see if the attributes of the system
depend on the size of the diameter being measured. The known diameters which were used
were: 10, 0, and +10: Remember the diameters have been rescaled so that a diameter of
10 is okay.
No other explanatory variates were measured. To de…ne the sampling protocol, it
was proposed to measure the three crankshafts ten times each in a random order. Each
measurement involved the loading of the crankshaft into the gauge. Note that this was to
be done quickly to avoid delay of production of the crankshafts. The whole procedure took
only a few minutes.
The preparation for the data collection was very simple. One operator was instructed
to follow the sampling protocol and write down the measured diameters in the order that
they were collected.
Data
The repeated measurements on the three crankshafts are shown below. Note that due to
poor explanation of the sampling protocol, the operator measured each part ten times in
a row and did not use a random ordering. (Unfortunately non-adherence to the sampling
protocol often happens when real data are collected and it is important to consider the
e¤ects of this in the Analysis and Conclusion steps.)
3.3. CASE STUDY 101
Analysis
where i = 1 to 3 indexes the three crankshafts and j = 1; : : : ; 10 indexes the ten repeated
measurements. The parameter i represents the long term average measurement for crank-
shaft i. The random variables Rij (called the residuals) represent the variability of the
measurement system, while m quanti…es this variability. Note that we have assumed, for
simplicity, that the variability m is the same for all three crankshafts in the study.
We can rewrite the model in terms of the random variables Yij so that Yij G( i ; m ).
Now we can write the likelihood as in Example 2.3.2 and maximize it with respect to the
four parameters 1 , 2 , 3 , and m (the trick is to solve @`=@ i = 0, i = 1; 2; 3 …rst). Not
surprisingly the maximum likelihood estimates for 1 , 2 , 3 are the sample averages for
each crankshaft so that
1 P n
^ i = yi = yij for i = 1; 2; 3:
10 j=1
To examine the assumption that m is the same for all three crankshafts we can calculate
the sample standard deviation for each of the three crankshafts. Let
s
1 P10
si = (yij yi )2 for i = 1; 2; 3:
9 j=1
yi si
Crankshaft 1 10:3 1:49
Crankshaft 2 0:6 1:17
Crankshaft 3 10:3 1:42
The estimate of the bias for crankshaft 1 is the di¤erence between the observed average
y1 and the known diameter value which is equal to 10 for crankshaft 1, that is, the
estimated bias is 10:3 ( 10) = 0:3. For crankshafts 2 and 3 the estimated biases are
0:6 0 = 0:6 and 10:3 10 = 0:3 respectively so the estimated biases in this study are all
small.
102 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES
Note that the sample standard deviations s1 ; s2 ; s3 are all about the same size and
our assumption about a common value seems reasonable. (Note: it is possible to test this
assumption more formally.) An estimate of m is given by
r
s21 + s22 + s23
sm = = 1:37
3
Note that this estimate is not the average of the three sample standard deviations but the
square root of the average of the three sample variances. (Why does this estimate make
sense? Is it the maximum likelihood estimate of m ? What if the number of measurements
for each crankshaft were not equal?)
Conclusion
The observed biases 0:3, 0:6, 0:3 appear to be small, especially when measured against
the estimate of m and there is no apparent dependence of bias on crankshaft diameter.
To interpret the variability, we can use the model (3.1). Recall that if Yij v G ( i ; m )
then
P ( i 2 m Yij i + 2 m ) = 0:95
Therefore if we repeatedly measure the same journal diameter, then about 95% of the time
we would expect to see the observations vary by about 2 (1:37) = 2:74.
There are several limitations to these conclusions. Because we have carried out the
study on one day only and used only three crankshafts, the conclusion may not apply to
all future measurements (study error). The fact that the measurements were taken within
a few minutes on one day might be misleading if something special was happening at that
time (sampling error). Since the measurements were not taken in random order, another
source of sampling error is the possible drift of the gauge over time.
We could recommend that, if the study were to be repeated, more than three known-
value crankshafts could be used, that the time frame for taking the measurements could be
extended and that more measurements be taken on each crankshaft. Of course, we would
also note that these recommendations would add to the cost and complexity of the study.
We would also insist that the operator be better informed about the Plan.
Study 2
The second study is designed to estimate the overall population standard deviation of the
diameters of current and future crankshafts (the target population). We need to estimate
this attribute to determine what variation is due to the process and what is due to the mea-
surement system. A cause-and-e¤ect or …shbone diagram listing some possible explanatory
variates for the variability in journal diameter is given in Figure 3.3. Note that there are
many explanatory variates other than the measurement system. Variability in the response
variate is induced by changes in the explanatory variates, including those associated with
the measurement system.
3.3. CASE STUDY 103
M e th o d M ac hine
M e a s u r e m e n ts
m a in t e n a n c e s p e e d o f r o ta tio n
s e t- u p o f to o lin g
o p e r a to r a n g le o f c u t
lin e s p e e d
c a lib r a tio n
c u ttin g to o l e d g e
p o s itio n in g a u g e J o u r n a l D i a m e te r
s e t- u p m e th o d
hardnes s
d ir t o n p a r t
tr a in in g
quenchant
te m p e r a tu r e o p e r a to r
c a s tin g c h e m is tr y
e n v ir o n m e n t m a in te n a n c e
c a s tin g lo t
E n vi r o n m e n t M a te r i a l O p e r a to r
Plan
The study population is de…ned as those crankshafts available over the next week, about
7500 parts (500 per shift times 15 shifts). No other explanatory variates were measured.
Initially it was proposed to select a sample of 150 parts over the week (ten from each
shift). However, when it was learned that the gauge software stores the measurements for
the most recent 2000 crankshafts measured, it was decided to select a point in time near
the end of the week and use the 2000 measured values from the gauge memory to be the
sample. One could easily criticize this choice (sampling error), but the data were easily
available and inexpensive.
Data
The individual observed measurements are too numerous to list but a histogram of the data
is shown in Figure 3.4. From this, we can see that the measured diameters vary from 14
to +16.
Analysis
where Yi represents the distribution of the measurement of the ith diameter, represents
the study population mean diameter and the residual Ri represents the variability due to
sampling and the measurement system. We let quantify this variability. We have not
included a bias term in the model because we assume, based on our results from Study 1,
104 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES
Figure 3.4: Histogram of 2000 measured values from the gauge memory
that the measurement system bias is small. As well we assume that the sampling protocol
does not contribute substantial bias.
The histogram of the 2000 measured diameters shows that there is considerable spread in
the measured diameters. About 4:2% of the parts require reworking and 1:8% are scrapped.
The shape of the histogram is approximately symmetrical and centred close to zero. The
sample mean is
P
1 2000
y= yi = 0:82
2000 i=1
which gives us an estimate of (the maximum likelihood estimate) and the sample standard
deviation is s
P
1 2000
s= (yi y)2 = 5:17
1999 i=1
which gives us an estimate of (not quite the maximum likelihood estimate).
Conclusion
The overall process variation is estimated by s. Since the sample contained 2000 parts
measured consecutively, many of the explanatory variates did not have time to change as
they would in the study populations Thus, there is a danger of sampling error producing
an estimate of the variation that is too small.
The variability due to the measurement system, estimated to be 1:37 in Study 1, is much
less than the overall variability which is estimated to be 5:17. One way to compare the two
standard deviations m and is to separate the total variability into the variability due
to the measurement system m and that due to all other sources. In other words, we are
interested in estimating the variability that would be present if there were no variability
3.3. CASE STUDY 105
in the measurement system ( m = 0). If we assume that the total variability arises from
two independent sources, the measurement system and all other sources, then we have
2 = 2 + 2 or
m p p
2 2
p = m
where p quanti…es the variability due to all other uncontrollable variates (sampling vari-
ability). An estimate of p is given by
p q
s 2 sm = (5:17)2 (1:37)2 = 4:99
2
Hence, eliminating all of the variability due to the measurement system would produce an
estimated variability of 4:99 which is a small reduction from 5:17. The measurement system
seems to be performing well and not contributing substantially to the overall variation.
Comments
Study 3 revealed that the measurement system had a serious long term problem. At …rst,
it was suspected that the cause of the variability was the fact that the gauge was not
calibrated over the course of the study. Study 3 was repeated with a calibration before
each measurement. A pattern similar to that for Study 3 was seen. A detailed examination
of the gauge by a repairperson from the manufacturer revealed that one of the electronic
components was not working properly. This was repaired and Study 3 was repeated. This
study showed variation similar to the variation of the short term study (Study 1) so that
the overall project could continue. When Study 2 was repeated, the overall variation and
the number of scrap and reworked crankshafts was substantially reduced. The project was
considered complete and long term monitoring showed that the scrap rate was reduced to
about 0:7% which produced an annual savings of more than $100,000.
As well, three similar gauges that were used in the factory were put through the “long
term” test. All were working well.
Summary
An important part of any Plan is the choice and assessment of the measurement
system.
The measurement system may contribute substantial error that can result in poor
decisions (e.g. scrapping good parts, accepting bad parts).
We represent systematic measurement error by bias in the model. The bias can be
assessed only by measuring units with known values, taken from another reference
measurement system. The bias may be constant or depend on the size of the unit
being measured, the person making the measurements, and so on.
Variability can be assessed by repeatedly measuring the same unit. The variability
may depend on the unit being measured or any other explanatory variates.
Both bias and variability may be a function of time. This can be assessed by examining
these attributes over a su¢ ciently long time span as in Study 3.
3.4. CHAPTER 3 PROBLEMS 107
Support Party
Plan to Vote YES NO
YES 351 381
NO 107 265
(a) De…ne the Problem for this study. What type of Problem is this and why?
(b) What is the target population?
(c) Identify the variates and their types for this study.
(d) What is the study population?
(e) What is the sample?
(f) Describe one possible source of study error is?
(g) Describe one possible source of sampling error?
(h) There are two attributes of interest in the target population. In each case,
describe the attribute and provide an estimate based on the given data.
2. U.S. to fund study of Ontario math curriculum, Globe & Mail, January 17,
2014, Caroline Alphonso - Education Reporter (article has been condensed)
The U.S. Department of Education has funded a $2.7-million (U.S.) project, led by
a team of Canadian researchers at Toronto’s Hospital for Sick Children. The study
will look at how elementary students at several Ontario schools fare in math using
the current provincial curriculum as compared to the JUMP math program, which
combines the conventional way of learning the subject with so-called discovery learn-
ing. Math teaching has come under scrutiny since OECD results that measured the
scholastic abilities of 15-year-olds in 65 countries showed an increasing percentage of
Canadian students failing the math test in nearly all provinces. Dr. Tracy Solomon
and her team are collecting and analyzing two years of data on students in primary
and junior grades from one school board, which she declined to name. The students
were in Grades 2 and 5 when the study began, and are now in Grades 3 and 6, which
means they will participate in Ontario’s standardized testing program this year. The
research team randomly assigned some schools to teach math according to the Ontario
curriculum, which allows open-ended student investigations and problem-solving. The
108 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES
other schools are using the JUMP program. Dr. Solomon said the research team is
using classroom testing data, lab tests on how children learn and other measures to
study the impact of the two programs on student learning.
Answer the questions below based on this article.
3. Playing racing games may encourage risky driving, study …nds, Globe &
Mail, January 8, 2015 (article has been condensed)
Playing an intense racing game makes players more likely to take risks such as speed-
ing, passing on the wrong side, running red lights or using a cellphone in a simulated
driving task shortly afterwards, according to a new study. Young adults with more
adventurous personalities were more inclined to take risks, and more intense games
led to greater risk-taking, the authors write in the journal Injury Prevention. Other
research has found a connection between racing games and inclination to risk-taking
while driving, so the new results broaden that evidence base, said lead author of the
new study, Mingming Deng of the School of Management at Xi’an Jiaotong University
in Xi’an, China. “I think racing gamers should be [paying] more attention in their
real driving,” Deng said.
The researchers recruited 40 student volunteers at Xi’an Jiaotong University, mostly
men, for the study. The students took personality tests at the start and were divided
randomly into two groups. Half of the students played a circuit-racing-type driving
game that included time trials on a race course similar to Formula 1 racing, for about
20 minutes, while the other group played computer solitaire, a neutral game for com-
parison. After a …ve-minute break, all the students took the Vienna Risk-Taking Test,
viewing 24 “risky” videotaped road-tra¢ c situations on a computer screen presented
3.4. CHAPTER 3 PROBLEMS 109
from the driver’s perspective, including driving up to a railway crossing whose gate
has already started lowering. How long the viewer waits to hit the “stop” key for
the manoeuvre is considered a measure of their willingness to take risks on the road.
Students who had been playing the racing game waited an average of almost 12 sec-
onds to hit the stop button compared with 10 seconds for the solitaire group. The
participants’ experience playing these types of games outside of the study did not
seem to make a di¤erence.
Answer the questions below based on this article.
4. Suppose you wish to study the smoking habits of teenagers and young adults, in order
to understand what personal factors are related to whether, and how much, a person
smokes. Brie‡y describe the main components of such a study, using the PPDAC
framework. Be speci…c about the target and study population, the sample, and the
variates you would collect.
5. Suppose you wanted to study the relationship between a person’s “resting”pulse rate
(heart beats per minute) and the amount and type of exercise they get.
(a) List some factors (including exercise) that might a¤ect resting pulse rate. You
may wish to draw a cause and e¤ect (…shbone) diagram to represent potential
causal factors.
(b) Describe brie‡y how you might study the relationship between pulse rate and
exercise using (i) an observational study, and (ii) an experimental study.
6. A large company uses photocopiers leased from two suppliers A and B. The lease
rates are slightly lower for B’s machines but there is a perception among workers
that they break down and cause disruptions in work ‡ow substantially more often.
110 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES
Describe brie‡y how you might design and carry out a study of this issue, with the
ultimate objective being a decision whether to continue the lease with company B.
What additional factors might a¤ect this decision?
7. For a study like the one in Example 1.3.1, where heights x and weights y of individuals
are to be recorded, discuss sources of variability due to the measurement of x and y
on any individual.
4. ESTIMATION
(1) Where do we get our probability model? What if it is not a good description of the
population or process?
We discussed the …rst question in Chapters 1 and 2. It is important to check the
adequacy (or “…t”) of the model; some ways of doing this were discussed in Chapter
2 and more formal methods will be considered in Chapter 7. If the model used is not
satisfactory, we may not be able to use the estimates based on it. For the lifetimes of
brake pads data introduced in Example 1.3.3, a Gaussian model does not appear to
be suitable (see Chapter 2, Problem 11).
(2) The estimation of parameters or population attributes depends on data collected from
the population or process, and the likelihood function is based on the probability of
the observed data. This implies that factors associated with the selection of sample
units or the measurement of variates (e.g. measurement error) must be included in
the model. In many examples it is assumed that the variate of interest is measured
without error for a random sample of units from the population. We will typically
assume that the data come from a random sample of population units, but in any
111
112 4. ESTIMATION
given application we would need to design the data collection plan to ensure this
assumption is valid.
(3) Suppose in the model chosen the population mean is represented by the parameter
. The sample mean y is an estimate of , but not usually equal to it. How far away
from is y likely to be? If we take a sample of only n = 50 units, would we expect
the estimate y to be as “good” as y based on 150 units? (What does “good” mean?)
We focus on the third point in this chapter and assume that we can deal with the …rst
two points with the methods discussed in Chapters 1 and 2.
^ = g(y1 ; : : : ; yn ): (4.1)
For example
^ = y = 1 P yi
n
n i=1
is a point estimate of if y1 ; : : : ; yn is an observed random sample from a Poisson distrib-
ution with mean .
The method of maximum likelihood provides a general method for obtaining estimates,
but other methods exist. For example, if = E(Y ) = is the average (mean) value of y
in the population, then the sample mean ^ = y is an intuitively sensible estimate; it is the
maximum likelihood estimate of if Y has a G ( ; ) distribution but because of the Central
Limit Theorem it is a good estimate of more generally. Thus, while we will use maximum
likelihood estimation a great deal, you should remember that the discussion below applies
to estimates of any type.
The problem facing us in this chapter is how to determine or quantify the uncertainty
in an estimate. We do this using sampling distributions 20 , which are based on the following
idea. If we select random samples on repeated occasions, then the estimates ^ obtained from
the di¤erent samples will vary. For example, …ve separate random samples of n = 50 persons
from the same male population described in Example 1.3.1 gave …ve di¤erent estimates
^ = y of E(Y ) as:
1:723 1:743 1:734 1:752 1:736:
Estimates vary as we take repeated samples and therefore we associate a random variable
and a distribution with these estimates.
20
See the video at www.watstat.ca called “What is a sampling distribution?”
4.2. ESTIMATORS AND SAMPLING DISTRIBUTIONS 113
More precisely, we de…ne this idea as follows. Let the random variables Y1 ; : : : ; Yn
represent the observations in a random sample, and associate with the estimate ^ given by
(4.1) a random variable
~ = g(Y1 ; : : : ; Yn ):
The random variable ~ = g(Y1 ; : : : ; Yn ) is simply a rule that tells us how to process the
data to obtain a numerical value ^ = g(y1 ; : : : ; yn ) which is an estimate of the unknown
parameter for a given data set y1 ; : : : ; yn . For example
~ = Y = 1 P Yi
n
n i=1
Example 4.2.1
Suppose we want to estimate the mean = E(Y ) of a random variable, and that
a Gaussian distribution Y G( ; ) describes variation in Y in the population. Let
Y1 ; : : : ; Yn represent a random sample from the population, and consider the estimator
1 Pn
~=Y = Yi
n i=1
is, we want to …nd the probability that j~ j is no more than 0:01 meters. Assuming
= s = 0:07 (meters), (4.2) gives the following results for sample sizes n = 50 and n = 100:
This indicates that a larger sample is “better” in the sense that the probability is higher
that ~ will be within 0:01m of the true (and unknown) average height in the population.
It also allows us to express the uncertainty in an estimate ^ = y from an observed sample
y1 ; : : : ; yn by indicating the probability that any single random sample will give an estimate
within a certain distance of .
Example 4.2.2
In the Example 4.2.1 we were able to determine the distribution of the estimator exactly,
using properties of Gaussian random variables. Often we are not be able to do this and
in this case we could use simulation to study the distribution21 . For example, suppose we
have a random sample y1 ; : : : ; yn which we have assumed comes from an Exponential( )
distribution. The maximum likelihood estimate of is ^ = y. What is the sampling
distribution for ~ = Y ? We can examine the sampling distribution by using simulation.
This involves taking repeated samples, y1 ; : : : ; yn , giving (possibly di¤erent) values of y for
each sample as follows:
2. Compute ^ = y from the sample. In R this is done using the statement ybar<-mean(y).
Repeat these two steps k times. The k values y1 ; : : : ; yk can then be considered as a
sample from the distribution of ~, and we can study the distribution by plotting a histogram
of the values.
The histogram in Figure 4.1 was obtained by drawing k = 10000 samples of size n = 15
from an Exponential(10) distribution, calculating the values y1 ; : : : ; y10000 and then plotting
the relative frequency histogram. What do you notice about the distribution particularly
with respect to symmetry? Does the distribution look like a Gaussian distribution?
The approach illustrated in the preceding example can be used more generally. The
main idea is that, for a given estimator ~, we need to determine its sampling distribution
in order to be able to compute probabilities of the form P (j~ j ) so that we can
quantify the uncertainty of the estimate.
21
This approach can also be used to study sampling from a …nite population of N values, fy1 ; : : : ; yN g,
where we might not want to use a continuous probability distribution for Y .
4.2. ESTIMATORS AND SAMPLING DISTRIBUTIONS 115
0.16
0.14
0.12
Relative
Frequency
0.1
0.08
0.06
0.04
0.02
0
0 5 10 15 20 25
Figure 4.1: Relative frequency histogram of means from 10000 samples of size 15
from an Exponential(10) distribution
The estimates and estimators we have discussed so far are often referred to as point es-
timates and point estimators. This is because they consist of a single value or “point”. The
discussion of sampling distributions shows how to address the uncertainty in an estimate.
We also usually prefer to indicate explicitly the uncertainty in the estimate. This leads to
the concept of an interval estimate 22 , which takes the form
[L (y) ; U (y)]
where L (y) and U (y) are functions of the observed data y. Notice that this provides an in-
terval with endpoints L and U both of which depend on the data. If we let L (Y) and U (Y)
represent the associated random variables then [L (Y) ; U (Y)] is a random interval. If we
were to draw many random samples from the same population and each time we constructed
the interval [L (y) ; U (y)] how often would the statement 2 [L (y) ; U (y)] be true? The
probability that the parameter falls in this random interval is P [L (Y) U (Y)] and
hopefully this probability is large. This probability gives an indication how good the rule is
by which the interval estimate was obtained. For example if P [L (Y) U (Y)] = 0:95
then this means that 95% of the time (that is, for 95% of the di¤erent samples we might
draw), the true value of the parameter falls in the interval [L (y) ; U (y)] constructed from
the data set y. This means we can be reasonably safe in assuming, on this occasion, and
for this data set, it does so. In general, uncertainty in an estimate is explicitly stated by
giving the interval estimate along with the probability P ( 2 [L (Y) ; U (Y)]).
22
See the video What is a con…dence Interval? at watstat.ca
116 4. ESTIMATION
De…nition 23 Suppose is scalar and that some observed data (say a random sample
y1 ; : : : ; yn ) have given a likelihood function L( ). The relative likelihood function R( ) is
de…ned as
L( )
R( ) = for 2
L(^)
where ^ is the maximum likelihood estimate and is the parameter space. Note that
0 R( ) 1 for all 2 :
n=200
-1
-2
log RL
-3 n=1000
-4
-5
1.0
0.8 n=200
0.6
RL
0.4 n=1000
0.2
0.0
Figure 4.2: Relative likelihood function and log relative likelihood function for a
Binomial model
Figure 4.2 shows the relative likelihood functions R( ) for two polls:
Poll 1 : n = 200; y = 80
Poll 2 : n = 1000; y = 400:
In each case ^ = 0:40, but the relative likelihood function is more “concentrated”around ^
for the larger poll (Poll 2). The 10% likelihood intervals also re‡ect this:
Table 4.1 gives rough guidelines for interpreting likelihood intervals. These are only
guidelines for this course. The interpretation of a likelihood interval must always be made
in the context of a given study.
The one apparent shortcoming of likelihood intervals so far is that we do not know how
probable it is that a given interval will contain the true parameter value. As a result we
also do not have a basis for the choice of p. Sometimes it is argued that values like p = 0:10
or p = 0:05 make sense because they rule out parameter values for which the probability
of the observed data is less than 1=10 or 1=20 of the probability when = ^. However, a
more satisfying approach is to apply the sampling distribution ideas in Section 4.2 to the
interval estimates. This leads to the concept of con…dence intervals, which we describe next.
In Section 4.6 we revisit likelihood intervals and show that they are also con…dence intervals.
The idea of a likelihood interval for a parameter can also be extended to the case of
a vector of parameters . In this case R( ) P gives likelihood “regions” for 23 .
23
is called the coverage probability for the interval estimator [L(Y); U (Y)].
A few words are in order about the meaning of the probability statement in (4.3). The
parameter is an unknown …xed constant associated with the population. It is not a
random variable and therefore does not have a distribution. The statement (4.3) can be
interpreted in the following way. Suppose we were about to draw a random sample of the
same size from the same population and the true value of the parameter was . Suppose
also that we knew that we would construct an interval of the form [L(y); U (y)] once we
had collected the data. Then the probability that will be contained in this new interval
is C( )24 .
How then does C( ) assist in the evaluation of interval estimates? In practice, we try
to …nd intervals for which C( ) is fairly close to 1 (values 0:90, 0:95 and 0:99 are often
used) while keeping the interval fairly narrow. Such interval estimates are called con…dence
intervals.
If p = 0:95, for example, then (4.4) indicates that 95% of the samples that we would
draw from this model result in an interval which includes the true value of the parameter
(and of course 5% do not). This gives us some con…dence that for a particular sample, such
as the one at hand, the true value of the parameter is contained in the interval.
The following example illustrates that the con…dence coe¢ cient sometimes does not
depend on the unknown parameter .
24
When we use the observed data y; L(y) and U (y) are numerical values not random variables. We do
not know whether 2 [L(y); U (y)] or not. P [L(y) U (y)] makes no more sense than P (1 3)
since L(y); ; U (y) are all numerical values: there is no random variable to which the probability statement
can refer.
25
See the video at www.watstat.com called “What is a con…dence interval”. See also the Java applet
http://www.math.uah.edu/stat/applets/MeanEstimateExperiment.html
120 4. ESTIMATION
1 P
n p
where Y = n Yi is the sample mean. Since Y G( ; 1= n), then
i=1
p p
P Y 1:96= n Y + 1:96= n
p
=P 1:96 n Y 1:96
= P ( 1:96 Z 1:96)
= 0:95
p p
where Z G(0; 1). Thus the interval [y 1:96= n; y + 1:96= n] is a 95% con…dence in-
terval for the unknown mean . This is an example in which the con…dence coe¢ cient
does not depend on the unknown parameter, an extremely desirable feature of an interval
estimator.
We repeat the very important interpretation of a 95% con…dence interval (since so many
people get the interpretation incorrect!). Suppose the experiment which was used to esti-
mate was conducted a large number of times and each time a 95% con…dence interval
p p
for was constructed using the observed data and the interval [y 1:96= n; y + 1:96= n].
Then, approximately 95% of these constructed intervals would contain the true, but un-
p p
known value of . Since we only have one interval [y 1:96= n; y + 1:96= n]] we do not
know whether it contains the true value of or not. We can only say that we are 95%
p p
con…dent that the given interval [y 1:96= n; y + 1:96= n] contains the true value of
since we are told it is a 95% con…dence interval. In other words, we hope we were one of
the “lucky” 95% who constructed an interval containing the true value of . Warning:
p p
You cannot say that the probability that the interval [y 1:96= n; y + 1:96= n] contains
the true value of is 0:95!!!
If in Example 4.4.1 a particular sample of size n = 16 had observed mean y = 10:4, then
the observed 95% con…dence interval would be [y 1:96=4; y + 1:96=4], or [9:91; 10:89]. We
cannot say that the probability that P ( 2 [9:91; 10:89]) = 0:95. We can only say that
we are 95% con…dent that the interval [9:91; 10:89] contains .
Con…dence intervals become narrower as the size of the sample on which they are based
increases. For example, note the e¤ect of n in Example 4.4.1. The width of the con…dence
p
interval is 2(1:96)= n which decreases as n increases. We noted this earlier for likelihood
intervals, and we will show in Section 4.6 that likelihood intervals are a type of con…dence
4.4. CONFIDENCE INTERVALS AND PIVOTAL QUANTITIES 121
interval.
Recall that the coverage probability for the interval in the above example did not depend
on the unknown parameter, a highly desirable property because we’d like to know the
coverage probability while not knowing the value of the unknown parameter. We next
consider a general method for …nding con…dence intervals which have this property.
Pivotal Quantities
De…nition 28 A pivotal quantity Q = Q(Y; ) is a function of the data Y and the un-
known parameter such that the distribution of the random variable Q is fully known. That
is, probability statements such as P (Q a) and P (Q b) depend on a and b but not on
or any other unknown information.
We now describe how a pivotal quantity can be used to construct a con…dence interval.
We begin with the statement P [a Q(Y; ) b] = p where Q(Y; ) is a pivotal quantity
whose distribution is completely known. Suppose that we can re-express the inequality
a g(Y; ) b in the form L(Y) U (Y) for some functions L and U: Then since
the interval [L (y) ; U (y)] is a 100p% con…dence interval for . The con…dence coe¢ cient
for the interval [L (y) ; U (y)] is equal to p which does not depend on . The con…dence
coe¢ cient does depend on a and b, but these are determined by the known distribution of
Q(Y; ).
Example 4.4.2 Con…dence interval for the mean of a Gaussian distribution with
known standard deviation
Suppose Y = (Y1 ; : : : ; Yn ) is a random sample from the G( ; 0 ) distribution where
E (Yi ) = is unknown but sd (Yi ) = 0 is known. Since
Y
Q = Q (Y; ) = p G(0; 1)
0= n
and G(0; 1) is a completely known distribution, Q is a pivotal quantity. To obtain a 95%
con…dence interval for we need to …nd values a and b such that P (a Q (Y; ) b) = 0:95.
Now
Y
0:95 = P a p b
0= n
p p
=P Y b 0= n Y a 0= n ;
so that
p p
y b 0= n; y a 0= n
122 4. ESTIMATION
is a 95% con…dence interval for based on the observed data y = (y1 ; : : : ; yn ). Note that
there are in…nitely many pairs (a; b) giving P (a Q b) = 0:95. A common choice for the
Gaussian distribution is to pick points symmetric about zero, a = 1:96, b = 1:96. This
p p p
results in the interval [y 1:96 0 = n; y + 1:96 0 = n] or y 1:96 0 = n which turns out
to be the narrowest possible 95% con…dence interval.
p p
The interval [y 1:96 0 = n; y + 1:96 0 = n] is often referred to as a two-sided con…-
dence interval. Note that this interval takes the form
Many two-sided con…dence intervals we will encounter in this course will take a similar form.
Another choice for a and b would be a = 1, b = 1:645, which gives the interval
p p
[y 1:645 0 = n; 1). The interval [y 1:645 0 = n; 1) is usually referred to as a one-sided
con…dence interval. This type of interval is useful when we are interested in determining a
lower bound on the value of .
It turns out that for most distributions it is not possible to …nd exact pivotal quantities
or con…dence intervals for whose coverage probabilities do not depend somewhat on the
true value of . However, in general we can …nd quantities Qn = Qn (Y1 ; : : : ; Yn ; ) such that
as n ! 1, the distribution of Qn ceases to depend on or other unknown information. We
then say that Qn is asymptotically pivotal, and in practice we treat Qn as a pivotal quantity
for su¢ ciently large values of n; more accurately, we call Qn an approximate pivotal quantity.
where ~ = Y =n, is also close to G(0; 1) for large n. Thus Qn can be used as an approximate
pivotal quantity to construct con…dence intervals for . For example,
Thus s
^(1 ^)
^ 1:96 (4.5)
n
4.4. CONFIDENCE INTERVALS AND PIVOTAL QUANTITIES 123
gives an approximate 95% con…dence interval for where ^ = y=n and y is the observed
data.
As a numerical example, suppose we observed n = 100, y = 18. Then (4.5) gives
0:18 1:96 [0:18(0:82)=100]1=2 or [0:115; 0:255].
Remark: It is important to understand that con…dence intervals may vary a great deal
when we take repeated samples. For example, in Example 4.4.3, ten samples of size n = 100
which were simulated for a population with = 0:25 gave the following approximate 95%
con…dence intervals for :
[0:20; 0:38] [0:14; 0:31] [0:23; 0:42] [0:22; 0:41] [0:18; 0:36]
[0:14; 0:31] [0:10; 0:26] [0:21; 0:40] [0:15; 0:33] [0:19; 0:37]
For larger samples (larger n), the con…dence intervals are narrower and will have better
agreement. For example, try generating a few samples of size n = 1000 and compare the
con…dence intervals for .
We have seen that con…dence intervals for a parameter tend to get narrower as the sample
size n increases. When designing a study we often decide how large a sample to collect
on the basis of (i) how narrow we would like con…dence intervals to be, and (ii) how much
we can a¤ord to spend (it costs time and money to collect data). The following example
illustrates the procedure.
which was introduced in Example 4.4.3 and which has approximately a G(0; 1) distribution
to obtain con…dence intervals for . Here is a criterion that is widely used for choosing the
size of n: Choose n large enough so that the width of a 95% con…dence interval for is no
wider than 2 (0:03). Let us see where this leads and why this rule is used.
From Example 4.4.3, we know that
s
^(1 ^)
^ 1:96
n
is an approximate 0:95 con…dence interval for and that the width of this interval is
s
^(1 ^)
2 (1:96) :
n
124 4. ESTIMATION
To make this con…dence interval narrower that 2 (0:03) (or even narrower, say 2 (0:025)),
we need n large enough so that
s
^(1 ^)
1:96 0:03
n
or
2
1:96 ^(1 ^):
n
0:03
Of course we don’t know what ^ is because we have not taken a sample, but we note that
the worst case scenario occurs when ^ = 0:5. So to be conservative, we …nd n such that
2
1:96
n (0:5)2 t 1067:1
0:03
Thus, choosing n = 1068 (or larger) will result in an approximate 95% con…dence interval
of the form ^ c, where c 0:03. If you look or listen carefully when polling results are
announced, you’ll often hear words like “this poll is accurate to within 3 percentage points
19 times out of 20.”What this really means is that the estimator ~ (which is usually given
in percentile form) approximately satis…es P (j~ j 0:03) = 0:95, or equivalently, that
the actual estimate is the centre of an approximate 95% con…dence interval ^ c, for
^
which c = 0:03. In practice, many polls are based on 1050 1100 people, giving “accuracy
to within 3 percent” with probability 0:95. Of course, one needs to be able to a¤ord to
collect a sample of this size. If we were satis…ed with an accuracy of 5 percent, then we’d
only need n = 385 (show this). In many situations however this might not be su¢ ciently
accurate for the purpose of the study.
Exercise: Show that to ensure that the width of the approximate 95% con…dence interval
is 2 (0:02) = 0:04 or smaller, you need n = 2401: What should n be to make a 99% con…-
dence interval less than 2 (0:02) = 0:04 or less?
Remark: Very large Binomial polls (n 2000) are not done very often. Although we can
in theory estimate very precisely with an extremely large poll, there are two problems:
2. In many settings the value of ‡uctuates over time. A poll is at best a snapshot at
one point in time.
As a result, the “real” accuracy of a poll cannot generally be made arbitrarily high.
Sample sizes can be similarly determined so as to give con…dence intervals of some
desired length in other settings. We consider this topic again in Section 4.7 for the G ( ; )
distribution.
4.5. THE CHI-SQUARED AND T DISTRIBUTIONS 125
Conducting a complete census is usually costly and time-consuming. This example il-
lustrates how a random sample, which is less expensive, can be used to obtain “good”
information about the attributes of interest for a population.
Suppose interviewers are hired at $20 per hour to conduct door to door interviews of
adults in a municipality of 50,000 households. There are two choices:
(2) take a random sample of households in the municipality and then interview a member
of each household.
If a random sample is used it is estimated that each interview will take approximately
20 minutes (travel time plus interview time). If a census is used it is estimated that each
interview will take approximately 10 minutes each since there is less travel time. We can
summarize the costs and precision one would obtain for one question on the form which
asks whether a person agrees/disagrees with a statement about the funding levels for higher
education. Let be the proportion in the population who agree. Suppose we decide that a
“good”estimate of is one that is accurate to within 2% of the true value 95% of the time.
For a census, six interviews can be completed in one hour. At $20 per hour the inter-
viewer cost for the census is approximately
50000
$20 = $166; 667
6
since there are 50,000 households.
For a random sample, three interviews can be completed in one hour. An approximate
95% con…dence interval for of the form ^ 0:02 requires n = 2401. The cost of the random
sample of size n = 2401 is
2401
$20 t $16; 000
3
as compared to $166; 667 for the census - more than ten times the cost of the random
sample!
Of course, we have also not compared the costs of processing 50; 000 versus 2401 surveys
but it is obvious again that the random sample will be less costly and time consuming.
2
The (Chi-squared) Distribution
To de…ne the Chi-squared distribution we …rst recall the Gamma function and its properties:
Z1
1 y
( )= y e dy for > 0:
0
k is referred to as the “degrees of freedom” (d.f.) parameter. In Figure 4.3 you see the
characteristic shapes of the Chi-squared probability density functions. For k = 2; the
probability density function is the Exponential (2) probability density function. For k > 2;
the probability density function is unimodal with maximum value at x = k 2. For values
of k > 30, the probability density function resembles that of a N (k; 2k) probability density
function.
Chisquared df =1 Chisquared df =2
4 0.5
0.4
3
0.3
p.d.f.
p.d.f.
2
0.2
1
0.1
0 0
0 1 2 3 4 5 0 2 4 6 8 10
Chisquared df =4 Chisquared df =8
0.2 0.12
0.1
0.15
0.08
p.d.f.
p.d.f.
0.1 0.06
0.04
0.05
0.02
0 0
0 5 10 15 0 5 10 15 20
The cumulative distribution function, F (x; k), can be given in closed algebraic form for
even values of k. In R the functions dchisq(x; k) and pchisq(x; k) give the probability den-
sity function f (x; k) and cumulative distribution function F (x; k) for the 2 (k) distribution.
A table with selected values is given at the end of these course notes.
If X v 2 (k) then
E (X) = k and V ar(X) = 2k:
This result follows by …rst showing that
2j (k=2 + j)
E Xj = for j = 1; 2; : : : :
(k=2)
This is true since
Z1
j 1
E X = xj x(k=2) 1
e x=2
dx
2k=2 (k=2)
0
Z1
1
= x(k=2)+j 1
e x=2
dx let y = x=2 or x = 2y
2k=2 (k=2)
0
Z1 Z1
1 2j
= (2y)(k=2)+j 1
e y
2dy = y (k=2)+j 1
e y
dy
2k=2 (k=2) (k=2)
0 0
2j (k=2 + j)
= :
(k=2)
Letting j = 1 we obtain
Letting j = 2 we obtain
and therefore
V ar (X) = E X 2 [E (X)]2 = 2k:
Proof. Suppose W = Z 2 where Z G(0; 1). Let represent the cumulative distribution
function of a G(0; 1) random variable and let represent the probability density function of
a G(0; 1) random variable. Then
p p p p
P (W w) = P ( w Z w) = ( w) ( w) for w > 0
d p p p p 1 1=2
( w) ( w) = ( w) + ( w) w
dw 2
1 1=2 w=2
=p w e for w > 0
2
Proof. Since Zi G(0; 1) then by Theorem 30, Zi2 v 2 (1) and the result follows by
Theorem 29.
Student’s t Distribution
Student’s t distribution (or more simply the t distribution) has probability density function
(k+1)=2
t2
f (t; k) = ck 1 + for t 2 < and k = 1; 2; : : :
k
The parameter k is called the degrees of freedom. We write T t (k) to indicate that
the random variable T has a Student t distribution with k degrees of freedom. In Figure
4.4 the probability density function f (t; k) for k = 2 is plotted together with the G (0; 1)
probability density function.
Obviously the t probability density function is similar to that of the G (0; 1) distribution
in several respects: it is symmetric about the origin, it is unimodal, and indeed for large
values of k, the graph of the probability density function f (t; k) is indistinguishable from
that of the G (0; 1) probability density function. The primary di¤erence, for small k such
as the one plotted, is in the tails of the distribution. The t probability density function has
4.6. LIKELIHOOD-BASED CONFIDENCE INTERVALS 129
0.4
0.35
0.3
0.25
pdf
0.2
0.15
0.1
0.05
0
-5 0 5
x
Figure 4.4: Probability density functions for t (2) distribution (dashed red ) and
G (0; 1) distribution (solid blue)
fatter “tails”or more area in the extreme left and right tails. Problem 22 at the end of this
chapter considers some properties of f (x; k).
Probabilities for the t distribution are available from tables at the end of these notes26 or
computer software. In R, the cumulative distribution function F (t; k) = P (T t; k) where
T t (k) is obtained using pt(t,k). For example, pt(1.5,10) gives P (T 1:5; 10) = 0:918.
The t distribution arises as a result of the following theorem involving the ratio of a
N (0; 1) random variable and an independent Chi-squared random variable. We will not
attempt to prove this theorem here.
where ~ is the maximum likelihood estimator. The random variable ( ) is called the
likelihood ratio statistic. The following theorem implies that ( ) is an asymptotic pivotal
quantity.
This theorem means that ( ) can be used as a pivotal quantity for su¢ ciently large n
in order to obtain approximate con…dence intervals for . More importantly we can use this
result to show that the likelihood intervals discussed in Section 4.3 are also approximate
con…dence intervals.
By Theorem 33 the con…dence coe¢ cient for this interval can be approximated by
as required.
Conversely Theorem 33 can also be used to …nd an approximate 100p% likelihood based
con…dence interval.
4.6. LIKELIHOOD-BASED CONFIDENCE INTERVALS 131
p = 2P (Z a) 1 where Z v N (0; 1)
n o
then the likelihood interval : R( ) e a2 =2 is an approximate 100p% con…dence inter-
val.
n o
Proof. The con…dence coe¢ cient corresponding to the likelihood interval : R( ) e a2 =2
is
L( ) a2 =2 L( )
P e = P 2 log a2
L(~) ~
L( )
t P W a 2
where W v 2
(1) by Theorem 33
= 2P (Z a) 1 where Z v N (0; 1)
= p
as required.
Example:
Since
0:95 = 2P (Z 1:96) 1 where Z v N (0; 1)
and
(1:96)2 =2 1:9208
e =e t 0:1465 t 0:15;
therefore a 15% likelihood interval for is also an approximate 95% con…dence interval for
.
Exercise: Show that a 26% likelihood interval is an approximate 90% con…dence interval
and a 4% likelihood interval is an approximate 99% con…dence interval.
Suppose the observed data were n = 100 and y = 40 so that ^ = 40=100 = 0:4. From the
graph of the relative likelihood function given in Figure 4.5 we can read o¤ the 15% likeli-
hood interval which is [0:31; 0:495] which is also an approximate 95% con…dence interval.
1.2
0.8
R(θ)
0.6
0.4
0.2
0
0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6
θ
Figure 4.5: Relative likelihood function for Binomial with n = 100 and y = 40
which gives the interval [0:304; 0:496]. The two intervals di¤er slightly (they are both based
on approximations) but are very close.
Suppose n = 30 and ^ = 0:1: From the graph of the relative likelihood function given
in Figure 4.6 we can read o¤ the 15% likelihood interval which is [0:03; 0:24] which is also
an approximate 95% con…dence interval.
0.9
0.8
0.7
R(θ)
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
θ
which gives the interval [ 0:0074; 0:2074] which is quite di¤erent than the likelihood based
approximate con…dence interval and which also contains negative values for . Of course
can only take on values between 0 and 1. This happens because the con…dence interval in
(4.8) is always symmetric about ^ and if ^ is close to 0 or 1 and n is not very large then
the interval can contain values less than 0 or bigger than 1. The graph of the likelihood
interval in Figure 4.6 is not symmetric about ^. In this case the 15% likelihood interval is a
better summary of the values which are supported by the data. If ^ is close to 0:5 or n is
large then the likelihood interval will be fairly symmetric and there will be little di¤erence
in the two approximate con…dence intervals as we saw in the previous example in which n
was equal to 100 and ^ was equal to 0:4.
which di¤ers from ~ 2 only by the choice of denominator. Indeed if n is large there is very
little di¤erence between S 2 and ~ 2 . Note that the sample variance has the advantage that
it is an “unbiased” estimator, that is, E(S 2 ) = 2 . This follows since
2
E (Yi )2 = V ar (Yi ) = 2
, E (Y )2 = V ar Y =
n
and
1 P
n
E(S 2 ) = E (Yi Y )2
n 1 i=1
1 Pn
= E (Yi )2 n(Y )2
n 1 i=1
1 P
n
= E (Yi )2 nE (Y )2
n 1 i=1
1 2 1
2 2
= n n = (n 1)
n 1 n n 1
2
= :
Y
T = p (4.10)
S= n
Since S, unlike , is a random variable in (4.10) the distribution of T is no longer G(0; 1).
The random variable T actually has a t distribution which was introduced in Section 4.5.
Y
T = p v t (n 1) : (4.11)
S= n
and
(n 1)S 2
U= 2
:
We choose this function of S 2 since it can be shown that U 2 (n 1). It can also be
27
shown that Z and U are independent random variables . By Theorem 36 with k = n 1,
we have
p Y
Z = n Y
q = q = p t (n 1) :
U S 2 S= n
k 2
In other words if we replace in the pivotal quantity (4.9) by its estimator S, the distri-
bution of the resulting pivotal quantity has a t(n 1) distribution rather than a G(0; 1)
distribution. The degrees of freedom are inherited from the degrees of freedom of the Chi-
squared random variable U or from S 2 .
27
The proof of the remarkable result that, for a random sample from a Normal distribution, the sample
mean and the sample variance are independent random variables, is beyond the scope of this course.
4.7. CONFIDENCE INTERVALS FOR PARAMETERS IN THE G( ; ) MODEL 135
We now show how to use the t distribution to obtain a con…dence interval for when
is unknown. Since (4.11) has a t distribution with n 1 degrees of freedom which is
a completely known distribution, we can use this pivotal quantity to construct a 100p%
con…dence interval for . Since the t distribution is symmetric we determine the constant
a such that P ( a T a) = p using the t tables provided in these course notes or R.
Note that, due to symmetry, P ( a T a) = p is equivalent to P (T a) = (1 + p) =2
(you should verify this) and since the t tables tabulate the cumulative distribution function
P (T t), it is easier to …nd a such that P (T a) = (1 + p) =2. Then since
p = P( a T a)
Y
= P a p a
S= n
p p
= P Y aS= n Y + aS= n
(Note that if we attempted to use (4.9) to build a con…dence interval we would have two
unknowns in the inequality since both and are unknown.) As usual the method used
to construct this interval implies that 100p% of the con…dence intervals constructed from
samples drawn from this population contain the true value of .
p
We note that this interval is of the form y as= n or
Recall that a con…dence interval for in the case of a G( ; ) population when is known
has a similar form
except that the standard deviation of the estimator is known in this case and the value of
a is taken from a G(0; 1) distribution rather than the t distribution.
103; 115; 97; 101; 100; 108; 111; 91; 119; 101
P
10 P
10
yi = 1046 and yi2 = 110072:
i=1 i=1
136 4. ESTIMATION
We wish to use these data to estimate the parameter which represents the mean test
score for ten year old children at this school. Since
For the given data y = 104:6 and s = 8:57, so the con…dence interval is 104:6 6:13 or
[98:47; 110:73].
Behaviour as n ! 1
As n increases, con…dence intervals behave in a largely predictable fashion. First the
estimated standard deviation gets closer28 to the true standard deviation . Second as
the degrees of freedom increase, the t distribution approaches the Gaussian so that the
quantiles of the t distribution approach that of the G(0; 1) distribution. For example, if
in Example 4.7.1 we knew that = 8:57 then we would use the 95% con…dence interval
p p
y 1:96 (8:57) = n instead of y 2:262 (8:57) = n with n = 10. In general for large n, the
p
width of the con…dence interval gets narrower as n increases (but at the rate 1= n) so the
con…dence intervals shrink to include only the point y.
We would usually choose n a little larger than this formula gives to accommodate the fact
that we used Normal quantiles rather than the quantiles of the t distribution which are
larger in value.
While we will not prove this result, we should at least try to explain the puzzling number
Pn
of degrees of freedom n 1, which, at …rst glance, seems wrong since (Yi Y )2 is the
i=1
sum of n squared Normal random variables. Does this contradict Corollary 31? It is true
that each Wi = (Yi Y ) is a Normally distributed random variable. However Wi does not
have a N (0; 1) distribution and more importantly the Wi0 s are not independent! (See
Problem 17.) It is easy to see that W1 ; W2 ; : : : ; Wn are not independent random variables
Pn nP1
since Wi = 0 implies Wn = Wi so the last term can be determined using the sum
i=1 i=1
P
n P
n
of the …rst n 1 terms. Therefore in the sum, (Yi Y )2 = Wi2 there are really only
i=1 i=1
n 1 terms that are linearly independent or “free”. This is an intuitive explanation for
the n 1 degrees of freedom both of the Chi-squared and of the t distribution. In both
cases, the degrees of freedom are inherited from S 2 and are related to the dimension of the
subspace inhabited by the terms in the sum for S 2 , that is, Wi = Yi Y ; i = 1; : : : ; n:
We will now show how we can use Theorem 37 to construct a 100p% con…dence interval
for the parameter 2 or . First note that (4.13) is a pivotal quantity since its distribution
is completely known. Using Chi-squared tables or R we can …nd constants a and b such
that
P (a U b) = p
where U s 2 (n 1). Since
p = P (a U b)
(n 1)S 2
= P a 2
b
(n 1)S 2 2 (n 1)S 2
= P
b a
r r !
(n 1) S 2 (n 1) S 2
= P
b b
"r r #
(n 1) s2 (n 1) s2
; : (4.15)
b a
As usual the choice for a; b is not unique. For convenience, a and b are usually chosen such
that
1 p
P (U a) = P (U > b) = (4.16)
2
where U s 2 (n 1). Note that since the Chi-squared tables provided in these course
notes tabulate the cumulative distribution function, P (U u), this means using the tables
to …nd a and b such that
(1 p) (1 p) (1 + p)
P (U a) = and P (U b) = p + = :
2 2 2
The intervals (4.14) and (4.15) are called equal-tailed con…dence intervals. The choice (4.16)
for a, b does not give the narrowest con…dence interval. The narrowest interval must be
found numerically. For large n the equal-tailed interval and the narrowest interval are
nearly the same.
Note that, unlike con…dence intervals for , the con…dence interval for 2 is not symmet-
ric about s2 , the estimate of 2 . This happens of course because the 2 (n 1) distribution
is not a symmetric distribution.
In some applications we are interested in an upper bound on (because small is
“good” in some sense). In this case we take b = 1 and …nd a such that P (a U ) = p or
q
(n 1)s2
P (U a) = 1 p so that a one-sided 100p% con…dence interval for is 0; a .
so a = 5:63 and b = 26:12. Substituting these values along with (14) s2 = 0:002347 into
(4.15) we obtain "r #
r
0:002347 0:002347
; = [0:0095; 0:0204] :
26:12 5:63
and the value 0:02 is not in the interval. Why are the intervals di¤erent? Both cover the
true value of the parameter for 95% of all samples so they have the same con…dence
coe¢ cient. However the one-sided interval, since it allows smaller (as small as zero) values
on the left end of the interval, can achieve the same coverage with a smaller right end-point.
If our primary concern was for values of being too large, that is, for an upper bound for
the interval, then the one-sided interval is the one that should be used for this purpose.
1
Y ~=Y Y vN 0; 2
1+ :
n
Also
Y Y
q v t (n 1)
1
S 1+ n
140 4. ESTIMATION
is a pivotal quantity which can be used to obtain an interval of values for Y . Let a be the
value such that P ( a T a) = p or P (T a) = (1 + p) =2 which is obtained from t
tables or by using R. Since
p = P( a T a)
0 1
Y Y
= P@ a q aA
S 1 + n1
r r !
1 1
= P Y aS 1 + Y Y + aS 1+
n n
therefore " r r #
1 1
y as 1 + ; y + as 1+ (4.17)
n n
is an interval of values for the future observation Y with con…dence coe¢ cient p. The
interval (4.17) is called a 100p% prediction interval instead of a con…dence interval since
Y is not a parameter but a random variable. Note that the interval (4.17) is wider than a
100p% con…dence interval for mean . This makes sense since is an unknown constant
with no variability while Y is a random variable with its own variability V ar (Y ) = 2 .
Note that this interval is much wider than a 95% con…dence interval for = the mean
of the population of lens thicknesses produced by this manufacturing process which is given
by
h p p i
25:009 2:1448 (0:013) = 15; 25:009 + 2:1448 (0:013) = 15
= [25:009 0:007; 25:009 + 0:007]
= [25:002; 25:016] :
4.8. A CASE STUDY: TESTING RELIABILITY OF COMPUTER POWER SUPPLIES30 141
29
May be omitted
142 4. ESTIMATION
Table 4.2: Lifetimes (in hours) from an accelerated life test experiment in PC
power supplies Temperature
70 C 60 C 50 C 40 C
2 1 55 78
5 20 139 211
9 40 206 297
10 47 263 556
10 56 347 600
11 58 402 600
64 63 410 600
66 88 563 600
69 92 600 600
70 103 600 600
71 108 600 600
73 125 600 600
75 155 600 600
77 177 600 600
97 209 600 600
103 224 600 600
115 295 600 600
130 298 600 600
131 352 600 600
134 392 600 600
145 441 600 600
181 489 600 600
242 600 600 600
263 600 600 600
283 600 600 600
Notes: Lifetimes are given in ascending order; asterisks( ) denote censored observations.
For the censored observations we only know that the lifetime is greater than 600. Since
Z1
1 y= 600=
P (Y ; ) = P (Y > 600; ) = e dy = e
600
the contribution to the likelihood function of each observation censored at 600 is e 600= .
4.8. A CASE STUDY: TESTING RELIABILITY OF COMPUTER POWER SUPPLIES31 143
Q
22 1
yi = Q
25
yi = k s=
L( ) = e e = e
i=1 i=23
P
25
where k = 22 = the number of uncensored observations and s = yi = sum of all lifetimes
i=1
and censored times.
Question 1 Show that the maximum likelihood estimate of is given by ^ = s=k and
thus ^40 = s=k.
Question 2 Assuming that the Exponential model is correct, the likelihood function for
t ; t = 40; 50; 60; 70 can be obtained using the method above and is given by
kt st = t
L( t ) = t e
t = exp + (4.18)
t + 273:2
where t is the temperature in degrees Celsius and and are parameters. Plot the points
log ^t ; (t + 273:2) 1 for t = 40; 50; 60; 70. If the model is correct why should these
points lie roughly along a straight line? Do they?
Using the graph give rough point estimates of and . Extrapolate the line or use your
estimates of and to estimate 20 , the mean lifetime at t = 20 C which is the normal
operating temperature.
Question 5 Question 4 indicates how to obtain a rough point estimate of
20 = exp + :
20 + 273:2
144 4. ESTIMATION
Suppose we wanted to …nd the maximum likelihood estimate of 20 . This would require
the maximum likelihood estimates of and which requires the joint likelihood function
of and . Explain why this likelihood is given by
Q
70
kt st = t
L( ; ) = t e
t=40
where t is given by (4.18). (Note that the product is only over t = 40; 50; 60; 70.) Outline
how you might attempt to get an interval estimate for 20 based on the likelihood function
for and . If you obtained an interval estimate for 20 , would you have any concerns
about indicating to the engineers what mean lifetime could be expected at 20 C? (Explain.)
Question 6 Engineers and statisticians have to design reliability tests like the one just
discussed, and considerations such as the following are often used:
Suppose that the mean lifetime at 20 C is supposed to be about 90,000 hours and that
at 70 C you know from past experience that it is about 100 hours. If the model (4.18) holds,
determine what and should be approximately and thus what is roughly equal to at
40 , 50 and 60 C. How might you use this information in deciding how long a period of time
to run the life test? In particular, give the approximate expected number of uncensored
lifetimes from an experiment that was terminated after 600 hours.
4.9. CHAPTER 4 PROBLEMS 145
3. Suppose that a fraction of a large population of persons over 18 years of age never
drink alcohol. In order to estimate , a random sample of n persons is to be selected
and the number y who do not drink determined; the maximum likelihood estimate of
is then ^ = y=n. We want our estimate ^ to have a high probability of being close
to , and want to know how large n should be to achieve this. Consider the random
variable Y and estimator ~ = Y =n.
(a) Determine P 0:03 ~ 0:03 , if n = 1000 and = 0:5 using the Normal
approximation to the Binomial. You do not need to use a continuity correction.
(b) If = 0:50 determine how large n should be to ensure that
(a) Suppose a 95% con…dence interval for , the mean time Canadians spent on the
internet in this quarter, is reported to be [42:8; 47:8]. How should this interval
be interpreted?
(b) Construct an approximate 95% con…dence interval for the proportion of Cana-
dians whose mobile phone is a smartphone.
(c) Since this study was conducted in March 2012 the research company has been
asked to conduct a new survey to determine if the proportion of Canadians whose
mobile phone is a smartphone has changed. What size sample should be used to
ensure that an approximate 95% con…dence interval is less than 2 (0:02)?
6. Two hundred adults are chosen at random from a population and each adult is asked
whether information about abortions should be included in high school public health
sessions. Suppose that 70% say they should.
(a) Obtain an approximate 95% con…dence interval for the proportion of the pop-
ulation who support abortion information included in high school public health
sessions.
(b) Suppose you found out that the 200 persons interviewed consisted of 50 married
couples and 100 other persons. The 50 couples were randomly selected, as were
the other 100 persons. Discuss the validity (or non-validity) of the analysis in
(a).
7. For Chapter 2, Problem 3 (b) determine an approximate 95% con…dence interval for
by using a 15% likelihood interval for . The likelihood interval can be found from
the graph of R( ) or by using the function uniroot in R.
8. For Chapter 2, Problem 5(b) determine an approximate 95% con…dence interval for
by using a 15% likelihood interval for . The likelihood interval can be found from
the graph of R( ) or by using the function uniroot in R.
9. For Chapter 2, Problem 6(b) determine an approximate 95% con…dence interval for
by using a 15% likelihood interval for . The likelihood interval can be found from
the graph of r( ) or by using the function uniroot in R.
(a) Plot the relative likelihood function R( ) and determine a 10% likelihood inter-
val. The likelihood interval can be found from the graph of R( ) or by using the
function uniroot in R. Is very accurately determined?
(b) Suppose that we can …nd out whether each pair of twins is identical or not, and
that it is determined that of 50 pairs, 17 were identical. Obtain the likelihood
function, the maximum likelihood estimate and a 10% likelihood interval for
in this case. Plot the relative likelihood function on the same graph as the one
in (a), and compare the accuracy of estimation in the two cases.
11. For Chapter 2, Problem 8(c) determine an approximate 95% con…dence interval for
by using a 15% likelihood interval for . The likelihood interval can be found from
the graph of R( ) or by using the function uniroot in R.
12. Suppose that a fraction of a large population of persons are infected with a certain
virus. Let n and k be integers. Suppose that blood samples for n k people are to be
tested to obtain information about . In order to save time and money, pooled testing
is used, that is, samples are mixed together k at a time to give a total of n pooled
148 4. ESTIMATION
samples. A pooled sample will test negative if all k individuals in that sample are not
infected.
(a) Find the probability that y out of n samples will be negative, if the nk people
are a random sample from the population. State any assumptions you make.
(b) Obtain a general expression for the maximum likelihood estimate ^ in terms of
n, k and y.
(c) Suppose n = 100, k = 10 and y = 89. Find the maximum likelihood estimate of
, and a 10% likelihood interval for .
13. A manufacturing process produces …bers of varying lengths. The length of a …ber Y
is a continuous random variable with probability density function
y
f (y; ) = 2 e y= ; y 0; >0
(e) Explain how you would use the statement in (c) to construct an approximate
95% con…dence interval for .
(f) Suppose n = 18 …bers were selected at random and the lengths were:
6:19 7:92 1:23 8:13 4:29 1:04 3:67 9:87 10:34
1:41 10:76 3:69 1:34 6:80 4:21 3:44 2:51 2:08
P
18
For these data yi = 88:92. Give the maximum likelihood estimate of and
i=1
an approximate 95% con…dence interval for using your result from (e).
14. The lifetime T (in days) of a particular type of light bulb is assumed to have a
distribution with probability density function
1 3 2 t
f (t; ) = t e for t > 0 and > 0:
2
4.9. CHAPTER 4 PROBLEMS 149
(a) Suppose t1 ; t2 ; : : : ; tn is a random sample from this distribution. Find the maxi-
mum likelihood estimate ^ and the relative likelihood function R( ).
P20
(b) If n = 20 and ti = 996, graph R( ) and determine the 15% likelihood interval
i=1
for which is also an approximate 95% con…dence interval for . The interval
can be obtained from the graph of R( ) or by using the function uniroot in R.
(c) Suppose we wish to estimate the mean lifetime of a light bulb. Show E(T ) = 3= .
Hint: Use the Gamma function. Find an approximate 95% con…dence interval
for the mean.
(d) Show that the probability p that a light bulb lasts less than 50 days is
50 2
p = p ( ) = P (T 50; ) = 1 e [1250 + 50 + 1]:
(a) Determine the following using 2 tables provided in the Course Notes:
(i) If X v 2 (10) …nd P (X 2:6) and P (X > 16).
(ii) If X v 2 (4) …nd P (X > 15).
(iii) If X v 2 (40) …nd P (X 24:4) and P (X 55:8). Compare these values
with P (Y 24:4) and P (Y 55:8) if Y v N (40; 80).
(iv) If X v 2 (25) …nd a and b such that P (X a) = 0:025 and P (X > b) =
0:025.
(v) If X v 2 (12) …nd a and b such that P (X a) = 0:05 and P (X > b) = 0:05.
(b) Determine the following WITHOUT using 2 tables:
(i) If X v 2 (1) …nd P (X 2) and P (X > 1:4).
(ii) If X v 2 (2) …nd P (X 2) and P (X > 3).
(a) Show that this probability density function integrates to one for k = 1; 2; : : :
using the properties of the Gamma function.
150 4. ESTIMATION
(b) Plot the probability density function for k = 5, k = 10 and k = 25 on the same
graph. What do you notice?
(c) Show that the moment generating function of Y is given by
1
M (t) = E etX = (1 2t) k=2
for t <
2
and use this to show that E(X) = k and V ar(X) = 2k.
(d) Prove Theorem 29 using moment generating functions.
(a) Plot the probability density function for k = 1; 5; 25. Plot the N (0; 1) probability
density function on the same graph. What do you notice?
(b) Show that f (t; k) is unimodal.
(c) Use Theorem 32 to show that E (T ) = 0. Hint: If X and Y are independent
random variables then E [g (X) h (Y )] = E [g (X)] E [h (Y )].
(d) Use the t tables provided in the Course Notes to answer the following:
(i) If T v t(10) …nd P (T 0:88), P (T 0:88) and P (jT j 0:88).
(ii) If T v t(17) …nd P (jT j > 2:90).
(iii) If T v t(30) …nd P (T 2:04) and P (T 0:26). Compare these values
with P (Z 2:04) and P (Z 0:26) if Z v N (0; 1).
(iv) If T v t(18) …nd a and b such that P (T a) = 0:025 and P (T > b) = 0:025.
(v) If T v t(13) …nd a and b such that P (T a) = 0:05 and P (T > b) = 0:05.
where
k+1
2
ck = p k
:
k 2
Show that
1 1 2
lim f (t; k) = p exp t for t 2 <
k!1 2 2
which is the probability density function of the G(0; 1) distribution. Hint: You may
p
use the fact that lim ck = 1= 2 which is a property of the Gamma function.
k!1
20. In an early study concerning survival time for patients diagnosed with Acquired Im-
mune De…ciency Syndrome (AIDS), the survival times (i.e. times between diagnosis
P30
of AIDS and death) of 30 male patients were such that yi = 11; 400 days.
i=1
(a) Assuming that survival times are Exponentially distributed with mean days,
graph the relative likelihood function for these data and obtain an approximate
90% con…dence interval for . This interval may be obtained from the graph of
the relative likelihood function or by using the function uniroot in R.
(b) Show that m = ln 2 is the median survival time. Using the interval obtained
in (a), give an approximate 90% con…dence interval for m.
21.
This result implies that U is a pivotal quantity which can be used to obtain
con…dence intervals for .
(c) Refer to the data in the previous problem. Using the fact that
22. Company A leased photocopiers to the federal government, but at the end of their
recent contract the government declined to renew the arrangement and decided to
lease from a new vendor, Company B. One of the main reasons for this decision was
a perception that the reliability of Company A’s machines was poor.
152 4. ESTIMATION
(a) Over the preceding year the monthly numbers of failures requiring a service call
from Company A were
12 14 15 16 18 19 19 22 23 25 28 29
Assuming that the number of service calls needed in a one month period has
a Poisson distribution with mean , obtain and graph the relative likelihood
function R( ) based on the data above.
(b) In the …rst year using Company B’s photocopiers, the monthly numbers of service
calls were
7 8 9 10 10 12 12 13 13 14 15 17
Under the same assumption as in part (a), obtain R( ) for these data and graph
it on the same graph as used in (a).
(c) Determine the 15% likelihood interval for which is also an approximate 95%
con…dence interval for for each company. The intervals can be obtained from
the graphs of the relative likelihood functions or by using the function uniroot
in R. Do you think the government’s decision was a good one, as far as the
reliability of the machines is concerned?
(d) What conditions would need to be satis…ed to make the assumptions and analysis
in (a) to (c) valid?
(e) If Y1 ; : : : ; Yn is a random sample from the Poisson( ) distribution then the ran-
dom variable
Y
p
Y =n
has approximately a N (0; 1) distribution. Show how this result leads to an
approximate 95% con…dence interval for given by
r
y
y 1:96 :
n
Using this result determine the approximate 95% con…dence intervals for each
company based on the result. Compare these intervals with the intervals obtained
in (c).
23. A study on the common octopus (Octopus Vulgaris) was conducted by researchers
at the University of Vigo in Vigo, Spain. Nineteen octopi were caught in July 2008
in the Ria de Vigo (a large estuary on the northwestern coast of Spain). Several
measurements were made on each octopus including their weight in grams. These
weights are given in the table below.
680 1030 1340 1330 1260 770 830 1470 1380 1220
920 880 1020 1050 1140 960 1060 1140 860
4.9. CHAPTER 4 PROBLEMS 153
(a) Use a qqplot to determine how reasonable the Gaussian model is for these data.
(b) Describe a suitable study population for this study. The parameters and
correspond to what attributes of interest in the study population?
(c) The researchers at the University of Vigo were interested in determining whether
the octopi in the Ria de Vigo are healthy. For common octopi, a population mean
weight of 1100 grams is considered to be a healthy population. Determine a 95%
con…dence interval for . What should the researchers conclude about the health
of the octopi, in terms of weight, in the Ria de Vigo?
(d) Determine a 90% con…dence interval for based on these data.
24. Consider the data on weights of adult males and females from Chapter 1. (The data
are posted on the course webpage.)
(a) Determine whether is is reasonable to assume a Normal model for the female
heights and a di¤erent Normal model for the male heights.
(b) Obtain a 95% con…dence interval for the mean for the females and males sepa-
rately. Does there appear to be a di¤erence in the means for females and males?
(We will see how to test this formally in Chapter 6.)
(c) Obtain a 95% con…dence interval for the standard deviation for the females and
males separately. Does there appear to be a di¤erence in the standard deviations?
25. Sixteen packages are randomly selected from the production of a detergent packaging
machine. Let yi = weight in grams of the i0 th package, i = 1; : : : ; 16.
(a) Describe a suitable study population for this study. The parameters and
correspond to what attributes of interest in the study population?
154 4. ESTIMATION
26. Radon is a colourless, odourless gas that is naturally released by rocks and soils and
may concentrate in highly insulated houses. Because radon is slightly radioactive,
there is some concern that it may be a health hazard. Radon detectors are sold to
homeowners worried about this risk, but the detectors may be inaccurate. Univer-
sity researchers placed 12 detectors in a chamber where they were exposed to 105
picocuries per liter of radon over 3 days. The readings given by the detectors were:
91:9 97:8 111:4 122:3 105:4 95:0 103:8 99:6 96:6 119:3 104:8 101:7
27. A manufacturer wishes to determine the mean breaking strength (force) of a type
of string to “within 0:5 kilograms”, which we interpret as requiring that the 95%
con…dence interval for a should have length at most 1 kilogram. If breaking strength
P
10
Y of strings tested are G( ; ) and if 10 preliminary tests gave (yi y)2 = 45, how
i=1
many additional measurements would you advise the manufacturer to take?
28. A chemist has two ways of measuring a particular quantity; one has more random
error than the other. For method I, measurements X1 ; X2 ; : : : ; Xm follow a Normal
distribution with mean and variance 21 , whereas for method II, measurements
Y1 ; Y2 ; : : : ; Yn have a Normal distribution with mean and variance 22 .
(a) Assuming that 21 and 22 are known, …nd the combined likelihood function for
based on observed data x1 ; x2 ; : : : ; xm and y1 ; y2 ; : : : ; yn and show that the
maximum likelihood estimate of is
w1 x + w2 y
^=
w1 + w2
where w1 = m= 2 and w2 = n= 2. Why does this estimate make sense?
1 2
(b) Suppose that 1 = 1, 2 = 0:5 and n = m = 10. How would you rationalize
to a non-statistician why you were using the estimate (x + 4y) =5 instead of
(x + y) =2?
4.9. CHAPTER 4 PROBLEMS 155
p
We denote this by writing Xn ! c.
p
(a) If fXn g and fYn g are two sequences of random variables with Xn ! c1 and
p p p
Yn ! c2 , show that Xn + Yn ! c1 + c2 and Xn Yn ! c1 c2 .
(b) Let X1 ; X2 ; : : : be independent and identically distributed random variables with
probability density function f (x; ). A point estimator ~n based on a random
p
sample X1 ; : : : ; Xn is said to be consistent for if ~n ! as n ! 1.
(i) Let X1 ; : : : ; Xn be independent and identically distributed U nif orm(0; )
random variables. Show that ~n = max (X1 ; : : : ; Xn ) is consistent for .
(ii) Let X Binomial(n; ). Show that ~n = X=n is consistent for .
31. Challenge Problem: Refer to the de…nition of consistency in Problem 27(b). Dif-
…culties can arise when the number of parameters increases with the amount of data.
Suppose that two independent measurements of blood sugar are taken on each of n
individuals and consider the model
2
Xi1 ; Xi2 N ( i; ) for i = 1; ;n
where Xi1 and Xi2 are the independent measurements. The variance 2 is to be
estimated, but the i ’s are also unknown.
(a) Find the maximum likelihood estimator ~ 2 and show that it is not consistent.
(b) Suggest an alternative way to estimate 2 by considering the di¤erences Wi =
Xi1 Xi2 .
156 4. ESTIMATION
(c) What does represent physically if the measurements are taken very close to-
gether in time?
32. Challenge Problem: Proof of Central Limit Theorem (Special Case) Suppose
Y1 ; Y2 ; : : : are independent random variables with E(Yi ) = ; V ar(Yi ) = 2 and that
they have the same distribution, whose moment generating function exists.
2
(a) Show that (Yi )= has moment generating function of the form (1 + t2 +
p
terms in t3 ; th4 ; : : :) and thusi that (Yi )= n has moment generating function
t2
of the form 1 + 2n + o(n) , where o(n) signi…es a remainder term Rn with the
property that Rn =n ! 0 as n ! 1.
(b) Let p
P
n (Y
i ) n(Y )
Zn = p =
i=1 n
h in
t2
and note that its moment generating function is of the form 1 + 2n + o(n) .
2
Show that as n ! 1 this approaches the limit et =2 , which is the moment
generating function for G(0; 1). (Hint: For any real number a, (1 + a=n)n ! ea
as n ! 1.)
5. TESTS OF HYPOTHESES
5.1 Introduction
32 What does it mean to test a hypothesis in the light of observed data or information?
Suppose a statement has been formulated such as “I have extrasensory perception.” or
“This drug that I developed reduces pain better than those currently available.” and an
experiment is conducted to determine how credible the statement is in light of observed
data. How do we measure credibility? If there are two alternatives: “I have ESP.” and
“I do not have ESP.” should they both be considered a priori as equally plausible? If I
correctly guess the outcome on 53 of 100 tosses of a fair coin, would you conclude that
my gift is real since I was correct more than 50% of the time? If I develop a treatment
for pain in my basement laboratory using a mixture of seaweed and tofu, would you treat
the claims “this product is superior to aspirin”and “this product is no better than aspirin”
symmetrically?
When studying tests of hypotheses it is helpful to draw an analogy with the criminal
court system used in many places in the world, where the two hypotheses “the defendant is
innocent”and “the defendant is guilty”are not treated symmetrically. In these courts, the
court assumes a priori that the …rst hypothesis, “the defendant is innocent” is true, and
then the prosecution attempts to …nd su¢ cient evidence to show that this hypothesis of
innocence is not plausible. There is no requirement that the defendant be proved innocent.
At the end of the trial the judge or jury may conclude that there was insu¢ cient evidence
for a …nding of guilty and the defendant is then exonerated. Of course there are two types
of errors that this system can (and inevitably does) make; convict an innocent defendant or
fail to convict a guilty defendant. The two hypotheses are usually not given equal weight a
priori because these two errors have very di¤erent consequences.
Statistical tests of hypotheses are analogous to this legal example. We often begin by
specifying a single “default” hypothesis (“the defendant is innocent” in the legal context)
and then check whether the data collected is unlikely under this hypothesis. This default
hypothesis is often referred to as the “null”hypothesis and is denoted by H0 (“null”is used
because it often means a new treatment has no e¤ect). Of course, there is an alternative
32
For an introduction to testing hypotheses, see the video called "A Test of Signi…cance" at
www.watstat.ca
157
158 5. TESTS OF HYPOTHESES
hypothesis, which may not always be speci…ed. In many cases the alternative hypothesis is
simply that H0 is not true.
We will outline the logic of tests of hypotheses in the …rst example, the claim that I have
ESP. In an e¤ort to prove or disprove this claim, an unbiased observer tosses a fair coin
100 times and before each toss I guess the outcome of the toss. We count Y , the number
of correct guesses which we can assume has a Binomial distribution with n = 100. The
probability that I guess the outcome correctly on a given toss is an unknown parameter .
If I have no unusual ESP capacity at all, then we would assume = 0:5, whereas if I have
some form of ESP, either a positive attraction or an aversion to the correct answer, then
we expect 6= 0:5. We begin by asking the following questions in this context:
(1) Which of the two possibilities, = 0:5 or 6= 0:5, should be assigned to H0 , the null
hypothesis?
(2) What observed values of Y are highly inconsistent with H0 and what observed values
of Y are compatible with H0 ?
(3) What observed values of Y would lead to us to conclude that the data provide no
evidence against H0 and what observed values of Y would lead us to conclude that
the data provide strong evidence against H0 ?
In answer to question (1), hopefully you observed that these two hypotheses ESP and
NO ESP are not equally credible and decided that the null hypothesis should be H0 : = 0:5
or H0 : I do not have ESP.
To answer question (2), we note that observed values of Y that are very small (e.g.
0 10) or very large (e.g. 90 100) would clearly lead us to to believe that H0 is false,
whereas values near 50 are perfectly consistent with H0 . This leads naturally to the concept
of a test statistic or discrepancy measure.
Usually we de…ne D so that D = 0 represents the best possible agreement between the
data and H0 , and values of D not close to 0 indicate poor agreement. A general method for
constructing test statistics will be described in Sections 5:3, but in this example, it seems
natural to use D(Y ) = jY 50j.
Question (3) could be resolved easily if we could specify a threshold value for D, or
equivalently some function of D. In the given example, the observed value of Y was y = 52
and so the observed value of D is d = j52 50j = 2. One might ask what is the probability,
when H0 is true, that the discrepancy measure results in a value less than d. Equivalently,
what is the probability, assuming H0 is true, that the discrepancy measure is greater than
or equal to d? In other words we want to determine P (D d; H0 ) where the notation
5.1. INTRODUCTION 159
“; H0 ” means “assuming that H0 is true”. We can compute this easily in the our given
example. If H0 is true then Y Binomial(100; 0:5) and
How can we interpret this value in terms of the test of H0 ? Roughly 76% of claimants
similarly tested for ESP, who have no abilities at all but simply randomly guess, will
perform as well or better (that is, result in at least as large a value of D as the observed
value of 2) than I did. This does not prove I do not have ESP but it does indicate we
have failed to …nd any evidence in these data to support rejecting H0 . There is no evidence
against H0 in the observed value d = 2, and this was indicated by the high probability that,
when H0 is true, we obtain at least this much measured disagreement with H0 .
We now proceed to a more formal treatment of hypothesis tests. We will concentrate
on two types of hypotheses:
(1) the hypothesis H0 : = 0 where it is assumed that the data Y have arisen from a
family of distributions with probability (density) function f (y; ) with parameter
(2) the hypothesis H0 : Y f0 (y) where it is assumed that the data Y have a speci…ed
probability (density) function f0 (y).
De…nition 39 Suppose we use the test statistic D = D (Y) to test the hypothesis H0 .
Suppose also that d = D (y) is the observed value of D. The p-value or observed signi…cance
level of the test of hypothesis H0 using test statistic D is
p value = P (D d; H0 ):
160 5. TESTS OF HYPOTHESES
Remarks:
(1) Note that the p value is de…ned as P (D d; H0 ) and not P (D = d; H0 ) even
though the event that has been observed is D = d. If D is a continuous random variable
then P (D = d; H0 ) is always equal to zero which is not very useful. If D is a discrete
random variable with many possible values then P (D = d; H0 ) will be small which is also
not very useful. Therefore to determine how unusual the observed result is we compare it
to all the other results which are as unusual or more unusual than what has been observed.
(2) The p value is NOT the probability that H0 is true. This is a common misinter-
pretation.
The following table gives a rough guideline for interpreting p values. These are only
guidelines for this course. The interpretation of p values must always be made in the
context of a given study.
which can be calculated using R or using the Normal approximation to the Binomial since
n = 200 is large. Using the Normal approximation (without a continuity correction since
it is not essential to have an exact value) we obtain
p value = P (D 14; H0 )
= P (Y 44) where Y Binomial(180; 1=6)
180
X y 180 y
180 1 5
=
y 6 6
y=44
= 0:005
which provides strong evidence against H0 , and suggests that is bigger than 1=6. This is
an example of a one-sided test which is described in more detail below.
162 5. TESTS OF HYPOTHESES
= 0:18
and this probability is not especially small. Indeed almost one die in …ve, though fair, would
show this level of discrepancy with H0 . We conclude that there is no evidence against H0
in light of the observed data.
Note that we do not claim that H0 is true, only that there is no evidence in light of the
data that it is not true. Similarly in the legal example, if we do not …nd evidence against
H0 : “defendant is innocent”, this does not mean we have proven he or she is innocent, only
that, for the given data, the amount of evidence against H0 was insu¢ cient to conclude
otherwise.
The approach to testing a hypothesis described above is very general and straightfor-
ward, but a few points should be stressed:
1. If the p value is very small then as indicated in the table there is strong evidence
against H0 in light of the observed data; this is often termed “statistically
signi…cant” evidence against H0 . While we believe that statistical evidence is best
measured when we interpret p values as in the above table, it is common in some
of the literature to adopt a threshold value for the p value such as 0:05 and “reject
H0 ” whenever the p-value is below this threshold. This may be necessary
when there are only two options for your decision. For example in a trial, a person is
either convicted or acquitted of a crime.
2. If the p value is not small, we do not conclude that H0 is true. We simply say
there is no evidence against H0 in light of the observed data. The reason for
this “hedging”is that in most settings a hypothesis may never be strictly “true”. (For
example, one might argue when testing H0 : = 1=6 in Example 5.1.2 that no real
die ever has a probability of exactly 1=6 for side 1.) Hypotheses can be “disproved”
(with a small degree of possible error) but not proved. Again, if we are limited to
two possible decisions, if you fail to “reject H0 ” in the language above, you may say
that “H0 is accepted” when the p value is larger than the predetermined
threshold. This does not mean that we have determined that H0 is true, but that
there is insu¢ cient evidence on hand to reject it33 .
33
If the untimely demise of all of the prosecution witnesses at your trial leads to your acquittal, does this
prove your innocence?
5.1. INTRODUCTION 163
4. So far we have not re…ned the conclusion when we do …nd strong evidence against the
null hypothesis. Often we have in mind an “alternative” hypothesis. For example if
the standard treatment for pain provides relief in about 50% of cases, and we test, for
patients medicated with an alternative H0 : P (relief) = 0:5 we will obviously wish to
know, if we …nd strong evidence against H0 , in what direction that evidence lies. If
the probability of relief is greater than 0:5 we might consider further tests or adopting
the drug, but if it is less, then the drug will be abandoned for this purpose. We will
try and adapt to this type of problem with our choice of discrepancy measure D.
A drawback with the approach to testing described so far is that we do not have a
general method for choosing the test statistic or discrepancy measure D. Often there are
“intuitively obvious” test statistics that can be used; this was the case in the examples in
this section. In Section 5:3 we will see how to use the likelihood function to construct a
test statistic in more complicated situations where it is not always easy to come up with
an intuitive test statistic.
A …nal point is that once we have speci…ed a test statistic D, we need to be able to
compute the p value for the observed data. Calculating probabilities involving D brings
us back to distribution theory. In most cases the exact p value is di¢ cult to determine
mathematically, and we must use either an approximation or computer simulation. Fortu-
nately, for the tests considered in Section 5:3 we can use an approximation based on the 2
distribution.
For the Gaussian model with unknown mean and standard deviation we use test statis-
tics based on the pivotal quantities that were used in Chapter 4 for constructing con…dence
intervals.
164 5. TESTS OF HYPOTHESES
1 Pn 1 Pn
~=Y = Yi and ~ 2 = (Yi Y )2 :
n i=1 n i=1
to estimate 2 .
Recall from Chapter 4 that
Y
T = p v t (n 1) :
S= n
We use this pivotal quantity to construct a test of hypothesis for the parameter when the
standard deviation is unknown.
jY j
D= p 0 (5.1)
S= n
jy j
d= p0 (5.2)
s= n
be the value of D observed in a sample with mean y and standard deviation s, then
p value = P (D d; H0 is true)
= P (jT j d) = 1 P( d T d)
= 2 [1 P (T d)] where T t (n 1) : (5.3)
new treatment follow a G( ; ) distribution and that the new treatment can either have no
e¤ect represented by = 0 or a bene…cial e¤ect represented by > 0 . In this example
the null hypothesis is H0 : = 0 and the alternative hypothesis is HA : > 0 . To test
H0 : = 0 using this alternative we could use the test statistic
Y
D= p 0
S= n
so that large values of D provide evidence against H0 in the direction of the alternative
> 0 . Under H0 : = 0 the test statistic D has a t (n 1) distribution. Let the
observed value be
y
d= p0
s= n
Then
p value = P (D d; H0 is true)
= P (T d)
=1 P (T d) where T t (n 1) :
In Example 5.1.2, the hypothesis of interest was H0 : = 1=6 where was the probabil-
ity that the upturned face was a one. If the alternative of interest is that is not equal to
1=6 then the alternative hypothesis is HA : 6= 1=6 and the test statistic D = jY n=6j is
a good choice. If the alternative of interest is that is bigger than 1=6 then the alternative
hypothesis is HA : > 1=6 and the test statistic D = max [(Y n=6); 0] is a better choice.
A: 1:026 0:998 1:017 1:045 0:978 1:004 1:018 0:965 1:010 1:000
B: 1:011 0:966 0:965 0:999 0:988 0:987 0:956 0:969 0:980 0:988
Let Y represent a single measurement on one of the scales, and let represent the
average measurement E(Y ) in repeated weighings of a single 1 kg weight. If an experiment
involving n weighings is conducted then a test of H0 : = 1 can be based on the test
statistic (5.1) with observed value (5.2) and 0 = 1.
The samples from scales A and B above give us
p value = P (D 0:839; = 1)
= P (jT j 0:839) where T t (9)
= 2 [1 P (T 0:839)]
= 2 (1 0:7884)
t 0:42
where the probability is obtained using R. Alternatively if we use the t tables provided in
these notes we obtain P (T 0:5435) = 0:7 and P (T 0:88834) = 0:8 so
In either case we have that the p value > 0:1 and thus there is no evidence of bias, that
is, there is no evidence against H0 : = 1 for scale A based on the observed data.
For scale B, however, we obtain
p value = P (D 3:534; = 1)
= P (jT j 3:534) where T t (9)
= 2 [1 P (T 3:534)]
= 0:0064
where the probability is obtained using R. Alternatively if we use the t tables we obtain
P (T 3:2498) = 0:995 and P (T 4:2968) = 0:999 so
In either case we have that the p value < 0:01 and thus there is strong evidence against
H0 : = 1. The observed data suggest strongly that scale B is biased.
Finally, note that just although there is strong evidence against H0 for scale B, the
degree of bias in its measurements is not necessarily large enough to be of practical concern.
In fact, we can obtain a 95% con…dence interval for for scale B by using the pivotal
quantity
Y
T = p t (9) :
S= 10
For t tables we have P (T 2:2622) = 0:975 and a 95% con…dence interval for is given by
p
y 2:2622s= 10 = 0:981 0:012 or [0:969; 0:993] :
Evidently scale B consistently understates the weight but the bias in measuring the 1 kg
weight is likely fairly small (about 1% 3%).
Remark: The function t.test in R will give con…dence intervals and test hypotheses about
; for a data set y use t.test(y).
5.2. TESTS OF HYPOTHESES FOR PARAMETERS IN THE G( ; ) MODEL 167
jy j
if and only if P jT j p0 0:05 where T t (n 1)
s= n
jy j
if and only if P jT j p0 0:95
s= n
jy j
if and only if p0 a where P (jT j a) = 0:95
s= n
p p
if and only if 0 2 y as= n; y + as= n
which is a 95% con…dence interval for . In other words, the p value for testing H0 : = 0
is greater than or equal to 0:05 if and only if the value = 0 is inside a 95% con…dence
interval for (assuming we use the same pivotal quantity).
More generally, suppose we have data y, a model f (y; ) and we use the same pivotal
quantity to construct a con…dence interval for and a test of the hypothesis H0 : = 0 .
Then the parameter value = 0 is inside a 100q% con…dence interval for if and only if
the p value for testing H0 : = 0 is greater than 1 q.
to construct con…dence intervals for the parameter . We may also wish to test a hypothesis
such as H0 : = 0 . One approach is to use a likelihood ratio test statistic which is
described in the next section. Alternatively we could use the test statistic
(n 1)S 2
U= 2
0
168 5. TESTS OF HYPOTHESES
for testing H0 : = 0 . Large values of U and small values of U provide evidence against
H0 . (Why is this?) Now U has a Chi-squared distribution when H0 is true and the
Chi-squared distribution is not symmetric which makes the determination of “large” and
“small” values somewhat problematic. The following simpler calculation approximates the
p value:
1. Let u = (n 1)s2 = 2
0 denote the observed value of U from the data.
p value = 2P (U u)
where U s 2 (n 1).
p value = 2P (U u)
where U s 2 (n 1).
1
Figure 5.1 shows a picture for a large observed value of u. In this case P (U u) > 2
and the p value = 2P (U u).
0.09
0.08
0.07
0.06
p.d.f.
0.05
0.04 P(U< u)
0.03
0.02
0.01
P(U> u)
0
0 5 10 15 u 20 25 30
Example 5.2.2
For the manufacturing process in Example 4.7.2, test the hypothesis H0 : = 0:008
(0:008 is the desired or target value of the manufacturer would like to achieve). Note that
since the value = 0:008 is outside the two-sided 95% con…dence interval for in Example
4.5.2, the p value for a test of H0 based on the test statistic U = (n 1)S 2 = 20 will be
less than 0:05. To …nd the p value, we follow the procedure above:
5.3. LIKELIHOOD RATIO TESTS OF HYPOTHESES - ONE PARAMETER 169
1. u = (n 1)s2 = 2
0 = (14) s2 = (0:008)2 = 0:002347= (0:008)2 = 36:67
2. The p value is
2
p value = 2P (U u) = 2P (U 36:67) = 0:0017 where U s (14)
where the probability is obtain using R. Alternatively if we use the Chi-squared tables
provided in these notes we obtain P (U 31:319) = 0:995 so
In either case we have that the p value < 0:01 and thus there is strong evidence
based on the observed data against H0 : = 0:008. Since the observed value of
p
s = 0:002347=14 = 0:0129 is greater than 0:008, the data suggest that is bigger
than 0:008.
L( 0 )
R( 0 ) = :
L(^)
170 5. TESTS OF HYPOTHESES
where ^ is the maximum likelihood estimate of based on the observed data. The approx-
imate p value is then
Let us summarize the construction of a test from the likelihood function. Let the random
variable (or vector of random variables) Y represent data generated from a distribution
with probability function or probability density function f (y; ) which depends on the
scalar parameter . Let be the parameter space (set of possible values) for . Consider
a hypothesis of the form
H0 : = 0
where 0 is a single point (hence of dimension 0). We can test H0 using as our test statis-
tic the likelihood ratio test statistic , de…ned by (5.5). Then large observed values of
35
Recall that L ( ) = L ( ; y) is a function of the observed data y and therefore replacing y by the
corresponding random variable Y means that L ( ; Y) is a random variable. Therefore the random variable
L( 0 )=L(~) = L( 0 ; Y)=L(~; Y) is a function of Y in several places including ~ = g (Y).
5.3. LIKELIHOOD RATIO TESTS OF HYPOTHESES - ONE PARAMETER 171
0 .8
m o re p l a u s i b l e v a l u e s
0 .6
R( θ )
0 .4
0 .2
le s s p la u s ib le
0
0 .2 5 0 .3 0 .3 5 0 .4 0 .4 5 0 .5 0 .5 5 0 .6
1 2
1 0
θ ))
6
-2log(R(
le s s p la u s ib le m o re p l a u s i b l e v a l u e s
le s s p la u s ib le
0
0 .2 5 0 .3 0 .3 5 0 .4 0 .4 5 0 .5 0 .5 5 0 .6
θ = 0 .3
0
where ^ = y=n. If ^ and 0 are equal then ( 0 ) = 0. If ^ is either much larger or much
smaller than 0 , then ( 0 ) will be large in value.
172 5. TESTS OF HYPOTHESES
Suppose we use the likelihood ratio test statistic to test H0 : = 0:5 for the ESP
example and the data in Example 5.1.1 which were n = 200 and y = 110 so that ^ = 0:55.
The observed value of the likelihood ratio statistic for testing H0 : = 0:5 is
0:55 1 0:55
(0:5) = 2 (200) (0:55) log + (1 0:55) log = 2:003
0:5 1 0:5
and the approximate p value is
and there is no evidence against H0 : = 0:5 based on the data. Note that the test statistic
D = jY 100j used in Example 5.1.1 and the likelihood ratio test statistic (0:5) give
nearly identical results. This is because n = 200 is large.
( 0 ) = 2l ~ 2l ( 0 ) = 2l Y 2l ( 0 )
Y Y
= 2n log Y + + log 0 +
Y 0
Y Y
= 2n 1 log
0 0
Again we observe that, if ^ and 0 are equal then ( 0 ) = 0 and if ^ is either much larger
or much smaller than 0 , then ( 0 ) will be large in value.
5.3. LIKELIHOOD RATIO TESTS OF HYPOTHESES - ONE PARAMETER 173
The variability in lifetimes of light bulbs (in hours, say, of operation before failure) is
often well described by an Exponential( ) distribution where = E(Y ) > 0 is the average
(mean) lifetime. A manufacturer claims that the mean lifetime of a particular brand of
bulbs is 2000 hours. We can examine this claim by testing the hypothesis H0 : = 2000.
Suppose a random sample of n = 50 light bulbs was tested over a long period and that the
observed lifetimes were:
P
50
with yi = 93840. For these data the maximum likelihood estimate of is ^ = y =
i=1
93840=50 = 1876:8. To check whether the Exponential model is reasonable for these data
we plot the empirical cumulative distribution function for these data and then superimpose
the cumulative distribution function for a Exponential(1876:8) random variable. See Figure
5.3. Since the agreement between the empirical cumulative distribution function and the
0.9
Exponential(1876.8)
0.8
0.7
e.c .d.f.
0.6
0.5
0.4
0.3
0.2
0.1
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Lifetimes of Light Bulbs
The p value is
or more simply
1 P
n
L( ) = exp 2
(yi )2 for 2 <:
2 i=1
The log likelihood function is
1 P
n
l( ) = 2
(yi )2 for 2 <:
2 i=1
P
n P
n
(yi )2 = (yi y)2 + n(y )2 :
i=1 i=1
= 2l(~ ) 2l( 0)
1 P n
2 1 P
n
= 2 (Yi 0) 2
(Yi ~ )2
i=1 i=1
1 P
n P
n
= 2
(Yi Y )2 + n(Y 0)
2
(Yi ~ )2
i=1 i=1
1 2
= 2
n(Y 0) since ~ = Y
2
Y
= p 0 : (5.7)
= n
The purpose for writing the likelihood ratio statistic in the form (5.7) is to draw attention
to the fact that is the square of the standard Normal random variable Y =pn0 and therefore
has exactly a 2 (1) distribution. Of course it is not clear in general that the likelihood ratio
test statistic has an approximate 2 (1) distribution, but in this special case, the distribution
of is clearly 2 (1) (not only asymptotically but for all values of n).
H0 : 2 0
where 0 and 0 is of dimension p < k. For example H0 might specify particular values
for k p of the components of but leave the remaining parameters alone. The dimensions
of and 0 refer to the minimum number of parameters (or “coordinates”) needed to
specify points in them. Again we test H0 using as our test statistic the likelihood ratio
test statistic , de…ned as follows. Let ^ denote the maximum likelihood estimate of
over so that, as before,
L(^) = max L( ):
2
36 P
n P
n
You should be able to verify the identity (yi c)2 = (yi y)2 + n(y c)2 for any value of c
i=1 i=1
176 5. TESTS OF HYPOTHESES
Similarly we let ^ 0 denote the maximum likelihood estimate of over 0 (i.e. we maximize
the likelihood with the parameter constrained to lie in the set 0 ) so that
L(^0 ) = max L( ):
2 0
p value = P ( ; H0 ) t P (W ) (5.9)
where W s 2 (k p).
The likelihood ratio test covers a great many di¤erent types of examples, but we only
provide a few here.
H0 : A = B:
Essentially we have data from two Poisson distributions with possibly di¤erent parameters.
For convenience let (x1 ; : : : ; xn ) denote the observations for Company A’s photocopier which
are assumed to be a random sample from the model
x exp (
A A)
P (X = x; A) = for x = 0; 1; : : : and A > 0:
x!
5.4. LIKELIHOOD RATIO TESTS OF HYPOTHESES - MULTIPARAMETER 177
Similarly let (y1 ; : : : ; ym ) denote the observations for Company B’s photocopier which are
assumed to be a random sample from the model
y
B exp ( B)
P (Y = y; B) = for y = 0; 1; : : : and B >0
y!
independently of the observations for Company A’s photocopier. In this case the parameter
vector is the two dimensional vector = ( A ; B ) and = f( A ; B ) : A > 0; B > 0g.
The note that the dimension of is k = 2. Since the null hypothesis speci…es that the
two parameters A and B are equal but does not otherwise specify their values, we have
0 = f( ; ) : > 0g which is a space of dimension p = 1.
To construct the likelihood ratio test of H0 : A = B we need the likelihood function
for the parameter vector = ( A ; B ). We …rst note that the likelihood function for A
only based on the data (x1 ; : : : ; xn ) is
xi
Q
n Q
n
A exp ( A)
L1 ( A) = f (xi ; A) = for A >0
i=1 i=1 xi !
or more simply
Q
n
xi
L1 ( A) = A exp ( A) for A > 0:
i=1
Q
m
yj
L2 ( B) = B exp ( B) for B > 0:
j=1
Since the data from A and B are independent, the likelihood function for =( A; B) is
obtained as a product of the individual likelihoods
L( ) = L( A; B) = L1 ( A) L2 ( B)
Qn
xi Q
m
yj
= A exp ( A) B exp ( B) for ( A; B) 2
i=1 j=1
The number of failures in twelve consecutive months for company A and company B’s
copiers are given below; there were the same number of copiers from each company in use
so n = m = 12
Company A: 16 14 25 19 23 12 22 28 19 15 18 29
Company B: 13 7 12 9 15 17 10 13 8 10 12 14
178 5. TESTS OF HYPOTHESES
P
12 P
12
We note that xi = 240 and yj = 140.
i=1 j=1
The log likelihood function is
l( ) = l( A; B) = 12 A + 240 log A 12 B + 140 log B for ( A; B) 2 :
The values of A and B which maximize l( A; B) are obtained by solving the two equa-
tions37
@l @l
= 0; = 0;
@ A @ B
which gives two equations in two unknowns:
240
12 + =0
A
140
12 + =0
B
The maximum likelihood estimates of A and B (unconstrained) are ^ A = 240=12 = 20:0
and ^ B = 140=12 = 11:667. That is, ^ = (20:0; 11:667):
To determine
L(^0 ) = max L( )
2 0
we need to …nd the (constrained) maximum likelihood estimate ^0 , which is the value of
= ( A ; B ) which maximizes l( A ; B ) under the constraint A = B . To do this we
merely let = A = B in (5.10) to obtain
l( ; ) = 12 + 240 log 12 + 140 log
= 24 + 380 log for > 0:
Solving @l( ; )=@ = 0, we …nd ^ = 380=24 = 15:833(= ^ A = ^ B ); that is,
^0 = (15:833; 15:833).
The next step is to compute the observed value of the likelihood ratio statistic, which
from (5.8) is
= 2l(^) 2l(^0 )
= 2l(20:0; 11:667) 2l(15:833; 15:833)
= 2 (682:92 669:60)
= 26:64
Finally, we compute the approximate p value for the test, which by (5.9) is
p value = P ( 26:64; H0 )
t P (W 26:64) where W s 2 (1)
h p i
=2 1 P Z 26:64 where Z v G (0; 1)
t 0:
37
think of this as maximizing over each parameter with the other parameter …xed.
5.4. LIKELIHOOD RATIO TESTS OF HYPOTHESES - MULTIPARAMETER 179
Our conclusion is that there is very strong evidence against the hypothesis H0 : A = B .
The data indicate that Company B’s copiers have a lower rate of failure than Company
A’s copiers.
Note that we could also follow up this conclusion by giving a con…dence interval for the
mean di¤erence A B since this would indicate the magnitude of the di¤erence in the
two failure rates. The maximum likelihood estimates ^ A = 20:0 average failures per month
and ^B = 11:67 failures per month di¤er a lot, but we could also give a con…dence interval
in order to express the uncertainty in such estimates.
Example 5.4.4 Likelihood ratio tests of hypotheses for for G( ; ) model for
unknown
Consider a test of H0 : = 0 based on a random sample y1 ; y2 ; : : : ; yn . In this case
the unconstrained parameter space is = f( ; ) : 1 < < 1; > 0g, obviously a
2-dimensional space, but under the constraint imposed by H0 , the parameter must lie in
the space 0 = f( ; 0 ); 1 < < 1g a space of dimension 1. Thus k = 2, and p = 1.
The likelihood function is
Q
n Q
n 1 1
L( ) = L( ; ) = f (Yi ; ; ) = p exp 2
(yi )2
i=1 i=1 2 2
and the log likelihood function is
1 P
n
l( ; ) = n log( ) 2
(yi )2 + c
2 i=1
where h i
n=2
c = log (2 )
~=Y
1 Pn
~2 = (Yi Y )2 :
n i=1
Under the constraint imposed by H0 : = 0 the maximum likelihood estimator of the
parameter is also Y so the likelihood ratio statistic is
( 0) = 2l Y ; ~ 2l Y ; 0
1 P n 1 P
n
= 2n log(~ ) 2 (Yi Y )2 + 2n log( 0) + 2 (Yi Y )2
~ i=1 0 i=1
0 1 1 2
= 2n log + 2 2 n~
~ 0 ~
~2 ~2
=n 2 1 log 2 :
0 0
180 5. TESTS OF HYPOTHESES
This is not as obviously a Chi-squared random variable. It is, as one might expect, a
function of ~ 2 = 20 which is the ratio of the maximum likelihood estimator of the variance
divided by the value of 2 under H0 . In fact the value of ( 0 ) increases as the quantity
~ 2 = 20 gets further away from the value 1 in either direction.
The test proceeds by obtaining the observed value of ( 0 )
^2 ^2
( 0) =n 2 1 log 2
0 0
n! y1 y2 yk P
k
f (y1 ; : : : ; yk ; 1; : : : ; k ) = 1 2 k for 0 yj n where yj = n:
y1 ! yk ! j=1
Q
k n! yj
L( ) = j
j=1 y1 ! yk !
or more simply
Q
k
yj
L( ) = j :
j=1
5.4. LIKELIHOOD RATIO TESTS OF HYPOTHESES - MULTIPARAMETER 181
where
= 2l(^) 2l(^0 )
is the observed value of . We will give speci…c examples of the Multinomial model in
Chapter 7.
182 5. TESTS OF HYPOTHESES
(a) Let be the probability the woman guesses the card correctly and let Y be
the number of correct guesses in n repetitions of the procedure. Discuss why
Y Binomial(n; ) would be an appropriate model. If you wanted to test the
hypothesis that the woman is guessing at random what is the appropriate null
hypothesis H0 in terms of the parameter ?
(b) Suppose the woman guessed correctly 8 times in 20 repetitions. Using the test
statistic D = jY E (Y )j, calculate the p value for your hypothesis H0 in
(a) and give a conclusion about whether you think the woman has any special
guessing ability.
(c) In a longer sequence of 100 repetitions over two days, the woman guessed cor-
rectly 32 times. Using the test statistic D = jY E (Y )j, calculate the p value
for these data. What would you conclude now?
2. The accident rate over a certain stretch of highway was about = 10 per year for a
period of several years. In the most recent year, however, the number of accidents was
25. We want to know whether this many accidents is very probable if = 10; if not,
we might conclude that the accident rate has increased for some reason. Investigate
this question by assuming that the number of accidents in the current year follows a
Poisson distribution with mean and then testing H0 : = 10. Use the test statistic
D = max(0; Y 10) where Y represents the number of accidents in the most recent
year.
3. A hospital lab has just purchased a new instrument for measuring levels of dioxin
(in parts per billion). To calibrate the new instrument, 20 samples of a “standard”
water solution known to contain 45 parts per billion dioxin are measured by the new
instrument. The observed data are given below:
44:1 46:0 46:6 41:3 44:8 47:8 44:5 45:1 42:9 44:5
42:5 41:5 39:6 42:0 45:8 48:9 46:6 42:9 47:0 43:7
For these data
P
20 P
20
yi = 888:1 and yi 2 = 395:45:
i=1 i=1
(a) Use a qqplot to check whether a G ( ; ) model is reasonable for these data.
(b) Describe a suitable study population for this study. The parameters and
correspond to what attributes of interest in the study population?
5.5. CHAPTER 5 PROBLEMS 183
(c) Assuming a G ( ; ) model for these data test the hypothesis H0 : = 45.
Determine a 95% con…dence interval for . What would you conclude about
how well the new instrument is working?
(d) The manufacturer of these instruments claims that the variability in measure-
ments is less than two parts per billion. Test the hypothesis that H0 : 2 = 4
and determine a 95% con…dence interval for . What would you conclude about
the manufacturer’s claim?
5. Radon is a colourless, odourless gas that is naturally released by rocks and soils and
may concentrate in highly insulated houses. Because radon is slightly radioactive,
there is some concern that it may be a health hazard. Radon detectors are sold to
homeowners worried about this risk, but the detectors may be inaccurate. Univer-
sity researchers placed 12 detectors in a chamber where they were exposed to 105
picocuries per liter of radon over 3 days. The readings given by the detectors were:
91:9 97:8 111:4 122:3 105:4 95:0 103:8 99:6 96:6 119:3 104:8 101:7
Yi v N ; 2
= G( ; ); i = 1; : : : ; 12 independently
(a) Test the hypothesis H0 : = 105. Determine a 95% con…dence interval for .
(b) Determine a 95% con…dence interval for .
(c) As a statistician what would you say to the university researchers about the
accuracy and precision of the detectors?
with = 105.
184 5. TESTS OF HYPOTHESES
7. Between 10 a.m. on November 4, 2014 and 10 p.m. on November 6, 2014 the Fed-
eration of Students at the University of Waterloo conducted a referendum on the
question “Should classes start on the …rst Thursday after Labour Day to allow for
two additional days o¤ in the Fall term?”. All undergraduates were able to cast their
ballot online. Six thousand of the 30; 990 eligible voters voted. Of the 6000 who
voted, 4440 answered yes to this question.
8. Data on the number of accidents at a busy intersection in Waterloo over the last 5
years indicated that the average number of accidents at the intersection was 3 acci-
dents per week. After the installation of new tra¢ c signals the number of accidents
per week for a 25 week period were recorded as follows:
4 5 0 4 2 0 1 4 1 3 1 1 2
2 2 1 1 3 2 3 2 0 2 2 3
(a) To decide whether the mean number of accidents at this intersection has changed
after the installation of the new tra¢ c signals we wish to test the hypothesis H0 :
P25
= 3: Why is the discrepancy measure D = Yi 75 reasonable? Calculate
i=1
the exact p value for testing H0 : = 3. What would you conclude?
5.5. CHAPTER 5 PROBLEMS 185
10. For Chapter 2, Problem 5 (b) test the hypothesis H0 : = 5 using the likelihood ratio
test statistic. Is this result consistent with the approximate 95% con…dence interval
for that you found in Chapter 4, Problem 8?
11. For Chapter 2, Problem 6 (b) test the hypothesis H0 : = 0:1 using the likelihood
ratio test statistic. Is this result consistent with the approximate 95% con…dence
interval for that you found in Chapter 4, Problem 9?
12. Data from the 2011 Canadian census indicate that 18% of all families in Canada
have one child. Suppose the data in Chapter 2, Problem 7 (d) represented 33 children
chosen at random from the Waterloo Region. Based on these data, test the hypothesis
that the percentage of families with one child in Waterloo Region is the same as the
national percentage using the likelihood ratio test statistic. Is this result consistent
with the approximate 95% con…dence interval for that you found in Chapter 4,
Problem 10?
13. A company that produces power systems for personal computers has to demonstrate
a high degree of reliability for its systems. Because the systems are very reliable
under normal use conditions, it is customary to ‘stress’the systems by running them
at a considerably higher temperature than they would normally encounter, and to
measure the time until the system fails. According to a contract with one personal
computer manufacturer, the average time to failure for systems run at 70 C should
be no less than 1; 000 hours. From one production lot, 20 power systems were put on
test and observed until failure at 70 . The 20 failure times y1 ; : : : ; y20 were (in hours):
374:2 544:0 1113:9 509:4 1244:3
551:9 853:2 3391:2 297:0 63:1
250:2 678:1 379:6 1818:9 1191:1
162:8 1060:1 1501:4 332:2 2382:0
186 5. TESTS OF HYPOTHESES
P
20
(Note: yi = 18; 698:6). Failure times are assumed to have an Exponential( )
i=1
distribution.
(a) Check whether the Exponential model is reasonable for these data. (See Example
5:3:2.)
(b) Use a likelihood ratio test to test H0 : = 1000 hours. Is there any evidence
that the company’s power systems do not meet the contracted standard?
14. The R function runif () generates pseudo random Uniform(0; 1) random variables.
The command y runif (n) will produce a vector of n values y1 ; : : : ; yn .
(a) Suggest a test statistic which could be used to test that the yi ’s, i = 1; : : : ; n are
consistent with a random sample from Uniform(0; 1).
(See: www.dtic.mil/cgi-bin/GetTRDoc?AD=ADA393366)
(b) Generate 1000 yi ’s and carry out the test in (a).
15. The Poisson model is often used to compare rates of occurrence for certain types of
events in di¤erent geographic regions. For example, consider K regions with popula-
tions P1 ; : : : ; PK and let j , j = 1; : : : ; K be the annual expected number of events
per person for region j. By assuming that the number of events Yj for region j in a
given t-year period has a Poisson distribution with mean Pj j t, we can estimate and
compare the j ’s or test that they are equal.
(a) Under what conditions might the stated Poisson model be reasonable?
(b) Suppose you observe values y1 ; : : : ; yK for a given t-year period. Describe how
to test the hypothesis that 1 = 2 = = K.
(c) The data below show the numbers of children yj born with “birth defects”for 5
regions over a given …ve year period, along with the total numbers of births Pj
for each region. Test the hypothesis that the …ve rates of birth defects are equal.
16. Challenge Problem: Likelihood ratio test statistic for Gaussian model
and unknown: Suppose that Y1 ; : : : ; Yn are independent G( ; ) observations.
(a) Show that the likelihood ratio test statistic for testing H0 : = 0 ( unknown)
is given by
T2
( 0 ) = n log 1 +
n 1
5.5. CHAPTER 5 PROBLEMS 187
p
where T = n(Y 0 )=S and S is the sample standard deviation. Note: you
will want to use the identity
P
n
2 P
n
(Yi 0) = (Yi Y )2 + n(Y 0)
2
:
i=1 i=1
(b) Show that the likelihood ratio test statistic for testing H0 : = 0 ( unknown)
can be written as ( 0 ) = U n log (U=n) n where
(n 1)S 2
U= 2 :
0
17. Challenge Problem: Likelihood ratio test statistic for comparing two
Exponential means: Suppose that X1 ; : : : ; Xm is a random sample from the
Exponential( 1 ) distribution and independently and Y1 ; : : : ; Yn is a random sample
from the Exponential( 2 ) distribution. Determine the likelihood ratio test statistic
for testing H0 : 1 = 2 .
18. In the Wintario lottery draw, six digit numbers were produced by six machines that
operate independently and which each simulate a random selection from the digits
0; 1; : : : ; 9. Of 736 numbers drawn over a period from 1980-82, the following frequen-
cies were observed for position 1 in the six digit numbers:
If the machines operate in a truly “random” fashion, then we should have j = 0:1;
j = 0; 1; : : : ; 9.
(a) Test this hypothesis using a likelihood ratio test. What do you conclude?
(b) The data above were for digits in the …rst position of the six digit Wintario
numbers. Suppose you were told that similar likelihood ratio tests had in fact
been carried out for each of the six positions, and that position 1 had been
singled out for presentation above because it gave the largest observed value of
the likelihood ratio statistic . What would you now do to test the hypothesis
j = 0:1; j = 0; 1; 2; : : : ; 9? (Hint: Find P (largest of 6 independent ’s is ).)
188 5. TESTS OF HYPOTHESES
6. GAUSSIAN RESPONSE
MODELS
6.1 Introduction
A response variate Y is one whose distribution has parameters which depend on the value
of other variates. For the Gaussian models we have studied so far, we assumed that we had
a random sample Y1 ; Y2 ; : : : ; Yn from the same Gaussian distribution G( ; ). A Gaussian
response model generalizes this to permit the parameters of the Gaussian distribution for
Yi to depend on a vector xi of covariates (explanatory variates which are measured for
the response variate Yi ). Gaussian models are by far the most common models used in
statistics.
De…nition 40 A Gaussian response model is one for which the distribution of the response
variate Y , given the associated vector of covariates x = (x1 ; x2 ; : : : ; xk ) for an individual
unit, is of the form
Y G( (x) ; (x)):
In most examples we will assume (xi ) = is constant. This assumption is not necessary
but it does make the models easier to analyze. The choice of (x) is guided by past
information and on current data from the population or process. The di¤erence between
various Gaussian response models is in the choice of the function (x) and the covariates.
We often assume (xi ) is a linear function of the covariates. These models are called
Gaussian linear models and can be written as
189
190 6. GAUSSIAN RESPONSE MODELS
where xi = (xi1 ; xi2 ; : : : ; xik ) is the vector of known covariates associated with unit i and
0 ; 1 ; : : : ; k are unknown parameters. These models are also referred to as linear regres-
sion models 38 , and the j ’s are called the regression coe¢ cients.
Here are some examples of settings where Gaussian response models can be used.
In this case there is no formula relating and to the machines; they are simply di¤erent.
Notice that an important feature of a machine is the variability of its production so we
have, in this case, permitted the two variance parameters to be di¤erent.
A manufacturing company was appealing the assessed market value of its property,
which included a large building. Sales records were collected on the 30 largest buildings
sold in the previous three years in the area. The data are given in Table 6.1 and plotted in
Figure 6.1. They include the size of the building x (in m2 =105 ) and the selling price y (in
$ per m2 ). The purpose of the analysis is to determine whether and to what extent we can
determine the value of a property from the single covariate x so that we know whether the
assessed value appears to be too high. The building in question was 4:47 105 m2 , with
an assessed market value of $75 per m2 .
The scatterplot shows that the price y is roughly inversely proportional to the size x
but there is obviously variability in the price of buildings having the same area (size). In
this case we might consider a model where the price of a building of size xi is represented
by a random variable Yi , with
Yi s G ( 0 + 1 xi ; ) for i = 1; : : : ; n independently
where 0 and 1 are unknown parameters. We assume a common standard deviation for
the observations.
700
650
600
550
500
Pric e
450
400
350
300
250
200
0 0.5 1 1.5 2 2.5 3 3.5
Size
The data below show the breaking strengths y of six steel bolts at each of …ve di¤erent
bolt diameters x. The data are plotted in Figure 6.2.
The scatterplot gives a clear picture of the relationship between y and x. A reasonable
model for the breaking strength Y of a randomly selected bolt of diameter x would appear
to be Y G( (x); ). The variability in y values appears to be about the same for bolts of
di¤erent diameters which again provides some justi…cation for assuming to be constant.
It is not obvious what the best choice for (x) would be although the relationship looks
slightly nonlinear so we might try a quadratic function
2
(x) = 0 + 1x + 2x
2. 5
2. 4
2. 3
2. 2
S t rengt h
2. 1
1. 9
1. 8
1. 7
1. 6
0.05 0. 1 0.15 0. 2 0.25 0. 3 0.35 0. 4 0.45 0. 5 0.55
D ia m e t e r
G( ; ) Model
In Chapters 4 and 5 we discussed estimation and testing hypotheses for samples from a
Gaussian distribution. Suppose that Y G( ; ) models a response variate y in some
population or process. A random sample Y1 ; : : : ; Yn is selected, and we want to estimate
the model parameters and possibly to test hypotheses about them. We can write this model
in the form
Yi = + Ri where Ri G(0; ): (6.2)
so this is a special case of the Gaussian response model in which the mean function is con-
stant. The estimator of the parameter that we used is the maximum likelihood estimator
Pn
Y = n1 Yi . This estimator is also a “least squares estimator”. Y has the property that
i=1
it is closer to the data than any other constant, or
P
n P
n
min (Yi )2 = (Yi Y )2 :
i=1 i=1
You should be able to verify this. It will turn out that the methods for estimation, construct-
ing con…dence intervals and tests of hypothesis discussed earlier for the single Gaussian
G( ; ) are all special cases of the more general methods derived in Section 6.5.
In the next section we begin with a simple generalization of (6.2) to the case in which
the mean is a linear function of a single covariate.
or more simply
1 P
n
L( ; ; ) = n
exp 2
(yi xi )2 :
2 i=1
1 P
n
l( ; ; ) = n log 2
(yi xi )2 :
2 i=1
@l 1 P
n n
= 2
(yi xi ) = 2
(y x) = 0 (6.4)
@ i=1
@l 1 Pn 1 P
n P
n P
n
= 2
(yi xi ) xi = 2
xi yi xi x2i =0 (6.5)
@ i=1 i=1 i=1 i=1
@l n 1 P
n
= + 3
(yi xi )2 = 0
@ i=1
~ = Sxy ; (6.6)
Sxx
~=Y ~ x; (6.7)
1 Pn
~ xi )2
~2 = (Yi ~ (6.8)
n i=1
where
P
n P
n
Sxx = (xi x)2 = (xi x)xi
i=1 i=1
Pn P
n
Sxy = (xi x)(Yi Y)= (xi x)Yi
i=1 i=1
Pn
Syy = (Yi Y )2
i=1
The alternative expressions for Sxy and Syy 41 are easy to obtain.
We will use
1 P
n
~ xi )2 = 1 ~ Sxy )
Se2 = (Yi ~ (Syy
n 2 i=1 n 2
41 P
n
Since (xi x) = 0,
i=1
P
n P
n P
n P
n
(xi x)(xi x) = (xi x)xi x (xi x) = (xi x)xi
i=1 i=1 i=1 i=1
and
Pn P
n P
n P
n
(xi x)(Yi Y)= (xi x)Yi Y (xi x) = (xi x)Yi
i=1 i=1 i=1 i=1
6.2. SIMPLE LINEAR REGRESSION 195
as the estimator of 2 rather than the maximum likelihood estimator ~ 2 given by (6.8)
since it can be shown that E Se2 = 2 . Note that Se2 can be more easily calculated using
1 ~ Sxy )
Se2 = (Syy
n 2
which follows since
P
n P
n
(Yi ~ ~ xi )2 = (Yi Y + ~x ~ xi )2
i=1 i=1
Pn P
n 2 P
n
= (Yi Y )2 2~ (Yi Y ) (xi x) + ~ (xi x)2
i=1 i=1 i=1
Sxy
= Syy 2 ~ Sxy + ~ Sxx
Sxx
= Syy ~ Sxy :
simultaneously. We note that this is equivalent to solving the maximum likelihood equations
(6.4) and (6.5). In summary we have that the least squares estimates and the maximum
likelihood estimates obtained assuming the model (6.3) are the same estimates. Of course
the method of least squares only provides point estimates of the unknown parameters
and while assuming the model (6.3) allows us to obtain both estimates and con…dence
intervals for the unknown parameters. We now show how to obtain con…dence intervals
based on the model (6.3).
~ = Sxy = P ai Yi
n (xi x)
where ai =
Sxx i=1 Sxx
196 6. GAUSSIAN RESPONSE MODELS
to make it clear that ~ is a linear combination of the Normal random variables Yi and is
therefore Normally distributed with easily obtained expected value and variance. In fact it
P
n Pn
is easy to show that these non-random coe¢ cients satisfy ai = 0 and ai xi = 1 and
i=1 i=1
P
n
a2i = 1=Sxx . Therefore
i=1
P
n P
n
E( ~ ) = ai E(Yi ) = ai ( + xi )
i=1 i=1
Pn P
n
= ai xi since ai = 0
i=1 i=1
P
n
= since ai xi = 1:
i=1
Similarly
P
n
V ar( ~ ) = a2i V ar(Yi ) since the Yi are independent random variables
i=1
2 P
n
= a2i
i=1
2 P
n 1
= since a2i = :
Sxx i=1 Sxx
In summary
~ G ;p :
Sxx
and the fact that it can be shown that ~ and Se2 are independent random variables, then
by Theorem 32 it follows that
~
p v t (n 2) : (6.10)
Se = Sxx
This pivotal quantity can be used to obtain con…dence intervals for and to construct tests
of hypotheses about .
6.2. SIMPLE LINEAR REGRESSION 197
~ 0
p
Se = Sxx
Note also that (6.9) can be used to obtain con…dence intervals or tests for , but these
are usually of less interest than inference about or the other quantities below.
Thus, the intercept changes if we rede…ne x, but not . In the examples we consider here
we have kept the given de…nition of xi , for simplicity.
198 6. GAUSSIAN RESPONSE MODELS
~ (x) = ~ + ~ x = Y + ~ (x x);
since ~ = Y ~ x. Since
~ = Sxy = P (xi x) Yi
n
2 1 (x x)2
= + :
n Sxx
Note that the variance of ~ (x) is smallest in the middle of the data, or when x is close to
x and much larger when (x x)2 is large.
6.2. SIMPLE LINEAR REGRESSION 199
Since (6.12) holds independently of (6.9) then by Theorem (32) we obtain the pivotal
quantity
~ (x) (x)
q s t (n 2) (6.13)
2
Se n1 + Sxxx)
(x
which can be used to obtain con…dence intervals for (x) in the usual manner. Using
t-tables or R …nd the constant a such that P ( a T a) = p where T s t (n 2). Since
0 1
~ (x) (x)
p = P ( a T a) = P @ a q aA
1 (x x)2
Se n + Sxx
0 s s 1
1 (x x) 2 2
1 (x x) A
= P @ ~ (x) aSe + (x) ~ (x) + aSe + ;
n Sxx n Sxx
where ^ (x) = ^ + ^ x,
1 P
n
^ xi )2 = 1 ^ Sxy )
s2e = (yi ^ (Syy
n 2 i=1 n 2
Remark: Note that since = (0); a 95% con…dence interval for , is given by (6.14)
with x = 0 which gives s
1 (x)2
^ ase + (6.15)
n Sxx
In fact one can see from (6.15) that if x is large in magnitude (which means the average xi
is large), then the con…dence interval for will be very wide. This would be disturbing if
the value x = 0 is a value of interest, but often it is not. In the following example it refers
to a building of area x = 0, which is nonsensical!
Remark: The results of the analyses below can be obtained using the R function lm,
with the command lm(y x). We give the detailed results below to illustrate how the
200 6. GAUSSIAN RESPONSE MODELS
n = 30; x = 0:9543; y = 548:9700; Sxx = 22:9453; Sxy = 3316:6771; Syy = 489; 624:723
so we …nd
(Note that when calculating these values using a calculator you should use as many decimal
places as possible otherwise the values are a¤ected by roundo¤ error.) Since ^ is negative
this implies that the larger sized buildings tend to sell for less per square meter. (The
estimate ^ = 144:55 indicates a drop in average price of $144:55 per square meter for
each increase of one unit in x; remember x’s units are m2 (105 )).
The line y = ^ + ^ x is often called the …tted regression line for y on x. If we plot the
…tted line on the same graph as the scatterplot of points (xi ; yi ), i = 1; : : : ; n as in Figure
6.3, we see the …tted line passes close to the points.
A con…dence interval for is not of major interest in the setting here, where the data
were called on to indicate a fair assessment value for a large building with x = 4:47. One
way to address this is to estimate (x) when x = 4:47. We get the maximum likelihood
estimate for (4:47) as
^ (4:47) = ^ + ^ (4:47) = $40:79
which we note is much below the assessed value of $75 per square meter. However, one
can object that there is uncertainty in this estimate, and that it would be better to give a
con…dence interval for (4:47). Using (6.14) and P (T 2:0484) = 0:975 for T s t (28) we
get a 95% con…dence interval for (4:47) as
s
1 (4:47 x)2
^ (4:47) 2:0484se +
30 Sxx
= $40:79 $29:58
= [$11:21; $70:37] :
700
650
600
550
P ric e
500
450
y=686.9-144.5x
400
350
300
250
200
0 0.5 1 1.5 2 2.5 3 3.5
Size
Figure 6.3: Scatterplot and …tted line for building price versus size
However (playing lawyer for the assessor), we could raise another objection: we are
considering a single building but we have constructed a con…dence interval for the average
of all buildings of size x = 4:47( 105 )m2 . The constructed con…dence interval is for a point
on the line, not a point Y generated by adding to + (4:47) the random error R s G (0; )
which has a non-negligible variance. This suggests that what we should do is predict the
y value for a building with x = 4:47, instead of estimating (4:47). We will temporarily
leave the example in order to develop a method for this.
Since R is independent of ~ (x) (it is not connected to the existing sample), (6.16) is the
sum of independent Normally distributed random variables and is consequently Normally
distributed. Since
and
we have !
1=2
1 (x x)2
Y ~ (x) G 0; 1+ + : (6.17)
n Sxx
Since (6.17) holds independently of (6.9) then by Theorem (32) we obtain the pivotal
quantity
Y ~ (x)
q
2
v t (n 2) :
Se 1 + n1 + (xSxxx)
For an interval estimate with con…dence coe¢ cient p we choose a such that
p = P ( a T a) where T s t (n 2). Since
0 1
Y ~ (x)
p=P@ a q aA
1 (x x)2
Se 1 + n + Sxx
0 s s 1
1 (x x)2 1 (x x)2
= P @ ~ (x) aSe 1 + + Y ~ (x) + aSe 1 + + A
n Sxx n Sxx
This interval is usually called a 100p% prediction interval instead of a con…dence interval,
since Y is not a parameter but a “future” observation.
Remark: Care must be taken in constructing prediction intervals for values of x which
lie outside the interval of observed x0i s since this assumes that the linear relationship holds
beyond the observed data. This is dangerous since there are no data to support the as-
sumption.
The lower limit is negative, which is nonsensical. This happened because we were using
a Gaussian model (Gaussian random variables Y can be positive or negative) in a setting
where the price Y must be positive. Nonetheless, the Gaussian model …ts the data reason-
ably well. We might just truncate the prediction interval and take it to be [0; $89:83].
Now we …nd that the assessed value of $75 is inside this interval. On this basis it’s
di¢ cult to say that the assessed value is unfair (though it is towards the high end of
the prediction interval). Note also that the value x = 4:47 of interest is well outside the
interval of observed x values which was [0:20; 3:26]) in the data set of 30 buildings. Thus any
conclusions we reach are based on an assumption that the linear model E (Y jx) = + x
applies beyond x = 3:26 at least as far as x = 4:47. This may or may not be true, but we
have no way to check it with the data we have.
There is a slight suggestion in Figure 6.3 that V ar(Y ) may be smaller for larger x val-
ues. There is not su¢ cient data to check this either. We mention these points because an
important companion to every statistical analysis is a quali…cation of the conclusions based
on a careful examination of the applicability of the assumptions underlying the analysis.
Remark: Note from (6.14) and (6.18) that the con…dence interval for (x) and the predic-
tion interval for Y are wider the further away x is from x. Thus, as we move further away
from the “middle” of the x’s in the data, we get wider and wider intervals for (x) and Y .
so we …nd
^ = Sx1 y = 0:6368 = 2:8378;
Sx1 x1 0:2244
^ = y ^ x1 = 1:979 (2:8378) (0:11) = 1:6668;
1 ^ Sx y ) = 1 [1:88147
s2e = (Syy 1 (2:8378) (0:6368)] = 0:002656;
n 2 28
and se = 0:05154:
The …tted regression line y = ^ + ^ x1 is shown on the scatterplot in Figure 6.4. The model
appears to …t the data well.
2.5
2.4
2.3
2.2
Strength y=1.67+2.84x
2.1
1.9
1.8
1.7
1.6
0 0.05 0.1 0.15 0.2 0.25
Diameter Squared
Figure 6.4: Scatterplot plus …tted line for strength versus diameter squared
The parameter represents the increase in average strength (x1 ) from increasing x1 =
x2 by one unit. Using the pivotal quantity (6.10) and the fact that P (T 2:0484) = 0:975
for T s t (28), we obtain the 95% con…dence interval for as
p
^ 2:0484se = Sxx
= 2:8378 0:2228 = [2:6149; 3:0606] :
6.2. SIMPLE LINEAR REGRESSION 205
Table 6.2
Summary of Distributions for Simple Linear Regression
Mean or
degrees
Random variable Distribution Standard Deviation
of
freedom
h i1=2
Sxy
~=
Sxx
Gaussian E( ~ ) = std( ~ ) = 1
Sxx
degrees
~
p Student t of
Se = sxx
where freedom
Se2 = 1
Syy ~ Sxy =n 2
n 2
h i1=2
~x 1 x2
~=Y Gaussian E(~ ) = std(~ ) = n + Sxx
degrees
Y ~ (x) of
r
Se 1
1+ n +
(x x)2 Student t
Sxx freedom
=n 2
degrees
(n 2)Se2 of
2 Chi-squared
freedom
=n 2
206 6. GAUSSIAN RESPONSE MODELS
(1) The assumption that Yi (given any covariates xi ) is Gaussian with constant standard
deviation .
(2) The assumption that E (Yi ) = (xi ) is a linear combination of observed covariates
with unknown coe¢ cients.
Models should always be checked. In problems with only one x covariate, a plot of
the …tted line superimposed on the scatterplot of the data (as in Figures 6.3 and 6.4)
shows pretty clearly how well the model …ts. If there are two or more covariates in the
model, residual plots, which are described below, are very useful for checking the model
assumptions.
Residuals are de…ned as the di¤erence between the observed response and the …tted
values. Consider the simple linear regression model for which Yi G( i ; ) where
i = + xi and Ri = Yi i G(0; ), i = 1; 2; : : : ; n independently. The residuals are
given by
r^i = yi ^i
= yi ^ ^ xi for i = 1; : : : ; n:
The idea behind the r^i ’s is that they can be thought of as “observed” Ri ’s. This isn’t
exactly correct since we are using ^ i instead of i in r^i , but if the model is correct, then
the r^i ’s should behave roughly like a random sample from the G(0; ) distribution. The
r^i ’s do have some features that can be used to check the model assumptions. Recall that
the maximum likelihood estimate of is ^ = y ^ x which implies that y ^ ^ x = 0 or
^ x = 1 P yi ^ xi = 1 P r^i
n n
0=y ^ ^
n i=1 n i=1
(1) Plot points (xi ; r^i ); i = 1; : : : ; n. If the model is satisfactory the points should lie
more or less horizontally within a constant band around the line r^i = 0 (see Figure
6.5).
(2) Plot points (^ i ; r^i ); i = 1; : : : ; n. If the model is satisfactory the points should lie
more or less horizontally within a constant band around the line r^i = 0.
(3) Plot a Normal qqplot of the residuals r^i . If the model is satisfactory the points should
lie more or less along a straight line.
6.2. SIMPLE LINEAR REGRESSION 207
1
standardized
residual
0
-1
-2
-3
0 10 20 30 40 50
x
Figure 6.5: Residual plot for example in which model assumptions hold
Departures from the “expected” pattern may suggest problems with the model. For
example, Figure 6.6 plot suggests the mean function i = (xi ) is not correctly speci…ed.
The pattern of points suggests that assuming a quadratic form for the mean such as
(xi ) = + xi + x2i might give a better …t to the data than (xi ) = + xi .
Figure 6.7 suggests that for these data the variance is non-constant. Sometimes trans-
p
forming the response can solve this problem. Transformations such as log y and y are
frequently used.
1
standardized
residual
0
-1
-2
-3
0 10 20 30 40 50
x
Reading these plots requires practice. You should try not to read too much into plots
particularly if the plots are based on a small number of points.
208 6. GAUSSIAN RESPONSE MODELS
1
standardized
residual
0
-1
-2
-3
50 60 70 80 90 100
x
Figure 6.7: Example of residual plot which indicates that assumption V ar (Yi ) = 2
1.5
1
standardized
residual
0.5
-0.5
-1
-1.5
-2
0 0.05 0.1 0.15 0.2 0.25
Diameter Squared
Figure 6.8: Standard residuals versus diameter squared for bolt data
A qqplot of the standardized residuals is given in Figure 6.9. Since the points lie
reasonably along a straight line the Gaussian assumption seems reasonable. Remember
that, since the quantiles of the Normal distribution change more rapidly in the tails of the
distribution, we expect the points at both ends of the line to lie further from the line.
2
Quantiles of Standardized Residuals
-1
-2
-3
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
Standard Normal Quantiles
and obtain the conclusions below as a special case of the linear model. Below we derive the
estimates from the likelihood directly.
The likelihood function for 1 , 2 , is
Q nj
2 Q 1 1 2
L( 1; 2; )= p exp 2
yji j :
j=1 i=1 2 2
1 P n1
~1 = Y1i = Y1 ;
n1 i=1
1 P n2
~2 = Y2i = Y2 ;
n2 i=1
P2 Pnj
1
and ~ 2 = (Yji ~ j )2 :
n1 + n2 j=1 i=1
where
nj
P
1
Sj2 = (Yji Yj )2 ; j = 1; 2:
nj 1 i=1
are the sample variances obtained from the individual samples. The estimator Sp2 can be
written as a weighted average of the estimators Sj2 . In fact
w1 S12 + w2 S22
Sp2 = (6.19)
w1 + w2
6.3. COMPARING THE MEANS OF TWO POPULATIONS 211
where the weights are wj = nj 1. Although you could substitute weights other than
nj 1 in (6.19)42 , when you pool various estimators in order to obtain one that is better
than any of those being pooled, you should do so with weights that relate to a measure of
precision of the estimators. For sample variances, the number of degrees of freedom is such
an indicator.
We will use the estimator Sp2 for 2 rather than ~ 2 since E Sp2 = 2 .
To determine whether the two populations di¤er and by how much we will need to generate
con…dence intervals for the di¤erence 1 2 . First note that the maximum likelihood
estimator of this di¤erence is Y 1 Y 2 which has expected value
E(Y 1 Y 2) = 1 2
and variance
2 2 1 1
2
V ar(Y 1 Y 2 ) = V ar(Y 1 ) + V ar(Y 2 ) = + = + :
n1 n2 n1 n2
1 1
Sp2 +
n1 n2
and that this has n1 1 + n2 1 = n1 + n2 2 degrees of freedom. This provides at least
an intuitive justi…cation for the following:
Theorem 41 If Y11 ; Y12 ; : : : ; Y1n1 is a random sample from the G( 1 ; ) distribution and
independently Y21 ; Y22 ; : : : ; Y2n2 is a random sample from the G( 2 ; ) distribution then
(Y 1 Y 2) ( 2)
q 1
v t (n1 + n2 2)
1 1
Sp n1 + n2
and
(n1 + n2 2)Sp2 1 P nj
2 P
2
= 2
(Yji Yj )2 v 2
(n1 + n2 2)
j=1 i=1
42
you would most likely be tempted to use w1 = w2 = 1=2:
212 6. GAUSSIAN RESPONSE MODELS
estimated by
1 P12 P
12
s2p = (y1i y1 )2 + (y2i y2 )2 :
22 i=1 i=1
To test H0 : 1 2 = 0 we use the test statistic
Y1 Y2 0 Y1 Y2
D= q = q
1 1 1 1
Sp 12 + 12 Sp 12 + 12
This gives ^ 1 ^ 2 = y1 y2 = 1:4 and s2p = 2:3964. The observed value of the test statistic
is
jy1 y2 j 1:4
d= q =q = 2:22
1 1 1
sp 12 + 12 2:3964 6
43
If the sample variances di¤ered by a great deal we would not make this assumption. Unfortunately if
the variances are not assumed equal the problem becomes more di¢ cult.
6.3. COMPARING THE MEANS OF TWO POPULATIONS 213
with
p value = P (jT j 2:22) = 2 [1 P (T 2:22)] = 0:038
where T s t (22). There is evidence based on the data against H0 : 1 = 2 .
Since y1 > y2 , the indication is that paint A keeps its visibility better. A 95% con…dence
interval for 1 2 based on (6.21) is obtained using
Remark: The R function t.test will carry out the test above and will give con…dence
intervals for 1 2 . This can be done with the command t.test(y1 ,y2 ,var.equal=T),
where y1 and y2 are the data vectors from 1 and 2.
Y1 Y ( 2)
q2 1
(6.22)
S12 S22
n1 + n2
small; the standard deviations s1 = 1:13 and s2 = 1:97 do not provide evidence against
the hypothesis that 1 = 2 if a likelihood ratio test is carried out. Nevertheless, let us
use (6.22) to obtain a 95% con…dence interval for 1 2 . This resulting approximate 95%
con…dence interval is s
s21 s2
y1 y2 1:96 + 2 (6.23)
n1 n2
For the given data this equals 1:4 1:24, or [0:16; 2:64] which is not much di¤erent than
the interval obtained assuming the two Gaussian distributions have the same standard de-
viations.
The average score is somewhat higher in District 1, but is this di¤erence statistically
signi…cant? We will give a con…dence interval for the di¤erence in average scores in a model
representing this setting. This is done by thinking of the students in each district as a
random sample from a conceptual large population of “similar” students writing “similar”
tests. We assume that the scores in District 1 have a G( 1 ; 1 ) distribution and that
the scores in District 2 have a G( 2 ; 2 ) distribution. We can then test the hypothesis
H0 : 1 = 2 or alternatively construct a con…dence interval for the di¤erence 1 2.
(Achievement tests are usually designed so that the scores are approximately Gaussian, so
this is a sensible procedure.)
Since n1 = 278 and n2 = 345 we use (6.23) to construct an approximate 95% con…dence
interval for 1 2 . We obtain
s
(10:16)2 (9:02)2
60:2 58:1 1:96 + = 2:1 (1:96)(0:779) or [0:57; 1:63] :
278 345
Since 1 2 = 0 is outside the approximate 95% con…dence interval (can you show that
it is also outside the approximate 99% con…dence interval?) we can conclude there is fairly
strong evidence against the hypothesis H0 : 1 = 2 , suggesting that 1 > 2 . We should
not rely only on a comparison of their means. It is a good idea to look carefully at the data
and the distributions suggested for the two groups using histograms or boxplots.
The mean is a little higher for District 1 and because the sample sizes are so large, this
gives a “statistically signi…cant” di¤erence in a test of H0 : 1 = 2 . However, it would
6.3. COMPARING THE MEANS OF TWO POPULATIONS 215
be a mistake44 to conclude that the actual di¤erence in the two distributions is very large.
Unfortunately, “signi…cant” tests like this are often used to make claims about one group
or class or school is “superior” to another and such conclusions are unwarranted if, as is
often the case, the assumptions of the test are not satis…ed.
and the di¤erence 1 2 . However, the heights of related persons are not independent,
so to estimate 1 2 the method in the preceding section should not be used since it
required that we have independent random samples of males and females. In fact, the
primary reason for collecting these data was to consider the joint distribution of Y1i ; Y2i and
to examine their relationship. A clear picture of the relationship is obtained by plotting
the points (Y1i ; Y2i ) in a scatterplot.
consumptions Y1i ; Y2i for the i’th car are related, because factors such as size, weight and
engine size (and perhaps the driver) a¤ect consumption. As in the preceding example
it would not be appropriate to treat the Y1i ’s (i = 1; : : : ; 50) and Y2i ’s (i = 1; : : : ; 50)
as two independent samples from larger populations. The observations have been paired
deliberately to eliminate some factors (like driver/ car size) which might otherwise e¤ect
the conclusion. Note that in this example it may not be of much interest to consider E(Y1i )
and E(Y2i ) separately, since there is only a single observation on each car type for either
fuel.
Two types of Gaussian models are used to represent settings involving paired data.
The …rst involves what is called a Bivariate Normal distribution for (Y1i ; Y2i ), and it could
be used in the fuel consumption example. This is a continuous bivariate model for which
each component has a Normal distribution and the components may be dependent. We
will not describe this model here47 (it is studied in third year courses), except to note one
fundamental property: If (Y1i ; Y2i ) has a Bivariate Normal distribution then the di¤erence
between the two is also Normally distributed;
2
Y1i Y2i N( 1 2; ) (6.24)
where the i ’s are unknown constants. The i ’s represent factors speci…c to the di¤erent
pairs so that some pairs can have larger (smaller) expected values than others. This model
also gives a Gaussian distribution like (6.24), since
This model seems relevant for Example 6.3.2, where i refers to the i’th car type.
Thus, whenever we encounter paired data in which the variation in variables Y1i and
Y2i is adequately modeled by Gaussian distributions, we will make inferences about 1 2
by working with the model (6.24).
47
For Stat 241: Let Y = (Y1 ; : : : ; Yk )T be a k 1 random vector with E(Yi ) = i and Cov(Yi ; Yj ) = ij ;
i; j = 1; : : : ; k: (Note: Cov(Yi ; Yi ) = ii = V ar(Yi ) = 2i :) Let = ( 1 ; : : : ; k )T be the mean vector and
1
be the k k symmetric covariance matrix whose (i; j) entry is ij : Suppose also that exists. If the joint
T 1
1
p.d.f. of (Y1 ; : : : ; Yk ) is given by f (y1 ; : : : ; yk ) = (2 )k=2 j j1=2 exp 1
2
(y ) (y ) ; y 2 <k where
y = (y1 ; : : : ; yk )T then Y is said to have a Multivariate Normal distribution. The case k = 2 is called
bivariate normal.
6.3. COMPARING THE MEANS OF TWO POPULATIONS 217
P
1 1401
y = 4:895 inches and s2 = (yi y)2 = 6:5480 (inches)2 :
1400 i=1
which has a t (1400) distribution, a two-sided 95% con…dence interval for = E(Yi ) is given
p
by y 1:96s= n where n = 1401. (Note that t (1400) is indistinguishable from G(0; 1).)
This gives the 95% con…dence interval 4:895 0:134 inches or [4:76; 5:03] inches.
Remark: The method above assumes that the (brother, sister) pairs are a random sample
from the population of families with a living adult brother and sister. The question arises
as to whether E(Yi ) also represents the di¤erence in the average heights of all adult males
and all adult females (call them 01 and 02 ) in the population. Presumably 01 = 1 (i.e.
the average height of all adult males equals the average height of all adult males who also
have an adult sister) and similarly 02 = 2 , so E(Yi ) does represent this di¤erence. This is
true provided that the males in the sibling pairs are randomly sampled from the population
of all adult males, and similarly the females, but it might be worth checking.
Recall our earlier Example 1.3.1 involving the di¤erence in the average heights of males
and females in New Zealand. This gave the estimate ^ = y1 y2 = 68:72 64:10 = 4:62
inches, which is a little less than the di¤erence in the example above. This is likely due to
the fact that we are considering two distinct populations, but it should be noted that the
New Zealand data are not paired.
We note that it is slightly wider than the 95% con…dence interval [4:76; 5:03] obtained
using the pairings.
To see why the pairing is helpful in estimating the mean di¤erence 1 2 , suppose that
2 2
Y1i G 1 ; 1 and Y2i G 2 ; 2 , but that Y1i and Y2i are not necessarily independent
(i = 1; 2; : : : ; n). The estimator of 1 2 is
Y1 Y2
48
from the old Stat 231 notes of MacKay and Oldford
6.3. COMPARING THE MEANS OF TWO POPULATIONS 219
Remark: The results here can be obtained using the R function t.test.
Exercise: Compute the p-value for the test of hypothesis H0 : 1 2 = 0, using the test
statistic (5.1).
Final Remarks: When you see data from a comparative study (that is, one whose
objective is to compare two distributions, often through their means), you have to determine
whether it involves paired data or not. Of course, a sample of Y1i ’s and Y2i ’s cannot be from
a paired study unless there are equal numbers of each, but if there are equal numbers the
study might be either “paired”or “unpaired”. Note also that there is a subtle di¤erence in
the study populations in paired and unpaired studies. In the former it is pairs of individual
units that form the population where as in the latter there are (conceptually at least)
separate individual units for Y1 and Y2 measurements.
220 6. GAUSSIAN RESPONSE MODELS
P
k
Yi G( i ; ) with (xi ) = j xij for i = 1; 2; : : : ; n independently.
j=1
(Note: To facilitate the matrix proof below we have taken 0 = 0 in (6.1). The estimator of
0 can be obtained from the result below by letting xi1 = 1 for i = 1; : : : ; n and 0 = 1 .)
For convenience we de…ne the n k (where n > k) matrix X of covariate values as
and the n 1 vector of responses Yn 1 = (Y1 ; : : : ; Yn )T . We assume that the values xij
are non-random quantities which we observe. We now summarize some results about the
maximum likelihood estimators of the parameters = ( 1 ; : : : ; k )T and .
=( T
Maximum Likelihood Estimators of 1; : : : ; k) and of
~ = (X T X) 1
XT Y (6.25)
1 Pn P
k
~ xij
and ~2 = (Yi ~ i )2 where ~ i = j (6.26)
n i=1 j=1
l( ; ) = log L( ; )
1 P
n
2
= n log 2
(yi i) :
2 i=1
Note that if we take the derivative with respect to a particular j and set this derivative
equal to 0, we obtain,
@l 1 P n @ i
= 2 (yi i) =0
@ j 2 i=1 @ j
or
P
n
(yi i ) xij =0
i=1
49
May be omitted in Stat 231/221
6.4. MORE GENERAL GAUSSIAN RESPONSE MODELS51 221
for each j = 1; 2; : : : ; k. In terms of the matrix X and the vector y =(y1 ; :::; yn )T we can
rewrite this system of equations more compactly as
X T (y X )= 0
or X T y = X T X :
Assuming that the k k matrix X T X has an inverse we can solve these equations to obtain
the maximum likelihood estimate of , in matrix notation as
^ = (X T X) 1
XT y
with corresponding maximum likelihood estimator
e = (X T X) 1
X T Y:
In order to …nd the maximum likelihood estimator of , we take the derivative with respect
to and set the derivative equal to zero and obtain
@l @ 1 P
n
2
= n log 2
(yi i) =0
@ @ 2 i=1
or
n 1 P
n
2
+ 3
(yi i) =0
i=1
from which we obtain the maximum likelihood estimate of 2 as
1 Pn
^2 = (yi ^ i )2
n i=1
where
P
k
^ xij
^i = j
j=1
Recall that when we estimated the variance for a single sample from the Gaussian
distribution we considered a minor adjustment to the denominator and with this in mind
we also de…ne the following estimator50 of the variance 2 :
1 P n n
Se2 = (Yi ~ i )2 = ~2:
n k i=1 n k
Note that for large n there will be small di¤erences between the observed values of ~ 2 and
Se2 .
50
It is clear why we needed to assume k < n: Otherwise n k 0 and we have no “degrees of freedom”
left for estimating the variance.
222 6. GAUSSIAN RESPONSE MODELS
Theorem 43 1. The estimators ~ j are all Normally distributed random variables with
expected value j and with variance given by the j 0 th diagonal element of the matrix
2 (X T X) 1 ; j = 1; 2; : : : ; k:
Proof. 52 The estimator ~ j can be written using (6.25) as a linear combination of the
Normal random variables Yi ,
~ = P bji Yi
n
j
i=1
P
n
E( ~ j ) = bji E(Yi )
i=1
Pn P
k
= bji i where i = l xil
i=1 l=1
Pn
= bji i
i=1
P
k
Note that i = l xil is the j’th component of the vector X which implies that E( ~ j )
l=1
is the j’th component of the vector BXX . But since BX is the identity matrix, this is
the j’th component of the vector or j : Thus E( ~ j ) = j for all j. The calculation of
the variance is similar.
P
n
V ar( ~ j ) = b2ji V ar(Yi )
i=1
2 P
n
= b2ji
i=1
P
n
and an easy matrix calculation will show, since BB T = (X T X) 1; that b2ji is the j’th
i=1
diagonal element of the matrix (X T X) 1 . We will not attempt to prove part (3) here,
which is usually proved in a subsequent statistics course.
52
This proof can be omitted for Stat 231.
6.4. MORE GENERAL GAUSSIAN RESPONSE MODELS54 223
Remark: The maximum likelihood estimate ^ is also called a least squares estimate
of in that it is obtained by taking the sum of squared vertical distances between the
observations Yi and the corresponding …tted values ^ i and then adjusting the values of the
estimated j until this sum is minimized. Least squares is a method of estimation in linear
models that predates the method of maximum likelihood. Problem 16 describes the method
of least squares.
Remark:53 From Theorem 39 we can obtain con…dence intervals and test hypotheses for
the regression coe¢ cients using the pivotal
~
j j
p s t (n k) (6.28)
Se cj
1
where cj is the j’th diagonal element of the matrix X T X .
^
j j
a p a
se cj
to obtain
^ p ^ + ase pcj
j ase cj j j
where
1 P
n P
k
^ xij :
s2e = (yi ^ i )2 and ^ i = j
n k i=1 j=1
2
p
Recall: if Z G(0; 1) and W (m) then the random variable T = Z= W=m s t (m).
~ (n k)S 2
j j
Let Z = p
cj
,W = 2 and m = n k to obtain this result.
224 6. GAUSSIAN RESPONSE MODELS
We now consider a special case of the Gaussian response models. We have already
seen this case in Chapter 4, but it provides a simple example to validate the more general
formulae.
(n 1)S 2
2
x y x y x y x y x y
46 136 37 115 58 139 48 134 59 142
36 132 45 129 50 156 35 120 54 135
62 138 39 127 41 132 42 137 57 150
26 115 28 134 31 115 27 120 60 159
53 143 32 133 51 143 34 128 38 127
x = 43:20 y = 133:56
Sxx = 2802:00 Syy = 3284:16 Sxy = 2325:20
To analyze these data assume the simple linear regression model: Yi G( + xi ; ),
i = 1; : : : ; 12:
(a) Give the maximum likelihood (least squares) estimates of and and an unbi-
ased estimate of 2 .
(b) Use the plots discussed in Section 6:2 to check the adequacy of the model.
(c) Construct a 95% con…dence interval for . What is the interpretation of this
interval?
(d) Construct a 90% con…dence interval for the mean systolic blood pressure of
nurses aged x = 35.
(e) Construct a 99% prediction interval for the systolic blood pressure Y of a nurse
aged x = 50.
(f) Construct a 95% con…dence interval for the mean amount grossed by movies for
actors whose value is x = 50. Construct a 95% con…dence interval for the mean
amount grossed by movies for actors whose value is x = 100. What assumption
is being made in constructing the interval for x = 100?
(a) Construct a 95% con…dence interval for the mean breaking strength of bolts of
diameter x = 0:35, that is, x1 = (0:35)2 = 0:1225.
(b) Construct a 95% prediction interval for the breaking strength Y of a single bolt
of diameter x = 0:35. Compare this with the interval in (a).
(c) Suppose that a bolt of diameter 0:35 is exposed to a large force V that could
potentially break it. In structural reliability and safety calculations, V is treated
as a random variable and if Y represents the breaking strength of the bolt (or
some other part of a structure), then the probability of a “failure”of the bolt is
P (V > Y ). Give a point estimate of this value if V G(1:60; 0:10), where V
and Y are independent.
4. There are often both expensive (and highly accurate) and cheaper (and less accurate)
ways of measuring concentrations of various substances (e.g. glucose in human blood,
salt in a can of soup). The table below gives the actual concentration x (determined
by an expensive but very accurate procedure) and the measured concentration y
obtained by a cheap procedure, for each of 20 units.
x y x y x y x y
4:01 3:7 13:81 13:02 24:85 24:69 36:9 37:54
6:24 6:26 15:9 16 28:51 27:88 37:26 37:2
8:12 7:8 17:23 17:27 30:92 30:8 38:94 38:4
9:43 9:78 20:24 19:9 31:44 31:03 39:62 40:03
12:53 12:4 24:81 24:9 33:22 33:01 40:15 39:4
x = 23:7065 y = 23:5505
Sxx = 2818:946855 Syy = 2820:862295 Sxy = 2818:556835
To analyze these data assume the regression model: Yi G( + xi ; ), i = 1; : : : ; 20
independently.
(a) Fit the model to these data. Use the plots discussed in Section 6.2 to check the
adequacy of the model.
(b) Construct a 95% con…dence intervals for the slope and test the hypothesis
= 1. Construct 95% con…dence intervals for the intercept and test the
hypothesis = 0. Why are these hypotheses of interest?
6.5. CHAPTER 6 PROBLEMS 227
(c) Describe brie‡y how you would characterize the cheap measurement process’s
accuracy to a lay person.
(d) If the units to be measured have true concentrations in the range 0 40, do you
think that the cheap method tends to produce a value that is lower than the true
concentration? Support your answer based on the data and the assumed model.
is the maximum likelihood estimate of and also the least squares estimate of
.
(b) Show that 0 1
P
n
x i Yi
B 2 C
~= i=1
vNB
@ ;
C:
A
Pn P
n
x2i x2i
i=1 i=1
P
n
Hint: Write ~ in the form ai Yi .
i=1
(c) Prove the identity
2 1
P
n
^ xi
2 P
n P
n P
n
yi = yi2 x i yi x2i :
i=1 i=1 i=1 i=1
7. The following data were recorded concerning the relationship between drinking
(x = per capita wine consumption) and y = death rate from cirrhosis of the liver in
n = 46 states of the U.S.A. (for simplicity the data has been rounded):
x y x y x y x y x y x y
5 41 12 77 7 67 4 52 7 41 16 91
4 32 7 57 18 57 16 87 13 67 2 30
3 39 14 81 6 38 9 67 8 48 6 28
7 58 12 34 31 130 6 40 28 123 3 52
11 75 10 53 13 70 6 56 23 92 8 56
9 60 10 55 20 104 21 58 22 76 13 56
6 54 14 58 19 84 15 74 23 98
3 48 9 63 10 66 17 98 7 34
x = 11:5870 y = 63:5870
Sxx = 2155:1522 Syy = 24801:1521 Sxy = 6175:1522
8. Skinfold body measurements are used to approximate the body density of individuals.
The data on n = 92 men, aged 20-25, where x = skinfold measurement and Y = body
density are given in Appendix C as well as being posted on the course website.
Note: The R function lm, with the command lm(y~x) gives the detailed the calcula-
tions for linear regression. The command summary(lm(y~x)) also gives useful output.
>Dataset<-read.table("Skinfold Data.txt",header=T,sep="",strip.white=T)
# reads data and headers from file Skinfold Data.txt
>RegModel <-lm(BodyDensity~Skinfold,data=Dataset)
# runs regression Bodydensity=a+b*Skinfold
>summary(RegModel) # summary of output on next page
Diagnostic plots.
>x<-Dataset$Skinfold
>y<-Dataset$BodyDensity
>muhat<-1.161139-0.062066*x
>plot(x,y)
>points(x,muhat,type="l")
>title(main="Scatterplot of Skinfold/BodyDensity with fitted line")
Residual Plots
>r<- RegModel$residuals
>x<- Dataset$Skinfold
>plot(x,r)
>title(main="residual plot: Skinfold vs residual")
>muhat=1.161139-0.062066*x
>plot(muhat,r)
>title(main="residual plot: fitted values vs residual")
>rstar <- r/0.007877
>plot(muhat,rstar)
230 6. GAUSSIAN RESPONSE MODELS
(a) Run the R code given. What do the scatterplot and residual plots indicate about
the …t of the model?
(b) Do you think that the skinfold measurements provide a reasonable approximation
to the Body Density?
9. The following data, collected by a famous British botanist named Joseph Hooker in
the Himalaya Mountains between 1848 and a850, relates atmospheric pressure to the
boiling point of water. Theory suggests that a graph of log pressure versus boiling
point should give a straight line.
(a) Let y = atmospheric pressure (in Hg) and x = boiling point of water (in F).
Fit a simple linear regression model to the data (xi ; yi ), i = 1; : : : ; 31. Prepare
a scatterplot of y versus x and draw on the …tted line. Plot the standardized
residuals versus x. How well does the model …t these data?
(b) Let z = log y. Fit a simple linear regression model to the data (xi ; zi ), i =
1; : : : ; 31. Prepare a scatterplot of z versus x and draw on the …tted line. Plot
the standardized residuals versus x. How well does the model …t these data?
6.5. CHAPTER 6 PROBLEMS 231
(c) Based on the results in (a) and (b) which data are best …t by a linear model?
Does this con…rm the theory’s model?
(d) Obtain a 95% con…dence interval for the mean atmospheric pressure if the boiling
point of water is 195 F .
10. An educator believes that the new directed readings activities in the classroom will
help elementary school students improve some aspects of their reading ability. She
arranges for a Grade 3 class of 21 students to take part in the activities for an 8-
week period. A control classroom of 23 Grade 3 students follows the same curriculum
without the activities. At the end of the 8-week period, all students are given a Degree
of Reading Power (DRP) test, which measures the aspects of reading ability that the
treatment is designed to improve. The data are:
24 43 58 71 43 49 61 44 67 49 53
Treatment Group:
56 59 52 62 54 57 33 46 43 57
42 43 55 26 62 37 33 41 19 54 20 85
Control Group:
46 10 17 60 53 42 37 42 55 28 48
Let y1j = the DRP test score for the treatment group, j = 1; : : : ; 21: Let y2j = the
DRP test score for the control group, j = 1; : : : ; 23: For these data
P
21
y1 = 51:4762 (y1j y1 )2 = 2423:2381
j=1
P23
y2 = 41:5217 (y2j y2 )2 = 6469:7391
j=1
11. A study was done to compare the durability of diesel engine bearings made of two
di¤erent compounds. Ten bearings of each type were tested. The following table gives
the “times” until failure (in units of millions of cycles):
Type I: y1i 3:03 5:53 5:60 9:30 9:92 12:51 12:95 15:21 16:04 16:84
Type II: y2i 3:19 4:26 4:47 4:53 4:67 4:69 12:78 6:79 9:37 12:75
P
10 P
10
y1 = 10:693 (y1i y1 )2 = 209:02961 y2 = 6:75 (y2i y2 )2 = 116:7974
i=1 i=1
To analyze these data assume
Y1j v G ( 1; ); j = 1; : : : ; 10 independently
Y2j v G ( 2; ); j = 1; : : : ; 10 independently
(a) Obtain a 90% con…dence interval for the di¤erence in the means 1 2.
(c) It has been suggested that log failure times are approximately Normally dis-
tributed, but not failure times. Assuming that the log Y ’s for the two types of
bearing are Normally distributed with the same variance, test the hypothesis
that the two distributions have the same mean. How does the answer compare
with that in part (b)?
(d) How might you check whether Y or log Y is closer to Normally distributed?
(e) Give a plot of the data which could be used to describe the data and your
analysis.
12. To compare the mathematical abilities of incoming …rst year students in Mathemat-
ics and Engineering, 30 Math students and 30 Engineering students were selected
randomly from their …rst year classes and given a mathematics aptitude test. A sum-
mary of the resulting marks xi (for the math students) and yi (for the engineering
students), i = 1; : : : ; 30, is as follows:
P
30
Math students: n = 30 y1 = 120 (y1i y1 )2 = 3050
i=1
P30
Engineering students: n = 30 y2 = 114 (y2i y2 )2 = 2937
i=1
Y1j v G ( 1; ); j = 1; : : : ; 30 independently
6.5. CHAPTER 6 PROBLEMS 233
Y2j v G ( 2; ); j = 1; : : : ; 30 independently
(a) Obtain a 95% con…dence interval for the di¤erence in mean scores for …rst year
Math and Engineering students.
(b) Test the hypothesis that the di¤erence is zero.
13. Fourteen welded girders were cyclically stressed at 1900 pounds per square inch and
the numbers of cycles to failure were observed. The sample mean and variance of the
log failure times were y1 = 14:564 and s21 = 0:0914. Similar tests on ten additional
girders with repaired welds gave y2 = 14:291 and s22 = 0:0422. Log failure times are
assumed to be independent with a Gaussian distribution. Assuming equal variances
for the two types of girders, obtain a 90% con…dence interval for the di¤erence in
mean log failure times and test the hypothesis of no di¤erence.
14. Consider the data in Problem 9 of Chapter 1 on the lengths of male and female
coyotes.
(a) Construct a 95% con…dence interval the di¤erence in mean lengths for the two
sexes. State your assumptions.
(b) Estimate P (Y1 > Y2 ) (give the maximum likelihood estimate), where Y1 is the
length of a randomly selected female and Y2 is the length of a randomly selected
male. Can you suggest how you might get a con…dence interval?
(c) Give separate con…dence intervals for the average length of males and females.
15. To assess the e¤ect of a low dose of alcohol on reaction time, a sample of 24 student
volunteers took part in a study. Twelve of the students (randomly chosen from the 24)
were given a …xed dose of alcohol (adjusted for body weight) and the other twelve got
a nonalcoholic drink which looked and tasted the same as the alcoholic drink. Each
student was then tested using software that ‡ashes a coloured rectangle randomly
placed on a screen; the student has to move the cursor into the rectangle and double
click the mouse. As soon as the double click occurs, the process is repeated, up to a
total of 20 times. The response variate is the total reaction time (i.e. time to complete
the experiment) over the 20 trials. The data are given below.
“Alcohol” Group:
1:33 1:55 1:43 1:35 1:17 1:35 1:17 1:80 1:68 1:19 0:96 1:46
P
12
y1 = 16:44
12 = 1:370 (y1i y1 )2 = 0:608
i=1
234 6. GAUSSIAN RESPONSE MODELS
“Non-Alcohol” Group:
1:68 1:30 1:85 1:64 1:62 1:69 1:57 1:82 1:41 1:78 1:40 1:43
P
12
y2 = 19:19
12 = 1:599 (y2i y2 )2 = 0:35569
i=1
Analyze the data with the objective of determining whether there is any evidence
that the dose of alcohol increases reaction time. Justify any models that you use.
16. An experiment was conducted to compare gas mileages of cars using a synthetic oil
and a conventional oil. Eight cars were chosen as representative of the cars in general
use. Each car was run twice under as similar conditions as possible (same drivers,
routes, etc.), once with the synthetic oil and once with the conventional oil, the order
of use of the two oils being randomized. The average gas mileages were as follows:
Car 1 2 3 4 5 6 7 8
Synthetic: y1i 21:2 21:4 15:9 37:0 12:1 21:1 24:5 35:7
Conventional: y21 18:0 20:6 14:2 37:8 10:6 18:5 25:9 34:7
yi = y1i y2i 3:2 0:8 1:7 0:8 1:5 2:6 1:4 1
P
8
y1 = 23:6125 (y1i y1 )2 = 535:16875
i=1
P8
y2 = 22:5375 (y2i y2 )2 = 644:83875
i=1
P
8
y = 1:075 (yi y)2 = 17:135
i=1
(a) Obtain a 95% con…dence interval for the di¤erence in mean gas mileage, and
state the assumptions on which your analysis depends.
(b) Repeat (a) if the natural pairing of the data is (improperly) ignored.
(c) Why is it better to take pairs of measurements on eight cars rather than taking
only one measurement on each of 16 cars?
17. The following table gives the number of sta¤ hours per month lost due to accidents
in eight factories of similar size over a period of one year and after the introduction
of an industrial safety program.
Factory i 1 2 3 4 5 6 7 8
After: y1i 28:7 62:2 28:9 0:0 93:5 49:6 86:3 40:2
Before: y2i 48:5 79:2 25:3 19:7 130:9 57:6 88:8 62:1
yi = y1i y2i 19:8 17:0 3:6 19:7 37:4 8:0 2:5 21:9
P
8
y= 15:3375 (yi y)2 = 1148:79875
i=1
6.5. CHAPTER 6 PROBLEMS 235
There is a natural pairing of the data by factory. Factories with the best safety records
before the safety program tend to have the best records after the safety program as
well. The analysis of the data must take this pairing into account and therefore the
model
Yi v G ( ; ) ; i = 1; : : : ; 8 independently
(a) The parameters and correspond to what attributes of interest in the study
population?
(b) Calculate a 95% con…dence interval for .
(c) Test the hypothesis of no di¤erence due to the safety program, that is, test the
hypothesis H0 : = 0:
18. Comparing sorting algorithms: Suppose you want to compare two algorithms A
and B that will sort a set of numbers into an increasing sequence. (The R function,
sort(x), will, for example, sort the elements of the numeric vector x.) To compare
the speed of algorithms A and B, you decide to “present” A and B with random
permutations of n numbers, for several values of n. Explain exactly how you would
set up such a study, and discuss what pairing would mean in this context.
19. Sorting algorithms continued: Two sort algorithms as in the preceding problem
were each run on (the same) 20 sets of numbers (there were 500 numbers in each set).
Times to sort the sets of two numbers are shown below.
Set: 1 2 3 4 5 6 7 8 9 10
A: 3:85 2:81 6:47 7:59 4:58 5:47 4:72 3:56 3:22 5:58
B: 2:66 2:98 5:35 6:43 4:28 5:06 4:36 3:91 3:28 5:19
yi 1:19 :17 1:12 1:16 0:30 0:41 0:36 :35 :06 0:39
Set: 11 12 13 14 15 16 17 18 19 20
A: 4:58 5:46 3:31 4:33 4:26 6:29 5:04 5:08 5:08 3:47
B: 4:05 4:78 3:77 3:81 3:17 6:02 4:84 4:81 4:34 3:48
yi 0:53 0:68 :46 0:52 1:09 0:27 0:20 0:27 0:74 :01
20
X
y = 0:409 s2 = 1
19 (yi y)2 = 0:237483
i=1
(a) Since the two algorithms are each run on the same 20 sets of numbers we analyse
the di¤erences yi = yAi yBi , i = 1; 2; : : : ; 20. Construct a 99% con…dence
interval for the di¤erence in the average time to sort with algorithms A and B,
assuming the di¤erence have a Gaussian distribution.
236 6. GAUSSIAN RESPONSE MODELS
(b) Use a Normal qqplot to determine if a Gaussian model is reasonable for the
di¤erences.
(c) Give a point estimate of the probability that algorithm B will sort a randomly
selected list faster than A.
(d) Another way to estimate the probability p in part (c) is to notice that of the 20
sets of numbers in the study, B sorted faster on 15 sets of numbers. Obtain an
approximate 95% con…dence interval for p. (It is also possible to get a con…dence
interval using the Gaussian model.)
(e) Suppose the study had actually been conducted using two independent samples of
size 20 each. Using the two sample Normal analysis determine a 99% con…dence
interval for the di¤erence in the average time to sort with algorithms A and B.
Note:
y1 = 4:7375 s21 = 1:4697 y2 = 4:3285 s22 = 0:9945
How much better is the paired study as compared to the two sample study?
21. Challenge Problem: Readings produced by a set of scales are independent and
Normally distributed about the true weight of the item being measured. A study
is carried out to assess whether the standard deviation of the measurements varies
according to the weight of the item.
(a) Ten weighings of a 10 kilogram weight yielded y = 10:004 and s = 0:013 as the
sample mean and standard deviation. Ten weighings of a 40 kilogram weight
yielded y = 39:989 and s = 0:034. Is there any evidence of a di¤erence in the
standard deviations for the measurements of the two weights?
(b) Suppose you had a further set of weighings of a 20 kilogram item. How could
you study the question of interest further?
22. Challenge Problem: Least squares estimation. Suppose you have a model
where the mean of the response variable Yi given the covariates xi = (xi1 ; : : : ; xik )
has the form
i = E(Yi jxi ) = (xi ; )
Show that the least squares estimate of is the same as the maximum likelihood
estimate of in the Gaussian model Yi G( i ; ), when i is of the form
P
k
i = (xi ; ) = j xij :
j=1
(a) Predictions take the form Y^ = g(x), where g( ) is our “prediction” function.
Show that the minimum achievable value of E(Y^ Y )2 is minimized by choosing
g(x) = (x).
(b) Show that the minimum achievable value of E(Y^ Y )2 , that is, its value when
g(x) = (x) is (x)2 .
This shows that if we can determine or estimate (x), then “optimal”prediction
(in terms of Euclidean distance) is possible. Part (b) shows that we should try
to …nd covariates x for which (x)2 = V ar(Y jx) is as small as possible.
(c) What happens when (x)2 is close to zero? (Explain this in ordinary English.)
238 6. GAUSSIAN RESPONSE MODELS
7. MULTINOMIAL MODELS
AND GOODNESS OF FIT TESTS
H0 : j = j( ) for j = 1; : : : ; k (7.2)
Q
k
yj
L( ) = j : (7.3)
j=1
Let be the parameter space for . It was shown earlier that L( ) is maximized over
(of dimension m 1) by the vector ^ with ^j = yj =n, j = 1; : : : ; k. A likelihood ratio test
of the hypothesis (7.2) is based on the likelihood ratio statistic
" #
L( ~0 )
= 2l(~) 2l(~0 ) = 2 log ; (7.4)
L(~)
where ~0 maximizes L( ) under the hypothesis (7.2), which restricts to lie in a space
0 of dimension p. (Note that 0 is the space of all ( 1 ( ); 2 ( ); :::; k ( )) as
varies over its possible values.) If H0 is true (that is, if really lies in 0 ) and n is large the
239
240 7. MULTINOMIAL MODELS AND GOODNESS OF FIT TESTS
and
= 2l(^) 2l(^0 )
is the observed value of . This approximation is very accurate when n is large and none
of the j ’s is too small. When the observed expected frequencies under H0 are all at least
…ve, it is accurate enough for testing purposes.
The test statistic (7.4) can be written in a simple form. Let ~0 = ( 1 (~ ); : : : ; k (~ ))
denote the maximum likelihood estimator of under the hypothesis (7.2). Then, by (7.4),
we obtain
= 2l(~) 2l(~0 )
k
" #
X ~j
=2 Yj log :
j=1 j (~ )
Ej = n j (~ ) for j = 1; : : : ; k
we can rewrite as
k
X Yj
=2 Yj log : (7.6)
Ej
j=1
An alternative test statistic that was developed historically before the likelihood ratio
test statistic is the Pearson goodness of …t statistic
k
X (Yj Ej )2
D= : (7.7)
Ej
j=1
The Pearson goodness of …t statistic has similar properties to ; for example, their observed
values both equal zero when yj = ej = n j (^ ) for all j = 1; : : : ; k and are larger when
yj ’s and ej ’s di¤er greatly. It turns out that, like , the statistic D also has a limiting
2 (k 1 p) distribution when H0 is true.
The remainder of this chapter consists of the application of the general methods above
to some important testing problems.
7.2. GOODNESS OF FIT TESTS 241
Data collected on 100 persons gave y1 = 17, y2 = 46, y3 = 37, and we can use this to test
the hypothesis H0 that (7.8) is correct. (Note that (Y1 ; Y2 ; Y3 ) Multinomial(n; 1 ; 2 ; 3 )
with n = 100.) The likelihood ratio test statistic is given by (7.6), but we have to …nd ~
and then the Ej ’s. The likelihood function under (7.8) is
L1 ( ) = L( 1 ( ); 2( ); 3 ( ))
2 17 46
= c( ) [2 (1 )] [(1 )2 ]37
80
=c (1 )120
where c is a constant. We easily …nd that ^ = 0:40. The observed expected frequencies
under (7.8) are therefore e1 = 100^ 2 = 16, e2 = 100[2^ (1 ^ )] = 48, e3 = 100[(1 ^ )2 ] = 36.
Clearly these are close to the observed frequencies y1 = 17, y2 = 46, y3 = 37. The observed
value of the likelihood ratio statistic (7.6) is
3
X yj 17 46 37
2 yj log = 2 17 log + 46 log + 37 log = 0:17
ej 16 48 36
j=1
Interval 0 100 100 200 200 300 300 400 400 600 600 800 > 800
yj 29 22 12 10 10 9 8
ej 27:6 20:0 14:4 10:5 13:1 6:9 7:6
To calculate the expected frequencies we need an estimate of which is obtained by maxi-
mizing the likelihood function
Q
7
L( ) = [pj ( )]yj :
j=1
so there is no evidence against the model (7.9). Note that the reason the 2 degrees of
freedom are 5 is because k 1 = 6 and p = dim( ) = 1.
The goodness of …t test just discussed has some arbitrary elements, since we could have
used di¤erent intervals and a di¤erent number of intervals. Theory has been developed on
how best to choose the intervals. For this course we only give rough guidelines which are:
chose 4 10 intervals, so that the observed expected frequencies under H0 are at least 5.
7.2. GOODNESS OF FIT TESTS 243
so there is no evidence against the hypothesis that a Poisson model …ts these data.
The observed value of the goodness of …t statistic is
12
X (fj ej )2 (57 54:3)2 (203 210:3)2 (6 5:9)2
2 = + + + = 12:96
ej 54:3 210:3 5:9
j=1
so again there is no evidence against the hypothesis that a Poisson model …ts these data.
244 7. MULTINOMIAL MODELS AND GOODNESS OF FIT TESTS
The expected frequencies are all at least …ve so the approximate p value is
2
p value = P ( 50:36; H0 ) t P (W 50:36) t 0 where W s (7)
and there is very strong evidence against the hypothesis that an Exponential model …ts
these data. This conclusion is not unexpected since, as we noted in Example 2.6.2, the
observed and expected frequencies are not in close agreement at all. We could have chosen
a di¤erent set of intervals for these continuous data but the same conclusion of a lack of …t
would be obtained for any reasonable choice of intervals.
H0 : ij = i j for i = 1; : : : ; a; j = 1; : : : ; b (7.10)
P
a P
b
where 0 < i < 1, 0 < j < 1, i = 1, j = 1. Note that
i=1 j=1
and
j = P (an individual is type Bj )
and that (7.10) is the standard de…nition for independent events: P (Ai \Bj ) = P (Ai )P (Bj ).
We recognize that testing (7.10) falls into the general framework of Section 7.1, where
k = ab, and the dimension of the parameter space under (7.10) is p = (a 1) + (b 1) =
a + b 2. All that needs to be done in order to use the statistics (7.6) or (7.7) to test H0
is to obtain the maximum likelihood estimates ^ i , ^ j under the model (7.10), and then the
calculate the expected frequencies eij .
Under the model (7.10), the likelihood function for the yij ’s is proportional to
a Q
Q b
L1 ( ; ) = [ ij ( ; )]yij
i=1 j=1
Q
a Q
b
yij
= ( i j) :
i=1 j=1
246 7. MULTINOMIAL MODELS AND GOODNESS OF FIT TESTS
ri ^ cj
^i = ; j= i = 1; : : : a; j = 1; : : : b
n n
The observed value of the likelihood ratio statistic (7.6) for testing the hypothesis (7.10)
is then
Xa X b
yij
=2 yij log :
eij
i=1 j=1
2
p value = P ( ; H0 ) t P (W ) where W s ((a 1)(b 1))
We can think of the Rh types as the A-type classi…cation and the OAB types as the B-type
classi…cation in the general theory above. The row and column totals are also shown in the
table, since they are the values needed to compute the eij ’s in (7.11).
To carry out the test that a person’s Rh and OAB blood types are statistically inde-
pendent, we merely need to compute the eij ’s by (7.11). For example,
The remaining expected frequencies can be obtained by subtraction and these are given in
the table below in brackets next to the observed frequencies.
O A B AB Total
Rh+ 82 (77:3) 89 (94:4) 54 (49:6) 19 (22:8) 244
Rh 13 (17:7) 27 (21:6) 7 (11:4) 9 (5:2) 56
Total 95 116 61 28 300
The degrees of freedom for the Chi-squared approximation are (a 1)(b 1) = (3) (1) =
3 which is consistent with the fact that, once we had calculated three of the expected
frequencies, the remaining expected frequencies could be obtained by subtraction.
The observed value of the likelihood ratio test statistic is = 8:52, and the p value
is approximately P (W 8:52) = 0:036 where W s 2 (3) so there is evidence against the
hypothesis of independence based on the data. Note that by comparing the eij ’s and the
yij ’s we get some idea about the lack of independence, or relationship, between the two
classi…cations. We see here that the degree of dependence does not appear large.
i1 + i2 + + ib = 1 for each i = 1; : : : ; a
H0 : 1 = 2 = = a; (7.12)
P
a P
b
where n = n1 + + na and y+j = yij . Since ni = yi+ = yij the expected frequencies
i=1 j=1
have exactly the same form as in the preceding section, when we lay out the data in a
two-way table with a rows and b columns.
We can think of the persons receiving aspirin and those receiving placebo as two groups,
and test the hypothesis
H0 : 11 = 21 ;
where 11 = P (stroke) for a person in the aspirin group and 21 = P (stroke) for a person
in the placebo group. The expected frequencies under H0 : 11 = 21 are
(yi+ )(y+j )
eij = for i = 1; 2:
476
This gives the expected frequencies shown in the table in brackets. The observed value of
the likelihood ratio statistic is
2 X
X 2
yij
2 yij log = 5:25
eij
i=1 j=1
7.3. TWO-WAY (CONTINGENCY) TABLES 249
so there is evidence against H0 based on the data. A look at the yij ’s and the eij ’s indicates
that persons receiving aspirin have had fewer strokes than expected under H0 , suggesting
that 11 < 21 .
This test can be followed up with estimates for 11 and 21 . Because each row of the
table follows a Binomial distribution, we have
(~11 ~21 ) ( 11 21 )
q :
~11 (1 ~11 )=n1 + ~21 (1 ~21 )=n2
Remark: This and other tests involving Binomial probabilities and contingency tables can
be carried out using the R function prop.test.
250 7. MULTINOMIAL MODELS AND GOODNESS OF FIT TESTS
(a) Test the hypothesis that the probability of rust occurring is the same for the
rust-proofed cars as for those not rust-proofed. What do you conclude?
(b) Do you have any concerns about inferring that the rust-proo…ng prevents rust?
How might a better study be designed?
Test the hypothesis that the number of defective items Y in a single carton has a
Binomial(12; ) distribution. Why might the Binomial not be a suitable model?
Test whether a Poisson model for the number of interruptions Y on a single day is
consistent with these data.
7.4. CHAPTER 7 PROBLEMS 251
5. The table below records data on 292 litters of mice classi…ed according to litter size
and number of females in the litter.
(a) For litters of size n (n = 1; 2; 3; 4) assume that the number of females in a litter
of size n has Binomial distribution with parameters n and n = P (female). Test
the Binomial model separately for each of the litter sizes n = 2; n = 3 and
n = 4. (Why is it of scienti…c interest to do this?)
(b) Assuming that the Binomial model is appropriate for each litter size, test the
hypothesis that 1 = 2 = 3 = 4 .
1 1 6 8 10 22 12 15 0 0
2 26 1 20 4 2 0 10 4 19
2 3 0 5 2 8 1 6 14 2
2 2 21 4 3 0 0 7 2 4
4 7 16 18 2 13 22 7 3 5
Give an appropriate probability model for the number of digits between two successive
zeros, if the pseudo random number generator is truly producing digits for which
P (any digit = j) = 0:1; j = 0; 1; : : : ; 9, independent of any other digit. Construct a
frequency table and test the goodness of …t of your model.
7. 1398 school children with tonsils present were classi…ed according to tonsil size and
absence or presence of the carrier for streptococcus pyogenes. The results were as
follows:
8. The following data on heights of 210 married couples were presented by Yule in 1900.
Test the hypothesis that the heights of husbands and wives are independent.
9. In the following table, 64 sets of triplets are classi…ed according to the age of their
mother at their birth and their sex distribution:
3 boys 2 boys 2 girls 3 girls Total
Mother under 30 5 8 9 7 29
Mother over 30 6 10 13 6 35
Total 11 18 22 13 64
(a) Is there any evidence of an association between the sex distribution and the age
of the mother?
(b) Suppose that the probability of a male birth is 0:5, and that the sexes of triplets
are determined independently. Find the probability that there are y boys in a
set of triples y = 0; 1; 2; 3, and test whether the column totals are consistent with
this distribution.
10. A study was undertaken to determine whether there is an association between the
birth weights of infants and the smoking habits of their parents. Out of 50 infants of
above average weight, 9 had parents who both smoked, 6 had mothers who smoked
but fathers who did not, 12 had fathers who smoked but mothers who did not, and
23 had parents of whom neither smoked. The corresponding results for 50 infants of
below average weight were 21, 10, 6, and 13, respectively.
(a) Test whether these results are consistent with the hypothesis that birth weight
is independent of parental smoking habits.
(b) Are these data consistent with the hypothesis that, given the smoking habits of
the mother, the smoking habits of the father are not related to birth weight?
11. Purchase a box of smarties and count the number of each of the colours: red, green.
yellow, blue, purple, brown, orange, pink. Test the hypothesis that each of the colours
has the same probability H0 : i = 18 ; i = 1; 2; : : : ; 8: The following R code55 can be
modi…ed to give the two test statistics, the likelihood ratio test statistic and Pear-
son’s Chi-squared D:
55
these are the frequencies of smarties for a large number of boxes consumed in Winter 2013.
7.4. CHAPTER 7 PROBLEMS 253
255
256 8. CAUSAL RELATIONSHIPS
other factors that a¤ect y constant; often we don’t even know what all the factors are.
However, the de…nition serves as a useful ideal for how we should carry out studies in order
to show that a causal relationship exists. We try to design studies so that alternative (to the
variate x) explanations of what causes changes in attributes of y can be ruled out, leaving
x as the causal agent. This is much easier to do in experimental studies, where explanatory
variables may be controlled, than in observational studies. The following are brief examples.
groups. That is, the aspirin group and the placebo group both have similar variations in
dietary and blood pressure values across the subjects in the group. Thus, a di¤erence in
the two groups should not be due to these factors.
84 cars of eight di¤erent types were used; each car was used for 8 test drives.
the cars were each driven twice for 600 km on the track at each of four speeds:
80,100,120 and 140 km/hr.
8 drivers were involved, each driving each of the 8 cars for one test, and each driving
two tests at each of the four speeds.
the cars had similar initial mileages and were carefully checked and serviced so as to
make them as comparable as possible; they used comparable fuels.
the drivers were instructed to drive steadily for the 600 km. Each was allowed a 30
minute rest stop after 300 km.
the order in which each driver did his or her 8 test drives was randomized. The track
was large enough that all 8 drivers could be on it at the same time. (The tests were
conducted over 8 days.)
The response variate was the amount of fuel consumed for each test drive. Obviously
in the analysis we must deal with the fact that the cars di¤er in size and engine type, and
their fuel consumption will depend on that as well as on driving speed. A simple approach
would be to add the fuel amounts consumed for the 16 test drives at each speed, and to
compare them (other methods are also possible). Then, for example, we might …nd that
the average consumption (across the 8 cars) at 80, 100, 120 and 140 km/hr were 43.0,44.1,
45.8 and 47.2 liters, respectively. Statistical methods of testing and estimation could then
be used to test or estimate the di¤erences in average fuel consumption at each of the four
speeds. (Can you think of a way to do this?)
8.3. OBSERVATIONAL STUDIES 259
We want to see if females have a lower probability of admission than males. If we looked
only at the totals for Engineering plus Arts, then it would appear that the probability a
male applicant is admitted is a little higher than the probability for a female applicant.
However, if we look separately at Arts and Engineering, we see the probability for females
being admitted appears higher in each case! The reason for the reverse direction in the
totals is that Engineering has a higher admission rate than Arts, but the fraction of women
applying to Engineering is much lower than for Arts.
In cause and e¤ect language, we would say that the faculty one applies to (i.e. Engi-
neering or Arts) is a causative factor with respect to probability of admission. Furthermore,
it is related to the sex (male or female) of an applicant, so we cannot ignore it in trying to
see if sex is also a causative factor.
260 8. CAUSAL RELATIONSHIPS
Remark: The feature illustrated in the example above is sometimes called Simpson’s
Paradox. In probabilistic terms, it says that for events A; B1 ; B2 and C1 ; : : : ; Ck , we can
have
P (AjB1 Ci ) > P (AjB2 Ci ) for each i = 1; : : : ; k
but have
P (AjB1 ) < P (AjB2 )
P
k
(Note that P (AjB1 ) = P (AjB1 Ci )P (Ci jB1 ) and similarly for P (AjB2 ), so they depend
i=1
on what P (Ci jB1 ) and P (Ci jB2 ) are.) In the example above we can take B1 = fperson
is femaleg, B2 = fperson is maleg, C1 = fperson applies to Engineeringg, C2 = fperson
applies to Artsg, and A = fperson is admittedg.
Exercise: Write down estimated probabilities for the various events based on Example
8.3.1, and so illustrate Simpson’s paradox.
Epidemiologists (specialists in the study of disease) have developed guidelines or criteria
which should be met in order to argue that a causal association exists between a risk factor
x and a disease (represented by a response variable Y = I(person has the disease), for
example). These include
the need to account for other possible risk factors and to demonstrate that x and Y
are consistently related when these factors vary.
the demonstration that association between x and Y holds in di¤erent types of settings
57
The Coronary Drug Research Group, New England Journal of Medicine (1980), pg. 1038.
8.4. CLOFIBRATE STUDY 261
F a t a l H e a rt A t t a c k
w eat her m et hod of adm inis t rat ion
Investigate the e¤ect of clo…brate on the risk of fatal heart attack for patients with a
history of a previous heart attack.
The target population consists of all individuals with a previous non-fatal heart attack
who are at risk for a subsequent heart attack. The response of interest is the occurrence/non-
occurrence of a fatal heart attack. This is primarily a causative problem in that the investi-
gators are interested in determining whether the prescription of clo…brate causes a reduction
in the risk of subsequent heart attack. The …shbone diagram (Figure 8.1) indicates a broad
variety of factors a¤ecting the occurrence (or not) of a heart attack.
Plan:
The study population consists of men aged 30 to 64 who had a previous heart attack not
more than three months prior to initial contact. The sample consists of subjects from the
study population who were contacted by participating physicians, asked to participate in
the study, and provided informed consent. (All patients eligible to participate had to sign a
consent form to participate in the study. The consent form usually describes current state
of knowledge regarding the best available relevant treatments, the potential advantages and
disadvantages of the new treatment, and the overall purpose of the study.)
The following treatment protocol was developed:
Randomly assign eligible men to either clo…brate or placebo treatment groups. (This
is an attempt to make the clo…brate and placebo groups alike with respect to most ex-
262 8. CAUSAL RELATIONSHIPS
planatory variates other than the focal explanatory variate. See the …shbone diagram
above.)
Follow patients for 5 years and record the occurrence of any fatal heart attacks expe-
rienced in either treatment group.
1,103 patients were assigned to clo…brate and 2,789 were assigned to the placebo
group.
221 of the patients in the clo…brate group died and 586 of the patients in the placebo
group died.
Analysis:
The proportion of patients in the two groups having subsequent fatal heart attacks
(clo…brate: 221=1103 = 0:20 and placebo: 586=2789 = 0:21) are comparable.
Conclusions:
Clo…brate does not reduce mortality due to heart attacks in high risk patients.
This conclusion has several limitations. For example, study error has been introduced
by restricting the study population to male subjects alone. While clo…brate might be
discarded as a bene…cial treatment for the target population, there is no information in
this study regarding its e¤ects on female patients at risk for secondary heart attacks.
Problem:
Investigate the occurrence of fatal heart attacks in the group of patients assigned to
clo…brate who were adherers.
Plan:
Compare the occurrence of heart attacks in patients assigned to clo…brate who main-
tained the designated treatment schedule with the patients assigned to clo…brate who
abandoned their assigned treatment schedule.
Data:
In the clo…brate group, 708 patients were adherers and 357 were non-adherers. The
remaining 38 patients could not be classi…ed as adherers or non-adherers and so were
excluded from this analysis. Of the 708 adherers, 106 had a fatal heart attack during
the …ve years of follow up. Of the 357 non-adherers, 88 had a fatal heart attack during
the …ve years of follow up.
Analysis:
The proportion of adherers su¤ering from subsequent heart attack is given by 106=708 =
0:15 while this proportion for the non-adherers is 88=357 = 0:25.
Conclusions:
It would appear that clo…brate does reduce mortality due to heart attack for high
risk patients if properly administered.
However, great care must be taken in interpreting the above results since they are
based on an observational plan. While the data were collected based on an exper-
imental plan, only the treatment was controlled. The comparison of the mortality
rates between the adherers and non-adherers is based on an explanatory variate (ad-
herence) that was not controlled in the original experiment. The investigators did not
decide who would adhere to the protocol and who would not; the subjects decided
themselves.
Now the possibility of confounding is substantial. Perhaps, adherers are more health
conscious and exercised more or ate a healthier diet. Detailed measurements of these
variates are needed to control for them and reduce the possibility of confounding.
264 8. CAUSAL RELATIONSHIPS
(a) Test the hypothesis that birth weight is independent of the mother’s smoking
habits.
(b) Explain why it is that these results do not prove that birth weights would increase
if mothers stopped smoking during pregnancy. How should a study to obtain
such proof be designed?
(c) A similar, though weaker, association exists between birth weight and the amount
smoked by the father. Explain why this is to be expected even if the father’s
smoking habits are irrelevant.
2. One hundred and …fty Statistics students took part in a study to evaluate computer-
assisted instruction (CAI). Seventy-…ve received the standard lecture course while
the other 75 received some CAI. All 150 students then wrote the same examination.
Fifteen students in the standard course and 29 of those in the CAI group received a
mark over 80%.
(a) Are these results consistent with the hypothesis that the probability of achieving
a mark over 80% is the same for both groups?
(b) Based on these results, the instructor concluded that CAI increases the chances
of a mark over 80%. How should the study have been carried out in order for
this conclusion to be valid?
(a) The following data were collected some years ago in a study of possible sex bias
in graduate admissions at a large university:
Test the hypothesis that admission status is independent of sex. Do these data
indicate a lower admission rate for females?
8.5. CHAPTER 8 PROBLEMS 265
(b) The following table shows the numbers of male and female applicants and the
percentages admitted for the six largest graduate programs in (a):
Men Women
Program Applicants % Admitted Applicants % Admitted
A 825 62 108 82
B 560 63 25 68
C 325 37 593 34
D 417 33 375 35
E 191 28 393 24
F 373 6 341 7
Test the independence of admission status and sex for each program. Do any of
the programs show evidence of a bias against female applicants?
(c) Why is it that the totals in (a) seem to indicate a bias against women, but the
results for individual programs in (b) do not?
(a) Test the hypothesis that there is no di¤erence between the mean amount of rust
for rust-proofed cars as compared to non-rust-proofed cars.
(b) The manufacturer was surprised to …nd that the data did not show a bene…cial
e¤ect of rust-proo…ng. Describe problems with their study and outline how you
might carry out a study designed to demonstrate a causal e¤ect of rust-proo…ng.
5. In randomized clinical trials that compare two (or more) medical treatments it is
customary not to let either the subject or their physician know which treatment they
have been randomly assigned. These are referred to as double blind studies.
Discuss why doing a double blind study is a good idea in a causative study.
266 8. CAUSAL RELATIONSHIPS
9.1 References
R.J. Mackay and R.W. Oldford (2001). Statistics 231: Empirical Problem Solving (Stat
231 Course Notes)
C.J. Wild and G.A.F. Seber (1999). Chance Encounters: A First Course in Data Analysis
and Inference. John Wiley and Sons, New York.
J. Utts (2003). What Educated Citizens Should Know About Statistics and Probability.
American Statistician 57,74-79
267
268 9. REFERENCES AND SUPPLEMENTARY RESOURCES
p.f./p.d.f. Mean Variance m.g.f.
Discrete p.f.
Binomialn, p ny p y q n−y np npq pe t q n
0 p 1, q 1 − p y 0, 1, 2, . . . , n
Bernoullip p y 1 − p 1−y
p p1 − p pe t q
0 p 1, q 1 − p y 0, 1
yk−1 p k
Negative Binomialk, p y pkqy kq kq
p 1−qe t
p2
0 p 1, q 1 − p y 0, 1, 2, . . . t − ln q
p
Geometricp pq y q q 1−qe t
p p2
0 p 1, q 1 − p y 0, 1, 2, . . . t − ln q
r N−r
y n−y
HypergeometricN, r, n
N
n
nr
N
n Nr 1 − Nr N−n
N−1
intractible
r N, n N
y 0, 1, 2, . . . , min(r, n
e − y
Poisson
e e −1
t
y!
0 y 0, 1, . . .
y y y
Multinomialn, 1 , . . . k
n!
y 1 !y 2 !...y k !
11 22 . . . k k
k k n 1 , . . . , n k VarY i n i 1 − i
i 0, ∑ i 1 y i 0, 1, . . . ; ∑ yi n
i1 i1
Continuous p.d.f.
e bt −e at
b−a 2
Uniforma, b fy 1
b−a
,ayb ab
2 12
b−at
t≠0
Exponential
1
fy 1
e −y/ , y 0 2 1−t
0 t 1/
e −y− /2
1 2 2
N, 2 or G(, fy
e t t /2
2 2
2 2
− , 0 − y
fy 1
y k/2−1 e −y/2 , y0
2 k/2 Γk/2
Chi-squared(k 1 − 2t −k/2
k 2k
k0 where Γa x a−1 −x
e dx t 1/2
0
y2
fy c k 1 k −k1/2
k
Student t 0 k−2
− y where undefined
k 0 if k 1 if k 2
c k Γ k1
2
/ k Γ 2k
Formulae
n n n n
ȳ 1
n ∑ yi s2 1
n−1
∑y i − ȳ 2 x̄ 1
n ∑ xi S xx ∑x i − x̄ 2
i1 i1 i1 i1
n n n n
S yy ∑y i − ȳ ∑ y 2i −nȳ 2 S xy ∑x i − x̄ y i − ȳ ∑x i − x̄ y i
2
i1 i1 i1 i1
n
n 1 −1s 21 n 2 −1s 22
1
s 2e n−2 ∑ y i −̂ − ̂ x i 2 n−2
1
S yy −̂ S xy s 2p n 1 n 2 −2
i1
Pivotals/Test Statistics
Random variable Distribution Mean or df Standard Deviation
Ȳ −
/ n
Gaussian 0 1
n−1S 2
Chi-squared df n − 1
2
̄Y−
Student t df n − 1
S/ n
n
̃
1/2
S xy
S xx
1
S xx
∑x i − x̄ Y i Gaussian 1
S xx
i1
̃ −
Student t df n − 2
S e / S xx
̃ Ȳ − ̃ x̄
1/2
Sx̄
2
Gaussian 1
n xx
1/2
̃ x ̃ ̃ x
x−x̄ 2
Gaussian x x 1
n S
xx
̃ x−x
Student t df n − 2
x−x̄ 2
Se 1
n S xx
x−x̄ 2 1/2
Y − ̃ x Gaussian 0 1 1n S xx
Y−̃ x
Student t df n − 2
x−x̄ 2
S e 1 1n S xx
n−2S 2e
Chi-squared df n − 2
2
Ȳ 1 −Ȳ 2 − 1 − 2
Student t df n 1 n 2 −2
Sp 1
n1 n12
n 1 n 2 −2S 2p
Chi-squared df n 1 n 2 −2
2
Approximate Pivotals
̃ −
N0, 1 approximately if ̃ Y/n and Y Binomialn,
̃ 1−̃ /n
Ȳ −
N0, 1 approximately for a random sample from Poisson distribution
Ȳ /n
Chapter 1
1 Pn 1 P
n
u= (a + byi ) = na + b yi = a + by
n i=1 n i=1
275
276 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
on the value of y0 .
If y0 < y( n+1 ) 1 then there are now an even number of observations and the new
2
1.2 (a)
P
n P
n
s2u = (ui u)2 = [a + byi (a + by)]2
i=1 i=1
Pn P
n
= (byi by)2 = b2 (yi y)2 = b2 s2
i=1 i=1
su = jbj s
(b)
P
n n h
P i Pn P
n
(yi y)2 = yi2 2yi y + (y)2 = yi2 2y yi + n (y)2
i=1 i=1 i=1 i=1
Pn P
n
= yi2 2n (y)2 + n (y)2 = yi2 n (y)2
i=1 i=1
277
(c)
" #
1 Pn ny + y0 2
s2 (y0 ) = y 2 + y02 (n + 1)
n i=1 i n+1
1 P n 1 h i
= yi2 + y02 n2 (y)2 + 2nyy0 + y02
n i=1 (n + 1)
1 Pn
= (n + 1) yi2 + (n + 1) y02 n2 (y)2 2nyy0 y02
n (n + 1) i=1
1 Pn Pn
= n yi2 n (y)2 + yi2 + ny02 2nyy0
n (n + 1) i=1 i=1
(n 1) 2 1 Pn
= s + y 2 + ny0 (y0 2y)
(n + 1) n (n + 1) i=1 i
Therefore
1=2
(n 1) 2 1 Pn
lim s (y0 ) = lim s + y 2 + ny0 (y0 2y)
y0 ! 1 y0 ! 1 (n + 1) n (n + 1) i=1 i
1=2
(n 1) 2 1 Pn
= s + y 2 + lim ny0 (y0 2y) =1
(n + 1) n (n + 1) i=1 i y0 ! 1
This means that an additional very large (or very small) observation has a large
e¤ect on the sample standard deviation.
(d) Once y0 is larger than q(0:75) or smaller than q(0:75); then y0 has little e¤ect
on the interquartile range as y0 increases or decreases.
1.3 Since
P
n P
n
1
n (ui u)3 1
n (byi by)3
i=1 i=1
3=2
= 3=2
1 P
n
1 P
n
n (ui u)2 n (byi by)2
i=1 i=1
P
n
1
n (yi y)3
b3 i=1
=
(b2 )3=2 1 P
n 3=2
n (yi y)2
i=1
3
b
= g1
jbj
Since
P
n P
n
1
n (ui u)4 1
n (byi by)4
i=1 i=1
2 = 2
1 Pn
1 Pn
n (ui u)2 n (byi by)2
i=1 i=1
P
n
1
n (yi y)4
b4 i=1
= = g2
(b2 )2 1 P
n 2
n (yi y)2
i=1
therefore the sample kurtosis is the same for both data sets.
1.5 (a) The relative frequency histogram of the piston diameters is given in Figure 12.1.
0.06
0.05
0.04
Relative
F requency
0.03
0.02
0.01
0
-13 -11 -9 -7 -5 -3 -1 1 3 5 7 9
Diameter of Piston
(d) P pk = 0:6184
(f)
1.6
1.7 The empirical c.d.f. is constructed by …rst ordering the data (smallest to largest) to
obtain the order statistic: 0.01 0.39 0.43 0.45 0.52 0.63 0.72 0.76.85 0.88. Then the
empirical c.d.f. is in Figure 12.2
0.9
0.8
0.7
cumlative
relative
0.6
frequency
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Data
(b)
1 Pn 1 Pn 1 Pn 1
E(Y ) = E Yi = E (Yi ) = = (n ) =
n i=1 n i=1 n i=1 n
280 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
(c)
1 P
n h i
2
E S2 = E Yi2 nE Y
n 1 i=1
1 Pn 2
2 2 2
= + n +
n 1 i=1 n
1 2 2 2 2
= n + n
n 1
1 2 2
= (n 1) =
n 1
1.9 (a) The relative frequency histograms are given in Figure 12.3.
0 .0 3 0 .0 3
0 .0 2 5 0 .0 2 5
0 .0 2 0 .0 2
Rel ati ve F requency
0 .0 1 5 0 .0 1 5
0 .0 1 0 .0 1
0 .0 0 5 0 .0 0 5
0 0
70 75 80 85 90 95 100 105 110 70 75 80 85 90 95 100 105 110
L e n g t h o f F e m a l e Co y o t e L e n g t h o f M a l e Co y o t e
(b) Five number summary for female coyotes: 71:0 85:5 89:75 93:25 102:5
Five number summary for male coyotes: 78:0 87:0 92:0 96:0 105:0
(c) Female coyotes: x = 89:24; s1 = 6:5482
Male coyotes: y = 92:06; s2 = 6:6960
281
1.10 (a) The two variates are Value (x) and Gross (y), where Value is the average amount
the actor’s movies have made (in millions of U.S. dollars), and Gross is the
amount of the highest grossing movie in which the actor played as a major
character (in millions of U.S. dollars). Since the goal is to study the e¤ect of
an actor’s value (x) on the amount grossed in a movie (y), we choose x as the
explanatory variate and y as the response variate.
(b) A scatterplot of the data is given in Figure 12.4.
600
500
400
gross
300
200
100
0
10 20 30 40 50 60 70 80 90 100
value
Sxy
r = p
Sxx Syy
184540:93 20(860:6=20)(3759:5=20)
= h i1=2 h i1=2
43315:04 20 (860:6=20)2 971560:19 20 (3759:5=20)2
= 0:558
1.11 (a) Since the n people are selected at random from a large population it is reasonable
to assume that the people are independent and that the probability a randomly
282 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
(f) Since there are now four possible outcomes on each independent trial the joint
distribution of Y1 = no. of A types, Y2 = no. of B types, Y3 = no. of AB types,
Y4 = no. of O types is given by the Multinomial distribution.
n! y1 y2 y3 y4
P (Y1 = y1 ; Y2 = y2 ; Y3 = y3 ; Y4 = y4 ) = 1 2 3 4
y1 !y2 !y3 !y4 !
283
P
4
for yi = 0; 1; : : : ; n; i = 1; 2; 3; 4 yi = n
i=1
P
4
and 0 < i < 1; i = 1; 2; 3; 4 i = 1:
i=1
(ii)
p p
P Y 1:96 = n Y + 1:96 = n
p
= P Y 1:96 = n = P (jZj 1:96) where Z v G (0; 1)
= 2P (Z 1:96) 1
= 2 (0:975) 1 = 0:95
p
(iii) We want P Y 0:95 where Y v G ( ; 12= n) or
1:0
!
Y 1:0
P Y 1:0 = P p p
12= n 12= n
p
n
= P jZj 0:95 where Z v G (0; 1) :
12
p
Since P (jZj 1:96) = 0:95 we want n=12 1:96 or n (1:96)2 (144) =
553:2. Therefore n = 554.
1 y=
f (y) = e for y > 0 and > 0:
(i)
1 Pn 1 Pn 1
E Y = E (Yi ) = = (n ) =
n i=1 n i=1 n
2 n
1 P
and V ar Y = V ar (Yi ) since Yi0 s are independent r.v.’s
n i=1
2 n 2 2
1 P 2 1 2
= = n = ! 0 as n ! 1:
n i=1 n n
For large values of n, the sample mean Y should be close to the mean .
(ii) By the Central Limit Theorem
p p
P Y 1:6449 = n Y + 1:6449 = n
Y
= P p 1:6449 t P (jZj 1:6449) where Z v N (0; 1)
= n
= 2P (Z 1:6449) 1 = 2 (0:95) 1 = 0:9
285
P (Y1 = 0; Y2 = 2; Y3 = 0; Y4 = 1; Y5 = 3; Y6 = 1)
= P (Y1 = 0) P (Y2 = 2) P (Y3 = 0) P (Y4 = 1) P (Y5 = 3) P (Y6 = 1)
0 2 0 1 3 1
e e e e e e
=
0! 2! 0! 1! 3! 1!
7
e 6
= for > 0:
12
A reasonable estimate of the mean is the sample mean y = 7=6. An estimate
of the probability that there is at least one accident at this intersection next
Wednesday is given by
(7=6)0 e 7=6
7=6
1 =1 e = 0:6886:
0!
(i)
1 Pn 1 Pn 1
E Y = E (Yi ) = = (n ) =
n i=1 n i=1 n
2 n
1 P
and V ar Y = V ar (Yi ) since Yi0 s are independent r.v.’s
n i=1
2 n 2
1 P 1
= = (n ) = ! 0 as n ! 1:
n i=1 n n
For large values of n, the sample mean Y should be close to the mean .
(ii) By the Central Limit Theorem
p p
P Y 1:96 =n Y + 1:96 =n
!
Y
= P p 1:96 t P (jZj 1:96) where Z v N (0; 1)
=n
= 2P (Z 1:96) 1
= 2 (0:975) 1
= 0:95
286 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
Chapter 2
2.1 (a)
G( ) = a
(1 )b ; 0 < <1
g ( ) = log G ( ) = a log + b log (1 ); 0 < < 1
a b a (1 ) b a (a + b)
g0 ( ) = = =
1 (1 ) (1 )
a
g 0 ( ) = 0 if =
a+b
a a
Since g 0 ( ) > 0 for 0 < < a+b and g 0 ( ) < 0 for 1 > > a+b then by the First
a
Derivative Test g ( ) has a maximum value at = a+b .
(b)
a b=
G( ) = e ; >0
b
g ( ) = log G ( ) = a log ; >0
a b a +b
g0 ( ) = + 2 = 2
b
g 0 ( ) = 0 if =
a
Since g 0 ( ) > 0 for 0 < < ab and g 0 ( ) < 0 for > b
a then by the First
Derivative Test g ( ) has a maximum value at = ab .
(c)
a b
G( ) = e ; >0
g ( ) = log G ( ) = a log b ; >0
a a b
g0 ( ) = b=
a
g 0 ( ) = 0 if =
b
Since g 0 ( ) > 0 for 0 < < ab and g 0 ( ) < 0 for > a
b then by the First
Derivative Test g ( ) has a maximum value at = ab .
(d)
a( b)2
G( ) = e ; 2<
g ( ) = log G ( ) = a( b)2 ; 2<
0
g ( ) = 2a ( b)
0
g ( ) = 0 if =b
Since g 0 ( ) > 0 for < b and g 0 ( ) < 0 for > b then by the First Derivative
Test g ( ) has a maximum value at = b.
287
L( ) = 10
(1 )90 for 0 < < 1:
Now
10 90
l0 ( ) =
1
10 100 10
= = 0 if = = 0:1
(1 ) 100
9:5 0:1n
P (Y 10) t P Z p where Z v N (0; 1) :
0:09n
9:5 0:1n
p = 1:2816
0:09n
or
n2 204:78n + 9025 = 0
Solving
n (y 1) n n (y 1) (1 ) n
l0 ( ) = = =0
1 (1 )
gives the maximum likelihood estimate
^= y 1
:
y
P
200
If n = 200 and yi = 400 then
i=1
400 2 1
y = = 2, ^ = = 0:5
200 2
200
1
and R ( ) = for 0 < < 1:
0:5 0:5
A graph of R ( ) is given in Figure 12.5.
(c) Since p = P (Y = 1; ) = (1 ) then by the Invariance property of maximum
likelihood estimates the maximum likelihood estimate of p is p^ = 1 ^ =
1 0:5 = 0:5.
or more simply
41 10
L( ) = e for > 0:
The log likelihood function is
0.9
0.8
0.7
R( )
0.6
0.5
0.4
0.3
0.2
0.1
0
0.42 0.44 0.46 0.48 0.5 0.52 0.54 0.56 0.58
Solving
41
l0 ( ) = 10 = 0
(2 )0 e 2
2
p = P (no transactions in a two minute interval ; ) = =e
0!
then by the invariance property of maximum likelihood estimates the maximum
^
likelihood estimate of p is p^ = e 2 = 0:000275.
Q
n Q
n 2y
i yi2 =
f (yi ; ) = e for >0
i=1 i=1
Q
n 1 1P
n
= 2n ( yi ) n exp yi2 :
i=1 i=1
1P
n
l( ) = n log( ) yi2 > 0:
i=1
290 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
Solving
n 1 P
n 1 P
n
l0 ( ) = + 2 yi2 = 2 yi2 n =0
i=1 i=1
gives the maximum likelihood estimate
^ = 1 P y2
n
n i=1 i
P
20
If n = 20 and yi2 = 72 then ^ = 72=20 = 3:6. A graph of R ( ) is given in
i=1
Figure 12.6.
1.2
0.8
R( )
0.6
0.4
0.2
0
2 3 4 5 6 7 8
Q
n Q
n
L( ) = ( + 1)yi = ( + 1)n yi for > 1
i=1 i=1
Solving
d n P
n
l( ) = + log(yi ) = 0
d 1+ i=1
291
gives
^= n
1:
P
n
log(yi )
i=1
^) P log(yi ) for
+1 n
r( ) = l( ) l(^) = n log +( > 1:
^+1 i=1
P
15
If n = 15 and log(yi ) = 34:5 then ^ = 15
34:5 1= 0:5652. The graph of
i=1
r ( ) is given in Figure 12.7.
-1
r( )
-2
-3
-4
-5
-0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1
2.7 (a)
1 1 1 1+
P (M M ) = P (F F ) = + (1 )=
2 2 2 4
1+ 1
P (M F ) = 1 2 =
4 2
where = probability the pair is identical.
(b)
n1 n2 n3
n! 1+ 1+ 1
L( ) = where n = n1 + n2 + n3
n1 !n2 !n3 ! 4 4 2
or more simply
L ( ) = (1 + )n1 +n2 (1 )n3 :
Maximizing L ( ) gives ^ = (n1 + n2 n3 ) =n. For n1 = 16, n2 = 16 and
n3 = 18, ^ = 0:28.
292 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
is
0 10
f0 1
X
n! 1 2 2 f2 @ yA
( ) f2 ( ymax fmax
)
f0 !f1 ! fmax !0! 1
y=yamx +1
f0 ymax
n! 1 2 Q yfy
:
f0 !f1 ! fmax ! 1 y=1
Now
2f0 f0 1
l0 ( ) = + + T
1 2 1
1 2
= 2T (f0 + 3T ) + T
(1 ) (1 2 )
and l0 ( ) = 0 if
(f0 + 3T ) [(f0 + 3T )2 8T 2 ]1=2
=
4T
and since
(c) The probability that a randomly selected family has x children is x , x = 1; 2; :::.
Suppose for simplicity there are N di¤erent families where N is very large.
Then the number of families that have x children is N (probability a family
has x children) = N x for x = 1; 2; :::: and there is a total of xN x children in
P
1
families of x children and a total of xN x children altogether. Therefore the
x=1
probability a randomly chosen child is in a family of x children is:
x
xN
= cx x ; x = 1; 2; :::
P
1
x
xN
x=1
P
1
x 1
Note that = 1 . Therefore taking derivatives
x=0
P
1
x 1 1 P
1
x
x = and x =
x=1 (1 )2 x=1 (1 )2
Solving
P
1
x
cx =1
x=1
gives c = (1 )2 = and
(1 )2 1
P (X = x; ) = x x
= x (1 )2 x 1
for x = 1; 2; : : : and 0 <
2
(d) The probability of observing the given data for model (c) is
33! h i22 h i7 h i3 h i 1
(1 )2 2 (1 )2 3 (1 )2 2
4 (1 )2 3
for 0 < :
22!7!3!1! 2
The likelihood function is
L ( ) = (1 )2(22+7+3+1) 7+2(3)+3(3)
1
= 16
(1 )66 for 0 <
2
which is maximized for = 16= (16 + 66) = 16=82 = 8=41 = 0:1951.
Since the probability a family has no children is
1 2
P (Y = 0; ) = = g( )
1
then by the Invariance Property of maximum likelihood estimates the maximum
likelihood of g ( ) is
1 2^ 1 2 (0:1951)
g(^) = = = 0:7576
1 ^ 1 0:1951
294 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
(e) For these data f0 = 0, T = 49. and l0 ( ) = 49= = 0 has no solution. Since
l0 ( ) = 49= > 0 for all 0 < 0:5, therefore l( ) is an increasing function on
this interval. Thus the maximum value of l( ) occurs at the endpoint = 0:5
and therefore ^ = 0:5.
2.9 (a) The likelihood function based on the Poisson model and the frequency table is
696!
L( ) =
69!155!171!143!79!57!14!6!2!0!
0 69 1 155 2 171 3 143 4 79
e e e e e
0! 1! 2! 3! 4!
0 10
5 57 6 14 7 6 8 2 X1 y
e e e e @ e A
5! 6! 7! 8! y!
y=9
or more simply
0(69)+1(155)+2(171)+3(143)+4(79)+5(57)+6(14)+7(6)+8(2)
L( ) =
(69+155+171+143+79+57+14+6+2)
e
1669 696
= e ; > 0:
and
1669 y
696 e 1669=696
ey = 1669 ; y = 0; 1; : : : ; 8
y!
295
Their is quite good agreement between the observed and expected freqencies
which indicates the Poisson model is very reasonable. Recall the homogeneity
assumption for the Poisson process. Since a Poisson model …ts the data well
this suggests that Wayne was a very consistent player when he played with the
Edmonton Oilers.
2.10 (a)
P
n
yi
^= i=1
P
n
ti
i=1
( t)0 e t
t
p = P (X = 0; ) = =e
01
Suppose that x intervals were observed with no particles. Since X v Binomial (n; p)
the maximum likelihood estimate of p is p^ = x=n. Since p = e t implies
= (log p) =t then by the Invariance Property of maximum likelihood esti-
mates ^ = (log p^) =t.
jY j
P (Y 2 [ ; + ]) = P (jY j )=P 1
The proportion of observations in the interval (0:71) is slightly higher than what
would be expected for Normal data (0:68).
(d)
IQR = 22 16:25 = 5:75:
To show that for Normally distributed data that IQR = 1:349 we need to solve
jY j
0:5 = P (jY j c )=P c for c if Y v G ( ; ) :
line since the quantiles of the Normal distribution change more rapidly in both
tails of the distribution.
The proportion of observations in the interval [y s; y + s] is slightly higher than
we would expect for Normally distributed data. This also agrees with the sample
kurtosis value of 4:3 being larger than 3.
Overall, except for the two outliers, it seems reasonable to to assume that the
data are approximately Normally distributed. It would be a good idea to do any
formal analyses of the data with and without the outliers to determine the e¤ect
of these outliers on the conclusions of these analyses.
2.12 (a) The sample mean y = 159:77 and the sample standard deviation s = 6:03 for
the data.
(b) The number of observations in the interval [y s; y + s] = [153:75; 165:80] is 244
or 69:5% and actual number of observations in the interval [y 2s; y + 2s] =
[147:72; 171:83] is 334 or 95:2%. If Y v G ( ; ) then
jY j
P (Y 2 [ ; + ]) = P (jY j )=P 1
0.07 1
0.9
0.06
0.8
0.05 0.7
Relativ e Frequenc y
empiric al c .d.f.
0.6
0.04
G(159.77,6.03) 0.5
0.03 G(159.77,6.03)
0.4
0.02 0.3
0.2
0.01
0.1
0 0
140 143 146 149 152 155 158 161 164 167 170 173 176 179 140 145 150 155 160 165 170 175 180
Heights Heights
180
175 175
170 170
Sample Q uantiles
165 165
Heights
160 160
155
155
150
150
145
145
(j) All the numerical summaries indicate good agreement with the Gaussian as-
sumption. The relative frequency histogram has the shape of a Gaussian proba-
bility density function. The empirical cumulative distribution function and the
Gaussian cumulative distribution function also have similar shapes. The box-
plot is consistent with Gaussian data and the points in the qqplot lie reasonably
along a straight line also indicating good agreement with a Gaussian model. A
Gaussian distribution seems very reasonable for these data.
2.14 See Figure 12.9. Note that the qqplot for the log yi ’s is far more linear than for the
yi ’s indicating that the Normal model is more reasonable for the transformed data.
2.14 (a) If they are independent P (S and H) = P (S)P (H) = : The others are similar.
(b) The Multinomial probability function evaluated at the observed values is
100!
L( ; ) = ( )20 [ (1 )]15 [(1 ) ]22 [(1 ) (1 )]43
20!15!22!43!
299
Sample Quantiles 4
-1
-3 -2 -1 0 1 2 3
N(0,1) Quantiles
or
35(42) 35(58) 65(42) 65(58)
; ; ; = (14:7; 20:3; 27:3; 37:7)
100 100 100 100
which can be compared with 20; 15; 22; 43. The observed and expected frequen-
cies do not appear to be very close. In Chapter 7 we will see how to construct a
formal test of the model.
2.16
2.17 (a) The median of the N (0; 1) distribution is m = 0. Reading from the qqplot
the sample quantile on the y axis which corresponds to 0 on the x axis is
approximately equal to 1:0 so the sample median is approximately 1:0.
(b) To determine q (0:25) for the these data we note that P (Z 0:6745) = 0:25 if
Z v N (0; 1). Reading from the qqplot the sample quantile on the y axis which
corresponds to 0:67 on the x axis is approximately equal to 0:4 so q (0:25)
is approximately 0:4. To determine q (0:75) for the these data we note that
300 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
P (Z 0:6745) = 0:25 if Z v N (0; 1). Reading from the qqplot the sample
quantile on the y axis which corresponds to 0:67 on the x axis is approxi-
mately equal to 1:5 so q (0:75) is approximately 1:5. The IQR for these data is
approximately 1:5 0:4 = 1:1.
(c) The frequency histogram of the data would be approximately symmetric about
the sample mean.
(d) The frequency histogram would most resemble a Uniform probability density
function.
2.18 (a) If there is adequate mixing of the tagged animals, the number of tagged animals
caught in the second round is a random sample selected without replacement so
follows a hypergeometric distribution (see the STAT 230 Course Notes).
(b)
L(N + 1) (N + 1 k)(N + 1 n)
=
L(N ) (N + 1 k n + y)(N + 1)
2.19
n
Y
L( ) = f (yi ; )
i=1
Yn
1
= if yi i = 1; 2; : : : ; n
i=1
1
= n if y(n) = max (y1 ; : : : ; yn )
where n is a decreasing function of . Note also that L ( ) = 0 for 0 < < y(n) .
Therefore the maximum value of L ( ) occurs at = y(n) and therefore the maximum
likelihood estimate of is ^ = y(n) .
2.20 (a)
Z1
1 y= C=
P (Y > C; ) = e dy = e :
C
(b) For the i’th piece that failed at time yi < C; the contribution to the likelihood
is 1 e yi = . For those pieces that survive past time C, the contribution to the
likelihood is the probability of the event, P (Y > C; ) = e C= . Therefore the
301
likelihood is
Q
k 1
yi = C=
n k
L( ) = e e
i=1
1P
k C
l( ) = k log( ) yi (n k)
i=1
^ = 1 P yi + (n
k
k)C :
k i=1
(c) When k = 0 and C > 0 the maximum likelihood estimator is ^ = 1: In this case
there are no failures in the time interval [0; C] and this is more likely to happen
as the expected value of the Exponential gets larger and larger.
and, ignoring the terms yi ! which do not contain the parameters, the log likelihood is
n h
P i
l( ; ) = yi ( + xi ) e( + xi )
i=1
For a given set of data we can solve this system of equations numerically but not
explicitly.
302 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
Chapter 3
3.1 (a) The Problem is to determine the proportion of eligible voters who plan to vote
and, of those, the proportion who plan to support the party. This is a descriptive
Problem since the aim of the study is to determine the attributes just mentioned
for a population of eligible voters.
(b) The target population is all eligible voters. This would include those eligible
voters in all regions and those with/without telephone numbers on the list.
(c) One variate is whether or not an eligible voter plans to vote which is a categorical
variate. Another variate is whether or not an eligible voter supports the party
which is also a categorical variate.
(d) The study population is all eligible voters on the list.
(e) The sample is the 1104 eligible voters who responded to the questions.
(f) A possible source of study error is that the polling …rmed only called eligible
voters in urban areas. Urban eligible voters may have di¤erent views that rural
eligible voters – this is a di¤erence between the target and study populations.
Eligible voters with phones may have di¤erent views than those without.
(g) A possible source of sample error is that many of the people called refused to
participate in the survey. People who refuse to participate may have di¤erent
voting preferences as compared to people who participate. For example, people
who refuse to participate in the survey may also be less likely to vote.
(h) Attribute 1 is the proportion of units who plan to vote. An estimate of this
attribute based on the data is: 732=1104.
Attribute 2 is the proportion of those who plan to vote who also plan to support
the party. An estimate of this attribute based on the data is: 351=732.
3.2 (a) This study is an experimental study since the researchers are in control of which
schools received the regular curriculum and which schools are using the JUMP
program.
(b) The Problem is to compare the performance in math of students at Ontario
schools using the current provincial curriculum as compared to the performance
in math of students at Ontario schools using the JUMP math program.
(c) This is a causative problem since the researcher are interested in whether the
JUMP program causes better student performance in math.
(d) The target population is all elementary students in Ontario public schools at the
time the study or all elementary students in Ontario public schools at the time
the study and into the future.
303
(e) One variate of interest is whether a student the student receives the standard
curriculum or the standard curriculum plus the JUMP program which is a cat-
egorical study. Another variate of interest is classroom test scores which is a
discrete variate since scores only take on a …nite number of countable values.
(f) The study population is all Ontario elementary students in Grades 2 and 5 in
public schools at the time the study.
(g) The sampling protocol was to select the schools in one school board in Ontario.
The researchers did not indicate how this school board was chosen.
(h) A possible source of study error is that the ability of students in Grades 2 and
5 to learn math skills might be di¤erent than students in other grades.
(i) A possible source of sampling error is that the schools in the chosen school board
may not be representative of all the elementary schools in Ontario. For example,
the schools in the chosen board may have larger class sizes compared to other
schools. Student in larger classes may not receive as much help to improve their
math skills as students in smaller classes. Another example is that the chosen
school board might be in a low income area of a city. Students from low income
families may respond di¤erently to changes in the Math curriculum as compared
to students from middle class families.
(j) It is unclear from the article what type of classroom tests will be used or how they
will be graded. So depending on how this is done it could lead to measurement
error. For example, di¤erent schools may use di¤erent grading criteria for the
same test.
(k) Randomization ensures that the di¤erence in the learning outcome is only due
to di¤erent teaching programs, and not due to other potential confounders (e.g.,
class size, parents’education level, parents’social economic status, etc.).
3.3 (a) This is an experimental study because the researchers controlled, using ran-
domization, which students were assigned to the racing-type game and which
students were assigned to the game of solitaire.
(b) The Problem is to determine whether playing racing games makes players more
likely to take risks in a simulated driving test.
(c) This is a causative type Problem because the researchers were interested in
whether playing racing games as compared to playing a game like solitaire caused
players to take more risks in the driving test.
(d) A suitable target population for this study is young adults living in China at the
time of the study OR students attending university in China at the time of the
study.
(e) One important variate is whether the student played the racing-type driving
game or the game of solitaire. This is a categorical variate. The other important
304 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
variate was how long, in seconds, the student waited to hit the “stop”key in the
Vienna Risk-Taking Test. This is a continuous variate.
(f) A suitable study population for this study is students attending Xi’an Jiatong
University at the time of the study.
(g) From the article it appears that the researchers recruited volunteers for the study.
The article does not indicate how these volunteers were obtained.
(h) If the target population is young adults living in China and the study population
is students attending university in China at the time of the study then a possible
source of study error is that students who attend university are more educated
and more intelligent (on average) and therefore possibly di¤erent in their levels
of risk-taking as compared to young adults in China not attending university.
(i) Since the sample consisted of volunteers and not a random sample of students
from the Xi’an Jiatong University then a possible source of study error is that
students who volunteer for such studies are more likely to take risks than non-
volunteers who might be more conservative. The risk-taking habits of the vol-
unteers (on average) may be di¤erent than the risk-taking habits of all students
at the Xi’an Jiatong University.
(j) The most attribute of most interest is the average or mean di¤erence in the
time to hit the “stop”key in the Vienna Risk-Taking Test between young adults
who play racing games compared to young adults who play neutral games. An
estimate of this based on the given data is 12 10 = 2 seconds.
305
Chapter 4
~ Y
P 0:03 0:03 = P 0:03 0:5 0:03
1000
0 1
Y
0:03 0:5 0:03
= P @q 1000
q q A
(0:5)(0:5) (0:5)(0:5) (0:5)(0:5)
1000 1000 1000
t P ( 1:90 Z 1:90) where Z N (0; 1)
= 2P (Z 1:90) 1 = 2 (0:97128) 1
= 0:94256
(b)
~ Y
P 0:03 0:03 = P 0:03 0:5 0:03
n
0 1
Y
0:03 0:5 0:03
= P @q qn q A
(0:5)(0:5) (0:5)(0:5) (0:5)(0:5)
n n n
p p
t P 0:06 n Z 0:06 n
p
where Z N (0; 1). Since P ( 1:96 Z 1:96) = 0:95; we need 0:06 n 1:96
or n (1:96=0:06)2 = 1067:1. Therefore n should be at least 1068.
(c)
~ Y
P 0:03 0:03 = P 0:03 0:03
n
0 1
Y
0:03 0:03
= P @q qn q A
(1 ) (1 ) (1 )
n n n
p p !
0:03 n 0:03 n
t P p Z p
(1 ) (1 )
where Z N (0; 1). Since P ( 1:96 Z 1:96) = 0:95; we need
p
0:03 n
p 1:96
(1 )
or
2
1:96
n (1 ):
0:03
Since
1:96 2 1:96 2
(1 ) (0:5)2 = 1067:1
0:03 0:03
Therefore n should be at least 1068.
306 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
4.4 (a) Suppose the experiment which was used to estimate was conducted a large
number of times and each time a 95% con…dence interval for was constructed
using the observed data. Then, approximately 95% of these constructed intervals
would contain the true, but unknown value of . Since we only have one interval
[42:8; 47:8] we do not know whether it contains the true value of or not. We can
only say that we are 95% con…dent that the given interval [42:8; 47:8] contains
the true value of since we are told it is a 95% con…dence interval. In other
words, we hope we were one of the “lucky” 95% who constructed an interval
containing the true value of . Warning: You cannot say that the probability
that the interval [42:8; 47:8] contains the true value of is 0:95!!!
(b) An approximate 95% con…dence interval for the proportion of Canadians whose
mobile phone is a smartphone is
s r
^(1 ^) 0:45(0:55)
^ 1:96 = 0:45 1:96 = 0:45 0:03083
n 1000
= [0:4192; 0:4808]:
4.5 Let Y = number of women who tested positive. Assume that model Y v Binomial (n; ).
Since P ( 2:58 Z 2:58) = 0:99, an approximate 99% con…dence interval is given
by:
s s
^(1 ^) 64 64
( 28936 )
^ 2:58 = 2:58 29000 29000 = 0:0022 0:0007
n 29000 29000
= [0:0015; 0:0029] :
The Binomial model assumes that the 29; 000 women represented 29; 000 independent
trials and that the probability that a randomly chosen women is HIV positive is equal
to . The women may not represent independent trials and the probability that
a randomly chosen women is HIV positive may be higher among certain high risk
women such as women who are intravenous drug users.
4.6 (a) If Y is the number who support this information then Y v Binomial(n; ). An
approximate 95% con…dence interval is given by
r
0:7(0:3)
0:7 1:96 = 0:7 0:6351
200
= [0:6365; 0:7635] :
307
(b) The Binomial model assumes that the 200 people represent 200 independent
trials. If 100 of the people interviewed were 50 married couples then the two
people in a couple are probably not independent with respect to their views.
The 15% likelihood interval can be obtained from the graph of R ( ) given in Figure
12.5 or by using the R command:
uniroot(function(x)((4*x*(1-x))^200-0.15),lower=0.4,upper=0.5).
The 15% likelihood interval is [0:45; 0:55].
The 15% likelihood interval can be obtained from the graph of R ( ) given in Figure
12.6 or by using the R command:
uniroot(function(x)(((3.6/x)*exp(1-3.6/x))^20-0.15),lower=2.0,upper=3.0).
The 15% likelihood interval is [2:40; 5:76].
The 15% likelihood interval can be obtained from the graph of r ( ) given in Figure
?? or by using the R command:
uniroot(function(x)(15*log(2.3*(x+1))-34.5*(x+1)+15-log(0.15)),
lower=-0.8,upper=-0.7).
The 15% likelihood interval is [ 0:75; 0:31].
4.10 (a) For the data n1 = 16, n2 = 16 and n3 = 18, ^ = 0:28 and
(1 + )32 (1 )18
R( ) = ; 0< <1
(1 + 0:28)32 (1 0:28)18
(b) For the data for which 17 identical pairs were found, ^ = 17=50 = 0:34 and the
relative likelihood function is
17 (1 )33
R( ) = ; 0< <1
(0:34)17 (1 0:34)33
We use
uniroot(function(x)((x/0.34)^17*((1-x)/0.66)^33-0.1),lower=0,upper=0.3)
to obtain the 10% likelihood interval [0:21; 0:49]. This interval is much narrower
than the interval in (a) which indicates that is more accurately determined by
the second model.
0.8
0.6
R(α)
0.4
0.2
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
α
16
(1 )66 1
R( ) = for 0 < :
(8=41)16 (33=41)66 2
1.2
0.8
R( )
0.6
0.4
0.2
0
0.05 0.1 0.15 0.2 0.25 0.3 0.35
4.12 (a) The probability a group tests negative is p = (1 )k . The probability that y
out of n groups test negative is
n y
p (1 p)n y
y = 0; 1; :::; n:
y
We are assuming that the nk people represent independent trials and that does
not vary across subpopulations of the population of interest.
(b) Since L (p) = py (1 p)n y is the usual Binomial likelihood we know p^ = y=n.
Solving p = (1 )k for we obtain = 1 (1 p)1=k . Therefore by the
Invariance Property of maximum likelihood estimates, the maximum likelihood
estimate of is
^=1 p)1=k = 1
(^ (y=n)1=k :
therefore
E (Y ) = (3) = 2 , E Y 2 = 2
(4) = 6 2
V ar (Y ) = E Y 2 [E (Y )]2 = 6 2
(2 )2 = 2 2
as required.
(b) The likelihood function is
Q
n y
i yi = Q
n
2n 1P
n
L( ) = 2e = yi exp yi ; >0
i=1 i=1 i=1
or more simply
2n 1P
n
L( ) = exp yi ; > 0:
i=1
The log likelihood function is
1P
n
l( ) = 2n log yi ; >0
i=1
and
2n 1 P
n 1 P
n
l0 ( ) = + 2 yi = 2 yi 2n ; > 0:
i=1 i=1
Now l0 ( ) = 0 if
1 Pn 1
= yi = y:
2n i=1 2
(Note a First Derivative Test could be used to con…rm that l ( ) has an absolute
maximum at = y=2.) The maximum likelihood estimate of is
^ = y=2:
(c)
1 Pn 1 Pn 1 Pn 1
E Y =E Yi = E (Yi ) = 2 = (2n ) = 2
n i=1 n i=1 n i=1 n
and
1 Pn 1 P n 1 P n
2 1 2 2 2
V ar Y = V ar Yi = V ar (Yi ) = 2 = 2n = :
n i=1 n2 i=1 n2 i=1 n2 n
311
Y 2
p has approximately a N (0; 1) distribution.
2=n
If Z v N (0; 1)
P ( 1:96 Z 1:96) = 0:95:
Therefore !
Y 2
P 1:96 p 1:96 t 0:95:
2=n
(e) Since
!
Y 2
0:95 t P 1:96 p 1:96
2=n
p p
= P Y 1:96 2=n 2 Y + 1:96 2=n
p p
= P Y =2 0:98 2=n Y =2 + 0:98 2=n
where ^ = y=2.
(f) For these data the maximum likelihood estimate of is
4.14 (a)
Q
n 1
3 2 1 Q
n P
n
L( ) = ti exp ( ti ) = t2 3n
exp ti
i=1 2 2 i=1 i
n
i=1
or more simply
3n P
n
L( ) = exp ti ; > 0:
i=1
P n
Solving l( ) = 0, we obtain the maximum likelihood estimate ^ = 3n= ti .
i=1
The relative likelihood function is
3n
L( )
R( ) = = exp 3n 1 ; > 0:
L(^) ^ ^
0.8
R(θ)
0.6
0.4
0.2
0
0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
θ
P
20
(b) Since n = 20 and ti = 996, therefore ^ = 3 (20) =996 = 0:06024. Reading
i=1
from the graph in Figure 12.12 or by solving R( ) = 0:15 using the uniroot
function in R, we obtain the 15% likelihood interval [0:0463; 0:0768] which is an
approximate 95% con…dence interval for .
(c)
1 R1 3 3 t 1 R1
E (T ) = t e dt = ( t)3 e ( t) dt
20 20
1 R1 3 x
= x e dx (by letting x = t)
2 0
1 1 3
= (4) = 3! =
2 2
and a 95% approximate con…dence interval for E (T ) = 3= is
3 3
; = [39:1; 64:8] :
0:0463 0:0768
313
(d)
3R
50
p( ) = P (T 50) = t2 e t
dt
2 0
3
2500 50 100 50 2 1 50 1
= e 2 e + 2 e +
2
2 50
= 1 1250 + 50 + 1 e :
Since
h i
p (0:0463) = 1 1250 (0:0463)2 + 50 (0:0463) + 1 e 50(0:0463)
= 0:408
and
h i
p (0:0768) = 1 1250 (0:0768)2 + 50 (0:0768) + 1 e 50(0:0768)
= 0:738
The Binomial model involves fewer model assumptions but gives a less precise
(wider) interval.
and
55:8 40
P (Y 55:8) = P Z p where Z v N (0; 1)
80
= P (Z 1:77) = 0:96164 t 0:96
and
p
P (X > 1:4) = 1 P jZj 1:4 where Z v N (0; 1)
= 2 [1 P (Z 1:41)] = 2(1 0:88100) = 0:23800
and
3=2 1:5
P (X > 3) = e =e t 0:223
4.16 (a)
R1 1 k
1 y 1 R1 y k
2
1 y dy y
k y 2 e 2 dy = k
e 2 let x =
0 2 2
k
2 0 2 2 2
2
1 R1 k 1 x
= k
x2 e dx
2 0
1 k R1 1 x
= k
since x e dx = ( )
2
2 0
= 1
(b) See Figure 12.13. As k increases the probability density function becomes more
symmetric about the line y = k.
(c)
R1 1 k y
M (t) = E eY t = k y2 1
e 2 eyt dy
0 2 2 ( k2 )
1 R1 k
1 ( 12 t)y 1
= k y2 e dy converges for t <
2 2 ( k2 ) 0 2
1 R1 k 1 x 1
= k k x2 e dx by letting x = t y
2 2 ( k2 )( 12 t) 2 0 2
1
1
k k k 1
= 2 ( 2 t) 2 = (1 2t) 2 for t <
2 2
315
0.16
0.14 k=5
0.12
0.1
f(y)
k=10
0.08
0.06 k=25
0.04
0.02
0
0 5 10 15 20 25 30
y
Therefore
0 k k
1
M (0) = E (Y ) = (1 2t) 2 ( 2)jt=0 = k
2
00 k k k
M (0) = E Y 2 = +1 (1 2t) 2
2
( 2 2)jt=0 = k 2 + 2k
2 2
V ar(Y ) = k 2 + 2k k 2 = 2k
P
n
Q
n ki =2
Ms (t) = Mi (t) = (1 2t) i=1
i=1
4.16
(a) Since
1 Pn 1 1 P
Wi = Yi Y = Yi Yi = 1 Yi Yj i = 1; 2; : : : ; n
n i=1 n n j6=i
(b)
E (Wi ) = E Yi Y = E (Yi ) E Y = = 0 i = 1; 2; : : : ; n
Now Cov (Yi ; Yj ) = 0 if i 6= j (since the Yi0 s are independent random variables)
and Cov (Yi ; Yj ) = 2 if i = j (since Cov (Yi ; Yi ) = V ar (Yi ) = 2 ). This implies
1 Pn 1 Pn
Cov Yi ; Y = Cov Yi ; Yi = Cov Yi ; Yi
n i=1 n i=1
1 1 2
= Cov (Yi ; Yi ) = V ar (Yi ) = :
n n n
Therefore
0.4
0.35 k=25
k=5
0.3
f(t;k)
0.25
0.2
k=1
0.15
0.1
0.05
0
-3 -2 -1 0 1 2 3
t
Figure 12.14: Graphs of the t (k) pd.f. for k = 1; 5; 25 and the N (0; 1) p.d.f. (dashed line)
(d)
(i) If T v t(10) then P (T 0:88) = 0:8,
and
P (Z 2:04) = 1 P (Z 2:04)
= 1 0:97932 = 0:02068
318 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
4.19
k+1
k+1
2 t2 2
lim f (t; k) = lim p k
1+
k!1 k!1 k 2
k
k 1
k+1
2 t2 2 t2 2
= lim p k
1+ 1+
k!1 k 2
k k
1 1 2
= p exp t for t 2 <
2 2
since
1
lim ck = p
k!1 2
1
t2 2
lim 1+ = 1 and
k!1 k
k
t2 2 1 2 a bn
lim 1+ = exp t since lim 1+ = eab
k!1 k 2 y!1 n
L( ) = n
e ny=
for > 0 and ^ = y.
Therefore
n
L( ) L( ) e ny= y n
R( ) = = = = en(1 y= )
L(^) L(y) (y) en n
hy in
= e(1 y= ) for > 0:
30
380 (1 380= )
R( ) = e for > 0:
319
we obtain the interval as [285:5; 521:3]. Alternatively the likelihood interval can
be determined approximately from a graph of the relative likelihood function.
See Figure 12.15.
1.4
1.2
0.8
R(θ)
0.6
0.4
0.2
0
200 250 300 350 400 450 500 550 600 650
θ
Figure 12.15: Relative likelihood function for survival times for AIDS patients
320 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
4.21 (a) Using the cumulative distribution function of the Exponential distribution F (y) =
1 e y= ; we have for w > 0
2Y w w =(2 ) w=2
G (w) = P (W w) = P w =P Y =1 e =1 e :
2
Taking derivative with respect to w gives the probability density function as
w
g (w) = 21 e 2 for w > 0, which can be easily veri…ed as the probability density
function of a 2 (2) random variable.
Pn
(b) Let Wi = 2Yi = v 2 (2), i = 1; 2; : : : ; n. Then by Theorem 29 U = Wi =
i=1
P
n
2Yi = v 2 (2n). Since
i=1
(c)
0:9 = P (43:19 W 79:08) where W v 2
(60)
therefore
2P
n
0:9 = P 43:19 Yi 79:08
i=1
2 P n 2 P n
= P Yi Yi
79:08 i=1 43:19 i=1
and thus
2 P n 2 P n
yi ; yi
79:08 i=1 43:19 i=1
P
30
is a 90% con…dence interval for . Substituting yi = 11400, we obtain the 90%
i=1
con…dence interval for as [288:3; 527:9] which is very close to the approximate
90% likelihood-based con…dence interval [285:5; 521:3]. The intervals are close
since n = 30 is reasonably large.
4.22 (a) From Example 2:2:2 the likelihood function for Poisson data is
ny n
L( ) = e for >0
0.8
R( )
0.6
0.4
0.2
0
5 10 15 20 25
Figure 12.16: Relative Likelihood Functions for Company A and Company B Photocopiers
(b) For Company B, n = 12 and ^ = 11:67. See Figure 12.16 for a graph of the
relative likelihood function (graph on the left).
(c) The 15% likelihood interval for Company A is: [16:2171; 24:3299] and the 15%
likelihood interval for Company B is: [9:8394; 13:7072]. It is clear from these
approximate 95% con…dence intervals that the mean number of service calls for
Company A is much larger than for Company B which implies the decision to
go with Company B is a good one.
(d) The assumptions of the Poisson process (individuality, independence and homo-
geneity) would need to hold.
(e) Since
!
Y
0:95 t P 1:96 p 1:96
Y =n
q q
= P Y 1:96 Y =n Y + 1:96 Y =n
h p p i
therefore the interval y 1:96 y=n; y + 1:96 y=n is an approximate 95%
con…dence interval for . For Company A this interval is [17:5; 22:5] and for
Company B this interval is [9:73; 13:60]. These intervals are similar but not
identical to the intervals in (c) since n = 12 is small. The intervals would be
more similar for a larger value of n.
4.23 (a) Since the points in the qqplot in Figure 12.17 lie reasonably along a straight line
the Gaussian model seems reasonable for these data.
(b) A suitable study population for this study would be common octopi in the Ria de
Vigo. The parameter represents the mean weight in grams of common octopi
322 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
1400
1300
Sample Quantiles
1200
1100
1000
900
800
700
600
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
N(0,1) Quantiles
in the Ria de Vigo. The parameter represents the standard deviation of the
weights in grams of common octopi in the Ria de Vigo.
1=2
20340 1
^=y= = 1070:526 and s = (884095) = 221:62
19 18
Since the value = 1100 grams is well within this interval then the researchers
could conclude that based on these data the octopi in the Ria de Vigo are
reasonably healthy based on their mean weight.
4.24 (a) Qqplots of the weights for females and males separately are shown in Figures
12.18 and 12.19. In both cases the points lie reasonably along a straight line so
it is reasonable to assume a Normal model for each data set.
323
110
100
Quantiles of Input Sample
90
80
70
60
50
40
30
-3 -2 -1 0 1 2 3
Standard Normal Quantiles
Note that since the value for t (149) is not available in the t-tables we used
P (T 1:9647) = (1 + 0:95) =2 = 0:975 where T v t (100). Using R we obtain
P (T 1:976) = 0:975 where T v t (149) : The intervals will not change substan-
tially.
We note that the interval for females and the interval for males have no values in
common. The mean weight for males is higher than the mean weight for females.
(c) To obtain con…dence intervals for the standard deviations we note that the piv-
otal quantity (n 1) S 2 = 2 = 149S 2 = 2 has a 2 (149) distribution and the Chi-
squared tables stop at degrees of freedom = 100. Since E 149S 2 = 2 = 149 and
V ar 149S 2 = 2 = 2 (149) = 298 we use 149S 2 = 2 v N (149; 298) approximately
324 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
110
100
80
70
60
50
40
-3 -2 -1 0 1 2 3
Standard Normal Quantiles
4.25 (a) A suitable study population consists of the detergent packages produced by this
particular detergent packaging machine. The parameter corresponds to the
mean weight of the detergent packages produced by this detergent packaging
325
machine. The parameter is the standard deviation of the weights of the deter-
gent packages produced by this detergent packaging machine.
(b) For these data
4803
y = = 300:1875
16
1 h i
s2 = 1442369 16 (300:1875)2 = 37:89583
15
s = 6:155959:
2 2
4.27 Use 2 t s2 = 45 9 = 5 and d = 0:5. Hence, n
1:96
d = 1:96
0:5 5 = 76:832.
Since 10 observations have already been taken, the manufacturer should be advised
to take at least 77 10 = 67 additional measurements. We note that this calculation
depends on an estimate of from a small sample (n = 10) and the value 1:96 is from
the Normal tables rather than the t tables so the manufacturer should be advised to
take more than 67 additional measurements. If we round 1:96 to 2 to account for the
2 2
the fact that we don’t actually know , and note that 0:5 5 = 80 then this would
suggest that, to be safe, the manufacturer should take an additional 80 10 = 70
measurements.
(b) Since the observations in x are observations from a distribution with larger vari-
ability then we don’t want to take just an average of x and y. We would choose
an estimate that weights y more that x since y is a better estimate.
(c)
X + 4Y 1
V ar(~ ) = V ar = V ar(X) + 16V ar(Y )
5 25
1 1 0:25
= + 16 = 0:02:
25 10 10
p
and V ar(~ ) = 0:1414.
X +Y 1 1 1 0:25
V ar = V ar(X) + V ar(Y ) = + = 0:03125:
2 4 4 10 10
r
X+Y
and V ar 2 = 0:1768. We can clearly see now that ~ has a smaller
standard deviation than the estimator X + Y =2.
328 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
Chapter 5
5.1. (a) The model Y v Binomial (n; ) is appropriate in the case in which the experi-
ment consists of a sequence of n independent trials with two outcomes on each
trial (Success and Failure) and P (Success) = is the same on each trial. In this
experiment the trials are the guesses. Since the deck is reschu- ed each time it
seems reasonable to assume the guesses are independent. It also seems reason-
able to assume that the women’s ability to guess the number remains the same
on each trial. To test the hypothesis that the women is guessing at random the
appropriate null hypothesis would be H0 : = 15 = 0:2:
(b) For n = 20 and H0 : = 0:2, we have Y v Binomial (20; 0:2) and
E (Y ) = 20 (0:2) = 4. We use the test statistic or discrepancy measure D =
jY E (Y ) j = jY 4j : The observed value of D is d = j8 4j = 4. Then
p value = P (D 4; H0 ) = P (jY 4j 4; H0 )
= P (Y = 0) + P (Y 8)
X 20
20 0 20 20
= (0:2) (0:8) + (0:2)y (0:8)20 y
0 y
y=8
7
X 20
= 1 (0:2)y (0:8)20 y
= 0:04367 using R:
y
y=1
There is evidence based on the data against H0 : = 0:2. These data suggest
that the woman might have some special guessing ability.
(c) For n = 100 and H0 : = 0:2, we have Y v Binomial (100; 0:2),
E (Y ) = 100 (0:2) = 20 and V ar (Y ) = 100 (0:2) (0:8) = 16. We use the test
statistic or discrepancy measure D = jY E (Y ) j = jY 20j : The observed
value of D is d = j32 20j = 12. Then
or
49
48
47
46
Sample Quantiles 45
44
43
42
41
40
39
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
N(0,1) Quantiles
There is strong evidence based on the data against H0 : = 0:2. These data
suggest that the woman has some special guessing ability. Note that we would
not conclude that it has been proven that she does have special guessing ability!
5.3 (a) A qqplot of the data is given in Figure 12.20. Since the points in the qqplot lie
reasonably along a straight line it seems reasonable to assume a Normal model
for these data.
(b) A study population is a bit di¢ cult to de…ne in this problem. One possible
choice is to de…ne the study population to be all measurements that could be
taken on a given day by this instrument on a standard solution of 45 parts per
billion dioxin. The parameter corresponds to the mean measurement made
by this instrument on the standard solution. The parameter corresponds to
the standard deviation of the measurements made by this instrument on the
standard solution.
330 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
Y 45 Y 45
D= p where T = p v t (19) :
S= 20 S= 20
The observed value of D is
j44:405 45j
d= p = 1:11
2:3946= 20
and
p value = P (D d; H0 )
= P (jT j 1:11) where T v t (19)
= 2 [1 P (T 1:11)]
= 0:2803 (calculated using R).
In either case since the p value is larger than 0:1 and we would conclude
that, based on the observed data, there is no evidence against the hypothesis
H0 : = 45. (Note: This does not imply the hypothesis is true!).
A 100p% con…dence interval for based on the pivotal quantity
Y 45
T = p v t (19)
S= 20
is given by h p p i
y as= 20; y + as= 20
Based on these data it would appear that the new instrument is working as it
should be since there was not evidence against H0 : = 45. We might notice
331
that the value = 45 is not in the center of the 95% con…dence interval but closer
to the upper endpoint suggesting that the instrument might be under reading
the true value of 45. It would be wise to continue testing the instrument on a
regular basis on a known sample to ensure that the instrument is continuing to
work well.
(d) To test H0 : 2 = 2 we use the test statistic
0
(n 1) S 2
U= 2 v 2
(n 1) :
0
is given by "r r #
(n 1) s2 (n 1) s2
;
b a
where P (U a) = (1 p) =2 = P (U b). For n = 20 and p = 0:95 we have
P (U 8:907) = 0:025 = P (U 32:852) and the con…dence interval for is
"r r #
19 (5:7342) 9 (5:7342)
; = [1:82; 3:50] :
32:852 8:907
Y 45 Y 45
D= p where Z = p v N (0; 1) :
2= 20 2= 20
and
p value = P (D d; H0 )
= P (jZj 1:33) Z v N (0; 1)
= 2 [1 P (Z 1:33)] = 2 (1 0:90824)
= 0:18352.
Based on these data there is no evidence to contradict the manufacturer’s claim that
H0 : = 45.
5.5 (a) To test the hypothesis H0 : = 105 we use the discrepancy measure or test
statistic
Y 105
D= p
S= 12
where
1=2
1 P 12
2
S= Yi Y
11 i=1
and the t-statistic
Y 105
T = p v t (11)
S= 12
assuming the hypothesis H0 : = 105 is true.
For these data y = 104:13, s2 = 88:3115 and s = 9:3974. The observed value of
the discrepancy measure D is
and
p value = P (D d; H0 )
= P (jT j 0:3194) where T v t (11)
= 2 [1 P (T 0:3194)] = 2 (0:3777)
= 0:7554 (calculated using R).
333
Alternatively using the t-tables in the Course Notes we have P (T 0:260) = 0:6
and P (T 0:54) = 0:7 so
In either case since the p value is much larger than 0:1 and we would conclude
that, based on the observed data, there is no evidence against the hypothesis
H0 : = 105. (Note: This does not imply the hypothesis is true!)
From t- tables we have P (T 2:201) = (1 + 0:95) =2 = 0:975 where T v t (11).
A 95% con…dence interval for is
h p p i
y 2:201s= 12; y + 2:201s= 12 = [98:16; 110:10] :
(c) Since there was no evidence against H0 : = 105 and since the value
= 105 is near the center of the 95% con…dence interval for , the data support
the conclusion that the detector is accurate, that is, that the detector is not
giving biased readings. The con…dence interval for , however, indicates that the
precision of the detectors might be of concern. The 95% con…dence interval for
suggests that the standard deviation could be as large as 16 parts per billion.
As a statistician you would need to rely on the expertise of the researchers
for a decision about whether the size of the is scienti…cally signi…cant and
whether the precision of the detectors is too low. You would also point out to
the researchers that this evidence is based on a fairly small sample of only 12
detectors.
P
12
(Yi )2
i=1
U= 2 v 2
(n) :
0
P
12
(Yi 105)2
i=1
U= v 2
(12) :
100
334 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
Since
P
12 P
12 P
12
(yi 105)2 = yi2 2 (105) yi + 12 (105)2
i=1 i=1 i=1
= 131096:44 210 (1249:6) + 12 (105)2 = 980:44
Alternatively using the Chi-squared tables in the Course Notes we have P (U 9:034) =
0:3 so p value > 2 (0:3) = 0:6. In either case since the p value is larger than 0:1
and we would conclude that, based on the observed data, there is no evidence against
the hypothesis H0 : 2 = 100.
5.7 (a) The respondents to the survey are students who heard about the online referen-
dum and then decided to vote. These students may not be representative of all
students at the University of Waterloo. For example, it is possible that the stu-
dents who took the time to vote are also the students who most want a fall study
break. Students who don’t care about a fall study break probably did not bother
to vote. This is an example of sampling error. Any online survey such as this
online referendum has the disadvantage that the sample of people who choose to
vote are not necessarily a representative sample of the study population of in-
terest. The advantage of online surveys is that they are inexpensive and easy to
conduct. To obtain a representative sample you would need to select a random
sample of all students at the University of Waterloo. Unfortunately taking such
a sample would be much more time consuming and costly then conducting an
online referendum.
(b) A suitable target population would be the 30; 990 eligible voters. This would
also be the study population. Note that all undergraduates were able to vote
but it is not clear how the list of undergraduates is determined.
(c) The attribute of interest is the proportion of the 30; 990 eligible voters (the
study population) who would respond yes to the question. The parameter in
the Binomial model corresponds to this attribute. A Binomial model assumes
independent trials (students) which might not be a valid assumption. For ex-
ample, if groups of students, say within a speci…c faculty, all got together and
voted, their responses may not be independent events.
(d) The maximum likelihood estimate of based on the observed data is
^ = 4440 = 0:74:
6000
335
Since this estimate is not based on a random sample it is not possible to say how
accurate this estimate is.
(e) An approximate 95% con…dence interval for is given by
r
0:74 (0:26)
0:74 1:96 = 0:74 0:01 = [0:73; 0:75]
6000
(f) Since = 0:7 is not a value contained in the approximate 95% con…dence interval
[0:73; 0:75] for , therefore the approximate p value for testing H0 : = 0:7 is
less than 0:05. (Note that since = 0:7 is far outside the interval, the p value
would be much smaller than 0:05.)
5.8 (a) If H0 : = 3 is true then since Yi has a Poisson distribution with mean 3;
P
25
i = 1; 2; : : : ; 25 independently, then Yi has a Poisson distribution with mean
i=1
3 25 = 75. The discrepancy measure
P
25 P
25 P
25
D= Yi 75 = Yi E Yi
i=1 i=1 i=1
P
25
For the given data, yi = 51: The observed value of the discrepancy measure
i=1
is
P
25
d= yi 75 = j51 75j = 24
i=1
and
p value = P (D d; H0 )
P
25
= P Yi 75 24; H0
i=1
P 75x e 75
51 1 75x e
P 75
= +
x=0 x! x=99 x!
75x e 75
P
98
= 1
x=52 x!
= 0:006716 (calculated using R).
Since 0:001 < 0:006716 < 0:01 we would conclude that, based on the data, there
is strong evidence against the hypothesis H0 : = 3:
336 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
5.9 The observed value of the likelihood ratio test statistic for testing H0 : = 3 is
2:04
(3) = 2 (25) 2:04 log +3 2:04 = 8:6624
3
and
p value = P ( (3) 8:6624; H0 )
t P (W 8:6624) where W s 2 (1)
p
= P jZj 8:6624 where Z s N (0; 1)
= 2 [1 P (Z 2:94)] = 0:00328
The p value is close to the p values calculated in (a) and (b).
337
5.10 Since
20
3:6 (1 3:6= )
R( ) = e > 0:
then
20
3:6 (1 3:6=5)
R (5) = e = 0:3791
5
and
(5) = 2 log R (5) = 2 log (0:3791) = 1:9402:
Therefore
and since p value > 0:1 there is no evidence, based on the data, to contradict
H0 : = 5. The approximate 95% con…dence interval for is [2:40; 5:76] which
contains the value = 5. This also implies that the p value > 0:05 and so the
approximate con…dence interval is consistent with the test of hypothesis.
5.11 Since
r ( ) = 15 log [2:3 ( + 1)] 34:5 ( + 1) + 15 for > 1
then
r ( 0:1) = 15 log [2:3 ( 0:1 + 1)] 34:5 ( 0:1 + 1) + 15 = 5:1368
and
(5) = 2r ( 0:1) = 2 ( 5:1368) = 10:2735
Therefore
and since 0:001 < p value < 0:01 there is strong evidence, based on the data,
to contradict H0 : = 0:1. The approximate 95% con…dence interval for is
[ 0:75; 0:31] which does not contain the value = 0:1. This also implies that the
p value < 0:05 and so the approximate con…dence interval is consistent with the
test of hypothesis.
5.12 Since
16
(1 )66 1
R( ) = for 0 < :
(8=41)16 (33=41)66 2
338 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
then
(0:18)16 (1 0:18)66
R (0:18) = = 0:9397
(8=41)16 (33=41)66
and
(0:18) = 2 log R (0:18) = 2 log (0:9397) = 0:1244
Therefore
and since p value > 0:1 there is no evidence, based on the data, to contradict
H0 : = 0:18. The approximate 95% con…dence interval for is [0:12; 0:29] which
contains the value = 0:18. This also implies that the p value > 0:05 and so the
approximate con…dence interval is consistent with the test of hypothesis.
5.13 (a) The maximum likelihood estimate of is ^ = 18698:6=20 = 934:93. The agree-
ment between the plot of the empirical cumulative distribution function and
the cumulative distribution function of an Exponential(934:93) random variable
given in Figure 12.21 indicates that the Exponential is reasonable.
0.9
0.8
Exponential(934.93)
0.7
e.c.d.f.
0.6
0.5
0.4
0.3
0.2
0.1
0
0 500 1000 1500 2000 2500 3000 3500
Failure T ime
Figure 12.21: Empirical c.d.f. and Exponential(934:93) c.d.f. for failure times of power
systems
339
(b) The observed value of the likelihood ratio statistic for testing H0 : = 0 for
Exponential data is (see Example 5:3:2) is
y y
( 0 ) = 2n 1 log :
0 0
934:93 934:93
(1000) = 2 (20) 1 log = 0:0885
1000 1000
with
5.14 A test statistic that could be used will be to test the mean of the generated sample.
The mean should be closed to 0:5 if the random number generator is working well.
5.15 (a) For each given region the assumptions of independence, individuality and homo-
geneity would need to hold for the number of events per person per year.
(b) Assume the observations y1 ; y2 ; : : : ; yK from the di¤erent regions are indepen-
dent. Since Yj v P oisson (Pj j t) then the likelihood function for
= ( 1 ; 2 ; : : : ; K ) is
yj Pj jt
Q
K (P
j j t) e
L( ) =
j=1 yj !
or more simply
Q
K
yj Pj jt
L( ) = j e
j=1
K
X
l( ) = [yj log j Pj j t] :
j=1
Since
@l yj yj (Pj t) j
= Pj t = =0
@ j j j
340 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
K h
X i XK
yj
l(^) = yj log ^j Pj ^j t = yj log yj
Pj t
j=1 j=1
K
X yj
= yj log 1 :
Pj t
j=1
Q
K
yj Pj t
L( ) = e
j=1
Since " #
1 P
K P
K 1 P
K P
K
l0 ( ) = yj Pj t = yj t Pj = 0
j=1 j=1 j=1 j=1
P
K P
K
if = yj = Pj t, the maximum likelihood estimate of assuming
j=1 j=1
P
K P
K
H0 : 1 = 2 = = K is ^0 = yj =t Pj . So
j=1 j=1
!
P
K
^0 P Pj t
K
l(^0 ) = yj log ^0
j=1 j=1
0 1 0 1
P
K P
K
! yj C yj C
P B B
K
B j=1 C B j=1 CPK
= yj log B K C B K C Pj t
j=1 @ P A @ P A j=1
t Pj t Pj
j=1 j=1
!" ! #
P
K P
K P
K
= yj log yj =t Pj 1 :
j=1 j=1 j=1
= 2l(~) 2l(~0 )
K
!" ! #
X Yj P
K P
K P
K
= 2 Yj log 1 2 Yj log Yj =t Pj 1 :
Pj t j=1 j=1 j=1
j=1
341
= 2l(^) 2l(^0 )
K
!" ! #
X yj P
K P
K P
K
= 2 yj log 1 2 yj log yj =t Pj 1 :
Pj t j=1 j=1 j=1
j=1
The p value is
P( ; H0 ) t P (W ) where W v 2
(K 1) :
27 18 41 29 31 146
^= ; ; ; ; and ^0 =
5 (2025) 5 (1116) 5 (3210) 5 (1687) 5 (2840) 5 (10878)
1 P
n
1 P
n
5.16 (a) ~ = Y ; ~ 2 = n (Yi Y ); ^ 0 = 0; ~ 20 = n (Yi 0) and ( 0) =
i=1 i=1
n log ~ 20 =~ 2 .
1 P
n P
n
2
(Yi 0) (Yi Y ) + n(Y 0)
~ 20 n
i=1 i=1
= =
~2 1 P
n P
n
n (Yi Y) (Yi Y)
i=1 i=1
so that 2 3
6 n(Y 27
0) 7 T2
( 6
0 ) = n log 41 + Pn 5 = n log 1 + n 1
(Yi Y)
i=1
5.17
Chapter 6
(b) The scatterplot with …tted line and the residual plots shown in Figure 12.22
show no unusual patterns. The model …ts the data well.
160 2.5
155 2
150 1.5
140 0.5
y
135 0
130 -0.5
125 -1
120 -1.5
115 -2
25 30 35 40 45 50 55 60 65 25 30 35 40 45 50 55 60 65
x x
2.5 2.5
2 2
1.5 1.5
s tandardiz ed res idual
1 1
Sam ple Quanti les
0.5 0.5
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
-2 -2
115 120 125 130 135 140 145 150 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
m uhat N(0,1) Quanti les
for the observed data. Then, approximately 95% of the constructed intervals
would contain the true, but unknown value of . We say that we are 95%
con…dent that our interval contains the true value of :
(d) Since P (T 1:7139) = 0:95 where T v t (23), a 90% con…dence interval for the
mean systolic blood pressure of nurses aged x = 35 is
" #1=2
1 (35 43:20)2
^ + ^ (35) 1:7139 (7:6744) +
25 2802:00
= 126:7553 3:3274 = [123:43; 130:08] :
(e) Since P (T 2:8073) = 0:995 where T v t (23), a 99% prediction interval for the
systolic blood pressure of a nurse aged x = 50 is
" #1=2
1 (50 43:20)2
^ + ^ (50) 2:8073 (7:6744) 1 + +
25 2802:00
= 139:2029 23:0108 = [116:19; 162:21] :
The study population is all the actors listed at boxo¢ cemojo.com/people/. The
parameter represents the mean increase in the amounted by a movie for a unit
344 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
600 3.5
3
500
2.5
300 1
y
y =32.04+3.62*x
0.5
200
0
-0.5
100
-1
0 -1.5
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
x x
3.5 3.5
3 3
2.5 2.5
s tandardiz ed res idual
2 2
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
50 100 150 200 250 300 350 400 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
m uhat N(0,1) Quanti les
change in the value of an actor. However, since the 20 data points were obtained
by taking the …rst 20 actors in the list, the sample is not a random sample. If
actors with last names starting with letters at the beginning of the alphabet are
more successful then other actors then the estimate of might be biased.
(e) The hypothesis of no relationship is equivalent to H0 : = 0. Since
2 0 13
^ 0
p value = 2 41 P @T p A5 = 2 [1 P (T 2:85)] = 0:011
se = Sxx
(using R), there is evidence based on the data against H0 : = 0. Note that
this is consistent with the fact that the 95% con…dence interval for does not
contain the value = 0.
(f) Since P (T 2:1009) = 0:975 where T v t (18), a 95% con…dence interval for the
mean amount grossed by movies for actors whose value is x = 50 is
" #1=2
1 (50 43:03)2
32:0444 + (3:6238) (50) 2:1009 (100:6524) +
20 6283:422
= 213:2326 50:8090 = [162:4236; 264:0417] :
A 95% con…dence interval for the mean amount grossed by movies for actors
345
6.3 (a) Recall this was a regression of the form E(Yi ) = + x1i where x1i = x2i ;
and xi = bolt diameter. Now n = 30; ^ = 1:6668; ^ = 2:8378; se = 0:05154,
Sxx = 0:2244; x1 = 0:11. A point estimate of the mean breaking strength at
x1 = (0:35)2 = 0:1225 is
6.4 (a)
^ = Sxy = 2818:556835 = 0:9999
Sxx 2818:946855
^=y ^ x = 23:5505 23:7065 0:9999 = 0:1527
A scatterplot of the data as well as the …tted line are given in top left panel of
Figure 12.24. The straight line …ts the data very well. The observed points all
lie very close to the …tted line.
(b) Since P (T 2:1009) = 0:975 where T v t (18) and
!1=2
Syy ^ Sxy
se =
n 2
1=2
2820:862295 (0:9998616) (2818:556835)
= = 0:3870
18
a 95% con…dence interval for is
p
0:9999 2:1009 (0:3870) = 2818:946855 = [0:9845; 1:0152] :
Since the value = 1 is inside the 95% con…dence interval for we know the
p value for testing H0 : = 1 is greater than 0:05. Alternatively
2 0 13
^ 1
p value = 2 41 P @T p A5 = 2 [1 P (T 0:019)] = 0:99
se = Sxx
45 2.5
40 2
35 1.5
s tandardiz ed res idual
30 1
25 0.5
y =-0.15+1.0*x
y
20 0
15 -0.5
10 -1
5 -1.5
0 -2
0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 45
x x
2.5 2.5
2 2
1.5 1.5
s tandardiz ed res idual
1 1
Sam ple Quanti les
0.5 0.5
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
-2 -2
0 5 10 15 20 25 30 35 40 45 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
m uhat N(0,1) Quanti les
Figure 12.24: Scatterplot and residual plots for cheap versus expensive procedures
(d) The scatterplot plus the …tted line indicates good agreement between the cheaper
way of determining concentrations and the more expensive way. The points
lie quite close to the …tted line. The data suggest that the cheaper way of
determining concentrations is quite accurate since the cheaper way does not
appear to consistently give values which are systematically above (or below) the
concentration determined by the more expensive way.
348 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
y= 0:1527 + 0:9999x;
or more simply.
n
!
1 X 2
L( ) = exp 2
(yi xi )
2
i=1
P
Maximizing l( ) is equivalent to minimizing g( ) = ni=1 (yi xi )2 which is
the criterion for determining the least squares estimate of .
Solving
Xn
dg
=2 ( xi yi )xi = 0
d
i=1
we obtain
P
n
xi yi
^= i=1
:
Pn
x2i
i=1
P
n 0 1
xi Yi Xn
B xi C Pn xi
~= i=1
= B C Yi = ai Yi where ai = n
Pn @Pn A P
x2i i=1 x2i i=1 x2i
i=1 i=1 i=1
349
and
0 12
Xn
B xi C 1 P
n 2
V ar( ~ ) = B C V ar (Yi ) = x2i 2
=
@Pn A P
n 2 P
n
x2i i=1 x2i
i=1 x2i
i=1 i=1 i=1
therefore 0 1
P
n
x i Yi
B 2 C
~= i=1
vNB
@ ;
C:
A
Pn P
n
x2i x2i
i=1 i=1
(c)
n
X n
X 2
(yi ^ xi )2 = (yi2 2xi yi ^ + x2i ^ )
i=1 i=1
n
X Pn n Pn n
x i yi X i=1 xi yi 2
X
= yi2 i=1
2 Pn 2 x y
i i + P n 2 x2i
x x
i=1 | i=1
{z i } i=1 | i=1{z i } i=1
^ ^2
n Pn 2 Pn 2
X
i=1 xi yi x i yi
= yi2 2 Pn 2 + Pi=1
n 2
i=1 i=1 xi i=1 xi
n Pn 2
X xi yi
= yi2 Pi=1
n 2
i=1 i=1 xi
as required.
(d) Find a in the t-table such that P ( a T a) = 0:95 where T v t (n 1).
Then since
0 1
~
0:95 = P @ a qP aA
n 2
Se = i=1 xi
s s !
Pn P
n
= P ~ aSe = x2i ~ + aSe = x2i
i=1 i=1
j~ j
D= s :
P
n
Se = x2i
i=1
6.6 (a)
P
n
x i yi
^= i=1 13984:5554
= = 0:9947
Pn
14058:9097
x2i
i=1
and the …tted model is y = 0:9947x.
(b) A scatterplot of the data as well as the …tted line are given in top left panel of
Figure 12.25. The straight line …ts the data very well. The observed points all
lie very close to the …tted line.
(c) Since P (T 2:0930) = 0:975 where T v t (19) and
0 21
Pn
xi yi C
1 B
n
BX 2 i=1 C
se = B y i Pn C
n 1@ A
i=1 x2i
i=1
1=2
1 13984:55542
= 13913:3833 = 0:3831
19 14058:9097
45 2.5
40 2
35 1.5
25 0.5
y =0.995*x
y
20 0
15 -0.5
10 -1
5 -1.5
0 -2
0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 45
x x
2.5 2.5
2 2
1.5 1.5
s tandardiz ed res idual
1 1
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
-2 -2
0 5 10 15 20 25 30 35 40 45 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
m uhat N(0,1) Quanti les
Figure 12.25: Scatterplot and residual plots for model through the origin
(d) The scatterplot with …tted line and the residual plots shown in Figure 12.25
show no unusual patterns. The model …ts the data well.
(e) Based on this analysis we would conclude that the simple model Y G( xi ; ) is
an adequate model for these data as compared to the model Yi G( + xi ; ).
140 1.5
1
120
0.5
-0.5
80
y
-1
60 -1.5
-2
40
-2.5
20 -3
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
x x
1.5 1.5
1 1
0.5 0.5
s tandardiz ed res idual
0 0
-1 -1
-1.5 -1.5
-2 -2
-2.5 -2.5
-3 -3
30 40 50 60 70 80 90 100 110 120 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
m uhat N(0,1) Quanti les
Figure 12.26: Scatterplot and residual plots for death rate due to cirrhosis of the liver versus
wine consumption
there is very strong evidence based on the data against H0 : = 0. Note that
this is consistent with the fact that the 95% con…dence interval for does not
contain the value = 0.
6.8 (a) The scatterplot and residual plots indicate that the model …ts the data well.
6.9 (a)
x = 191:7871 y = 20:0276
Sxx = 2291:3148 Syy = 447:8497 Sxy = 1008:8246
^ = Sxy 1008:8246
= = 0:44028
Sxx 2291:3148
^ = y ^ x = 20:0276 (0:44028) (191:7871) = 64:4128
The …tted line is y = 64:4128 + 0:44028x. The scatterplot and residual plot
are given in the top two panels of Figure 12.27. Both graphs show a distinctive
pattern. In the scatterplot as x increases the points lie above the line, then below
353
then above. Correspondingly in the residual plot as x increases the residuals are
positive then negative then positive. In the residual plot the points do not lie
in a horizontal about the line r^i = 0 which suggests that the linear model is not
adequate.
30 2.5
1.5
0.5
y =-64.41+0.44x
y
20 -0.5
-1
-1.5
15 -2
180 185 190 195 200 205 210 215 180 185 190 195 200 205 210 215
x x
3.5 2
3.4 1.5
3.3 1
3.1 0
y
-1.02+0.021x
3 -0.5
2.9 -1
2.8 -1.5
2.7 -2
180 185 190 195 200 205 210 215 180 185 190 195 200 205 210 215
x x
Figure 12.27: Fitted lines and residual plots for atmospheric pressure data
(b)
x = 191:7871 y = 2:9804
Sxx = 2291:3148 Syy = 1:00001 Sxy = 47:81920
^ = Sxy 47:81920
= = 0:02087
Sxx 2291:3148
^ = y ^ x = 2:9804 (0:02087) (191:7871) = 1:02214
The …tted line is z = 1:02214 + 0:02087x. The scatterplot and residual plots
are given in the bottom two panels of Figure 12.27. In both of these plots we do
not observe any unusual patterns. There is no evidence to contradict the linear
model for log(pressure) versus temperature. However this does not “prove”that
the theory’s model is correct - only that there is no evidence to disprove it.
(c) Since P (T 2:0452) = 0:975 where T v t (29), and
!1=2
Syy ^ Sxy
se =
n 2
1=2
1:00001 (0:02087) (47:81920)
= = 0:00838894
29
354 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
a 95% con…dence interval for the mean log atmospheric pressure at a temperature
of x = 195 is
" #1=2
1 (100 191:7871)2
1:02214 + (0:02087) (195) 2:0452 (0:008389) +
31 2291:3148
= 3:04747 0:00329 = [3:04418; 3:05076] :
which implies a 95% con…dence interval for the mean atmospheric pressure at a
temperature of x = 195 is
6.10 (a) We assume that the study population is the set of all Grade 3 students who
are being taught the same curriculum. (For example in Ontario all Grade 3
students must be taught the same Grade 3 curriculum set out by the Ontario
Government.) The parameter 1 represents the mean score on the DRP test
if all Grade 3 students in the study population took part in the new directed
readings activities for an 8-week period.
The parameter 2 represents the mean score on the DRP test for all Grade 3
students in the study population without the directed readings activities.
The parameter represents the standard deviation of the DRP scores for all
Grade 3 students in the study population which is assumed to be the same
whether the students take part in the new directed readings activities or not.
(b) The qqplot of the responses for the treatment group and the qqplot of the re-
sponses for the control group are given in Figures 12.28 and 12.29. Looking at
these plots we see that the points lie reasonably along a straight line in both plots
and so we would conclude that the normality assumptions seem reasonable.
0.98
0.95
0.90
0.75
Probability
0.50
0.25
0.10
0.05
0.02
25 30 35 40 45 50 55 60 65 70
Data
Figure 12.28: Normal Qqplot of the Responses for the Treatment Group
355
0.98
0.95
0.90
0.75
Probability
0.50
0.25
0.10
0.05
0.02
10 20 30 40 50 60 70 80
Data
Figure 12.29: Normal Qqplot for the Responses in the Control Group
(d) To test the hypothesis of no di¤erence between the means, that is, to test the
hypothesis H0 : 1 = 2 we use the discrepancy measure
Y1 Y2 0
D= q
Sp n11 + n12
where
Y1 Y2 0
T = q s t (n1 + n2 2)
Sp n11 + n12
assuming H0 : 1 = 2 is true. The observed value of D for these data is
jy1 y2 0j j51:4762 41:5217 0j
d= q = q = 2:2666
1 1 1 1
sp n1 + n2 14:5512 21 + 23
and
Since the p-value is less than 0:01 there is strong evidence against the hypothesis
H0 : 1 = 2 based on the data.
Although the data suggest there is a di¤erence between the treatment group and
the control group we cannot conclude that the di¤erence is due to the
the new directed readings activities. The di¤erence could simply be due to
the di¤erences in the two Grade 3 classes. Since randomization was not used to
determine which student received the treatment and which student was in the
control group, the di¤erence in the DRP scores could have existed before the
treatment was applied.
with
p value = 2 [1 P (T 2:074)] = 0:05 where T v t(18)
so there is weak evidence against H0 based on the data.
(c) We repeat the above using as data Zij = log(Yij ): This time the sample means
are 2:248, 1:7950 and the sample variances
q are 0:320, 0:240 respectively,. The
0:320+0:240
pooled estimate of variance is sp = 2 = 0:529. The observed value of
the discrepancy measure is
j2:248 1:795j 0
d= q = 1:9148
1 1
0:529 10 + 10
with
p value = 2 [1 P (T 1:91)] t 0:07 where T v t(18)
so there is even less evidence against H0 based on the data.
357
(d) One could check the Normality assumption with qqplots for each of the variables
Yij and Zij = log(Yij ) although with such a small sample size these will be
di¢ cult to interpret.
6.13 Let 1 be the mean log failure time for welded girders and 2 be the mean score
for log failure time for repaired welded girders. The pooled estimate of the common
standard deviation is
r
13 (0:0914) + 9 (0:0422)
sp = = 0:26697
22
From t tables, P (T < 2:0739) = 0:975 where T v t (22). The 95% con…dence interval
for 1 2 is
r
1 1
14:564 14:291 2:0739 (0:26697) + = 0:273 0:22924 = [0:04376; 0:50224] :
14 10
Since
j14:564 14:291 0j
d= q = 2:4698
1 1
0:26697 14 + 10
with
p value = 2 [1 P (T 2:4698)] = 0:02175 where T v t(22)
there is evidence against the hypothesis of no di¤erence based on the data. This is
consistent with the fact that the 95% con…dence interval for 1 2 did not contain
the value 1 2 = 0.
358 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
6.15 We assume that the observations for the “Alcohol” group are a random sample from
a G ( 1 ; ) distribution and that the observations for the “Non-Alcohol” group are a
random sample from a G ( 2 ; ) distribution. To see if there is any di¤erence between
the two groups we construct a 95% con…dence interval for the mean di¤erence in
reaction times 1 2.
The pooled estimate of the common standard deviation is
r
0:608 + 0:35569
sp = = 0:2093:
22
359
6.16 (a) We assume that the observed di¤erences are a random sample from a G ( ; )
distribution. An estimate of is
r
17:135
s= = 1:5646
7
(b) If the natural pairing is ignored an estimate of the common standard deviation
is r
535:16875 + 644:83875
sp = = 9:18075
14
Since P (T 2:1148) = 0:975 where T v t (14), a 95% con…dence interval for
1 2 is
r
1 1
23:6125 22:5375 2:1148 (9:18075) + = [ 8:7704; 10:9204]:
8 8
We notice that although both intervals in (a) and (b) are centered at the value
1:075, the interval in (b) is very much wider.
(c) A matched pairs study allows for a more precise comparison since di¤erences
between the 8 pairs have been eliminated. That is by analyzing the di¤erences
we do not need to worry that there may have been large di¤erences in the 8 cars
which were used in the study with respect to other explanatory variates which
might a¤ect gas mileage (the response variate) such as size of engine, make of
car, etc.
6.17 (a) We assume that the study population is the set of all factories of similar size.
The parameter represents the mean di¤erence in the number of sta¤ hours per
month lost due to accidents before and after the introduction of an industrial
safety program in the study population.
360 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
(c) Since
jy 0j j 15:3375 0j
d= p = p = 3:39
s= n 12:8107= 8
with
Since the p-value is between 0:01 and 0:05 there is reasonable evidence against
the hypothesis H0 : = 0 based on the data.
Since this experimental study was conducted as a matched pairs study, an analy-
sis of the di¤erences, yi = y1ii y2i ; allows for a more precise comparison since
di¤erences between the 8 pairs have been eliminated. That is by analyzing the
di¤erences we do not need to worry that there may have been large di¤erences
in the safety records between factories due to other variates such as di¤erences
in the management at the di¤erent factories, di¤erences in the type of work be-
ing conducted at the factories etc. Note however that a drawback to the study
was that we were not told how the 8 factories were selected. To do the analysis
above we have assumed that the 8 factories are a random sample from the study
population of all similar size factories but we do not know if this is the case.
6.18 (a) Since two algorithms are each run on the same 20 sets of numbers we analyse the
di¤erences yi = yAi yBi ; i = 1; : : : ; 20. Since P (T < 2:8609) = (1 + 0:99) =2 =
0:995 where T v t (19), we obtain the con…dence interval
p
0:409 2:8609 (0:487322) = 20 = [0:097; 0:721]
These values are all positive indicating strong evidence based on the data against
H0 : A B = 0 (p value < 0:01), that is, the data suggest that algorithm B
is faster.
(b) To check the Normality assumption we plot a qqplot of the di¤erences. See
Figure 12.30. The data lie reasonably along a straight line and therefore a
Normal model is reasonable.
361
1.2
0.8
0.4
0.2
-0.2
-0.4
-0.6
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Standard Normal Quantiles
(e) r
1:4697 + 0:9945
= 1:11 sp =
2
Since P (T < 2:86) = (1 + 0:99)=2 = 0:995 where T v t(38), the interval,
assuming common variance, is
r r
1 1 1 1
y1 y2 asp + = 0:409 2:68(1:11) +
20 20 20 20
or
[ 0:532; 1:349]:
This second interval [ 0:532; 1:349] is much wider than the …rst interval [0:097; 0:721]
biased on the paired experiment and unlike the …rst interval, it contains the value
362 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
zero. Unlike the paired design, independent samples of the same size (20 di¤erent
problems run with each algorithm) is too small to demonstrate the superiority
of algorithm B. The independent samples is a less e¢ cient way to analyse the
di¤erence. This is why in computer simulations, it is essential to be able to run
di¤erent simulations using the same random number seed.
363
Chapter 7
The observed value of the likelihood ratio statistic is likelihood ratio statistic is
14 28 36 22
= 2 14 log + 28 log + 36 log + 22 log = 8:1701
21 21 29 29
with
so there is strong evidence against the hypothesis that the probability of rust
occurring is the same for rust-proofed and non-rust-proofed cars based on the
observed data.
7.2 If the probability of catching the cold is the same for each group, then it is estimated
as 50=200 = 0:25 in which case the expected frequencies ej in the four categories are
25; 75; 25; 75 respectively. The observed frequencies yj are 20; 80; 30; 70. The observed
value of the likelihood ratio statistic is
20 80 30 70
2 20 log + 80 log + 30 log + 70 log = 2:6807
25 75 25 75
with
Based on the observed data there is no evidence against the hypothesis that the
probability of catching a cold during the study period was the same for each group.
7.3 The total number of defectives among the 250 12 = 3000 items inspected is
^ = 274 = 0:09133:
3000
We want to test the hypothesis that the number of defectives in a box is Binomial(12; ).
Under this hypothesis and using ^ = 0:091333 we obtain the expected numbers in each
category
where
12 ^i ^)12 i
ei = 250 (1 for i = 0; 1; : : : ; 5
i
and the last category is obtained by subtraction. Since the expected numbers in
the last three categories are all less than 5 we pool these categories to improve the
Chi-squared approximation and obtain
Under the null hypothesis we had to estimate the parameter . The degrees of freedom
are is 4 1 = 3. The p value is P (W > 38:8552) t 0 where W v 2 (3), so based
on the data there is very strong evidence that the Binomial model does not …t. The
likely reason is that the defects tend to occur in batches when packed (so that there
more cartons with no defects than one would expect).
The expected frequencies assuming a Poisson(1:15) distribution are given in the table
below in brackets
Number of
0 1 2 3 4 5 Total
Interruptions
fi 64 71 42 18 4 1
200
ei 63:33 72:83 41:88 16:05 4:61 1:3
365
where
(1:15)i e 1:15
ei = 200 for i = 0; 1; : : : ; 4
i!
and the last category is obtained by subtraction. Since the expected frequency in the
last category is less than 5 we combine the last two categories to obtain
Number of
0 1 2 3 4 Total
Interruptions
fi (ei ) 64(63:33) 71(72:83) 42(41:88) 18(16:05) 5(5:91) 200
and p value t P (W > 0:43) = 0:93 where W v 2 (3). Based on the data there is
no evidence against the hypothesis that the Poisson model …ts the data.
The expected frequencies assuming the Binomial model, are calculated using
n ^j ^n
n j
enj = yn+ 1 j = 0; 1; : : : ; n; n = 2; 3; 4
j n
Total
Number of females = j number
of litters
enj 0 1 2 3 4 yn+
Litter 2 25:3125 39:375 15:3125 80
Size = n 3 8:4280 31:6049 39:5062 16:4609 96
4 7:0643 25:9964 35:8751 22:0034 5:0608 96
p value = P (W 1:11)
and there is no evidence based on the data against the Binomial model. Similarly
for n = 3, we obtain = 4:22 and P (W 4:22) = 0:12 where W v 2 (2) and
there is no evidence based on the data against the Binomial model. For n = 4,
= 1:36 and P (W 1:36) = 0:71 where W v 2 (3) and there is also no
evidence based on the data against the Binomial model.
(b) The likelihood function for 1; 2; 3; 4 is
12 8 70 90 160 128 184 200
L ( 1; 2; 3; 4) = 1 (1 1) 2 (1 2) 3 (1 3) 4 (1 4)
0 < n < 1; n = 1; 2; 3; 4:
L( ) = 12
(1 )8 70
(1 )90 160
(1 )128 184
(1 )200
= 12+70+160+184
(1 )8+90+128+200
= 426
(1 )426 0< <1
367
n
enj = yn+ (0:5)n j = 0; 1; : : : ; n; n = 2; 3; 4
j
8 12 22 5
2 8 log + 12 log + + 22 log + 5 log = 14:27:
10 10 24 6
7.6 This process can be thought of as an experiment in which we observe yi = the number
of non-zero digits (Failures) until the …rst zero (Success) for i = 1; 2; : : : ; 50. There-
fore the Geometric( ) distribution is an appropriate model. Since is unknown we
estimate it using the maximum likelihood estimate. The likelihood function for is
P
50
Q
50 yi
L( ) = (1 )yi = 50
(1 )i=1 0< < 1:
i=1
P
50
For these data yi = 348 and the log likelihood function is
i=1
Solving
50 348
l0 ( ) = =0
(1 )
gives the maximum likelihood estimate
^= 50
= 0:1256
50 + 348
368 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
for . To test the …t of the model we summarize the data in a frequency table:
# between 2 zeros 0 1 2 3 4 5 6 7 8 10 12
# of occurrences 6 4 9 3 5 2 2 3 2 2 1
# between 2 zeros 13 14 15 16 18 19 20 21 22 26
# of occurrences 1 1 1 1 1 1 1 1 2 1
ej = 50 (0:1256) (1 0:1256)j ; j = 0; 1; : : : :
Observation
0 1 2 3 4 5 6 7 8 10 11 Total
between two 0’s
Observed
6 4 12 7 5 4 12 50
Frequency.: fj
Expected
6:28 5:49 9:0 6:88 5:26 5:67 11:42 50
Frequency.: ei
The observed value of the likelihood ratio statistic is = 1:96. The degrees of
freedom for the Chi-squared approximation are 7 1 1 = 5 and the p value t
P (W 1:96) t 0:9 where W v 2 (5). There is no evidence based on the data against
the hypothesis that the Geometric distribution is a good model for these data.
so there is evidence based on the data against the hypothesis that the two classi…ca-
tions are independent.
The observed value of the likelihood ratio statistic is = 0:5587 with p value t
P (W 0:5587) = 0:9058 where W v 2 (3) so there is no evidence based on the
data to contradict the hypothesis of no association between the sex distribution
and age of the mother.
(b) The expected frequencies are:
The observed value of the likelihood ratio statistic is = 5:4441 with p value t
P (W 5:4441) = 0:1420 where W v 2 (3). There is no evidence based on the
data against the Binomial model.
370 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
The observed value of the likelihood ratio statistic is = 10:8 with p value t
P (W 10:8) = 0:013 where W v 2 (3). Therefore there is evidence based
on the data against the hypothesis that birth weight is independent of parental
smoking habits.
(b) The expected frequencies depending on whether the mother is a smoker or non-
smoker are:
Mother smokes
yij (eij ) Father smokes Father non-smoker Total
30 15
Above average 46 = 9:78 5:22 15
Below average 20:22 10:78 31
Total 30 16 46
Mother non-smoker
yij (eij ) Father smokes Father non-smoker Total
Above average 185435 = 11:67 23:33 35
Below average 6:33 12:67 19
Total 18 36 54
For the Mother smokes table, the observed value of the likelihood ratio statistic
is = 0:2644 with
For the Mother non-smoker table, the observed value of the likelihood ratio
statistic is = 0:04078 with
In both cases there is no evidence based on the data against the hypothesis
that, given the smoking habits of the mother, birth weight is independent of the
smoking habits of the father.
371
Chapter 8
8.1 (a) The observed value of the likelihood ratio statistic is 480:65 so the p value
is almost zero; there is very strong evidence against independence based on the
data.
8.3 (a) The observed value of the likelihood ratio statistic is = 112 and p value t 0.
(b) Only Program A shows any evidence of non-independence, and that is in the
direction of a lower admission rate for males.
372 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
APPENDIX B: SAMPLE TESTS
[14] 2. Fill in the blanks below. You may use a numerical value or one of the following words
or phrases: sample skewness, sample kurtosis, sample variance, sample mean, relative
frequencies, frequencies, histogram, boxplot.
373
374 APPENDIX B: SAMPLE TESTS
[12] 3. Researchers are interested in the relationship between a certain gene and the risk
of contracting diabetes. A gene is said to be expressed if its coded information is
converted into certain proteins. A team of researchers investigates whether there is
a relationship between a certain gene being expressed, and whether or not a person
contracts diabetes in their lifetime. The team takes a random sample of 100 people
who are aged 55 or above. For each person selected they determine (i) age, (ii)
whether or not the gene is expressed, (iii) the person’s insulin level, and (iv) if the
person has diabetes.
ii. an observational study because we are recording observations for each sam-
pled unit.
[3] b. The “age” of the subject is an example of (check only those that apply)
i. an explanatory variate because it explains how long the subject is in the
study.
ii. an explanatory variate because it may help to explain whether a given person
will contract diabetes.
iii. a non-Normal variate because subjects may lie about their age.
[3] c. The Plan step in PPDAC for this experiment includes (check only those that
apply)
i. the question of whether or not diabetes was related to the expression of the
gene.
ii. the sampling protocol or the procedure used to select the sample.
[3] d. In the Problem step of PPDAC, we (check only those that apply)
Min 1st Quartile Median Mean 3rd Quartile Max Sample s.d.
30 40 45 44:02 49 57 6:65
20
15
frequency
10
5
0
30 35 40 45 50 55 60
Baumann$post.test.3
45
40
35
30
-2 -1 0 1 2
nor m quantiles
Figure 11.3: Normal qq plot for test scores with superimposed line and con…dence region
377
55
50
45
post.test.3
40
35
30
Based on these plots and statistics circle True or False for the following
statements.
[13] 5.[7] a. Suppose y1 ; y2 ; :::; y25 are the observed values in a random sample from the
Poisson( ) distribution: Find the maximum likelihood estimate of . Show all
your steps.
[6] b. Suppose y1 ; y2 ; :::; y10 are the observed values in a random sample from the prob-
ability density function
y y=
f (y; ) = 2e for y > 0
where 0 < < 1: Find the maximum likelihood estimate of . Show all your
steps.
378 APPENDIX B: SAMPLE TESTS
The set of university students living in the Kitchener Waterloo region in Sep-
tember 2012.
(d) What are two variates in this problem?
There may be a di¤ erence between KW university students and the population of
Ontario university students, for e.g., university students in Toronto and Thunder
Bay may have di¤ erent …nancial worries then KW university students.
(g) A possible source of sampling error is:
Since more males tend to go to football games there may be a di¤ erence between
the proportion of males in the sample and the proportion of males in the study
population.
379
(h) Describe an attribute of interest for the target population and provide an esti-
mate based on the given data.
[14] 2. Fill in the blanks below. You may use a numerical value or one of the following words
or phrases: sample skewness, sample kurtosis, sample variance, sample mean, relative
frequencies, frequencies, histogram, boxplot.
[2] a. A large positive value of the sample skewness indicates that the distribution
is not symmetric and the right tail is larger than the left.
[12] 3. Researchers are interested in the relationship between a certain gene and the risk
of contracting diabetes. A gene is said to be expressed if its coded information is
converted into certain proteins. A team of researchers investigates whether there is
a relationship between a certain gene being expressed, and whether or not a person
contracts diabetes in their lifetime. The team takes a random sample of 100 people
who are aged 55 or above. For each person selected they determine (i) age, (ii)
whether or not the gene is expressed, (iii) the person’s insulin level, and (iv) if the
person has diabetes.
380 APPENDIX B: SAMPLE TESTS
ii. an observational study because we are recording observations for each sam-
p
pled unit.
[3] b. The “age” of the subject is an example of (check only those that apply)
i. an explanatory variate because it explains how long the subject is in the
study.
ii. an explanatory variate because it may help to explain whether a given person
p
will contract diabetes.
iii. a non-Normal variate because subjects may lie about their age.
[3] c. The Plan step in PPDAC for this experiment includes (check only those that
apply)
i. the question of whether or not diabetes was related to the expression of the
gene.
p
ii. the sampling protocol or the procedure used to select the sample.
p
iii. the speci…cation of the sample size.
[3] d. In the Problem step of PPDAC, we (check only those that apply)
Based on these plots and statistics circle True or False for the following
statements.
n
Y yi
L( ) = e
yi !
i=1
P
n
Q
n 1 yi
n Q
n 1
= i=1 e note that the term is optional.
i=1 yi ! i=1 yi !
382 APPENDIX B: SAMPLE TESTS
where 0 < < 1: Find the maximum likelihood estimate of . Show all your
steps.
n
Y yi yi = Q
n 1 1P
n
L( ) = 2e = yi 2n exp yi for >0
i=1 i=1 i=1
or more simply
1 ny
L( ) = 2n exp for > 0:
(b) Between December 20, 2013 and February 7, 2014 the Kitchener City Council conducted
an online survey which was posted on the City of Kitchener’s website. The online survey
was publicized in the local newspapers, radio stations and TV news. The propose of the
survey was to determine whether or not the citizens of Kitchener supported a proposal to
put life sized bronze statues of Canada’s past prime ministers in Victoria Park, Kitchener
as a way to celebrate Canada’s 150th. The community group that had proposed the idea
had already received 2 million dollars in pledges and was asking the city for a contribution
of $300,000 over three years.
People who took part in the survey were asked "Do you support the statue proposal in
concept, by which we mean do you like the idea even if you don’t agree with all aspects of
the proposal?" Of the 2441 who took the survey, 1920 answered no to this question.
(i) Explain clearly whether you think using the online survey was a good way for the
City of Kitchener to determine whether or not the citizens of Kitchener support the Prime
Ministers’Statues Project.
(ii) Assume the model Y v Binomial (n; ) where Y = number of people who responded
no to the question "Do you support the statue proposal in concept, by which we mean do
you like the idea even if you don’t agree with all aspects of the proposal?" What does the
parameter represent in this study?
(iv) An approximate 95% con…dence interval for based on the observed data is
_________________________.
(v) By reference to the con…dence interval, indicate what you know about the p value
for a test of the hypothesis H0 : = 0:8?
(c) Suppose a Binomial experiment is conducted and the observed 95% con…dence interval
for is [0:1; 0:2]. This means (circle the letter for the correct answer):
A : The probability that is contained in the interval [0:1; 0:2] equals 0:95.
384 APPENDIX B: SAMPLE TESTS
B : If the Binomial experiment was repeated 100 times independently and a 95% con-
…dence interval was constructed each time then approximately 95 of these intervals would
contain the true value of .
2: [20] At the R.A.T. laboratory a large number of genetically engineered rats are raised
for conducting research. Twenty rats are selected at random and fed a special diet. The
weight gains (in grams) from birth to age 3 months of the rats fed this diet are:
63:4 68:3 52:0 64:5 62:3 55:8 59:3 62:4 75:8 72:1
55:6 73:2 63:9 60:7 63:9 60:2 60:5 67:1 66:6 66:7
Let yi = weight gain of the i0 th rat, i = 1; : : : ; 20. For these data
P
20 P
20
yi = 1273:8 and (yi y)2 = 665:718:
i=1 i=1
Yi v N ; 2
= G( ; ); i = 1; : : : ; 20
75
70
Quantiles of Input Sample
65
60
55
50
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Standard Normal Quantiles
(b) Describe a suitable study population for this study. The parameters and correspond
to what attributes of interest in the study population?
(d) Let
Y 1 P20
2
T = p where S2 = Yi Y :
S= 20 19 i=1
(e) The company, R.A.T. Chow, that produces the special diet claims that the mean weight
gain for rats that are fed this diet is 67 grams.
The p value for testing the hypothesis H0 : = 67 is between _____________
and _______________.
1 P
20
2
(f ) Let W = 2 Yi Y .
i=1
(g) A 90% con…dence interval for for the given data is _____________________.
1 y=
f (y; ) = e for y > 0 and > 0:
1 w=2
g (w) = e ; for w > 0
2
which is the probability density function of a random variable with a 2 (2) distribution.
(b) Suppose Y1 ; : : : ; Yn is a random sample from the Exponential ( ) distribution. Use your
result from (a) and theorems that you have learned in class to prove that
2P
n
2
U= Yi (2n) :
i=1
386 APPENDIX B: SAMPLE TESTS
(c) Explain clearly how the pivotal quantity U can be used to obtain a two-sided 100p%
con…dence interval for .
2P
25
2
U= Yi (50) :
i=1
(e) Suppose y1 ; : : : ; y25 is an observed random sample from the Exponential ( ) distribution
P
25
with yi = 560.
i=1
Justi…cation:
q An approximate 90% con…dence interval for is given by
^ 1:645 ^(1 ^)=n since P (Z 1:645) = 0:95 where Z v N (0; 1) which has width
q
2 (1:645) ^(1 ^)=n. Therefore we need n such that
q
(1:645) ^(1 ^)=n 0:02
2
1:645 ^(1 ^)
or n
0:02
Since we don’t know ^ and the right side of the inequality takes on its largest value for
^ = 0:5 we chose n such that
2
1:645
n (0:5)2 = 1691:3
0:02
(b) Between December 20, 2013 and February 7, 2014 the Kitchener City Council conducted
an online survey which was posted on the City of Kitchener’s website. The online survey
was publicized in the local newspapers, radio stations and TV news. The propose of the
survey was to determine whether or not the citizens of Kitchener supported a proposal to
put life sized bronze statues of Canada’s past prime ministers in Victoria Park, Kitchener
as a way to celebrate Canada’s 150th. The community group that had proposed the idea
had already received 2 million dollars in pledges and was asking the city for a contribution
of $300,000 over three years.
People who took part in the survey were asked "Do you support the statue proposal in
concept, by which we mean do you like the idea even if you don’t agree with all aspects of
the proposal?" Of the 2441 who took the survey, 1920 answered no to this question.
(i) [3] Explain clearly whether you think using the online survey was a good way for the
City of Kitchener to determine whether or not the citizens of Kitchener support the Prime
Ministers’Statues Project.
This is not a good way for the City of Kitchener to determine whether or not the citizens
of Kitchener support the Prime Ministers’Statues Project.
388 APPENDIX B: SAMPLE TESTS
The respondents to the survey are people who heard about the survey through local
media, had access to the internet and then took the time to complete the survey. These
people are probably not representative of all citizens of Kitchener. This is an example of
sampling error.
To obtain a representative sample you would need to select a random sample of all
citizens living in Kitchener.
(ii) [2] Assume the model Y v Binomial (n; ) where Y = number of people who
responded no to the question ”Do you support the statue proposal in concept, by which we
mean do you like the idea even if you don’t agree with all aspects of the proposal?” The
parameter corresponds to what attribute of interest in the study population? Be sure to
de…ne the study population
The parameter corresponds to the proportion of people in the study population, which
consists of all citizens of Kitchener, who would respond no to the question.
(iii) [2] A point estimate of based on the observed data is 1920=2441 = 0:7866 .
(iv) [4] An approximate 95% con…dence interval for based on the observed data is
[0:7703; 0:8029] .
s
1920 1920 1920
1:96 1 =2441 = 0:7866 0:0163 = [0:7703; 0:8029]
2441 2441 2441
(v) [2] By reference to the con…dence interval, indicate what you know about the p
value for a test of the hypothesis H0 : = 0:8?
Since = 0:8 is a value contained in the interval [0:7703; 0:8029] therefore the p value
for testing H0 : = 0:8 is greater than or equal to 0:05.
(Note that since = 0:8 is very close to the upper endpoint of the interval that the
p value would be very close to 0:05.)
(c) [2] Suppose a Binomial experiment is conducted and the observed 95% con…dence in-
terval for is [0:1; 0:2]. This means (circle the letter for the correct answer):
A : The probability that is contained in the interval [0:1; 0:2] equals 0:95.
B : If the Binomial experiment was repeated 100 times independently and a 95%
con…dence interval was constructed each time then approximately 95 of these intervals
would contain the true value of .
389
2: [20] At the R.A.T. laboratory a large number of genetically engineered rats are raised
for conducting research. Twenty rats are selected at random and fed a special diet. The
weight gains (in grams) from birth to age 3 months of the rats fed this diet are:
63:4 68:3 52:0 64:5 62:3 55:8 59:3 62:4 75:8 72:1
55:6 73:2 63:9 60:7 63:9 60:2 60:5 67:1 66:6 66:7
Let yi = weight gain of the i0 th rat, i = 1; : : : ; 20. For these data
P
20 P
20
yi = 1273:8 and (yi y)2 = 665:718:
i=1 i=1
Yi v N ; 2
= G( ; ); i = 1; : : : ; 20
Since the points in the qqplot lie reasonably along a straight line the Gaussian model
seems reasonable for these data.
75
70
Quantiles of Input Sample
65
60
55
50
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Standard Normal Quantiles
(b) [4] Describe a suitable study population for this study. The parameters and corre-
spond to what attributes of interest in the study population?
A suitable study population consists of the genetically engineered rats which are raised
for conducting research at the R.A.T. laboratory.
390 APPENDIX B: SAMPLE TESTS
The parameter corresponds to the mean weight gain of the rats fed the special diet
from birth to age 3 months in the study population.
The parameter corresponds to the standard deviation of the weight gains of the rats
fed the special diet from birth to age 3 months in the study population.
1=2
The maximum likelihood estimate of is 20 1
(665:718) = (33:2859)1=2 = 5:7694
(You do not need to derive these estimates.)
Y 1 P 20
2
T = p where S2 = Yi Y :
S= 20 19 i=1
(e) [6] The company, R.A.T. Chow, that produces the special diet claims that the mean
weight gain for rats that are fed this diet is 67 grams.
1=2
1 jy j j63:69 67j
s= (665:718) = 5:9193 p0 = p = 2:5008
19 s= n 5:9193= 20
Since the p value 0:05, therefore there is evidence against R.A.T. Chow’s claim,
H0 : = 67, based on the observed data.
1 P
20
2
(f ) [3] Let W = 2 Yi Y .
i=1
(g) [2] A 90% con…dence interval for for the given data is [4:6994; 8:1118] .
" #
1=2 1=2
665:718 665:718
; = [22:0846; 65:8019] = [4:6994; 8:1118]
30:144 10:117
1 y=
f (y; ) = e for y > 0 and > 0:
1 w=2
g (w) = e ; for w > 0
2
which is the probability density function of a random variable with a 2 (2) distribution.
For w 0;
2Y w w
G (w) = P (W w) = P w =P Y =F
2 2
where
F (y) = P (Y y)
Therefore
w 1 w 1
g (w) = G0 (w) = f = exp = = e w=2
; for w 0
2 2 2 2 2
as required.
(b) [3] Suppose Y1 ; : : : ; Yn is a random sample from the Exponential ( ) distribution. Use
your result from (a) and theorems that you have learned in class to prove that
2P
n
2
U= Yi (2n) :
i=1
Since the sum of independent Chi-squared random variables has a Chi-squared distribu-
tion with degrees of freedom equal to the sum of the degrees of freedom of the Chi-squared
random variables in the sum, therefore
2P
n
2 P
n
2
U= Yi 2 or (2n)
i=1 i=1
as required.
(c) [4] Explain clearly how the pivotal quantity U can be used to obtain a two-sided 100p%
con…dence interval for .
1 p
Using Chi-squared tables …nd a and b such that P (U a) = 2 = P (U b) where
U 2 (2n)
Since
p = P (a U b)
0 1
B1 1C
=PB
@b
C
P
n
aA
2 Yi
i=1
0 P n P
n 1
2 Y 2 Yi
B i=1 i C
B
=P@ i=1 C
b a A
then 2 P 3
n Pn
2 yi 2 yi
6 i=1 i=1 7
6 7
4 b ; a 5
2P
25
2
U= Yi (50) :
i=1
(e) [3] Suppose y1 ; : : : ; y25 is an observed random sample from the Exponential ( ) distri-
P
25
bution with yi = 560.
i=1
393
The maximum likelihood estimate for is 560=25 = 22:4 .(You do not need
to derive this estimate.)
2 (560) 2 (560)
; = [16:5914; 32:2172]
67:505 34:764
Cost of Number of
Total Sales (y )
Advertising (x) Communities
1:2 5 4:9 3:0 3:6 4:4 8:8
2:4 5 8:6 6:8 8:4 8:7 7:8
3:6 5 8:3 8:3 8:0 8:8 7:7
4:8 5 11:0 10:8 11:6 12:0 10:1
P
20
x = 3, y = 7:93, Sxx = (xi x)2 = 36,
i=1
P
20 P
20
Syy = (yi y)2 = 125:282, Sxy = (xi x) (yi y) = 61:32
i=1 i=1
Yi = + xi + Ri ; where Ri v N 0; 2
= G (0; ) ; i = 1; : : : ; 20 independently
is assumed where , and are unknown parameters and the x0i s are assumed to be known
constants.
(a) [1] Is this an experimental or observational study? Explain.
(b) [4] Calculate the maximum likelihood estimates of and for these data and draw
the …tted line on the scatterplot below. How well does this line …t the data? Do you notice
anything unusual?
(c) [3] A Normal QQ-plot of the estimated residuals is given below.
Explain clearly how this plot is obtained. What conclusions can be drawn from this
plot about the validity of the assumed model for these data?
(d) [5] Test the hypothesis that there is no relationship between the amount of money
spent in advertising a product on local television in one week and the sales of the product
in the following week. Show all your work.
395
12
11
10
9
Sa le s ( th o u s a n d s
o f d o lla r s ) 8
3
1 1 .5 2 2 .5 3 3 .5 4 4 .5 5
Ad v e r tis in g s p e n d in g ( th o u s a n d s o f d o lla r s )
0.98
0.95
0.90
0.75
Probability
0.50
0.25
0.10
0.05
0.02
(e) [2] Would you conclude that an increase in the amount of money spent in advertising
causes an increase in the sales of the product in the following week? Explain your answer.
(f ) [3] If the amount of dollars spent on advertising a product on local television in one
week is 5 thousand dollars, …nd a 90% prediction interval for the sales of the product (in
thousands of dollars) in the following week.
2: [16] A wind farm is a group of wind turbines in the same location used for production
of electric power. The number of wind farms is increasing as we try to move to more
renewable forms of energy. Wind turbines are most e¢ cient if the mean windspeed is 16
km/h or greater.
The windspeed Y at a speci…c location is modeled using the Rayleigh distribution which
has probability density function
2y y2 =
f (y; ) = e ; y 0; >0
396 APPENDIX B: SAMPLE TESTS
(b) [2] To determine whether a location called Windy Hill is a good place for a wind farm,
the windspeed was measured in km/h on 14 di¤erent days as given below:
14:7 30:0 13:3 41:9 25:6 39:6 34:5 9:9 13:6 24:2 5:1 41:4 20:5 22:2
P
14 P
14
yi = 336:5 and yi2 = 9984:03
i=1 i=1
For these data calculate the Maximum Likelihood estimate of and give the Relative
Likelihood function for .
p
(c) [5] If the random variable Y has a Rayleigh distribution then E (Y ) = =2: Thus a
2
mean of 20 km/h corresponds to = (40) = t 509:3. The owner of Windy Hill claims
that the average windspeed at Windy Hill is 20 km/h. Test the hypothesis H0 : = 509:3
using the given data and the likelihood ratio test statistic. Show all your work.
If n = 14; …nd a and b such that P (W a) = 0:025 = P (W b). Use the pivotal
quantity W and the data from Windy Hill to construct an exact 95% con…dence interval
for . Show all your work.
(e) [2] Would you recommend that a wind farm be situated at Windy Hill? Justify your
answer.
3: [14] Two drugs, both in identical tablet form, were each given to 10 volunteer subjects
in a pilot drug trial. The order in which each volunteer received the drugs was randomized
and the drugs were administered one day apart. For each drug the antibiotic blood serum
level was measured one hour after medication. The data are given below:
Subject: i 1 2 3 4 5 6 7 8 9 10
Drug A: ai 1:08 1:19 1:22 0:60 0:55 0:53 0:56 0:93 1:43 0:67
Drug B: bi 1:48 0:62 0:65 0:32 1:48 0:79 0:43 1:69 0:73 0:71
Di¤erence:
0:40 0:57 0:57 0:28 0:93 0:26 0:13 0:76 0:70 0:04
yi = a i bi
397
P
10 P
10
yi = 0:14 and (yi y)2 = 2:90484:
i=1 i=1
Yi = + Ri ; where Ri v N 0; 2
= G (0; ) ; i = 1; : : : ; 10 independently
(b) [5] Test the hypothesis of no di¤erence in the mean response for the two drugs, that
is, test H0 : = 0. Show all your work.
(d) [2] This experiment is a matched pairs experiment. Explain why this type of design
is better then a design in which 20 volunteers are randomly divided into two groups of 10
with one group receiving drug A and the other group receiving drug B.
(e) [2] Explain the importance of randomizing the order of the drugs, the fact that the
drugs where given in identical tablet form and the fact that the drugs were administered
one day apart.
4: [13] Exhaust emissions produced by motor vehicles is a major source of air pollution.
One of the major pollutants in vehicle exhaust is carbon monoxide (CO). An environmental
group interested in studying CO emissions for light-duty engines purchased 11 light-duty
engines from Manufacturer A and 12 light-duty engines from Manufacturer B. The amount
of CO emitted in grams per mile for each engine was measured. The data are given below:
Manufacturer A:
5:01 8:60 4:95 7:51 14:59 11:53 5:21 9:62 15:13 3:95 4:12
P
11 P
11
y1j = 90:22 and (y1j y1 )2 = 166:9860
j=1 j=1
Manufacturer B:
16:67 6:42 9:24 14:30 9:98 6:10 14:10 16:97 7:04 5:38 25:53 24:92
P
12 P
12
y2j = 136:65 and (y2j y2 )2 = 218:7656
j=1 j=1
398 APPENDIX B: SAMPLE TESTS
(b) [4] Calculate a 99% con…dence interval for the di¤erence in the means: 1 2.
(d) [2] What conclusions can the environmental group draw from this study? Justify
your answer.
5: [9] In a court case challenging an Oklahoma law that di¤erentiated the ages at which
young men and women could buy 3:2% beer, the Supreme Court examined evidence from a
random roadside survey that measured information on age, gender, and drinking behaviour.
The table below gives the results for the drivers under 20 years of age.
(b) [5] Test the hypothesis of no relationship (independence) between the two variates:
gender and whether or not the driver drank alcohol in the last 2 hours. Show all your work.
(c) [2] The Supreme Court decided to strike down the law that di¤erentiated the ages
at which young men and women could buy 3:2% beer based on the evidence presented. Do
you agree with the Supreme Court’s decision? Justify your answer.
6:The Survey of Study Habits and Attitudes (SSHA) is a psychological test that eval-
uates university students’motivation, study habits, and attitudes toward university. At a
small university college 19 students are selected at random and given the SSHA test. Their
scores are:
399
10 10 11 12 13 13 13 14 14 14
14 15 15 15 16 16 17 18 20
P
19 P
19
yi = 270 and yi2 = 3956:
i=1 i=1
For these data calculate the mean, median, mode, sample variance, range, and interquartile
range.
7: A dataset consisting of six columns of data was collected by interviewing 100 students
on the University of Waterloo campus. The columns are:
Column 1: Sex of respondent
Column 2: Age of respondent
Column 3: Weight of respondent
Column 4: Faculty of respondent
Column 5: Number of courses respondent has failed.
Column 6: Whether the respondent (i) strongly disagreed, (ii) disagreed, (iii) agreed or
(iv) strongly agreed with the statement “The University of Waterloo is the best
university in Ontario.
(a) For this dataset give an example of each of the following types of data;
discrete__________
continuous___________
categorical____________
binary_______________
ordinal______________
(b) Two ways to graphically represent categorical data are ____________ and
________________.
(c) A graphical way to examine the relationship between heights and weights is a
______________.
(d) If the sample correlation between heights and weights was 0.4 you would con-
clude_____________.
400 APPENDIX B: SAMPLE TESTS
(b) [4]
11
10
y=2.82+1.703x
Sales (thousands
of dollars) 8
3
1 1.5 2 2.5 3 3.5 4 4.5 5
Advertising spending (thousands of dollars)
Looking at the scatterplot and the …tted line we notice that for x = 2:4, 4 of the 5 data
points lie above the …tted line while for x = 3:6, all 5 of the data points lie below the …tted
line. This suggests that the linear model might not be the best model for these data.
(c) [3]
Calculate the estimated residuals ri = yi y^i = yi (^ + ^ xi ), i = 1; : : : ; 20 and order
the residuals from smallest to largest: r(1) ; : : : ; r(n) .
Calculate qi , i = 1; : : : ; 20 where qi satis…es F (qi ) = (i 0:5) =20 and F is the N(0; 1)
cumulative distribution function. Plot r(i) ; qi , i = 1; : : : ; 20.
OR:
Calculate the estimated residuals ri = yi y^i = yi (^ + ^ xi ), i = 1; : : : ; 20 and order
the residuals from smallest to largest: r(1) ; : : : ; r(n) .
401
Plot the ordered residuals against the theoretical quantiles of the Normal distribution.
Since there is no obvious pattern of departure from a straight line we would conclude
that there is no evidence against the normality assumption Ri v N 0; 2 , i = 1; : : : ; 20.
p value = P (D 9:50; H0 )
= P (jT j 9:50) where T s t (18)
t 0:
Therefore there is very strong evidence based on the data against the hypothesis of no
relationship between the amount of money spent in advertising a product on local television
in one week and the sales of the product in the following week.
(e) [2] Since this study was an experimental study, since there was strong evidence
against H0 : = 0, and since the slope of the …tted line was ^ = 1:703 > 0, the data
suggest that an increase in the amount of money spent advertising causes an increase in
the sales of the product in the following week. However we don’t know if the 4 levels of
spending on advertising were applied in the 5 di¤erent communities using randomization.
If the levels of advertising were not randomly applied then the di¤erences in the sales of the
product could be due to di¤erences between the communities. For example, if the highest
(lowest) level was applied to the richest (poorest) communities you might expect to see the
same pattern of response as was observed.
402 APPENDIX B: SAMPLE TESTS
(f ) [3] From t tables P (T 1:73) = 0:95 where T s t (18). A 90% prediction interval
for the sales of the product (in thousands of dollars) in the following week if x = 5 is
" #1=2
1 (5 3)2
2:82 + 1:703(5) (1:73) (1:0758) 1 + +
20 36
= 11:3367 2:0055
= [9:33; 13:34]
2: [16]
(a) [4] The likelihood function is
Q
n 2y
i yi2 = Q
n
n 1P
n
L( ) = e = 2yi exp yi2 ; >0
i=1 i=1 i=1
1 P
n
and l0 ( ) = 0 if = n yi2 : Therefore the Maximum Likelihood estimate of is
i=1
^ = 1 P y2:
n
n i=1 i
^ = 9984:03 = 713:145
14
and the Relative Likelihood function is
14 14
L( ) exp ( 9984:03= ) 713:45
R( ) = = 14
= exp (14 9984:03= ) ; > 0:
L(^) ^ exp 9984:03=^
For these data the observed value of the likelihood ratio test statistic for H0 : = 509:3
is
d = 2r (509:3)
= 2 [l (509:3) l(713:145)]
713:45 9984:03
= 2 14 log + 14
509:3 509:3
= 2 ( 0:8904)
= 1:7807
and
p value = P (D 1:7807; H0 )
t P (W 1:7807) where W s 2 (1)
p
= P jZj 1:7807 where Z s N (0; 1)
= 2 [1 P (Z 1:33)]
= 2 (1 0:9082) = 2 (0:0912)
= 0:1824:
Since the p-value t 0:1824 > 0:1, therefore there is no evidence based on the data against
H0 : = 509:3.
Since
2P
n
0:95 = P 15:31 Yi2 44:46
i=1
2 P n 2 P n
= P Y2 Y2
44:46 i=1 i 15:31 i=1 i
Since the values of this interval are all above 16, the data seem to suggest a mean windspeed
greater than 16km/hr. However we don’t know how the data were collected. It would be
wise to determine how the data were collected before reaching a conclusion. Suppose that
Windy Hill is only windy at one particular time of the year and that the data were collected
only during the windy period. We would not want to make a decision only based on these
data.
3: [14]
(a) [2] A suitable study population would consist of individuals who have volunteered
to partake in clinical trials.
The parameter corresponds to the mean di¤erence in antibiotic blood serum level
between drugs A and B in the study population.
The parameter corresponds to the standard deviation of the di¤erences in antibiotic
blood serum level between drugs A and B in the study population.
(b) [5] To test the hypothesis of no di¤erence in the mean response for the two drugs,
that is, H0 : = 0 we use the discrepancy measure
Y 0
D= p
S= 10
where
Y 0
T = p s t (9) assuming H0 : = 0 is true
S= 10
and
1P 10
2
S2 = Yi Y :
9 i=1
Since y = 0:14=10 = 0:014 and
1=2
2:90484
s= = (0:32276)1=2 = 0:5681
9
the observed value of D is
j 0:014 0j
d= p = 0:078
0:5681= 10
and
p value = P (D 0:078; H0 )
= P (jT j 0:078) where T s t (9)
= 2 [1 P (T 0:078)] :
Therefore there is no evidence based on the data against the hypothesis of no di¤erence in
the mean response for the two drugs, that is, H0 : = 0.
(c) [3] From t tables P (T 2:26) = 0:975 where T s t (9). A 95% con…dence interval
for based on these data is
p
y 2:26 (s) = 10
p
= 0:014 2:26 (0:5681) = 10
= [ 0:4200; 0:3920] :
(d) [2] Since this experimental study was conducted as a matched pairs study, an analysis
of the di¤erences, yi = ai bi ; allows for a more precise comparison since di¤erences between
the 10 pairs have been eliminated. That is, by analysing the di¤erences we do not need to
worry that there may have been large di¤erences in the responses between subjects due to
other variates such as age, general health, etc.
(e) [2] It is important to randomize the order of the drugs in case the order in which
the drugs are taken a¤ects the outcome.
It is important to give the drugs in identical tablet form so the subject does not know
which drug he or she is taking since knowing which drug is being taken could a¤ect the
outcome.
It is important that the drugs be administered one day apart to ensure that the e¤ects
of one drug are gone before the second drug is given.
4: [13]
(a) [2] The study population would consist of light-duty engines produced by Manufac-
turer A and Manufacturer B..The parameter 1 corresponds to the mean amount of CO
emitted by light-duty engines produced by Manufacturer A
The parameter 2 corresponds to the mean amount of CO emitted by light-duty engines
produced by Manufacturer B.
The parameter corresponds to the standard deviation of the CO emissions from light-
duty engines produced by Manufacturers A and B. (Note that it has been assumed that
this standard deviation is the same for both manufacturers.)
(b) [4] From t tables P (T 2:83) = 0:995 where T s t (21). For these
1=2
166:9860 + 218:7656
s= = (18:3691)1=2 = 4:2860
21
406 APPENDIX B: SAMPLE TESTS
(c) [5] To test the hypothesis of no di¤erence in the mean response for the two drugs,
that is, H0 : 1 = 2 we use the discrepancy measure
Y1 Y2 0
D= q
1 1
S 11 + 12
where
Y1 Y2 0
T = q s t (21) assuming H0 : 1 = 2 is true
1 1
S 11 + 12
and
1 P11
2 P
12
2
S2 = Y1i Y1 + Y2i Y2 :
21 i=1 i=1
Since
90:22 136:65
y1 y2 = = 3:1857
11 12
and s = 4:2860 the observed value of D is
j 3:1857 0j
d= q = 1:7806
1 1
4:2860 11 + 12
and
p value = P (D 1:7806; H0 )
= P (jT j 1:7806) where T s t (21)
= 2 [1 P (T 1:7806)] :
and therefore there is weak evidence based on the data against the hypothesis of no di¤er-
ence in the mean response for the two drugs, that is, H0 : 1 = 2 :
407
(d) [2] Although there is weak evidence of a di¤erence between the mean CO emissions
for the two manufacturers it is di¢ cult to draw much of a conclusion. The sample sizes
n1 = 11 and n2 = 12 are small. We also don’t know whether the engines were chosen
at random from the two manufacturers on the day, week, or month. In other words we
don’t know if the samples are representative of all light-duty engines produced by these
manufacturers.
5: [9]
(a) [2] This is an observational study because no explanatory variates were manipulated
by the researcher.
(b) [5] Denote the frequencies as F1 ; F2 ; F3 ; F4 with observed values f1 = 77; f2 = 404;
f3 = 16 and f4 = 122. Denote the expected frequencies as E1 ; E2 ; E3 ; E4 : If the hypothesis
of no relationship (independence) between the two variates: gender and whether or not the
driver drank alcohol in the last 2 hours is true then the expected frequency for the outcome
male and drank alcohol in the last 2 hours for the given data is
93 481
e1 = = 72:27:
619
The other expected frequencies e2 ; e3 ; e4 can be obtained by subtraction from the appro-
priate row or column total. The expected frequencies are given in brackets in the table
below.
To test the hypothesis of no relationship we use the discrepancy measure (a random variable)
P
4 (F
i Ei )2
D= :
i=1 Ei
or
619 [(77) (122) (16) (404)]2
D= = 1:6366:
(481) (138) (93) (526)
Since the expected frequencies are all great than 5 then D has approximately a 2 (1)
distribution.
Thus
p value = P (D d; H0 )
= P (D 1:6366; H0 )
t P (W 1:6366) where W v 2 (1)
p
= P jZj 1:6366 where Z s N (0; 1)
= 2 [1 P (Z 1:28)]
= 2 (1 0:8997)
= 0:2006
Since the p-value = 0:2006 > 0:1 we would conclude that there is no evidence against the
hypothesis of no relationship between between the two variates, gender and whether or not
the driver drank alcohol in the last 2 hours.
(c) [2] Although there is no evidence against the hypothesis of no relationship between
the two variates: gender and whether or not the driver drank alcohol in the last 2 hours
based on the data we cannot conclude there is no relationship since this is an observational
study. Whether a causal relationship exists or not cannot be determined by an observational
study only. A decision to strike down the law based on these data alone is unwise.
6:The Survey of Study Habits and Attitudes (SSHA) is a psychological test that eval-
uates university students’motivation, study habits, and attitudes toward university. At a
small university college 19 students are selected at random and given the SSHA test. Their
scores are:
10 10 11 12 13 13 13 14 14 14
14 15 15 15 16 16 17 18 20
P
19 P
19
yi = 270 and yi2 = 3956:
i=1 i=1
For these data calculate the mean, median, mode, sample variance, range, and interquartile
range.
409
7: A data set consisting of six columns of data was collected by interviewing 100 students
on the University of Waterloo campus. The columns are:
Column 1: Sex of respondent
Column 2: Age of respondent
Column 3: Weight of respondent
Column 4: Faculty of respondent
Column 5: Number of courses respondent has failed.
Column 6: Whether the respondent (i) strongly disagreed, (ii) disagreed, (iii) agreed or
(iv) strongly agreed with the statement “The University of Waterloo is the best
university in Ontario.
(a) For this data set give an example of each of the following types of data;
discrete number of courses failed
continuous weight or age
categorical faculty or sex
binary sex
ordinal degree of agreement with statement
(b) Two ways to graphically represent categorical data are pie charts and bar charts
.
(c) A graphical way to examine the relationship between heights and weights is a
scatterplot .
(d) If the sample correlation between heights and weights was 0.4 you would conclude
that there is a positive linear relationship between heights and weights.
410 APPENDIX B: SAMPLE TESTS
APPENDIX C: DATA
Here we list the data for Example 1.5.2. In the …le ch1example152.txt, there are three
columns labelled hour, machine and volume. The data are (H=hour, M=Machine, V=Volume):
H M V H M V H M V H M V
1 1 357:8 11 1 357 21 1 356:5 31 1 357:7
1 2 358:7 11 2 359:6 21 2 357:3 31 2 357
2 1 356:6 12 1 357:1 22 1 356:9 32 1 356:3
2 2 358:5 12 2 357:6 22 2 356:7 32 2 357:8
3 1 357:1 13 1 356:3 23 1 357:5 33 1 356:6
3 2 357:9 13 2 358:1 23 2 356:9 33 2 357:5
4 1 357:3 14 1 356:3 24 1 356:9 34 1 356:7
4 2 358:2 14 2 356:9 24 2 357:1 34 2 356:5
5 1 356:7 15 1 356 25 1 356:9 35 1 356:8
5 2 358 15 2 356:4 25 2 356:4 35 2 357:6
6 1 356:8 16 1 357 26 1 356:4 36 1 356:6
6 2 359:1 16 2 357:5 26 2 357:5 36 2 357:2
7 1 357 17 1 357:5 27 1 356:5 37 1 356:6
7 2 357:5 17 2 357:2 27 2 357 37 2 357:6
8 1 356 18 1 355:9 28 1 356:5 38 1 356:7
8 2 356:4 18 2 357:1 28 2 358:1 38 2 356:9
9 1 355:9 19 1 356:5 29 1 357:6 39 1 356:8
9 2 357:9 19 2 358:2 29 2 357:6 39 2 357:2
10 1 357:8 20 1 355:8 30 1 357:5 40 1 356:1
10 2 358:5 20 2 359 30 2 356:4 40 2 356:4
411
412 APPENDIX C: DATA
Times (in minutes) between 300 eruptions of the Old Faithful geyser
between 1/08/85 and 15/08/85
80 71 57 80 75 77 60 86 77 56 81 50 89 54 90 73 60 83 65 82 84 54 85 58 79 57 88 68 76
78 74 85 75 65 76 58 91 50 87 48 93 54 86 53 78 52 83 60 87 49 80 60 92 43 89 60 84 69 74
71 108 50 77 57 80 61 82 48 81 73 62 79 54 80 73 81 62 81 71 79 81 74 59 81 66 87 53 80 50
87 51 82 58 81 49 92 50 88 62 93 56 89 51 79 58 82 52 88 52 78 69 75 77 53 80 55 87 53 85
61 93 54 76 80 81 59 86 78 71 77 76 94 75 50 83 82 72 77 75 65 79 72 78 77 79 75 78 64 80
49 88 54 85 51 96 50 80 78 81 72 75 78 87 69 55 83 49 82 57 84 57 84 73 78 57 79 57 90 62
87 78 52 98 48 78 79 65 84 50 83 60 80 50 88 50 84 74 76 65 89 49 88 51 78 85 65 75 77 69
92 68 87 61 81 55 93 53 84 70 73 93 50 87 77 74 72 82 74 80 49 91 53 86 49 79 89 87 76 59
80 89 45 93 72 71 54 79 74 65 78 57 87 72 84 47 84 57 87 68 86 75 73 53 82 93 77 54 96 48
89 63 84 76 62 83 50 85 78 78 81 78 76 74 81 66 84 48 93 47 87 51 78 54 87 52 85 58 88 79
417
Skinfold BodyDensity
1.6841 1.0613 1.9200 1.0338 1.5324 1.0696 2.0755 1.0355
1.9639 1.0478 1.6736 1.0560 1.7035 1.0449 1.4351 1.0693
1.0803 1.0854 1.7914 1.0487 1.8040 1.0411 1.7295 1.0518
1.7541 1.0629 1.7249 1.0496 1.8075 1.0426 1.5265 1.0837
1.6368 1.0652 1.5025 1.0824 1.3815 1.0715 1.7599 1.0328
1.2857 1.0813 1.6314 1.0526 1.5847 1.0602 1.4029 1.0933
1.4744 1.0683 1.3980 1.0707 1.3059 1.0807 1.2653 1.0860
1.6420 1.0575 1.7598 1.0459 1.3276 1.0536 1.2609 1.0919
2.3406 1.0126 1.3203 1.0697 1.5665 1.0602 1.6734 1.0433
2.1659 1.0264 1.3372 1.0770 1.8989 1.0536 1.5297 1.0614
1.2766 1.0829 1.3932 1.0727 1.4018 1.0655 1.5257 1.0643
2.2232 1.0296 0.9323 1.1171 1.6482 1.0668 1.8744 1.0482
1.7246 1.0670 1.8785 1.0423 1.5193 1.0700 1.6310 1.0459
1.5544 1.0688 1.6382 1.0506 1.8092 1.0485 1.6107 1.0653
1.7223 1.0525 1.4050 1.0878 1.3329 1.0804 1.9108 1.0321
1.5237 1.0721 1.8638 1.0557 1.5750 1.0503 1.3943 1.0755
1.5412 1.0672 1.1985 1.0854 1.6873 1.0557 1.7184 1.0600
1.8896 1.0350 1.5459 1.0527 1.8056 1.0625 1.7483 1.0554
1.8722 1.0528 1.5159 1.0635 1.9014 1.0438 1.5154 1.0765
1.8740 1.0473 1.6369 1.0583 1.5866 1.0632 1.6146 1.0696
1.7130 1.0560 1.6355 1.0621 1.2460 1.0782 1.3163 1.0744
1.3073 1.0848 1.3813 1.0736 1.4077 1.0739 1.3202 1.0818
1.7229 1.0564 1.5615 1.0682 1.3388 1.0805 1.5906 1.0546