STAT 231 Course Notes W16 Print

STATISTICS 231 COURSE NOTES
Department of Statistics and Actuarial Science, University of Waterloo
2016 Edition
ii 1
Contents
1. INTRODUCTION TO STATISTICAL SCIENCES 1

1.1 Statistical Sciences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Collecting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Data Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Probability Distributions and Statistical Models . . . . . . . . . . . . . . . . 21
1.5 Data Analysis and Statistical Inference . . . . . . . . . . . . . . . . . . . . . 24
1.6 Statistical Software and R . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.7 Chapter 1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMA-

TION 47
2.1 Choosing a Statistical Model . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.2 Estimation of Parameters and the Method of Maximum Likelihood . . . . . 51
2.3 Likelihood Functions for Continuous Distributions . . . . . . . . . . . . . . 61
2.4 Likelihood Functions For Multinomial Models . . . . . . . . . . . . . . . . . 63
2.5 Invariance Property of Maximum Likelihood Estimates . . . . . . . . . . . . 65
2.6 Checking the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3. PLANNING AND CONDUCTING EMPIRICAL STUDIES 89

3.1 Empirical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.2 The Steps of PPDAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.3 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4. ESTIMATION 111
4.1 Statistical Models and Estimation . . . . . . . . . . . . . . . . . . . . . . . 111
4.2 Estimators and Sampling Distributions . . . . . . . . . . . . . . . . . . . . . 112
4.3 Interval Estimation Using the Likelihood Function . . . . . . . . . . . . . . 116
4.4 Con…dence Intervals and Pivotal Quantities . . . . . . . . . . . . . . . . . . 119
4.5 The Chi-squared and t Distributions . . . . . . . . . . . . . . . . . . . . . . 125
4.6 Likelihood-Based Con…dence Intervals . . . . . . . . . . . . . . . . . . . . . 129
iii
iv CONTENTS
4.7 Con…dence Intervals for Parameters in the G( ; ) Model . . . . . . . . . . 133

4.8 A Case Study: Testing Reliability of Computer Power Supplies1 . . . . . . . 141
5. TESTS OF HYPOTHESES 157

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
5.2 Tests of Hypotheses for Parameters in the G( ; ) Model . . . . . . . . . . 164
5.3 Likelihood Ratio Tests of Hypotheses - One Parameter . . . . . . . . . . . . 169
5.4 Likelihood Ratio Tests of Hypotheses - Multiparameter . . . . . . . . . . . 175
6. GAUSSIAN RESPONSE MODELS 189

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
6.2 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
6.3 Comparing the Means of Two Populations . . . . . . . . . . . . . . . . . . . 210
6.4 More General Gaussian Response Models2 . . . . . . . . . . . . . . . . . . . 220
7. MULTINOMIAL MODELS AND GOODNESS OF FIT TESTS 239

7.1 Likelihood Ratio Test for the Multinomial Model . . . . . . . . . . . . . . . 239
7.2 Goodness of Fit Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
7.3 Two-Way (Contingency) Tables . . . . . . . . . . . . . . . . . . . . . . . . . 244
8. CAUSAL RELATIONSHIPS 255

8.1 Establishing Causation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
8.2 Experimental Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
8.3 Observational Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
8.4 Clo…brate Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
9. REFERENCES AND SUPPLEMENTARY RESOURCES 267

9.1 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
9.2 Departmental Web Resources . . . . . . . . . . . . . . . . . . . . . . . . . . 267
10. FORMULA, DISTRIBUTIONS AND STATISTICAL TABLES 269

10.1 Summary of Distributions and Formula . . . . . . . . . . . . . . . . . . . . 269
10.2 Probabilities for the Standard Normal Distribution . . . . . . . . . . . . . . 271
10.3 Chi-Squared Cumulative Distribution function . . . . . . . . . . . . . . . . 272
10.4 Student t Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
1
May be omitted
2
May be omitted in Stat 231/221
CONTENTS v
APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS 275
APPENDIX B: SAMPLE TESTS 373
APPENDIX C: DATA 411

0 CONTENTS
Preface
These notes are a work-in-progress with contributions from those students taking the
courses and the instructors teaching them. An original version of these notes was prepared
by Jerry Lawless. Additions and revisions were made by Don McLeish, Cyntha Struthers,
Jock MacKay, and others. Richard Cook supplied the example in Chapter 8. In order to
provide improved versions of the notes for students in subsequent terms, please email lists of
errors, or sections that are confusing, or additional remarks/suggestions to your instructor
or [email protected].
Speci…c topics in these notes also have associated video …les or powerpoint shows that
can be accessed at www.watstat.ca. Where possible we reference these videos in the text.
1. INTRODUCTION TO
STATISTICAL SCIENCES
1.1 Statistical Sciences
Statistical Sciences are concerned with all aspects of empirical studies including problem
formulation, planning of an experiment, data collection, analysis of the data, and the con-
clusions that can be made. An empirical study is one in which we learn by observation or
experiment. A key feature of such studies is that there is usually uncertainty in the conclu-
sions. An important task in empirical studies is to quantify this uncertainty. In disciplines
such as insurance or …nance, decisions must be made about what premium to charge for
an insurance policy or whether to buy or sell a stock, on the basis of available data. The
uncertainty as to whether a policy holder will have a claim over the next year, or whether
the price of a stock will rise or fall, is the basis of …nancial risk for the insurer and the
investor. In medical research, decisions must be made about the safety and e¢ cacy of new
treatments for diseases such as cancer and HIV.
Empirical studies deal with populations and processes; both of which are collections
of individual units. In order to increase our knowledge about a process, we examine a
sample of units generated by the process. To study a population of units we examine
a sample of units carefully selected from that population. Two challenges arise since we
only see a sample from the process or population and not all of the units are the same.
For example, scientists at a pharmaceutical company may conduct a study to assess the
e¤ect of a new drug for controlling hypertension (high blood pressure) because they do not
know how the drug will perform on di¤erent types of people, what its side e¤ects will be,
and so on. For cost and ethical reasons, they can involve only a relatively small sample
of subjects in the study. Variability in human populations is ever-present; people have
varying degrees of hypertension, they react di¤erently to the drug, they have di¤erent side
e¤ects. One might similarly want to study variability in currency or stock values, variability
in sales for a company over time, or variability in the number of hits and response times
for a commercial web site. Statistical Sciences deal both with the study of variability
in processes and populations, and with good (that is, informative, cost-e¤ective) ways to
collect and analyze data about such processes.
1
2 1. INTRODUCTION TO STATISTICAL SCIENCES
We can have various objectives when we collect and analyze data on a population or
process. In addition to furthering knowledge, these objectives may include decision-making
and the improvement of processes or systems. Many problems involve a combination of
objectives. For example, government scientists collect data on …sh stocks in order to further
scienti…c knowledge and also to provide information to policy makers who must set quotas
or limits on commercial …shing.
Statistical data analysis occurs in a huge number of areas. For example, statistical
algorithms are the basis for software involved in the automated recognition of handwritten
or spoken text; statistical methods are commonly used in law cases, for example in DNA
pro…ling; statistical process control is used to increase the quality and productivity of
manufacturing and service processes; individuals are selected for direct mail marketing
campaigns through a statistical analysis of their characteristics. With modern information
technology, massive amounts of data are routinely collected and stored. But data do not
equal information, and it is the purpose of the Statistical Sciences to provide and analyze
data so that the maximum amount of information or knowledge may be obtained3 . Poor
or improperly analyzed data may be useless or misleading. The same could be said about
poorly collected data.
We use probability models to represent many phenomena, populations, or processes
and to deal with problems that involve variability. You studied these models in your …rst
probability course and you have seen how they describe variability. This course will focus
on the collection, analysis and interpretation of data and the probability models studied
earlier will be used extensively. The most important material from your probability course
is the material dealing with random variables, including distributions such as the Binomial,
Poisson, Multinomial, Normal or Gaussian, Uniform and Exponential. You should review
this material.
Statistical Sciences is a large discipline and this course is only an introduction. Our
broad objective is to discuss all aspects of: problem formulation, planning of an empirical
study, formal and informal analysis of data, and the conclusions and limitations of the
analysis. We must remember that data are collected and models are constructed for a
speci…c reason. In any given application we should keep the big picture in mind (e.g. Why
are we studying this? What else do we know about it?) even when considering one speci…c
aspect of a problem. We …nish this introduction with a recent quote4 from Hal Varien,
Google’s chief economist.
“The ability to take data - to be able to understand it, to process it, to extract value
from it, to visualize it, to communicate it’s going to be a hugely important skill in the next
decades, not only at the professional level but even at the educational level for elementary
3
A brilliant example of how to create information through data visualization is found in the video by
Hans Rosling at: http://www.youtube.com/watch?v=jbkSRLYSojo
4
For the complete article see “How the web challenges managers” Hal Varian, The McKinsey Quarterly,
January 2009
1.2. COLLECTING DATA 3
school kids, for high school kids, for college kids. Because now we really do have essen-
tially free and ubiquitous data. So the complemintary (sic) scarce factor is the ability to
understand that data and extract value from it.
I think statisticians are part of it, but it’s just a part. You also want to be able to
visualize the data, communicate the data, and utilize it e¤ectively. But I do think those
skills - of being able to access, understand, and communicate the insights you get from data
analysis - are going to be extremely important. Managers need to be able to access and
understand the data themselves.”
1.2 Collecting Data

The objects of study in this course are referred to as populations or processes. A population
is a collection of units. Examples are: the population of all students taking STAT 231 this
term; the population of all persons aged 18-25 living in Ontario on January 1, 2016; and
the populaton of all car insurance policies issued by a particular insurance company in the
year 2016. A process is a system by which units are produced. For example, the hits on a
website could be considered as units in a system or process. Of course this system or process
would be quite complex and di¢ cult to describe. As another example, the claims made by
car insurance policy holders could be considered as units in a process. A key feature of
processes is that they usually occur over time whereas populations are often static (de…ned
at one moment in time).
We pose questions about populations (or processes) by de…ning variates for the units
which are characteristics of the units. Variates can be of di¤erent types. Variates such as
height and weight of a person, lifetime of an electrical component, and time until recurrence
of disease after medical treatment are all examples of continuous or measured variates.
Variates such as the number of defective smartphones sold by a particular company in a
week, the number of deaths in a year on a dangerous highway or the number of damaged
pixels in a monitor are all examples of discrete variates.
Variates such as hair colour, university program or marital status are examples of cate-
gorical variates since these variates do not take on numerical values. Another example of a
categorical variate would be the presence or absence of a disease. Sometimes, to facilitate
the analysis of the data, we might rede…ne the variate of interest to be 1 if the disease
is present and 0 if the disease is absent. In such a case we would now call the variate a
discrete variate. Since the variate only takes on values 0 or 1 such a variate is also often
called a binary variate.
If a variate classi…es a unit by size (large, medium or small) then this is an example
of a categorical variate. However since this categorical variate has a natural ordering, it is
also called an ordinal variate. Another example of an ordinal variate would be opinion on a
given statement for which the categories might be: strongly agree, agree, neutral, disagree,
strongly disagree.
Variates can also be complex such as an image or an open ended response to a survey
question.
We are interested in functions of the variates over the whole population; for example
the average drop in blood pressure due to a treatment for individuals with hypertension
or the proportion of a population having a certain characteristic. We call these functions
attributes of the population or process.
We represent variates by letters such as x; y; z. For example, we might de…ne a variate
y as the size of the insurance claim or the time between claims. The values of y typically
vary across the units in a population or process. This variability generates uncertainty and
makes it necessary to study populations and processes by collecting data about them. By
data, we mean the values of the variates for a sample of units in the population or a sample
of units taken from the process.
In planning to collect data about some process or population, we must carefully specify
what the objectives are. Then, we must consider feasible methods for collecting data as
well as the extent it will be possible to answer questions of interest. This sounds simple
but is usually di¢ cult to do well, especially since resources are always limited.
There are several ways in which we can obtain data. One way is purely according to
what is available: that is, data are provided by some existing source. Huge amounts of
data collected by many technological systems are of this type, for example, data on credit
card usage or on purchases made by customers in a supermarket. Sometimes it is not
clear what available data represent and they may be unsuitable for serious analysis. For
example, people who voluntarily provide data in a web survey may not be representative of
the population at large. Alternatively, we may plan and execute a sampling plan to collect
new data. Statistical Sciences stress the importance of obtaining data that will be objective
and provide maximal information at a reasonable cost. There are three broad approaches:
(i) Sample Surveys. The object of many studies is to learn about a …nite population
(e.g. all persons over 19 in Ontario as of September 12 in a given year). In this case
information about the population may be obtained by selecting a “representative”
sample of units from the population and determining the variates of interest for each
unit in the sample. Obtaining such a sample can be challenging and expensive. In
a survey sample the variates of interest are most often collected using a question-
naire. Sample surveys are widely used in government statistical studies, economics,
marketing, public opinion polls, sociology, quality assurance and other areas.
(ii) Observational Studies. An observational study is one in which data are collected
about a process or population without any attempt to change the value of one or
more variates for the sampled units. For example, in studying risk factors associated
with a disease such as lung cancer, we might investigate all cases of the disease at a
particular hospital (or perhaps a sample of them) that occur over a given time period.
We would also examine a sample of individuals who did not have the disease. A dis-
tinction between a sample survey and an observational study is that for observational
1.2. COLLECTING DATA 5
studies the population of interest is usually in…nite or conceptual. For example, in

investigating risk factors for a disease, we prefer to think of the population of interest
as a conceptual one consisting of persons at risk from the disease recently or in the
future.
(iii) Experiments or Experimental Studies. An experiment is a study in which the

experimenter (that is, the person conducting the study) intervenes and changes or
sets the values of one or more variates for the units in the sample. For example, in an
engineering experiment to quantify the e¤ect of temperature on the performance of
a certain type of computer chip, the experimenter might decide to run a study with
40 chips, ten of which are operated at each of four temperatures 10, 20, 30, and 40
degrees Celsius. Since the experimenter decides the temperature level for each chip
in the sample, this is an experiment.
The three types of studies described above are not mutually exclusive, and many studies
involve aspects of all of them. Here are some slightly more detailed examples.
Example 1.2.1 A sample survey about smoking

Suppose we wish to study the smoking behaviour of Ontario residents aged 14-20 years5 .
(Think about reasons why such studies are considered important.) Of course, the population
of Ontario residents aged 14-20 years and their smoking habits both change over time, so
we will content ourselves with a snapshot of the population at some point in time (e.g. the
second week of September in a given year). Since we cannot a¤ord to contact all persons
in the population, we decide to select a sample of persons from the population of interest.
(Think about how we might do this - it is quite di¢ cult!) We decide to measure the
following variates on each person in the sample: age, sex, place of residence, occupation,
current smoking status, length of time smoked, etc.
Note that we have to decide how we are going to obtain our sample and how large it
should be. The former question is very important if we want to ensure that our sample
provides a good picture of the overall population. The amount of time and money available
to carry out the study heavily in‡uences how we will proceed.
Example 1.2.2 A study of a manufacturing process

When a manufacturer produces a product in packages stated to weigh or contain a
certain amount, they are generally required by law to provide at least the stated amount in
each package. Since there is always some inherent variation in the amount of product which
the manufacturing process deposits in each package, the manufacturer has to understand
this variation and set up the process so that no packages or only a very small fraction of
packages contain less than the required amount.
5
One of the most important studies was conducted in the Waterloo school board; see
for example "Six-year follow-up of the …rst Waterloo school smoking prevention trial" at
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1350177/
Consider, for example, soft drinks sold in nominal 355 ml cans. Because of inherent
variation in the …lling process, the amount of liquid y that goes into a can varies over a
small range. Note that the manufacturer would like the variability in y to be as small as
possible, and for cans to contain at least 355 ml. Suppose that the manufacturer has just
added a new …lling machine to increase the plant’s capacity. The process engineer wants
to compare the new machine with an old one. Here the population of interest is the cans
…lled in the future by both machines. She decides to do this by sampling some …lled cans
from each machine and accurately measuring the amount of liquid y in each can. This is
an observational study.
How exactly should the sample be chosen? The machines may drift over time (that is,
the average of the y values or the variability in the y values may vary systematically up or
down over time) so we should select cans over time from each machine. We have to decide
how many, over what time period, and when to collect the cans from each machine.
Example 1.2.3 A clinical trial in medicine

In studies of the treatment of disease, it is common to compare alternative treatments
in experiments called clinical trials. Consider, for example, a population of persons who are
at high risk of a stroke. Some years ago it was established in clinical trials that small daily
doses of aspirin (which acts as a blood thinner) could lower the risk of stroke. This was done
by giving some high risk subjects daily doses of aspirin (call this Treatment 1) and others
a daily dose of a placebo (an inactive compound) given in the same form as the aspirin
(call this Treatment 2). The two treatment groups were then followed for a period of time,
and the number of strokes in each group was observed. Note that this is an experiment
because the researchers decided which subjects in the sample received Treatment 1 and
which subjects received Treatment 2.
This sounds like a simple plan to implement but there are several important points.
For example, patients should be assigned to receive Treatment 1 or Treatment 2 in some
random fashion to avoid unconscious bias (e.g. doctors might otherwise tend to put persons
at higher risk of stroke in the aspirin group) and to balance other factors (e.g. age, sex,
severity of condition) across the two groups. It is also best not to let the patients or
their doctors know which treatment they are receiving. Many other questions must also be
addressed. For example, what variates should we measure other than the occurrence of a
stroke? What should we do about patients who are forced to drop out of the study because
of adverse side e¤ects? Is it possible that the aspirin treatment works for certain types of
patients but not others? How long should the study go on? How many persons should be
included?
As an example of a statistical setting where the data are not obtained by a survey,
experiment, or even an observational study, consider the following.
Example 1.2.4 Direct marketing campaigns

With products or services such as credit cards it is common to conduct direct marketing
1.3. DATA SUMMARIES 7
campaigns in which large numbers of individuals are contacted by mail and invited to
acquire a product or service. Such individuals are usually picked from a much larger number
of persons on whom the company has information. For example, in a credit card marketing
campaign a company might have data on several million persons, pertaining to demographic
(e.g. sex, age, place of residence), …nancial (e.g. salary, other credit cards held, spending
patterns) and other variates. Based on the data, the company wishes to select persons whom
it considers have a good chance of responding positively to the mail-out. The challenge is
to use data from previous mail campaigns, along with the current data, to achieve as high
a response rate as possible.
1.3 Data Summaries

When we study a population or process we collect data. We cannot answer the questions
of interest without summarizing the data. Summaries are especially important when we
report the conclusions of the study. Summaries must be clear and informative with respect
to the questions of interest and, since they are summaries, we need to make sure that they
are not misleading. There are two classes of summaries: graphical and numerical.
The basic set-up is as follows. Suppose that data on a variate y is collected for n units
in a population or process. By convention, we label the units as 1; 2; : : : ; n and denote their
respective y-value as y1 ; y2 ; : : : ; yn . We might also collect data on a second variate x for
each unit, and we would denote the values as x1 ; x2 ; : : : ; xn . We refer to n as the sample
size and to fx1 ; x2 ; : : : ; xn g, fy1 ; y2 ; : : : ; yn g or f(x1 ; y1 ); (x2 ; y2 ); : : : (xn ; yn )g as data sets.
Most real data sets contain the values for many variates.
Numerical Summaries
We now describe some numerical summaries which are useful for describing features of
a single variate in a data set when the variate is either continuous or discrete. These
summaries fall generally into three categories: measures of location (mean, median, and
mode), measures of variability or dispersion (variance, range, and interquartile range), and
measures of shape (skewness and kurtosis).
1. Measures of location:
1 P
n
The (sample) mean also called the sample average: y = n yi .
i=1
The (sample) median m ^ or the middle value when n is odd and the sample is ordered
from smallest to largest, and the average of the two middle values when n is even.
Since the median is less a¤ected by a few extreme observations (see Problem 1) it is
a more robust measure of location.
The (sample) mode, or the value of y which appears in the sample with the highest
frequency (not necessarily unique).
2. Measures of dispersion or variability:
The (sample) variance:
1 P
n 1 P
n
s2 = (yi y)2 = yi2 n (y)2
n 1 i=1 n 1 i=1
p
and the (sample) standard deviation: s = s2 .
The range = y(n) y(1) where y(n) = max (y1 ; y2 ; : : : ; yn ) and y(1) = min (y1 ; y2 ; : : : ; yn ).
The interquartile range IQR which is described below.
3. Measures of shape:
Measures of shape generally indicate how the data, in terms of a relative frequency
histogram, di¤er from the Normal bell-shaped curve, for example whether one tail of the
relative frequency histogram is substantially larger than the other so the histogram is asym-
metric, or whether both tails of the relative frequency histogram are large so the data are
more prone to extreme values than data from a Normal distribution.
The (sample) skewness

1 P
n
n (yi y)3
i=1
g1 = 3=2
1 P
n
n (yi y)2
i=1
is a measure of the (lack of) symmetry in the data.
0.5
0.45
skewness = 1.15
0.4
0.35
0.3
Relative
F requency
0.25
0.2
0.15
0.1
0.05
0
0 1 2 3 4 5 6 7 8 9 10 11 12
Figure 1.1: Relative frequency histogram for data with positive skewness
When the relative frequency histogram of the data is approximately symmetric then
there is an approximately equal balance between the positive and negative values in the
Pn
sum (yi y)3 and this results in a value for the skewness that is approximately zero. If
i=1
the relative frequency histogram of the data has a long right tail (see Figure 1.1), then the
positive values of (yi y)3 dominate the negative values in the sum and the value of the
skewness will be positive.
Similarly if the relative frequency histogram of the data has a long left tail (see Figure
1.2) then the value of the skewness will be negative.
0.7
0.6
skewness = -1.35
0.5
Relative
F requency
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6 7 8 9 10 11 12
Figure 1.2: Relative frequency histogram for data with negative skewness
The (sample) kurtosis

1 P
n
n (yi y)4
i=1
g2 = 2
1 Pn
n (yi y)2
i=1
measures the heaviness of the tails and the peakedness of the data relative to data
that are Normally distributed. For the Normal distribution the kurtosis is equal to 3.
Since the term (yi y)4 is always positive, the kurtosis is always positive and values
greater than three indicate heaver tails (and a more peaked center) than data that are
Normally distributed. See Figures 1.3 and 1.4. Typical …nancial data such as the S&P500
index have kurtosis greater than three, because the extreme returns (both large and small)
are more frequent than one would expect for Normally distributed data.
0.35
0.3 skewness = 0.71

kurtosis = 5.24
0.25
Relative
Frequency
0.2 G (0.15,1.52) p.d.f.
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6 7
Figure 1.3: Relative frequency histogram for data with kurtosis > 3
1.4
skewness = 0.08
kurtosis = 1.73
1.2
1
Relative
Frequency
0.8
0.6
0.4
G (0.49,0.29) p.d.f.
0.2
0
-0.5 0 0.5 1 1.5
Figure 1.4: Relative frequency histogram for data with kurtosis < 3
Another way to numerically summarize data is to use sample percentiles or quantiles.
Sample Quantiles and Percentiles

For 0 < p < 1; the pth quantile (also called the 100pth percentile) is a value such that
approximately a fraction p of the y values in the data set are less than q(p) and roughly
1 p are greater. More precisely:
De…nition 1 (sample percentiles and sample quantiles): The pth (sample) quantile (also
called the 100pth (sample) percentile) is a value, call it q(p), determined as follows:
Let m = (n + 1)p where n is the sample size.
If m 2 f1; 2; : : : ; ng then take the m0 th smallest value q(p) = y(m) , where

y(1) y(2) y(n) denotes the ordered sample values.
If m 2
= f1; 2; : : : ; ng but 1 < m < n then determine the closest integer j such that
j < m < j + 1 and take q(p) = 21 y(j) + y(j+1) .
Depending on the size of the data set, quantiles are not uniquely de…ned for all values
of p. For example, what is the median of the values f1; 2; 3; 4; 5; 6g? What is the lower
quartile? There are di¤erent conventions for de…ning quantiles in these cases; if the sample
size is large, the di¤erences in the quantiles based on the various de…nitions are small.
De…nition 2 The values q(0:5), q(0:25) and q(0:75) are called the median, the lower or
…rst quartile, and the upper or third quartile respectively.
We can easily understand what the sample mean, quantiles and percentiles tell us about
the variate values in a data set. The sample variance and sample standard deviation measure
the variability or spread of the variate values in a data set. We prefer the standard deviation
because it has the same scale as the original variate. Another way to measure variability
is to use the interquartile range, the di¤erence between the lower or …rst quartile and the
higher or third quartile.
De…nition 3 The interquartile range is IQR = q(0:75) q(0:25).
Since the interquartile range is less a¤ected by a few extreme observations (see Problem
2) it is a more robust measure of variability.
De…nition 4 The …ve number summary of a data set consists of the smallest observation,
the lower quartile, the median, the upper quartile and the largest value, that is, the …ve
values y(1) ; q (0:25) ; q (0:5) ; q (0:75) ; y(n) .
The …ve number summary provides a concise numerical summary of a data set which
provides information about the location (through the median), the spread (through the
lower and upper quartiles) and the range (through the minimum and maximum values).
Example 1.3.1 Comparison of Body Mass Index

In a longitudinal study (that is, the people in the sample were followed over time) of
obesity in New Zealand, a sample of 150 men and 150 women were selected from workers
aged 18 to 60. Many variates were measured for each subject (unit), including their height
(m) and weight (kg) at the start of the study. These variates are both continuous variates.
Their initial Body Mass Index (BMI) was also calculated. BMI is used to measure obesity
or severely low weight. It is de…ned as:
weight(kg)
BM I = :
height(m)2
There is some variation in what di¤erent guidelines refer to as “overweight”, “under-
weight”, etc. We present one such classi…cation in Table 1.1. The BMI obesity classi…cation
is an example of a ordinal variate.
Table 1.1: BMI Obesity Classi…cation

Underweight BMI < 18:5
Normal 18:5 BMI < 25:0
Overweight 25:0 BMI < 30:0
Moderately Obese 30:0 BMI < 35:0
Severely Obese 35:0 BMI
The data are available in the …le ch1example131.txt available on the course web page
and are listed in Appendix C. For statistical analysis of the data, it is convenient to record
the data in row-column format (see Table 1.2). The …rst row of the …le gives the variate
names, in this case subject number, sex (M=male or F=female), height, weight and BMI.
Each subsequent row gives the variate values for a particular subject.
Table 1.2: First Five Rows of the File ch1example131.txt

subject sex height weight BMI
1 M 1:76 63:81 20:6
2 M 1:77 89:60 28:6
3 M 1:91 88:65 24:3
4 M 1:80 74:84 23:1
The …ve number summaries for the BMI data for each sex are given in Table 1.3 along
with the sample mean and standard deviation.
Table 1.3: Summary of BMI by Sex

Sex y(1) q (0:25) q (0:5) q (0:75) y(n) y s
Female 16:8 23:4 26:8 29:7 38:8 26:9 4:60
Male 18:3 24:7 26:7 29:1 37:5 27:1 3:56
From the table, we see that there are only small di¤erences in the median and the mean.
For the standard deviation, IQR and the range we notice that the values are all larger for
the females. In other words, there is more variability in the BMI measurements for females
than for males in this sample.
We can also construct a relative frequency table that gives the proportion of subjects
that fall within each obesity class by sex.
Table 1.4: BMI Relative Frequency Table by Sex

Obesity Classi…cation Males Females
Underweight 0:01 0:02
Normal 0:28 0:33
Overweight 0:50 0:42
Moderately Obese 0:19 0:17
Severely Obese 0:02 0:06
Total 1:00 1:00
From Table 1.4, we see that the reason for the larger variability for females is that there
is a greater proportion of females in the extreme classes.
Sample Correlation
So far we have looked only at graphical summaries of a data set fy1 ; y2 ; : : : ; yn g. Often
we have bivariate data of the form f(x1 ; y1 ) ; (x2 ; y2 ) ; : : : ; (xn ; yn )g. A numerical summary
of such data is the sample correlation.
De…nition 5 The sample correlation, denoted by r, for data f(x1 ; y1 ) ; (x2 ; y2 ) ; : : : ; (xn ; yn )g
is
Sxy
r=p
Sxx Syy
where
P
n P
n P
n P
n
Sxx = (xi x)2 = x2i n (x)2 ; Sxy = (xi x) (yi y) = x i yi nxy
i=1 i=1 i=1 i=1
Pn Pn
and Syy = (yi y)2 = yi2 n (y)2 :
i=1 i=1
The sample correlation, which takes on values between 1 and 1; is a measure of the
linear relationship between the two variates x and y. If the value of r is close to 1 then
we say that there is a strong positive linear relationship between the two variates while if
the value of r is close to 1 then we say that there is a strong negative linear relationship
between the two variates. If the value of r is close to 0 then we say that there is no linear
relationship between the two variates.
Example 1.3.1 Continued

If we let x = height and y = weight then the sample correlation for the males is r = 0:55
and for the females r = 0:31 which indicates that there is a positive relationship between
height and weight which is exactly what we would expect.
Relative Risk
Recall that categorical variates consist of group or category names that do not neces-
sarily have any ordering. If two variates of interest in a study are categorical variates then
it does not make sense to use sample correlation as a measure of the relationship between
the two variates.
Example 1.3.2 Physicians’Health Study

During the 1980’s in the United States a very large study called the Physicians’Health
Study was conducted to study the relationship between taking daily aspirin and the occur-
rence of coronary heart disease (CHD). One set of data collected in the study are given in
Table 1.5.
Table 1.5: Physicians’Health Study

CHD No CHD Total
Placebo 189 10845 11034
Daily Aspirin 104 10933 11037
Total 293 21778 22071
What measure can be used to summarize the relationship between taking daily aspirin and
the occurrence of CHD?
One measure which is used to summarize the relationship between two categorical vari-
ates is relative risk. To de…ne relative risk consider a generalized version of Table 1.5 given
by
Table 1.6: General Two-way Table

A A Total
B y11 y12 y11 + y12
B y21 y22 y21 + y22
Total y11 + y21 y12 + y22 n
Recall that events A and B are independent events if P (A \ B) = P (A) P (B) or

equivalently P (A) = P (AjB) = P AjB . If A and B are independent events then
P (AjB)
= 1:
P AjB
and otherwise the ratio is not equal to one. In the PHS if we let A = takes daily aspirin
and B = CHD then we can estimate this ratio using the ratio of the sample proportions.
De…nition 6 For categorical data in the form of Table 1.6 the relative risk of event A in
group B as compared to group B is
y11 = (y11 + y12 )
relative risk = :
y21 = (y21 + y22 )
Example 1.3.2 Revisited

For the PHS the relative risk of CHD in the placebo group as compared to the aspirin
group is
189= (189 + 10845)
relative risk = = 1:82:
104= (104 + 10933)
The data suggest that the group taking the placebo are nearly twice as likely to experience
CHD as compared to the group taking the daily aspirin. Can we conclude that daily aspirin
reduces the occurrence of CHD? The topic of causation will be discussed in more detail in
Chapter 8.
In Chapter 7 we consider methods for analyzing data which can be summarized in a

two way table like Table 1.6.
Graphical Summaries
We consider several types of plots for a data set fy1 ; y2 ; : : : ; yn g and one type of plot for a
data set f(x1 ; y1 ); (x2 ; y2 ); : : : (xn ; yn )g.
Frequency histograms
Consider measurements fy1 ; y2 ; : : : ; yn g on a variate y. Partition the range of y into k
non-overlapping intervals Ij = [aj 1 ; aj ); j = 1; 2; : : : ; k and then calculate for j = 1; : : : ; k
fj = number of values from fy1 ; y2 ; : : : ; yn g that are in Ij .

P
k
The fj are called the observed frequencies for I1 ; : : : ; Ik ; note that fj = n. A
j=1
histogram is a graph in which a rectangle is placed above each interval; the height of the
rectangle for Ij is chosen so that the area of the rectangle is proportional to fj . Two main
types of frequency histogram are used. The second is preferred.
(a) a “standard” frequency histogram where the intervals Ij are of equal length. The
height of the rectangle for Ij is the frequency fj or relative frequency fj =n. This type
of histogram is similar to a bar chart.
(b) a “relative” frequency histogram, where the intervals Ij = [aj 1 ; aj ) may or may not
be of equal length. The height of the rectangle for Ij is chosen so that its area equals
fj =n, that is, the height of the rectangle for Ij is equal to
fj =n
.
(aj aj 1)
Note that in this case the sum of the areas of the rectangles in the histogram is equal
to one.
We can make the two types of frequency histograms visually comparable by using inter-
vals of equal length for both types. If we wish to compare two groups which have di¤erent
sample sizes then a relative frequency histogram must be used. If we wish to superimpose
a probability density function on a frequency histogram to see how well the data …t the
model then a relative frequency histogram must always be used.
To construct a frequency histogram, the number and location of the intervals must be
chosen. The intervals are typically selected so that there are ten to …fteen intervals and each
interval contains at least one y-value from the sample (that is, each fj 1). If a software
package is used to produce the frequency histogram (see Section 1.7) then the intervals are
usually chosen automatically. An option for user speci…ed intervals is also usually provided.

Figures 1.5 and 1.6, give the relative frequency histograms for BMI for males and females
separately. We often say that histograms show the distribution of the data. Here the shapes
of the two distributions are somewhat bell-shaped. In each case the skewness is positive
but close to zero while the kurtosis is close to three.
0.14
0.12 skewness = 0.41
kurtosis = 3.03
0.1
Relative
Frequency
0.08
0.06
0.04
0.02
0
16 18 20 22 24 26 28 30 32 34 36 38 40
BMI
Figure 1.5: Relative frequency histogram for male BMI data
Example 1.3.3 Lifetimes of brake pads

A frequency histogram can have many di¤erent shapes. Figure 1.7 shows a relative
frequency histogram of the lifetimes (in terms of number of thousand km driven) for the
front brake pads on 200 new mid-size cars of the same type.
0.09
0.08 skewness = 0.30

kurtosis = 2.79
0.07
Relative
Frequency
0.06
0.05
0.04
0.03
0.02
0.01
0
16 18 20 22 24 26 28 30 32 34 36 38 40
BMI
Figure 1.6: Relative frequency histogram for female BMI data
The data are listed in Appendix C and are available in the …le Brake Pad Lifetime
Data.txt which is posted on the course web page. Notice that the distribution has a very
di¤erent shape compared to the BMI histograms. The brake pad lifetimes have a long right
tail which is consistent with a skewness value which is positive and not close to zero. The
high degree of variability in lifetimes is due to the wide variety of driving conditions which
di¤erent cars are exposed to, as well as to variability in how soon car owners decide to
replace their brake pads.
0.018
0.016
skewness = 1.28
0.014
0.012
Relative
Frequency
0.01
0.008
0.006
0.004
0.002
0
0 15 30 45 60 75 90 105 120 135 150 165 180
Lifetime
Figure 1.7: Relative frequency histogram of brake pad lifetime data

Empirical Cumulative Distribution Functions

Another way to portray the values of a variate fy1 ; y2 ; : : : ; yn g is to determine the
proportion of values in the set which are smaller than any given value. This is called the
empirical cumulative distribution function or e.c.d.f. and is de…ned by
number of values in fy1 ; y2 ; :::; yn g which are y

F^ (y) = :
n
To construct F^ (y), it is convenient to …rst order the yi ’s (i = 1; : : : ; n) to give the
ordered values y(1) y(2) y(n) . Then, we note that F^ (y) is a step function with a
jump at each of the ordered observed values y(1) ; y(2) ; : : : ; y(n) . More details on constructing
the empirical cumulative distribution function and its close relative the qqplot are provided
in Section 2.4 but for the moment consider the empirical cumulative distribution function
as an estimate, based on the data, of the population cumulative distribution function.

Figure 1.8 shows the empirical cumulative distribution function for male and female
heights on the same plot. The plot of the empirical cumulative distribution function does
not show the shape of the distribution as clearly as a histogram does. However, it does
show the proportion of y-values in any given interval; the proportion in the interval (a; b] is
just F^ (b) F^ (a). In addition, this plot allows us to determine the pth quantile or 100pth
percentile (the left-most value on the horizontal axis yp where F^ (yp ) = p), and in particular
the median (the left-most value m ^ on the horizontal axis where F^ (m)^ = 0:5). For example,
we see from Figure 1.8 that the median height for females is about 1:60m and for males the
median height is about 1:73m.
0.9
Males
0.8
0.7
cumlative
relative F emales
0.6
frequency
0.5
0.4
0.3
0.2
0.1
0
1.4 1.5 1.6 1.7 1.8 1.9 2
Height
Figure 1.8: Empirical cumulative distribution function of heights for males and
for females
Boxplots
In many situations, we want to compare the values of a variate for two or more groups,
as in Example 1.3.1 where we compared BMI values and heights for males versus females.
Especially when the number of groups is large (or the sample sizes within groups are small),
side-by-side boxplots are a convenient way to display the data. Boxplots are also called box
and whisker plots.
The boxplot is usually displayed vertically. The center line in each box corresponds
to the median and the lower and upper sides of the box correspond to the lower quartile
q(0:25) and the upper quartile q(0:75) respectively. The so-called whiskers extend down
and up from the box to a horizontal line. The lower line is placed at the smallest observed
data value that is larger than the value q(0:25) 1:5 IQR where IQR = q(0:75) q(0:25)
is the interquartile range. Similarly the upper line is placed at the largest observed data
value that is smaller than the value q(0:75) + 1:5 IQR. Any values beyond the whiskers
(often called outliers) are plotted with special symbols.
120
110
100
90
80
70
60
50
40
Males Females
Figure 1.9: Boxplots of weights for males and females
Figure 1.9 displays side-by-side boxplots of male and female weights from Example 1.3.1.
We can see for this sample that males are generally heavier than females but that the spread
of the two distributions is about the same. For the males and the females, the center line
in the box, which corresponds to the median, divides the box and whiskers approximately
in half which indicates that both distributions are roughly symmetric about the median.
For the females there are two very large weights.
Boxplots are particularly usually for comparing several groups. Figure 1.10 shows a
comparison of the miles per gallon (MPG) for 100 cars by country of origin. The boxplot
makes it easy to see the di¤erences and similarities between the cars from di¤erent countries.
The graphical summaries we have just discussed are most useful for summarizing variates
which are either continuous or discrete with many possible values. For a categorical variates
45
40
35
30
MPG
25
20
15
10
USA France Japan Germany Sweden Italy
Figure 1.10: Boxplots for miles per gallon for 100 cars from six di¤erent countries
the data can be best summarized using bar graphs and pie charts. Such graphs can be used
incorrectly. See Problems 15-18 at the end of this chapter.
The graphical summaries discussed to this point deal with a single variate. If we have
data on two variates x and y for each unit in the sample then then data set is represented
as f(xi ; yi ); i = 1; : : : ; ng. We are often interested in examining the relationships between
the two variates.
Scatterplots
A scatterplot, which is a plot of the points (xi ; yi ); i = 1; : : : ; n, can be used to see
whether the two variates are related in some way.
120
110
100
Weight
90
80
70
r=0.55
60
50
1.55 1.6 1.65 1.7 1.75 1.8 1.85 1.9 1.95 2
Height
Figure 1.11: Scatterplot of weight versus height for males
Figures 1.11 and 1.12 give the scatterplots of x = weight versus y = height for males
and females respectively for the data in Example 1.3.1. As expected, there is a tendency
1.4. PROBABILITY DISTRIBUTIONS AND STATISTICAL MODELS 21
120
110
100
Weight
90
80
70
60
50 r=0.31
40
1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 1.8 1.85
Height
Figure 1.12: Scatterplot of weight versus height for females
for weight to increase as height increases for both sexes. What might be surprising is the
variability in weights for a given height.
1.4 Probability Distributions and Statistical Models

Statistical models are used to describe processes such as the daily closing value of a stock
or the occurrence and size of claims over time in a portfolio of insurance policies. With
populations, we use a statistical model to describe the selection of the units and the mea-
surement of the variates. The model depends on the distribution of variate values in the
population (that is, the population histogram) and the selection procedure. We exploit
this connection when we want to estimate attributes of the population and quantify the
uncertainty in our conclusions. We use the models in several ways:
questions are often formulated in terms of parameters of the model
the variate values vary so random variables can describe this variation
empirical studies usually lead to inferences that involve some degree of uncertainty,
and probability is used to quantify this uncertainty
procedures for making decisions are often formulated in terms of models
models allow us to characterize processes and to simulate them via computer experi-
ments
Example 1.4.1 A Binomial Distribution Example

Consider again the survey of smoking habits of teenagers described in Example 1.2.1.
To select a sample of 500 units (teenagers living in Ontario), suppose we had a list of most
of the units in the population. Getting such a list would be expensive and time consuming
so the actual selection procedure is likely to be very di¤erent. We select a sample of 500
units from the list at random and count the number of smokers in the sample. We model
this selection process using a Binomial random variable Y with probability function (p.f.)
500 y
P (Y = y; ) = (1 )500 y
for y = 0; 1; : : : ; 500
y
Here the parameter represents the unknown proportion of smokers in the population,
one attribute of interest in the study.
Example 1.4.2 An Exponential Distribution Example

In Example 1.3.3, we examined the lifetime (in 1000 km) of a sample of 200 front brake
pads taken from the population of all cars of a particular model produced in a given time
period. We can model the lifetime of a single brake pad by a continuous random variable
Y with Exponential probability density function (p.d.f.)
1 y=
f (y; ) = e for y > 0:
Here the parameter > 0 represents the mean lifetime of the brake pads in the population
since, in the model, the expected value of Y is E (Y ) = .
To model the sampling procedure, we assume that the data fy1 ; : : : ; y200 g represent 200
independent realizations of the random variable Y . That is, we let Yi = the lifetime for
the ith brake pad in the sample, i = 1; 2; : : : ; 200, and we assume that Y1 ; Y2 ; : : : ; Y200 are
independent Exponential random variables each having the same mean .
We can use the model and the data to estimate and other attributes of interest such
as the proportion of brake pads that fail in the …rst 100; 000 km of use. In terms of the
model, we can represent this proportion by
Z100
100=
P (Y 100; ) = f (y; )dy = 1 e
0
If we model the selection of a data set fy1 ; : : : ; yn g as n independent realizations of a

random variable Y as in the above brake pad example, we can draw strong parallels between
summaries of the data set described in Section 1.3 and properties of the corresponding
probability model Y . For example,
The sample mean y corresponds to the population mean E (Y ) = .
The sample median m ^ corresponds to the population median m. For continuous

distributions the population median is the solution m of the equation F (m) = 0:5
where F (y) = P (Y y) is the cumulative distribution function of Y . For discrete
distributions, it is a point m chosen such that P (Y m) 0:5 and P (Y m) 0:5.
1.4. PROBABILITY DISTRIBUTIONS AND STATISTICAL MODELS 23
The sample standard deviation s corresponds to , the population standard deviation

of Y , where 2 = E[(Y )2 ].
The relative frequency histogram corresponds to the probability histogram of Y for

discrete distributions and the probability density function of Y for continuous distri-
butions.
Example 1.4.3 A Gaussian Distribution Example

Earlier, we described an experiment where the goal was to see if there is a relationship
between a measure of operating performance y of a computer chip and ambient temper-
ature x. In the experiment, there were four groups of 10 chips and each group operated
at a di¤erent temperature x = 10; 20; 30; 40. The data are f(y1 ; x1 ); : : : ; (y40 ; x40 )g. A
model for Y1 ; : : : ; Y40 should depend on the temperatures xi and one possibility is to as-
sume Yi v G( 0 + 1 xi ; ), i = 1; : : : ; 40 independently. In this model, the mean of Y is a
linear function of the temperature xi . The parameter allows for variation in performance
among chips operating at the same temperature. We will consider such models in detail in
Chapter 6.
Response versus Explanatory Variates

Suppose we wanted to study the relationship between second hand smoke and asthma
among children aged 10 and under. The two variates of interest could be de…ned as:
x = whether the child lives in a household where adults smoke,
Y = whether the child su¤ers from asthma.
In this study there is a natural division of the variates into two types: response variate
and explanatory variate. In this example Y , the asthma status, is the response variate
(often coded as Y = 1 if child su¤ers from asthma, Y = 0 otherwise) and x, whether
the child lives in a household where adults smoke, is the explanatory variate (also often
coded as x = 1 if child lives in household where adults smoke and x = 0 otherwise). The
explanatory variate x is in the study to partially explain or determine the distribution of
the response variate.
Similarly in an observational study of 1718 men aged 40-55, the men were classi…ed
according to whether they were heavy co¤ee drinkers (more than 100 cups/month) or not
(less than 100 cups/month) and whether they su¤ered from CHD (coronary heart disease) or
not. In this study there are also two categorical variates. One variate is the amount of co¤ee
consumption while the other variate is whether or not the subject had experienced CHD or
not. The question of interest is whether there is a relationship between co¤ee consumption
and CHD. Unlike Example 1.4.3, neither variate is under the control of the researchers.
We might be interested in whether co¤ee consumption can be used to “explain” CHD. In
this case we would call co¤ee consumption an explanatory variate while CHD would be the
response variate. However if we were interested in whether CHD can be used to explain
co¤ee consumption (a somewhat unlikely proposition to be sure) then CHD would be the
explanatory variate and co¤ee habits would be the response variate.
In some cases it is not clear which is the explanatory variate and which is the response
variate. For example, the response variable Y might be the weight (in kg) of a randomly
selected female in the age range 16-25, in some population. A person’s weight is related to
their height. We might want to study this relationship by considering females with a given
height x (say in meters), and proposing that the distribution of Y , given x is Gaussian,
G( + x; ). That is, we propose that the average (expected) weight of a female depends
linearly on her height x and we write this as
E(Y jx) = + x:
However it would be possible to reverse the roles of the two variates here and consider
the weight to be an explanatory variate and height the response variate, if for example we
wished to predict height using data on individuals’weights.
Models for describing the relationships among two or more variates are considered in
more detail in Chapters 6 and 7.
1.5 Data Analysis and Statistical Inference

Whether we are collecting data to increase our knowledge or to serve as a basis for making
decisions, proper analysis of the data is crucial. We distinguish between two broad aspects
of the analysis and interpretation of data. The …rst is what we refer to as descriptive
statistics. This is the portrayal of the data, or parts of it, in numerical and graphical ways
so as to show features of interest. (On a historical note, the word “statistics”in its original
usage referred to numbers generated from data; today the word is used both in this sense
and to denote the discipline of Statistics.) We have considered a few methods of descriptive
statistics in Section 1.3. The terms data mining and knowledge discovery in data bases
(KDD) refer to exploratory data analysis where the emphasis is on descriptive statistics.
This is often carried out on very large data bases. The goal, often vaguely speci…ed, is to
…nd interesting patterns and relationships
A second aspect of a statistical analysis of data is what we refer to as statistical inference.
That is, we use the data obtained in the study of a process or population to draw general
conclusions about the process or population itself. This is a form of inductive inference, in
which we reason from the speci…c (the observed data on a sample of units) to the general
(the target population or process). This may be contrasted with deductive inference (as
in logic and mathematics) in which we use general results (e.g. axioms) to prove speci…c
things (e.g. theorems).
This course introduces some basic methods of statistical inference. Three main types
of problems will be discussed, loosely referred to as estimation problems, hypothesis testing
problems and prediction problems. In the …rst type, the problem is to estimate one or more
attributes of a process or population. For example, we may wish to estimate the proportion
1.5. DATA ANALYSIS AND STATISTICAL INFERENCE 25
of Ontario residents aged 14 - 20 who smoke, or to estimate the distribution of survival

times for certain types of AIDS patients. Another type of estimation problem is that of
“…tting” or selecting a probability model for a process.
Hypothesis testing problems involve using the data to assess the truth of some question
or hypothesis. For example, we may hypothesize that in the 14-20 age group a higher
proportion of females than males smoke, or that the use of a new treatment will increase
the average survival time of AIDS patients by at least 50 percent. Tests of hypotheses will
be discussed in more detail in Chapter 4.
In prediction problems, we use the data to predict a future value for a process variate
or a unit to be selected from the population. For example, based on the results of a clinical
trial such as Example 1.2.3, we may wish to predict how much an individual’s blood pressure
would drop for a given dosage of a new drug. Or, given the past performance of a stock
and other data, to predict the value of the stock at some point in the future. Examples of
prediction are given in Sections 4.7 and 6.2.
Statistical analysis involves the use of both descriptive statistics and formal methods of
estimation, prediction and hypothesis testing. As brief illustrations, we return to the …rst
two examples of section 1.2.
Example 1.5.1 A smoking behaviour survey

Suppose in Example 1.2.1, we sampled 250 males and 250 females aged 14-20 as de-
scribed in Example 1.4.1. Here we focus only on the sex of each person in the sample, and
whether or not they smoked. The data are summarized in the following two-way table:
Smokers Non-smokers Total

Female 82 168 250
Male 71 179 250
Total 153 347 500
Suppose we are interested in the question “Is the smoking rate among teenage girls higher
than the rate among teenage boys?” From the data, we see that the sample proportion of
girls who smoke is 82=250 = 0:328 or 32:8% and the sample proportion of males who smoke
is 71=250 = 0:284 or 28:4%. In the sample, the smoking rate for females is higher. But
what can we say about the whole population? To proceed, we formulate the hypothesis
that there is no di¤erence in the population rates. Then assuming the hypothesis is true,
we construct two Binomial models as in Example 1.4.1 each with a common parameter .
We can estimate using the combined data so that ^ = 153=500 = 0:306 or 30:6%. Then
using the model and the estimate, we can calculate the probability of such a large di¤erence
in the observed rates. Such a large di¤erence occurs about 20% of the time (if we selected
samples over and over and the hypothesis of no di¤erence is true) so such a large di¤erence
in observed rates happens fairly often and therefore, based on the observed data, there is no
evidence of a di¤erence in the population smoking rates. In Chapter 7 we discuss a formal
method for testing the hypothesis of no di¤erence in rates between teenage girls and boys.
Example 1.5.2 A can …ller study

Recall Example 1.2.2 where the purpose of the study was to compare the performance
of the two machines in the future. Suppose that every hour, one can is selected from the
new machine and one can from the old machine over a period of 40 hours. You can …nd
measurements of the amounts of liquid in the cans in the …le ch1example152.txt and also
listed in Appendix C. The variates (column headings) are hour, machine (new = 1, old = 2)
and volume (ml). We display the …rst few rows of the …le below.
Hour Machine Volume

1 1 357:8
1 2 358:7
2 1 356:6
2 2 358:5
.. .. ..
. . .
357.8
357.6
357.4
357.2
Volume
357
356.8
356.6
356.4
356.2
356
355.8
0 5 10 15 20 25 30 35 40
Hour
Figure 1.13: Run chart of the volume for the new machine over time
First we examine if the behaviour of the two machines is stable over time. In Figures
1.13 and 1.14, we show a run chart of the volumes over time for each machine. There is no
indication of a systematic pattern for either machine so we have some con…dence that the
data can be used to predict the performance of the machines in the near future.
The sample mean and standard deviation for the new machine are 356:8 and 0:54 ml
respectively and, for the old machine, are 357:5 and 0:80. Figures 1.15 and 1.16 show the
relative frequency histograms of the volumes for the new machine and the old machine re-
spectively. To see how well a Gaussian model might …t these data we superimpose Gaussian
probability density functions with the mean equal to the sample mean and the standard
deviation equal to the sample standard deviation on each histogram. The agreement is
1.5. DATA ANALYSIS AND STATISTICAL INFERENCE 27
360
359.5
359
358.5
Volume
358
357.5
357
356.5
356
0 5 10 15 20 25 30 35 40
Hour
Figure 1.14: Run chart of the volume for old machine over time
0.9
0.8 skewness = 0.22
0.7 kurtos is = 2.38
0.6 G(356.76,0.54)
Relative
Frequenc y
0.5
0.4
0.3
0.2
0.1
0
355 356 357 358 359 360 361
Volume
Figure 1.15: Relative frequency histogram of volumes for the new machine
reasonable given that the sample size for both data sets is only forty. Note that it only
makes sense to compare density functions and relative frequency histograms (not standard)
since the areas both equal one.
None of the 80 cans had volume less than the required 355ml. However, we examined
only 40 cans per machine. We can use the Gaussian models to estimate the long term
proportion of cans that fall below the required volume. For the new machine, we …nd
that if V G(356:8; 0:54) then P (V 355) = 0:0005 so about 5 in 10; 000 cans will be
under…lled. The corresponding rate for the old machine is about 8 in 10; 000 cans. These
estimates are subject to a high degree of uncertainty because they are based on a small
sample and we have no way to test that the models are appropriate so far into the tails of
the distribution.
We can also see that the new machine is superior because of its smaller sample mean
0.7
skewness = 0.54
0.6
kurtos is = 2.84
0.5
Relative
Frequenc y
0.4
G(357.5,0.80)
0.3
0.2
0.1
0
355 356 357 358 359 360 361
Volume
Figure 1.16: Relative frequency histogram of volumes for the old machine
which translates into less over…ll (and hence less cost to the manufacturer). It is possible
to adjust the mean of the new machine to a lower value because of its smaller standard
deviation.
1.6 Statistical Software and R

Statistical software is essential for data manipulation and analysis. It is also used to deal
with numerical calculations, to produce graphics, and to simulate probability models. There
are many statistical software systems; some of the most comprehensive and popular are SAS,
S-Plus, SPSS, Strata, Systat Minitab and R. Spreadsheet software such as EXCEL is also
useful.
In this course we use the R software system. It is an open source package that has
extensive statistical capabilities and very good graphics procedures. The R home page is
www.r-project.org where a free download is available for most common operating systems.
Some of the basics of R are described in the next section. We use R for several purposes:
to manipulate and graph data, to …t and check statistical models, to estimate attributes or
test hypotheses, to simulate data from probability models.
Using R
Lots of online help is available in R. You can use a search engine to …nd the answer to most
questions. For example, if you search for “R tutorial”, you will …nd a number of excellent
introductions to R that explain how to carry out most tasks. Within R, you can …nd help
for a speci…c function using the command help(function name) but it is often easier to look
externally using a search engine.
Here we show how to use R on a Windows machine. You should have R open as you
read this material so you can play along.
1.6. STATISTICAL SOFTWARE AND R 29
Some R Basics
R is command-line driven. For example, if you want to de…ne a quantity x, use the assign-
ment function <- (that is, < followed by -).
x< 15
or, (a slight complication)

x< c(1; 3; 5)
so x is a column vector with elements 1,3,5.
A few general comments
If you want to change x, you can up-arrow to return to the assignment and make the
change you want, followed by a carriage return.
If you are doing something more complicated, you can type the code in Notepad or
some other text editor (Word is not advised!) and cut and paste the code into R.
You can save your session and, if you choose, it will be restored the next time you
open R.
You can add comments by entering # with the comment following on the same line.
Vectors
Vectors can consist of numbers or other symbols; we will consider only numbers here.
Vectors are de…ned using the function c( ). For example,
x< c(1; 3; 5; 7; 9)
de…nes a vector of length 5 with the elements given. You can display the vector by typing
x and carriage return. Vectors and other objects possess certain attributes. For example,
typing
length(x)
will give the length of the vector x.
You can cut and paste comma- delimited strings of data into the function c(). This is
one way to enter data into R. See below to learn how you can read a …le into R.
Arithmetic
R can be used as a calculator. Enter the calculation after the prompt > and hit return as
shown below.
> 7+3
[1] 10
> 7*3
[1] 21
> 7/3
[1] 2.333333
> 2^3
[1] 8
You can save the result of the calculation by assigning it to a variable such as y<-7+3
Some Functions
There are many functions in R. Most operate on vectors in a transparent way, as do

arithmetic operations. (For example, if x and y are vectors then x + y adds the vectors
element-wise; if x and y are di¤erent lengths, R may do surprising things! Some examples,
with comments, follow
> x<-c(1,3,5,7,9) # Define a vector x

> x # Display x
[1] 1 3 5 7 9
> y<-seq(1,2,.25) # A useful function for defining a vector whose
elements are an arithmetic progression
> y
[1] 1.00 1.25 1.50 1.75 2.00
> y[2] # Display the second element of vector y
[1] 1.25
> y[c(2,3)] # Display the vector consisting of the 2nd and
3rd elements of vector y
[1] 1.25 1.50
> mean(x) # Computes average of the elements of vector x
[1] 5
> summary(x) # A useful function which summarizes features
of a vector x
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 3 5 5 7 9
> sd(x) # Computes the (sample) standard deviation of
the elements of x
[1] 10
> exp(1) # The exponential function
[1] 2.718282
> exp(y)
[1] 2.718282 3.490343 4.481689 5.754603 7.389056
> round(exp(y),2) # round(y,n) rounds the elements of vector y to
n decimal places
[1] 2.72 3.49 4.48 5.75 7.39
> x+2*y
[1] 3.0 5.5 8.0 10.5 13.0
We often want to compare summary statistics of variate values by group (such as sex). We
can use the by()function. For example,
> y<-rnorm(100) # y is a vector of length 100 with entries

generated at random from G(0,1) dist’n
> x<-c(rep(1,50),rep(2,50)) # x is a vector of length 100 with 50 1’s
followed by 50 2’s
> by(y,x,summary) # generates a summary for the elements of y
for each value of the grouping variable x
We can replace the function summary() by most other simple functions.
Graphs
Note that in R, a graphics window opens automatically when a graphical function is used. A
useful way to create several plots in the same window is the function par() so, for example,
following the command
par(mfrow=c(2,2))
the next 4 plots will be placed in a 2 2 array within the same window.
There are various plotting and graphical functions. Three useful ones are
plot(y~x) # Generates a scatterplot of y versus x and

thus x and y must be of the same length
hist(y) # Creates a frequency histogram based on the values in

the vector y. To get a relative frequency histogram
(areas of rectangles sum to one) use hist(x,freq=F)
boxplot(y~x) # Creates side-by-side boxplots of the values of y

for each value of x
You can control the axes of plots (especially useful when you are making comparisons) by
including xlim = c(a; b) and ylim = c(d; e) as arguments separated by commas within
the plotting function. Also you can label the axes by including xlab = \yourchoice" and
ylab = \yourchoice". A title can be added using main = \yourchoice". There are many
other options. Check out the Html help “An Introduction to R” for more information on
plotting.
To save a graph, you can copy and paste into a Word document for example or alternately
use the “Save as” menu to create a …le in one of several formats.
Probability Distributions
There are functions which compute values of probability functions or probability density
functions, cumulative distribution functions, and quantiles for various distributions. It is
also possible to generate random samples from these distributions. Some examples follow
for the Gaussian distribution. For other distributions, type help(distributionname) or
check the “Introduction to R” in the Html help menu.
> y<- rnorm(10,25,5) # Generate 10 random values from the G(25,5)

dist’n and store the values in the vector y
> y # Display the values
[1] 22.50815 26.35255 27.49452 22.36308 21.88811 26.06676 18.16831 30.37838
[9] 24.73396 27.26640
> pnorm(1,0,1) # Compute P(Y<=1) for a G(0,1) random variable
[1] 0.8413447
> qnorm(.95,0,1) # Find the 0.95 quantile for G(0,1)
[1] 1.644854
>dnorm(2,1,3) # Compute value of G(1,3) p.d.f. at y=2
[1] 0.1257944
Reading data from a …le
R stores and retrieves data from the current working directory. You can use the command
getwd()
to determine the current working directory. To change the working directory, look in
the File menu for \changedir" and browse until you reach your choice.
There are many ways to read data into R. The …les we used in Chapter 1 are in .txt
format with the variate labels in the …rst row separated by spaces and the corresponding
variate values in subsequent rows. We created the …les from EXCEL and then saved the
…les as text …les.
To read such …les, …rst be sure the …le is in your working directory. Then use the
commands
a<-read.table(’filename.txt’,header=T) #filename in single quotes
attach(a)
The “header=T”tells R that the variate names are in the …rst row of the data …le. The
object a is called a data frame in R and the variate names are of the form \a : v1" where
v1 is the name of the …rst column in the …le. The R function attach(a) allows you to drop
the a : from the variate names.
Writing data to a …le
You can cut and paste output generated by R in the sessions window although the format
is usually messed up. This approach works best for Figures. You can write an R vector or
other object to a text …le through
write(y,file="filename")
To see more about the write function use help(write).
In the …le ch1example152.txt, there are three columns labelled hour, machine and volume.
The data are
hour machine volume hour machine volume

1 1 357:8 21 1 356:5
1 2 358:7 21 2 357:3
2 1 356:6 22 1 356:9
.. .. .. .. .. ..
. . . . . .
Here is R code which could be used for these data:
#Read in the data

a<-read.table(’ch1example152.txt’,header=T)
attach(a)
#Calculate summary statistics and standard deviation by machine
by(volume,machine,summary)
by(volume,machine,sd)
#Separate the volumes by machine into separate vectors v1 and v2
v1<-volume[seq(1,79,2)] # Puts machine 1 values in vector v1
v2<-volume[seq(2,80,2)] # Puts machine 2 values in vector v2
h<-1:40
#Plot run charts by machine, one above of the other,
#type=’l’ joins the points on the plots
par(mfrow=c(2,1)) # Creates 2 plotting areas, one above the other
plot(v1~h,xlab=’Hour’,ylab=’volume’,main=’New Machine’,ylim=c(355,360),type=’l’)
plot(v2~h,xlab=’Hour’,ylab=’volume’,main=’Old Machine’,ylim=c(355,360),type=’l’)
#Plot side by side relative frequency histograms
#and overlay Gaussian densities for each machine
par(mfrow=c(1,2)) # Creates 2 plotting areas side by side
br<-seq(355,360,0.5) # Defines interval endpoints for the histograms
hist(v1,br,freq=F,xlab=’volume’,ylab=’density’,main=’New Machine’)
w1<-356.8+0.538*seq(-3,3,0.01) # Values where Gaussian p.d.f. is located
dd1<-dnorm(w1,356.8,0.53)
points(w1,dd1,type=’l’) # Superimpose Gaussian p.d.f.
hist(v2,br,freq=F,xlab=’volume’,ylab=’density’,main=’Old Machine’)
w2<-357.5+0.799*seq(-3,3,0.01) # Values where Gaussian p.d.f. is located
dd2<-dnorm(w2,357.5,0.8)
points(w2,dd2,type=’l’) # Superimpose Gaussian p.d.f.
1.7. CHAPTER 1 PROBLEMS 35
1.7 Chapter 1 Problems

1. The sample mean and the sample median are two di¤erent ways to measure the
location of a data set (y1 ; y2 ; : : : ; yn ). Let y be the sample mean and m
^ be the sample
median of the data set.
(a) Suppose we transform the data so that ui = a + byi , i = 1; : : : ; n where a and

b are constants with b 6= 0. How are the sample mean and sample median of
u1 ; : : : ; un related to y and m?
^
(b) Suppose we transform the data by squaring so that vi = y i 2 , i = 1; : : : ; n. How
are the sample mean and sample median of v1 ; : : : ; vn related to y and m? ^
Pn
(c) Consider the quantities ri = yi y, i = 1; : : : ; n. Show that ri = 0. Is it true
i=1
P
n
that (yi m)
^ = 0?
i=1
(d) Suppose we include an extra observation y0 to the data set and de…ne a(y0 ) to
be the mean of the augmented data set. Express a(y0 ) in terms of y and y0 .
What happens to the sample mean as y0 gets large (or small)?
(e) Repeat the previous question for the sample median. Hint: Let y(1) ; : : : ; y(n) be
the original data set with the observations arranged in increasing order.
(f) Use (d) and (e) to explain why the sample median income of a country might be
a more appropriate summary than the sample mean income.
P
n
(g) Show that V ( ) = (yi )2 is minimized when = y.
i=1
P
n
(h) Show that W ( ) = jyi j is minimized when = m.
^ Hint: Calculate the
i=1
derivative of W ( ) when < y(1), y(1) < < y(2) and so on. The minimum
occurs where the derivative changes sign.
2. The sample standard deviation and the interquartile range are two di¤erent measures
of the variability of a data set (y1 ; y2 ; : : : ; yn ). Let s be the sample standard deviation
and let IQR be the interquartile range of the data set.
(a) Suppose we transform the data so that ui = a+byi , i = 1; : : : ; n where a and b are
constants and b 6= 0. How are the sample standard deviation and interquartile
range of u1 ; : : : ; un related to s and IQR?
Pn P
n
(b) Show that (yi y)2 = yi2 n (y)2 .
i=1 i=1
(c) Suppose we include an extra observation y0 to the data set. Use the result in
(b) to write the sample standard deviation of the augmented data set in terms
of y0 and the original sample standard deviation. What happens when y0 gets
large (or small)?
(d) How does the IQR change as y0 gets large?
3. The sample skewness and kurtosis are two di¤erent measures of the shape of a data
set (y1 ; y2 ; : : : ; yn ). Let g1 be the sample skewness and let g2 be the sample kurtosis
of the data set. Suppose we transform the data so that ui = a + byi , i = 1; :::; n where
a and b are constants and b 6= 0. How are the sample skewness and sample kurtosis
of u1 ; : : : ; un related to g1 and g2 ?
4. Suppose the data c1 ; c2 ; : : : ; c24 represents the costs of production for a …rm every
month from January 2013 to December 2014. For this data set the sample mean was
$2500, the sample deviation was $5500; the sample median was $2600, the sample
skewness was 1:2, the sample kurtosis was 3:9, and the range was $7500. The rela-
tionship between cost and revenue is given by ri = 7ci + 1000; i = 1; 2; : : : ; 24. Find
the sample mean, standard deviation, median, skewness, kurtosis and range of the
revenues.
5. Mass production of complicated assemblies such as automobiles depend on the ability

to manufacture components to very tight speci…cations. The component manufac-
turer tracks performance by measuring a sample of parts and comparing the mea-
surements to the speci…cation. Suppose the speci…cation for the diameter of a piston
is a nominal value 10 microns (10 6 m). The data below (also available in the …le
ch1exercise3.txt) are the diameters of 50 pistons collected from the more than 10; 000
pistons produced in one day. (The measurements are the diameters minus the nominal
value in microns.)
12:8 7:3 3:9 3:4 2:9 2:7 2:5 2:3 1:0 0:9
0:8 0:7 0:6 0:4 0:4 0:2 0:0 0:5 0:6 0:7
1:2 1:8 1:8 2:0 2:1 2:5 2:6 2:6 2:7 2:8
3:3 3:4 3:5 3:8 4:3 4:6 4:7 5:1 5:4 5:7
5:8 6:6 6:6 7:0 7:2 7:9 8:5 8:6 8:7 8:9
P
50 P
50
yi = 100:7 yi2 = 1110:79
i=1 i=1
(a) Plot a relative frequency histogram of the data. Is the process producing pistons
within the speci…cations.
(b) Calculate the sample mean y and the sample median of the diameters.
(c) Calculate the sample standard deviation s and the IQR.
(d) Such data are often summarized using a single performance index called P pk
de…ned as
U y y L
P pk = min ;
3s 3s
where (L; U ) = ( 10; 10) are the lower and upper speci…cation limits. Calculate
P pk for these data.
(e) Explain why larger values of P pk (i.e. greater than 1) are desirable.
(f) Suppose we …t a Gaussian model to the data with mean and standard deviation
equal to the corresponding sample quantities, that is, with = y and = s. Use
the …tted model to estimate the proportion of diameters (in the process) that
are out of speci…cation.
6. In the above problem, we saw how to estimate the performance measure P pk based on
a sample of 50 pistons, a very small proportion of one day’s production. To get an idea
of how reliable this estimate is, we can model the process output by a Gaussian random
variable Y with mean and standard deviation equal to the corresponding sample
quantities. Then we can use R to generate another 50 observations and recalculate
P pk. We do this many times. Here is some R code. Make sure you replace x with
the appropriate values.
avgx<-mean(x) #Replace x with data from previous problem

sdx<-sd(x)
temp<-rep(0,1000) #Vector to store the generated Ppk values
for (i in 1:1000) #Begin loop
y<-rnorm(50, avgx, sdx) #Generate 50 new observations using a
Normal model with mean and sd equal to
sample mean and sd of original data
avg<-mean(y);
s<-sd(y) #Calculate the average and sd of the new data
ppk<-min((10-avg)/(3*s),(avg+10)/(3*s)) #Calculates new Ppk
temp[i]<-ppk #Store value of Ppk for 1000 iterations
hist(temp) #Make a histogram of the Ppk values
mean(temp) #Calculate the average Ppk value
sd(temp) #Calculate the standard deviation of the Ppk values
(a) Compare the P pk from the original data with the average P pk value from the
1000 iterations. Mark the original P pk value on the histogram of generated P pk
values. What do you notice? What would you conclude about how good the
original estimate of P pk was?
(b) Repeat the above exercise but this time use a sample of 300 pistons rather than
50 pistons. What conclusion would you make about using a sample of 300 versus
50 pistons?
7. Construct the empirical cumulative distribution function for the following data:
0:76 0:43 0:52 0:45 0:01 0:85 0:63 0:39 0:72 0:88
8. Suppose Y1 ; Y2 ; : : : ; Yn are independent and identically distributed random variables

with E (Yi ) = and V ar (Yi ) = 2 , i = 1; 2; : : : ; n.
(a) Find E Yi2 :

1 P
n
(b) Find E(Y ), V ar(Y ) and E (Y )2 where Y = n Yi .
i=1
(c) Use (a) and (b) to show that E S 2 = 2 where
1 P
n
2 1 P
n
2
S2 = Yi Y = Yi2 n Y :
n 1 i=1 n 1 i=1
9. The data below show the lengths (in cm) of 43 male coyotes and 40 female coyotes
captured in Nova Scotia. (Based on Table 2.3.2 in Wild and Seber 1999.) The data
are available in the …le ch1exercise5.txt.
Females x
71:0 73:7 80:0 81:3 83:5 84:0 84:0 84:5 85:0 85:0 86:0 86:4
86:5 86:5 88:0 87:0 88:0 88:0 88:5 89:5 90:0 90:0 90:2 91:0
91:4 91:5 91:7 92:0 93:0 93:0 93:5 93:5 93:5 96:0 97:0 97:0
97:8 98:0 101:6 102:5
P
40 P
40
xi = 3569:6 x2i = 320223:38
i=1 i=1
Males y
78:0 80:0 80:0 81:3 83:8 84:5 85:0 86:0 86:4 86:5 87:0 88:0
88:0 88:9 88:9 90:0 90:5 91:0 91:0 91:0 91:4 92:0 92:5 93:0
93:5 95:0 95:0 95:0 94:0 95:5 96:0 96:0 96:0 96:0 97:0 98:5
100:0 100:5 101:0 101:6 103:0 104:1 105:0
P
43 P
43
yi = 3958:4 yi2 = 366276:84
i=1 i=1
(a) Plot relative frequency histograms of the lengths for females and males sepa-
rately. Be sure to use the same bins.
(b) Determine the …ve number summary for each data set.
(c) Compute the sample mean y and sample standard deviation s for the lengths
of the female and male coyotes separately. Assuming = y and = s, overlay
the corresponding G ( ; ) probability density function on the histograms for the
females and males separately. Comment on how well the Normal model …ts each
data set.
(d) Plot the empirical distribution function of the lengths for females and males
separately. Assuming = y and = s, overlay the corresponding G ( ; )
cumulative distribution functions. Comment on how well the Normal model …ts
each data set.
10. Does the value of an actor in‡uence the amount grossed by a movie? The “value
of an actor” will be measured by the average amount the actors’movies have made.
The “amount grossed by a movie” is measured by taking the highest grossing movie,
in which that actor played a major part. For example, Tom Hanks, whose value is
103.2 had his best results with Toy Story 3 (gross 415.0). All numbers are corrected
to 2012 dollar amounts and have units “millions of U.S. dollars”. Twenty actors
were selected by taking the …rst twenty alphabetically listed by name on the website
(http://boxo¢ cemojo.com/people/), and the corresponding measurements (above),
were obtained for each actor. The data for 20 actors, their value (x) and the gross
(y) of their best movie are given below:
Actor 1 2 3 4 5 6 7 8 9 10
Value (x) 67 49:6 37:7 47:3 47:3 32:9 36:5 92:8 17:6 14:4
Gross (y) 177:2 201:6 183:4 55:1 154:7 182:8 277:5 415 90:8 83:9
Actor 11 12 13 14 15 16 17 18 19 20
Value (x) 51:1 54 30:5 42:1 23:6 62:4 32:9 26:9 43:7 50:3
Gross (y) 158:7 242:8 37:1 220 146:3 168:4 173:8 58:4 199 533
P
20 P
20 P
20
xi = 860:6 x2i = 43315:04 xi yi = 184540:93
i=1 i=1 i=1
P
20 P
20
yi = 3759:5 yi2 = 971560:19
i=1 i=1
(a) What are the two variates in this data set? Choose one variate to be an explana-
tory variate and the other to be a response variate. Justify your choice.
(b) Plot a scatterplot of the data.
(c) Calculate the sample correlation for the data (xi ; yi ) ; i = 1; 2; : : : ; 20. Is there a
strong positive or negative relationship between the two variates?
(d) Is it reasonable to conclude that the explanatory variate in this problem causes
the response variate? Explain.
11. In a very large population a proportion of people have blood type A. Suppose n
people are selected at random. De…ne the random variable Y = number of people
with blood type A in sample of size n.
(a) What is the probability function for Y ? What are E(Y ) and V ar(Y )? What
assumptions have you made?
(b) Suppose n = 50. What is the probability of observing 20 people with blood type
A as a function of ?
(c) If for n = 50 we observed y = 20 people with blood type A what is a reasonable

estimate of based on this information? Estimate the probability that in a
sample of n = 10 there will be at least one person with blood type A.
(d) More generally, suppose in a given experiment the random variable of interest Y
has a Binomial(n; ) distribution. If the experiment is conducted and y successes
are observed what is a good estimate of based on this information?
(e) Let Y s Binomial (n; ). Find E Yn and V ar Yn . What happens to V ar Yn
as n ! 1? What does this imply about how far Yn is from for large n?
Approximate
r r !
Y (1 ) Y (1 )
P 1:96 + 1:96 :
n n n n
You may ignore the continuity correction.

(f) There are actually 4 blood types: A, B, AB, O. Let Y1 = number with type
A, Y2 = number with type B, Y3 = number with type AB, and Y4 = number
with type O in a sample of size n. What is the joint probability function of Y1 ,
Y2 , Y3 , Y4 ? (Let 1 = proportion of type A, 2 = proportion of type B, 3 =
proportion of type AB, 4 = proportion of type O in the population.)
(g) If in a sample of n people the observed data were y1 , y2 , y3 , y4 what would be
reasonable estimates of 1 , 2 , 3 , 4 ?
12. The IQ’s of students of UWaterloo Math students are Normally distributed with
mean and standard standard deviation . De…ne the random variable Y = IQ of
UWaterloo Math student.
(a) What is the probability density function of Y ? What are E(Y ) and V ar(Y )?
(b) Suppose that the IQ’s for 16 students were:
127 108 127 136 125 130 127 117 123 112 129 109 109 112 91 134
P
16 P
16
yi = 1916; yi2 = 231618
i=1 i=1
What is a reasonable estimate of based on these data? What is a reasonable
estimate of 2 based on these data? Estimate the probability that a random
chosen UWaterloo Math student will have an IQ greater than 120.
(c) Suppose Yi v G( ; ), i = 1; 2; : : : ; n independently.
(i) What is the distribution of
1 Pn
Y = Yi ?
n i=1
Find E(Y ), and V ar(Y ). What happens to V ar(Y ) as n ! 1? What does
this imply about how far Y is from for large n?
p p
(ii) Find P Y 1:96 = n Y + 1:96 = n .
(iii) If = 12, …nd the smallest value of n such that P Y 1:0 0:95.
13. The lifetimes of a certain type of battery are Exponentially distributed with parameter
. De…ne the random variable Y = lifetime of a battery.
(a) What is the probability density function of Y ? What are E(Y ) and V ar(Y )?
(b) Suppose the lifetimes (in hours) for 20 batteries were:
20:5 9:9 206:4 9:1 45:8 232:7 127:8 60:4 4:3 3:6 P
20
yi = 1325:1
184:8 3:0 4:4 72:3 22:3 195:3 86:3 8:8 23:3 4:1 i=1
What is a reasonable estimate of based on these data? Estimate P (Y > 100)

using these data.
(c) Suppose Yi v Exponential ( ), i = 1; 2; : : : ; n independently. Let
1 Pn
Y = Yi
n i=1
(i) Find E(Y ) and V ar(Y ). What happens to V ar(Y ) as n ! 1? What does
p p
(ii) Approximate P Y 1:6449 = n Y + 1:6449 = n .
14. Accidents occur on Wednesday’s at a particular intersection at random at the average

rate of accidents per Wednesday according to a Poisson process. De…ne the random
variable Y = number of accidents on Wednesday at this intersection.
(a) What is the probability function for Y ? What are E(Y ) and V ar(Y )?
(b) Suppose on 6 consecutive Wednesday’s the number of accidents observed was
0, 2, 0, 1, 3, 1. What is the probability of observing these data as a function
of ? (Remember the Poisson process assumption that the number of events in
non-overlapping time intervals are independent.) What is a reasonable estimate
of based on these data? Estimate the probability that there is at least one
accident at this intersection next Wednesday.
(c) Suppose Yi v P oisson ( ), i = 1; 2; : : : ; n independently. Let
1 Pn
Y = Yi
n i=1
(i) Find E(Y ) and V ar(Y ). What happens to V ar(Y ) as n ! 1? What does
p p
(ii) Approximate P Y 1:96 =n Y + 1:96 =n . You may ignore
the continuity correction.
Figure 1.17: Pie chart for support for Republican Presidental candidates
15. The pie chart in Figure 1.17, from Fox News, shows the support for various Republican
Presidential candidates. What do you notice about this pie chart? Comment on how
e¤ective pie charts are in general at conveying information.
16. For the graph in Figure 1.18 indicate whether you believe the graph is e¤ective in
conveying information by giving at least one feature of the graph which is either good
or bad.
boys
girls
Candy
Chips
Chocolate bars
Cookies
Crackers
Fruit
Ice cream
P opcorn
P retzels
V egetables
0 50 100 150 200 250 300

Number of S tudents
Figure 1.18: Preferred snack choices of students at Ridgemont High School

17. The graphs in Figures 1.19 and 1.20 are two more classic Fox News graphs. What do
you notice? What political message do you think they were trying to convey to their
audience?
Figure 1.19: Unemployment Rate under President Obama
Figure 1.20: Federal Welfare in the US

18. Information about the mortality from malignant neoplasms (cancer) for females living
in Ontario is given in …gures 1.21 and 1.22 for the years 1970 and 2000 respectively.
The same information displayed in these two pie charts is also displayed in the bar
graph in Figure 1.23. Which display seems to carry the most information?
Lung
Leukemia & Lymphoma
O ther
Breast
Stomach
Colorectal
Figure 1.21: Mortality from malignant neoplasms for females in Ontario 1970
Lung
O ther
Leukemia & Lymphoma
Stomach
Breast
Colorectal
Figure 1.22: Mortality from malignant neoplasms for females in Ontario in 2000
40
1970
2000
35
30
25
20
15
10
0
Lung Leuk. & Lymph. Breast Colorectal Stomach Other
Figure 1.23: Mortality from malignant neoplasms for females living in Ontario,
1970 and 2000
2. STATISTICAL MODELS AND
MAXIMUM LIKELIHOOD
ESTIMATION
2.1 Choosing a Statistical Model

A statistical model is a mathematical model that incorporates probability6 in some way.
As described in Chapter 1, our interest here is in studying variability and uncertainty in
populations and processes and drawing inferences where warranted in the presence of this
uncertainty. This will be done by considering random variables that represent characteris-
tics of randomly selected units or individuals in the population or process, and by studying
the probability distributions of these random variables. It is very important to be clear
about what the “target” population or process is, and exactly how the variables being
considered are de…ned and measured. These issues are discussed in Chapter 3.
A preliminary step in probability and statistics is the choice of a statistical model7 to
suit a given application. The choice of a model is usually driven by some combination of
the following three factors:
1. Background knowledge or assumptions about the population or process which lead to

certain distributions.
2. Past experience with data sets from the population or process, which has shown that
certain distributions are suitable.
3. A current data set, against which models can be assessed.
6
The material in this section is largely a review of material you have seen in a previous probability course.
This material is available in the STAT 230 Notes which are posted on the course website.
7
The University of Wisconsin-Madison statistician George E.P. Box (18 October 1919 –28 March 2013)
says of statistical models that "All models are wrong but some are useful" which is to say that although
rarely do they …t very large amounts of data perfectly, they do assist in describing and drawing inferences
from real data.
47
48 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION
In probability theory, there is a large emphasis on factor 1 above, and there are many
“families”of probability distributions that describe certain types of situations. For example,
the Binomial distribution was derived as a model for outcomes in repeated independent
trials with two possible outcomes on each trial while the Poisson distribution was derived
as a model for the random occurrence of events in time or space. The Gaussian or Normal
distribution, on the other hand, is often used to represent the distributions of continuous
measurements such as the heights or weights of individuals. This choice is based largely on
past experience that such models are suitable and on mathematical convenience.
In choosing a model we usually consider families of probability distributions. To be
speci…c, we suppose that for a random variable Y we have a family of probability func-
tions/probability density functions, f (y; ) indexed by the parameter (which may be a
vector of values). In order to apply the model to a speci…c problem we need a value for .
The process of selecting a value for based on the observed data is referred to as “estimat-
ing” the value of or“…tting” the model. The next section describes the most widely used
method for estimating .
Most applications require a sequence of steps in the formulation (the word “speci…ca-
tion” is also used) of a model. In particular, we often start with some family of models in
mind, but …nd after examining the data set and …tting the model that it is unsuitable in cer-
tain respects. (Methods for checking the suitability of a model will be discussed in Section
2.4.) We then try other models, and perhaps look at more data, in order to work towards
a satisfactory model. This is usually an iterative process, which is sometimes represented
by diagrams such as:
Collect and examine data set

#
Propose a (revised?) model
# "
Fit model ! Check model
#
Draw conclusions
Statistics devotes considerable e¤ort to the steps of this process. However, in this
course we will focus on settings in which the models are not too complicated, so that
model formulation problems are minimized. There are several distributions that you should
review before continuing since they will appear frequently in these notes. See the STAT
220/230/240 Course Notes available on the course webpage. You should also consult the
Table of Distributions in these course notes for a condensed table of properties of these
distributions including their moment generating functions and their moments.
2.1. CHOOSING A STATISTICAL MODEL 49
Table 2.1: Properties of Discrete versus Continuous Random Variables
Property Discrete Continuous
P Rx
F (x) = P (X x) = P (X = t) F (x) = P (X x) = f (t) dt
c.d.f. t x 1
F is a right continuous step F is a continuous
function for all x 2 < function for all x 2 <
d
p.f./p.d.f. f (x) = P (X = x) f (x) = dx F (x) 6= P (X = x) = 0
P
P (X 2 A) = P (X = x) P (a < X b) = F (b) F (a)
Probability x2A
P Rb
of an event = f (x) = f (x) dx
x2A a
P P R1
Total Probability P (X = x) = f (x) = 1 f (x) dx = 1
all x all x 1
P R1
Expectation E [g (X)] = g (x) f (x) E [g (X)] = g (x) f (x) dx
all x 1
Binomial Distribution
The discrete random variable (r.v.) Y has a Binomial distribution if its probability
function is of the form
n y
P (Y = y; ) = f (y; ) = (1 )n y
for y = 0; 1; : : : ; n
y
where is a parameter with 0 < < 1. For convenience we write Y Binomial(n; ).

Recall that E(Y ) = n and V ar(Y ) = n (1 ).
Poisson Distribution
The discrete random variable Y has a Poisson distribution if its probability function is
of the form
y
e
f (y; ) = for y = 0; 1; 2; : : :
y!
where is a parameter with > 0. We write Y Poisson( ). Recall that E(Y ) = and
V ar(Y ) = .
Exponential Distribution
The continuous random variable Y has an Exponential distribution if its probability
density function is of the form
1 y=
f (y; ) = e for y > 0
where is parameter with > 0. We write Y Exponential( ). Recall that E(Y ) = and
V ar(Y ) = 2 .
Gaussian (Normal) Distribution

The continuous random variable Y has a Gaussian or Normal distribution if its proba-
bility density function is of the form
1 1
f (y; ; ) = p exp 2
(y )2 for y 2 <
2 2
where and are parameters, with 2 < and > 0. Recall that E(Y ) = ; V ar(Y ) =
2 ; and the standard deviation of Y is sd(Y ) = . We write either Y G( ; ) or
Y N ( ; 2 ). Note that in the former case, G( ; ), the second parameter is the stan-
dard deviation whereas in the latter, N ( ; 2 ), the second parameter is the variance 2 .
Most software syntax including R requires that you input the standard deviation for the
parameter. As seen in examples in Chapter 1, the Gaussian distribution provides a suitable
model for the distribution of measurements on characteristics like the height or weight of
individuals in certain populations, but is also used in many other settings. It is particularly
useful in …nance where it is the most commonly used model for asset prices, exchange rates,
interest rates, etc.
Multinomial Distribution
The Multinomial distribution is a multivariate distribution in which the discrete random
variable’s Y1 ; : : : ; Yk (k 2) have the joint probability function
P (Y1 = y1 ; : : : ; Yk = yk ; ) = f (y1 ; : : : ; yk ; )
n! y1 y2 ::: yk
= (2.1)
y1 !y2 ! : : : yk ! 1 2 k
where each yi , for i = 1; : : : ; k, is an integer between 0 and n, and satisfying the condition
Pk
yi = n. The elements of the parameter vector = ( 1 ; : : : ; k ) satisfy 0 < i < 1 for i =
i=1
P
k
1; : : : ; k and i = 1. This distribution is a generalization of the Binomial distribution. It
i=1
arises when there are repeated independent trials, where each trial has k possible outcomes
(call them outcomes 1; : : : ; k), and the probability outcome i occurs is i . If Yi , i = 1; : : : ; k
is the number of times that outcome i occurs in a sequence of n independent trials, then
2.2. ESTIMATION OF PARAMETERS AND THE METHOD OF MAXIMUM LIKELIHOOD51
(Y1 ; : : : ; Yk ) have the joint probability function given in (2.1). We write (Y1 ; : : : ; Yk )
Multinomial(n; ):
Pk
Since Yi = n we can rewrite f (y1 ; : : : ; yk ; ) using only k 1 variables, say y1 ; : : : ; yk 1
i=1
by replacing yk with n y1 : : : yk 1 . We see that the Multinomial distribution with
k = 2 is just the Binomial distribution, where the two possible outcomes are S (Success)
and F (Failure).
We now turn to the problem of …tting a model. This requires estimating or assigning
numerical values to the parameters in the model (for example, in an Exponential model
or and in the Gaussian model).
2.2 Estimation of Parameters and the Method of Maximum

Likelihood
Suppose a probability distribution that serves as a model for some random process depends
on an unknown parameter (possibly a vector). In order to use the model we have to
“estimate” or specify a value for . To do this we usually rely on some data set that has
been collected for the random variable in question. It is important that a data set be
collected carefully, and we consider this issue in Chapter 3. For example, suppose that the
random variable Y represents the weight of a randomly chosen female in some population,
and that we consider a Gaussian model, Y G ( ; ). Since E(Y ) = , we might decide to
randomly select, say, 50 females from the population, measure their weights y1 ; y2 ; : : : ; y50 ,
and use the average,
1 P50
^=y= yi (2.2)
50 i=1
to estimate . This seems sensible (why?) and similar ideas can be developed for other
parameters; in particular, note that must also be estimated, and you might think about
how you could use y1 ; : : : ; y50 to do this. (Hint: what does or 2 represent in the Gaussian
model?) Note that although we are estimating the parameter we did not write = y.
We introduced a special notation ^ . This serves a dual purpose, both to remind you that y
is not exactly equal to the unknown value of the parameter , but also to indicate that ^ is
a quantity derived from the data yi , i = 1; 2; : : : ; 50 and depends on the sample. A di¤erent
draw of the sample yi , i = 1; 2; : : : ; 50 will result in a di¤erent value for ^ :
De…nition 7 An estimate of a parameter is the value of a function of the observed data

y1 ; y2 ; : : : ; yn and other known quantities such as the sample size n. We use ^ to denote an
estimate of the parameter .
Note that ^ = ^(y1 ; y2 ; : : : ; yn ) = ^(y) depends on the sample y = (y1 ; y2 ; : : : ; yn ) drawn.

A function of the data which does not involve any unknown quantities such as unknown
parameters is called a statistic. The numerical summaries discussed in Chapter 1 are all
examples of statistics. A point estimate is also a statistic.
Instead of ad hoc approaches to estimation as in (2.2), it is desirable to have a general
method for estimating parameters. The method of maximum likelihood is a very general
method, which we now describe.
Let the discrete (vector) random variable Y represent potential data that will be used
to estimate , and let y represent the actual observed data that are obtained in a speci…c
application. Note that to apply the method of maximum likelihood, we must know (or
make assumptions about) how the data y were collected. It is usually assumed here that
the data set consists of measurements on a random sample of population units.
De…nition 8 The likelihood function for is de…ned as
L ( ) = L ( ; y) = P (Y = y; ) for 2
where the parameter space is the set of possible values for .
Note that the likelihood function is a function of the parameter and the given data y.
For convenience we usually write just L ( ). Also, the likelihood function is the probability
that we observe at random the observation y, considered as a function of the parameter
. Obviously values of the parameter that make our observation y more probable would
seem more credible or likely than those that make it less probable. Therefore values of
for which L( ) is large are more consistent with the observed data y. This seems like a
“sensible” approach, and it turns out to have very good properties.
De…nition 9 The value of which maximizes L( ) for given data y is called the maximum
likelihood estimate 8 (m.l. estimate) of . It is the value of which maximizes the probability
of observing the data y. The value is denoted by ^.
We are surrounded by polls. They guide the policies of political leaders, the products
that are developed by manufacturers, and increasingly the content of the media. The fol-
lowing is an example of a public opinion poll.
Example 2.2.1 Harris/Decima public opinion poll9

The article on the next page which was published in the CAUT (Canadian Associa-
tion of University Teachers) Bulletin describes a poll 10 conducted by the Harris/Decima
company. Harris/Decima conducts semi-annual polls for CAUT to learn about Canadian
public opinion about post-secondary education in Canada.
8
We will often distinguish between the random variable, the maximum likelihood estimator, which is the
function of the data in general, and its numerical value for the data at hand, referred to as the maximum
likelihood estimate.
9
See the corresponding video “harris decima poll and introduction to likelihoods” at www.watstat.ca
10
http://www.caut.ca/uploads/Decima_Fall_2010.pdf
The poll described in the article was conducted in November 2010. Harris/Decima uses
a telephone poll of 2000 “representative” adults. Figure 2.1 shows the results for the polls
conducted in fall 2009 and 2010. In 2009 and 2010, 26% of respondents agreed and 48% dis-
agreed with the statement: “University and college teachers earn too much”.Harris/Decima
Figure 2.1: Harris/Decima poll. The two bars are from polls conducted in Nov.
9, 2009 (left bar) and Nov 10, 2010 (right bar)
declared their result to be accurate within 2:2%, 19 times out of 20 (the margin of error
for regional, demographic or other subgroups is larger). What does this mean and how
were these estimates and intervals obtained?
Suppose that the random variable Y represents the number of individuals who, in a
randomly selected group of n persons, agreed with the statement. Suppose we assume that
Y is closely modelled by a Binomial distribution with probability function
n y
P (Y = y; ) = f (y; ) = (1 )n y
for y = 0; 1; : : : ; n
y
where represents the proportion of the Canadian adult population that agree with the
statement. If n people are selected and y people agree with the statement then the likelihood
function is given by
L( ) = P (y people agree with the statement ; )

n y
= (1 )n y for 0 < < 1: (2.3)
y
It is easy to see that (2.3) is maximized by the value = ^ = y=n. (You should show
this.) The estimate ^ = y=n is called the sample proportion. For the Harris/Decima poll
conducted in 2010, y = 520 people out of n = 2000 people agreed with the statement so
the likelihood function is
2000 520
L( ) = (1 )1480 for 0 < <1 (2.4)
520
and the maximum likelihood estimate is 520=2000 = 0:26 or 26%. This is also easily seen
from a graph of the likelihood function (2.4) given in Figure 2.2.
0.025
0.02
L(θ)
0.015
0.01
0.005
0 θ
0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.3
Figure 2.2: Likelihood function for the Harris/Decima poll and corresponding
interval estimate for
The interval suggested by the pollsters was 26 2:2% or [23:8; 28:2]. Looking at Figure
2.2 we see that the interval [0:238; 0:282] is a reasonable interval for the parameter since
it seems to contain most of the values of with large values of the likelihood L( ). We will
return to the construction of such interval estimates in Chapter 4.
Note that the likelihood function’s shape and where its maximum occurs are not a¤ected
if L( ) is multiplied by a constant. Indeed it is not the absolute value of the likelihood
function that is important but the relative values at two di¤erent values of the parameter,
e.g. L( 1 )=L( 2 ): You might think of this ratio as how much more or less consistent the
data are with the parameter 1 versus 2 . The ratio L( 1 )=L( 2 ) is also una¤ected if L( ) is
multiplied by a constant. In view of this the likelihood may be de…ned as P (Y = y; ) or as
any constant multiple of it, so, for example, we could drop the term ny in (2.3) and de…ne
L( ) = y (1 )n y . This function and (2.3) are maximized by the same value ^ = y=n and
have the same shape. Indeed we might rescale the likelihood function by dividing through
by its maximum value L(^) so that the new function has a maximum value equal to one.
De…nition 10 The relative likelihood function is de…ned as

L( )
R( ) = for 2 :
L(^)
Note that 0 R( ) 1 for all 2 .
Sometimes it is easier to work with the log (log = ln in these course notes) of the
likelihood function.
1.5
L(θ)
0.5
-0.5
l(θ)
-1
-1.5
-2
-2.5
-3
θ
0.23 0.24 0.25 0.26 0.27 0.28 0.29
Figure 2.3: The functions L ( ) (upper graph) and l ( ) (lower graph) are both
maximized at the same value = ^
De…nition 11 The log likelihood function is de…ned as
l( ) = log L( ) for 2 :
Note that ^ also maximizes l( ). In fact in Figure 2.3 we see that l( ), the lower of the
two curves, is a monotone function of L( ) so they increase together and decrease together.
This implies that both functions have a maximum at the same value = ^.
Because functions are often (but not always!) maximized by setting their derivatives
equal to zero11 , we can usually obtain ^ by solving the equation
dl
= 0:
d
y
For example, from L( ) = (1 )n y we get l( ) = y log( ) + (n y) log(1 ) and
dl y n y
= :
d 1
Solving dl=d = 0 gives = y=n. The First Derivative Test can be used to verify that this
corresponds to a maximum value so the maximum likelihood estimate of is ^ = y=n.
11
Can you think of an example of a continuous function f (x) de…ned on the interval [0; 1] for which the
maximum max0 x 1 f (x) is NOT found by setting f 0 (x) = 0?
Likelihood function for a random sample

In many applications the data set Y = (Y1 ; : : : ; Yn ) are independent and identically
distributed (i.i.d) random variables each with probability function f (y; ), 2 . We refer
to Y = (Y1 ; : : : ; Yn ) as a random sample from the distribution f (y; ). In this case the
observed data are y = (y1 ; : : : ; yn ) and
Q
n
L( ) = f (yi ; ) for 2 :
i=1
(You should recall from probability that if Y1 ; : : : ; Yn are independent random variables
then their joint probability function is the product of their individual probability functions.)
Example 2.2.2 Likelihood function for Poisson distribution

Suppose y1 ; : : : ; yn is an observed random sample from a Poisson( ) distribution. The
likelihood function is
Q
n Q
n
L( ) = f (yi ; ) = P (Yi = yi ; ) for 2
i=1 i=1
P
n
Q
n yi Q
n 1 yi
e n
= = i=1 e for >0
i=1 yi ! i=1 yi !
or more simply
ny n
L( ) = e for > 0:
The log likelihood is
l ( ) = n (y log ) for >0
with derivative
d y n
l( ) = n 1 = (y ):
d
A …rst derivative test easily veri…es that the value = y maximizes l( ) and so ^ = y is the
maximum likelihood estimate of .
Combining likelihoods based on independent experiments

If we have two data sets y1 and y2 from two independent studies for estimating , then
since the corresponding random variables Y1 and Y2 are independent we have
P (Y1 = y1 ; Y2 = y2 ; ) = P (Y1 = y1 ; ) P (Y2 = y2 ; )
and we obtain the “combined” likelihood function L( ) based on y1 and y2 together as
L( ) = L1 ( ) L2 ( ) for 2
where Lj ( ) = P (Yj = yj ; ); j = 1; 2. This idea, of course, can be extended to more than

two independent studies.

Harris/Decima also conducted a poll for CAUT in 2011 in which they asked respondents
whether they agreed with the statement: “University and college teachers earn too much”.
In 2011, y2 = 540 people agreed with the statement as compared to y1 = 520 people in
2010. If we assume that = the proportion of the Canadian adult population that agree
with the statement is the same in both years then may be estimated using the data from
these two independent polls. The combined likelihood would be
2000 520 2000
L( ) = (1 )1480 540
(1 )1460
520 540
2000 2000
= 1060
(1 )2940 for 0 < <1
520 540
or, ignoring the constants with respect to , we have
L( ) = 1060
(1 )2940 for 0 < < 1:
The maximum likelihood estimate of based on the two independent experiments is ^ =

1060=4000 = 0:265.
Sometimes the likelihood function for a given set of data can be constructed in more
than one way as the following example illustrates.
Example 2.2.3
Suppose that the random variable Y represents the number of persons infected with
the human immunode…ciency virus (HIV) in a randomly selected group of n persons. We
assume the data are reasonably modeled by Y v Binomial(n; ) with probability function
n y
P (Y = y; ) = f (y; ) = (1 )n y
for y = 0; 1; : : : ; n
y
where represents the proportion of the population that are infected. In this case, if we
select a random sample of n persons and test them for HIV, we have Y = Y , and y = y as
the observed number infected. Thus
n y
L( ) = (1 )n y
for 0 < <1
y
or more simply
y
L( ) = (1 )n y
for 0 < <1 (2.5)
and again L( ) is maximized by the value ^ = y=n.
For this random sample of n persons who are tested for HIV, we could also de…ne the
indicator random variable
Yi = I (person i tests positive for HIV)

for i = 1; : : : ; n. (Note: I(A) is the indicator function; it equals 1 if A is true and 0 if A is

false.) Now Yi v Binomial(1; ) with probability function
yi
f (yi ; ) = (1 )1 yi
for yi = 0; 1 and 0 < < 1:
The likelihood function for the observed random sample y1 ; y2 ; : : : ; yn is

Q
n
L( ) = f (yi ; )
i=1
Q
n
yi
= (1 )1 yi
i=1
P
n P
n
yi (1 yi )
= i=1 (1 )i=1
y
= (1 )n y
for 0 < <1
P
n
where y = yi . This is the same likelihood function as (2.5). The reason for this is
i=1
P
n
because the random variable Yi has a Binomial(n; ) distribution.
i=1
In many applications we encounter likelihood functions which cannot be maximized

mathematically and we need to resort to numerical methods. The following example pro-
vides an illustration.
Example 2.2.4 Coliform bacteria in water

The number of coliform bacteria Y in a random sample of water of volume vi milliliters
is assumed to have a Poisson distribution:
( vi )y
P (Y = y; ) = f (y; ) = e vi for y = 0; 1; : : : (2.6)
y!
where is the average number of bacteria per milliliter of water. There is an inexpensive
test which can detect the presence (but not the number) of bacteria in a water sample. In
this case what we do not observe Y , but rather the “presence” indicator I(Y > 0), or
(
1 if Y > 0
Z=
0 if Y = 0:
Note that from (2.6),

vi
P (Z = 1; ) = 1 e =1 P (Z = 0; ):
Suppose that n water samples, of volumes v1 ; : : : ; vn , are selected. Let z1 ; : : : ; zn be the

observed values of the presence indicators. The likelihood function is then
Q
n
L( ) = P (Zi = zi ; )
i=1
Q
n
vi zi vi 1 zi
= (1 e ) (e ) for >0
i=1
-17
-18
-19
l(θ)
-20
-21
-22
-23
-24
-25
θ
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 2.4: The log likelihood function l( ) for Example 2.2.4
and the log likelihood function is

P
n
vi
l( ) = [zi log(1 e ) (1 zi ) vi ] for > 0:
i=1
We cannot maximize l( ) mathematically by solving dl=d = 0, so we will use numerical

methods. Suppose for example that n = 40 samples gave data as follows:
vi (ml) 8 4 2 1
no. of samples 10 10 10 10
no. with zi = 1 10 8 7 3
This gives
8 4 2
l( ) = 10 log(1 e ) + 8 log(1 e ) + 7 log(1 e )
+ 3 log(1 e ) 21 for > 0:
Either by maximizing l( ) numerically for > 0, or by solving dl=d = 0 numerically, we

…nd the maximum likelihood estimate of to be ^ = 0:478. A simple way to maximize l( )
is to plot it, as shown in Figure 2.4; the maximum likelihood estimate can then be found
by inspection or, for more accuracy, by using a method like Newton’s method.
A few remarks about numerical methods are in order. Aside from a few simple models,
it is not possible to maximize likelihood functions explicitly. However, software exists which
implements powerful numerical methods which can easily maximize (or minimize) functions
of one or more variables. Multi-purpose optimizers can be found in many software packages;
2.3. LIKELIHOOD FUNCTIONS FOR CONTINUOUS DISTRIBUTIONS 61
in R the function nlm() is powerful and easy to use. In addition, statistical software packages
contain special functions for …tting and analyzing a large number of statistical models. The
R package MASS (which can be accessed by the command library(MASS)) has a function
fitdistr that will …t many common models.
2.3 Likelihood Functions for Continuous Distributions

Recall that we de…ned likelihoods for discrete random variables as the probability of ob-
serving the data y or
L ( ) = L ( ; y) = P (Y = y; ) for 2 :
For continuous distributions, P (Y = y; ) is unsuitable as a de…nition of the likelihood since
it always equals zero. In the continuous case, we de…ne the likelihood function similarly
to the discrete case but with the probability function P (Y = y; ) replaced by the joint
probability density function evaluated at the observed values. If Y1 ; : : : ; Yn are independent
and identically distributed random variables each with probability density function f (y; )
then the joint probability density function of (Y1 ; Y2 ; : : : ; Yn ) is
Q
n
f (yi ; )
i=1
and we use this to construct the likelihood function.
De…nition 12 If y1 ; : : : ; yn are the observed values of a random sample from a distribution

with probability density function f (y; ), then the likelihood function is de…ned as
Q
n
L ( ) = L ( ; y) = f (yi ; ) for 2 : (2.7)
i=1
Example 2.3.1 Likelihood function for Exponential distribution

Suppose that the random variable Y represents the lifetime of a randomly selected light
bulb in a large population of bulbs, and that Y v Exponential( ) is a reasonable model for
such a lifetime. If a random sample of light bulbs is tested and the lifetimes y1 ; : : : ; yn are
observed, then the likelihood function for is, from (2.7),
Q
n 1
yi = 1 P
n
L( ) = e = n exp yi = for > 0:
i=1 i=1
The log likelihood function is

1P
n y
l( ) = n log yi = n log + for >0
i=1
with derivative
d 1 y n
l( ) = n 2 = 2 (y ):
d
A …rst derivative test easily veri…es that the value = y maximizes l( ) and so ^ = y is the
maximum likelihood estimate of .
Example 2.3.2 Likelihood function for Gaussian distribution

As an example involving more than one parameter, suppose that the random variable
Y has a G( ; ) with probability density function
1 1
f (y; ; ) = p exp 2
(y )2 for y 2 <:
2 2
The likelihood function for = ( ; ) based on the observed random sample y1 ; : : : ; yn is
Q
n
L( ) = L( ; ) = f (yi ; ; )
i=1
Q
n 1 1
= p exp 2
(yi )2
i=1 2 2
1 P
n
= (2 ) n=2 n
exp 2
(yi )2 for 2 < and >0
2 i=1
or more simply
1 P
n
L( ) = L( ; ) = n
exp 2
(yi )2 for 2 < and > 0:
2 i=1
The log likelihood function for = ( ; ) is
1 P
n
l( ) = l( ; ) = n log 2
(yi )2 for 2 < and >0
2 i=1
To maximize l( ; ) with respect to both parameters and we solve 12 the two equations13
@l 1 Pn n
= 2 (yi )= 2
(y )=0
@ i=1
@l n 1 P
n
= + 3
(yi )2 = 0;
@ i=1
simultaneously. We …nd that the maximum likelihood estimate of is ^ = (^ ; ^ ), where
1=2
1 Pn 1 Pn
^= yi = y and ^= (yi y)2 :
n i=1 n i=1
12
To maximize a function of two variables, set the derivative with respect to each variable equal to zero.
Of course …nding values at which the derivatives are zero does not prove this is a maximum. Showing it is
a maximum is another exercise in calculus.
13
In case you have not met partial derivatives, the notation @@ means we are taking the derivative with
respect to while holding the other parameter constant. Similarly @@ is the derivative with respect to
while holding constant.
2.4. LIKELIHOOD FUNCTIONS FOR MULTINOMIAL MODELS 63
2.4 Likelihood Functions For Multinomial Models

Multinomial models are used in many statistical applications. From Section 2.1, the
Multinomial joint probability function is
n! Q
k
yi P
k
f (y1 ; : : : ; yk ; ) = i for yi = 0; 1; : : : where yi = n:
y1 ! yk ! i=1 i=1
The likelihood function for = ( 1; 2; : : : ; k ) based on data y1 ; : : : ; yk is given by

n! Q
k
yi
L( ) = L ( 1 ; 2; : : : ; k ) = i
y1 ! yk ! i=1
or more simply
Q
k
yi
L( ) = i
i=1
P
k
l( ) = yi log i :
i=1
If yi represents the number of times outcome i occurred in n “trials”, i = 1; : : : ; k, then it
can be shown that
î = yi for i = 1; : : : ; k
n
are the maximum likelihood estimates of 1 ; : : : ; k .14
Example 2.4.1 A, B, AB, O blood types

Each person is one of four blood types, labelled A, B, AB and O. (Which type a person
is has important consequences, for example in determining to whom they can donate a
blood transfusion.) Let 1 ; 2 ; 3 ; 4 be the fraction of a population that has types A, B,
AB, O, respectively. Now suppose that in a random sample of 400 persons whose blood
was tested, the numbers who were types A, B, AB, O, were y1 = 172; y2 = 38; y3 = 14 and
y4 = 176 respectively. (Note that y1 + y2 + y3 + y4 = 400.) Let the random variables Y1 ; Y2 ;
Y3 ; Y4 represent the number of type A, B, AB, O persons respectively that are in a random
sample of size n = 400. Then Y1 ; Y2 ; Y3 ; Y4 follow a Multinomial(400; 1 ; 2 ; 3 ; 4 ).
The maximum likelihood estimates from the observed data are therefore
^1 = 172 = 0:43; ^2 = 38 = 0:095; ^3 = 14 = 0:035; ^4 = 176 = 0:44
400 400 400 400
P
4
^
(as a check, note that i = 1). These give estimates of the population fractions 1; 2;
i=1
3 ; 4 . (Note: studies involving much larger numbers of people put the values of the i ’s
for Caucasians at close to 1 = 0:448; 2 = 0:083; 3 = 0:034; 4 = 0:436:)
14 P
k P
k
`( ) = yi log i is a little tricky to maximize because the i ’s satisfy a linear constraint, i = 1.
i=1 i=1
The Lagrange multiplier method (Calculus III) for constrained optimization allows us to …nd the solution
î = yi =n , i = 1; : : : ; k.
In some problems the Multinomial parameters 1; : : : ; k may be functions of fewer than

k 1 parameters. The following is an example.
Example 2.4.2 MM, MN, NN blood types

Another way of classifying a person’s blood is through their “M-N” type. Each person
is one of three types, labelled MM, MN and NN and we can let 1 ; 2 ; 3 be the fraction
of the population that is each of the three types. In a sample of size n we let Y1 = number
of MM types observed, Y2 = number of MN types observed and Y3 = number of NN types
observed. The joint probability function of Y1 ; Y2 ; Y3 is
n! y1 y2 y3
P (Y1 = y1 ; Y2 = y2 ; Y3 = y3 ) = 1 2 3
y1 !y2 !y3 !
According to a model in genetics, the i ’s can be expressed in terms of a single parameter

for human populations:
2
1 = ; 2 = 2 (1 ); 3 = (1 )2
where is a parameter with 0 < < 1. In this case
n! 2 y1
P (Y1 = y1 ; Y2 = y2 ; Y3 = y3 ) = [ ] [2 (1 )]y2 [(1 )2 ]y3 :
y1 !y2 !y3 !
If the observed data are y1 ; y2 ; y3 then the likelihood function for is
n!
L( ) = [ 2 ]y1 [2 (1 )]y2 [(1 )2 ]y3
y1 !y2 !y3 !
n!
= 2y2 2y1 +y2 (1 )y2 +2y3 for 0 < <1
y1 !y2 !y3 !
or more simply
2y1 +y2
L( ) = (1 )y2 +2y3 for 0 < < 1:
l ( ) = (2y1 + y2 ) log + (y2 + 2y3 ) log (1 ) for 0 < <1
and
dl 2y1 + y2 y2 + 2y3
=
d 1
and
dl 2y1 + y2 2y1 + y2
= 0 if = =
d 2y1 + 2y2 + 2y3 2n
so
2y1 + y2
^=
2n
is the maximum likelihood estimate of .
2.5. INVARIANCE PROPERTY OF MAXIMUM LIKELIHOOD ESTIMATES 65
2.5 Invariance Property of Maximum Likelihood Estimates

Many statistical problems involve the estimation of attributes of a population or process.
These attributes can often be represented as an unknown parameter or parameters in a
statistical model. The method of maximum likelihood gives us a general method for esti-
mating these unknown parameters. Sometimes the attribute of interest is a function of the
unknown parameters. Fortunately the method of maximum likelihood allows us to estimate
functions of unknown parameters with very little extra work. This property is called the
invariance property of maximum likelihood estimates and can be stated as follows:
Theorem 13 If ^ is the maximum likelihood estimate of then g(^) is the maximum

likelihood estimate of g ( ).
Example 2.5.1
Suppose we want to estimate attributes associated with BMI for some population of
individuals (for example, Canadian males age 21-35). If the distribution of BMI values in
the population is well described by a Gaussian model, Y G( ; ), then by estimating
and we can estimate any attribute associated with the BMI distribution. For example:
(i) The mean BMI in the population corresponds to = E(Y ) for the Gaussian distri-
bution.
(ii) The median BMI in the population corresponds to the median of the Gaussian
distribution which equals since the Gaussian distribution is symmetric about its mean.
(iii) For the BMI population, the 0:1 (population) quantile, Q (0:1) = 1:28 . (To
see this, note that P (Y 1:28 ) = P (Z 1:28) = 0:1, where Z = (Y )= has a
G(0; 1) distribution.)
(iv) The fraction of the population with BMI over 35:0 given by
35:0
p=1
where is the cumulative distribution function for a G(0; 1) random variable.

Suppose a random sample of 150 males gave observations y1 ; : : : ; y150 and that the
maximum likelihood estimates based on the results derived in Example 2.3.2 were
1=2
P
1 150
^ = y = 27:1 and ^ = (yi y)2 = 3:56:
150 i=1
The estimates of the attributes in (i)-(iv) would be:
(i) and (ii) ^ = 27:1
^ (0:1) = ^ 1:28^ = 27:1
(iii) Q 1:28 (3:56) = 22:54 and
35:0 ^
(iv) p^ = 1 ^ =1 (2:22) = 1 0:98679 = 0:01321.
Note that (iii) and (iv) follow from the invariance property of maximum likelihood
estimates.
2.6 Checking the Model

The models used in this course are probability distributions for random variables that
represent variates in a population or process. A typical model has probability density
function f (y; ) if the variate Y is continuous, or probability function f (y; ) if Y is discrete,
where is (possibly) a vector of parameter values. If a family of models is to be used for some
purpose then it is important to check that the model adequately represents the variability
in Y . This can be done by comparing the model with random samples y1 ; : : : ; yn of y-values
from the population or process.
For data that have arisen from a discrete probability model, a straightforward way to
check the …t of the model is to compare observed frequencies with the expected frequencies
calculated using the assumed model as illustrated in the example below.
Example 2.6.1 Rutherford and Geiger study of alpha-particles and the Poisson
model
In 1910 the physicists Ernest Rutherford and Hans Geiger conducted an experiment
in which they recorded the number of alpha particles omitted from a polonium source (as
detected by a Geiger counter) during 2608 time intervals each of length 1=8 minute. The
number of particles j detected in the time interval and the frequency fj of that number of
particles is given in Table 2.1.
We can see whether a Poisson model …t these data by comparing the observed frequencies
with the expected frequencies calculated assuming a Poisson model. To calculate these
expected frequencies we need to specify the mean of the Poisson model. We estimate
using the sample mean for the data which is
^ = 1 P 14
jfj
2608 j=0
1
= (10097)
2608
= 3:8715:
The expected number of intervals in which j particles is observed is
(3:8715)j e 3:8715
ej = (2608) ; j = 0; 1; : : :
j!
The expected frequencies are also given in Table 2.1.
Since the observed and expected frequencies are reasonably close, the Poisson model
seems to …t these data well. Of course, we have not speci…ed how close the expected and
observed frequencies need to be in order to conclude that the model is reasonable. We will
look at a formal method for doing this in Chapter 7.
2.6. CHECKING THE MODEL 67
Table 2.1: Frequency Table for Rutherford/Geiger Data

Number of - Observed Expected
particles detected: j Frequency: fj Frequency: ej
0 57 54:3
1 203 210:3
2 383 407:1
3 525 525:3
4 532 508:4
5 408 393:7
6 273 254:0
7 139 140:5
8 45 68:0
9 27 29:2
10 10 11:3
11 4 4:0
12 0 1:3
13 1 0:4
14 1 0:1
Total 2608 2607:9
This comparison of observed and expected frequencies to check the …t of a model can
also be used for data that have arisen from a continuous model. The following is an example.
Example 2.6.2 Lifetimes of brake pads and the Exponential model

Suppose we want to check whether an Exponential model is reasonable for modeling the
data in Example 1.3.3 on lifetimes of brake pads. To do this we need to estimate the mean
of the Exponential distribution. We use the sample mean y = 49:0275 to estimate .
Since the lifetime Y is a continuous random variable taking on all real values greater
than zero the intervals for the observed and expected frequencies are not obvious as they
were in the discrete case. For the lifetime of brake pads data we choose the same intervals
which were used to produce the relative frequency histogram in Example 1.3.3 except we
have collapsed the last four intervals into one interval [120; +1). The intervals are given
in Table 2.2.
The expected frequency in the interval [aj 1 ; aj ) is calculated using
Zaj
1 y=49:0275
ej = 200 e dy
49:0275
aj 1
aj 1 =49:0275 aj =49:0275
= 200 e e :
The expected frequencies are also given in Table 2.2. We notice that the observed and
expected frequencies are not close in this case and therefore the Exponential model does
not seem to be a good model for these data.
Table 2.2: Frequency Table for Brake Pad Data

Observed Expected
Interval
Frequency: fj Frequency: ej
[0; 15) 21 52:72
[15; 30) 45 38:82
[30; 45) 50 28:59
[45; 60) 27 21:05
[60; 75) 21 15:50
[75; 90) 9 11:42
[90; 105) 12 8:41
[105; 120) 7 6:19
[120; +1) 8 17:3
Total 200 200
The drawback of this method for continuous data is that the intervals must be selected
and this adds a degree of arbitrariness to the method. The following graphical methods
provide better techniques for checking the …t of the model for continuous data.
15
Graphical Checks of Models
We may also use graphical techniques for checking the …t of a model. These methods are
particularly useful for continuous data.
The …rst graphical method is to superimpose the probability density function on the
relative frequency histogram of the data as we did in Figures 1.15 and 1.16 for the data
from the can …ller study.
Empirical Cumulative Distribution Functions
A second graphical procedure is to plot the empirical cumulative distribution function F^ (y)
and then to superimpose on this a plot of the model-based cumulative distribution function,
P (Y y; ) = F (y; ). We saw an example of such a plot in Chapter 1 but we provide
more detail here. The objective is to compare two cumulative distribution functions, one
that we hypothesized is the cumulative distribution function for the population, and the
other obtained from the sample. If they di¤er a great deal, this would suggest that the
hypothesized distribution is a poor …t.
15
See the video at www.watstat.ca called "The empirical c.d.f. and the qqplot" on the material in this
section.
Example 2.6.3 Checking a Uniform(0; 1) model

Suppose, for example, we have 10 observations which we think might come from the
Uniform(0; 1) distribution. The observations are as follows:
0:76 0:43 0:52 0:45 0:01 0:85 0:63 0:39 0:72 0:88:
The …rst step in constructing the empirical cumulative distribution function is to order the
observations from smallest to largest16 obtaining
0:01 0:39 0:43 0:45 0:52 0:63 0:72 0:76 0:85 0:88
If you were then asked, purely on the basis of this data, what you thought the probability
is that a random value in the population falls below a given value y, you would probably
respond with the proportion in the sample that falls below y. For example, since four of the
values 0:01 0:39 0:43 0:45 are less than 0:5, we would estimate the cumulative distribution
function at 0:5 using 4=10. Thus, we de…ne the empirical cumulative distribution function
for all real numbers y by the proportion of the sample less than or equal to y or:
number of values in fy1 ; y2 ; : : : ; yn g which are y

F^ (y) = :
n
More generally for a sample of size n we …rst order the yi ’s, i = 1; : : : ; n to obtain the
ordered values y(1) y(2) : : : y(n) . F^ (y) is a step function with a jump at each of the
ordered observed values y(i) . If y(1) ; y(2) ; : : : ; y(n) are all di¤erent values, then F^ (y(j) ) = j=n
and the jumps are all of size 1=n. In general the size of a jump at a particular point y is
the number of values in the sample that are equal to y, divided by n:
number of values in fy1 ; y2 ; : : : ; yn g equal to y

Size of jump in F^ (y) at y = :
n
Why is this a step function? In the data above there were no observations at all between
the smallest number 0:01 and the second smallest 0:39. So for all y 2 [0:01; 0:39), the
proportion of the sample which is less than or equal to y is the same, namely 1=10.
Having obtained this estimate of the population cumulative distribution function, it

is natural to ask how close it is to a given cumulative distribution function, say the
Uniform(0; 1) cumulative distribution function. We can do this with a graph of the empiri-
cal cumulative distribution function or more simply on a graph that just shows the vertices
y(1) ; n1 ; (y(2) ; n2 ); : : : ; (y(n) ; nn ) shown as star on the graph in Figure 2.5.
16
We usually denote the ordered values y(1) y(2) ::: y(n) where y(1) is the smallest and y(n) is the
largest. In this case y(n) = 0:88:
0 .9
0 .8
0 .7
theoretical quantiles
0 .6
0 .5
0 .4
0 .3
0 .2
0 .1
0
0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1
s a m p l e q u a n ti l e s
Figure 2.5: The empirical cumulative distribution function for n = 10 data values
and a superimposed Uniform(0; 1) cumulative distribution function.
By superimposing on this graph the theoretical Uniform(0; 1) cumulative distribution

function, which in this case is a straight line, we can see how well the theoretical distribution
and empirical distribution agree. Since the sample is quite small we cannot expect a perfect
straight line, but for larger samples we would expect much better agreement with the
straight line.
Because the Uniform(0; 1) cumulative distribution function is a straight line, it is easy

to assess graphically how close the two curves …t, but what if the hypothesized distribution
is Normal, whose cumulative distribution function is distinctly non-linear?
As an example we consider data (see Appendix C) for the time between 300 eruptions,
between the …rst and the …fteenth of August 1985, of the geyser Old Faithful in Yellowstone
National Park. One might hypothesize that the random distribution of times between
consecutive eruptions follows a Normal distribution. We plot the empirical cumulative
distribution function in Figure 2.6 together with the cumulative distribution function of a
Gaussian distribution. Of course we don’t know the parameters of the appropriate Gaussian
distribution so we use the sample mean 72:3 and sample standard deviation 13:9 in order
to approximate these parameters. Are the di¤erences between the two curves in Figure 2.6
su¢ cient that we would have to conclude a distribution other than the Gaussian? There are
two ways of trying to get another view of the magnitude of these di¤erences. The …rst way
is to plot the relative frequency histogram of the data and then superimpose the Gaussian
curve. The second way is to use a qqplot which will be discussed in the next section.
0.9
0.8
0.7
0.6
e.c.d.f.
0.5
0.4
0.3
0.2 G (72.3,13.9)
0.1
0
30 40 50 60 70 80 90 100 110 120
T ime between Eruptions
Figure 2.6: Empirical c.d.f. of times between eruptions of Old Faithful and
superimposed G (72:3; 13:9) c.d.f.
Figure 2.7 seems to indicate that the distribution of the times between eruptions is not
very Normal because it appears to have two modes. The plot of the empirical cumulative
distribution function did not show the shape of the distribution as clearly as the histogram.
The empirical cumulative distribution function does allow us to determine the pth quantile
or 100pth percentile (the left-most value on the horizontal axis yp where F^ (yp ) = p). For
example, from the empirical cumulative distribution function of the Old Faithful data, we
see that the median time (F^ (m)
^ = 0:5) between eruptions is around m ^ = 78.
0.035
0.03
0.025
Relative
F requency
0.02
0.015
G (72.3,13.9)
0.01
0.005
0
43 49 55 61 67 73 79 85 91 97 103 109
T ime between Eruptions
Figure 2.7: Relative frequency histogram for times between eruptions of Old
Faithful and superimposed G (72:3; 13:9) p.d.f.
Example 2.6.4 Heights of females

For the data on female heights in Chapter 1 and using the results from Example 2.3.2
we obtain ^ = 1:62; ^ = 0:064 as the maximum likelihood estimates of and . Figure
2.8 shows a plot of the empirical cumulative distribution function with the G(1:62; 0:0637)
cumulative distribution function superimposed.
0.9
0.8
0.7
0.6
e.c.d.f.
0.5 G(1.62,0.064)
0.4
0.3
0.2
0.1
0
1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 1.8 1.85
Height
Figure 2.8: Empirical c.d.f. of female heights and G (1:62; 0:064) c.d.f.
5
Relative
F requency
4 G (1.62,0.064)
0
1.4 1.5 1.6 1.7 1.8
Height
Figure 2.9: Relative frequency histogram of female heights and G (1:62; 0:064) p.d.f.
Figure 2.9 shows a relative frequency histogram for these data with the G(1:62; 0:0637)
probability density function superimposed. The two types of plots give complementary but
consistent pictures. An advantage of the distribution function comparison is that the exact
heights in the sample are used, whereas in the histogram plot the data are grouped into
intervals to form the histogram. However, the histogram and probability density function
show the distribution of heights more clearly. Both graphs indicate that a Normal model
seems reasonable for these data.
Qqplots
An alternative view, which is really just another method of graphing the empirical cumu-
lative distribution function, tailored to the Normal distribution, is a graph called a qqplot.
Suppose the data Yi , i = 1; : : : ; n were in fact drawn from the G( ; ) distribution so that
the standardized variables, after we order them from smallest Y(1) to largest Y(n) , are
Y(i)
Z(i) = :
These behave like the ordered values from a sample of the same size taken from the G(0; 1)
distribution. Approximately what value do we expect Z(i) to take? If denotes the
standard Normal cumulative distribution function then for 0 < u < 1
1 1
P ( (Z) u) = P (Z (u)) = ( (u)) = u
so that (Z) has a Uniform distribution. It is easy to check that the expected value of
the i’th largest value in a random sample of size n from a Uniform(0; 1) distribution is
i 17 i
equal to n+1 so we expect that the i=n0 th quantile (Z(i) ) to be close to n+1 . In other
1 i
words we expect Z(i) = Y(i) = to be approximately n+1 or Y(i) to be roughly
1 i
a linear function of n+1 . This is the basic argument underlying the qqplot. If
1 i
the distribution is actually Normal, then a plot Y(i) ; n+1 , i = 1; : : : ; n should be
approximately linear (subject to the usual randomness).
Similarly if the data obtain from an Exponential distribution we expect a plot of

i
Y(i) ; F 1 n+1 to be approximately linear where F 1 (u) is the inverse of the Exponential(1)
cumulative distribution function given by F 1 (u) = ln(1 u).
Since reading qqplots is an art acquired from experience, it is a good idea to generate
similar plots where we know the answer. This can be done by generating data from a known
distribution and then plotting a qqplot. See the R code below and Chapter 2, Problem
14. A qqplot of 100 observations randomly generated from a G ( 2; 3) distribution is
given in Figure 2.10. The theoretical quantiles are plotted on the horizontal axis and the
empirical quantiles are plotted on the vertical axis. Since the quantiles of the Normal
distribution change more rapidly in the tails of the distribution, we expect the
points at both ends of the line to lie further from the line.
17
This is intuitively obvious since n values Y(i) breaks the interval into n + 1 spacings,
and it makes sense each should have the same expected length. For empirical evidence see
http://www.math.uah.edu/stat/applets/OrderStatisticExperiment.html. More formally we must …rst show
the p.d.f. of Y(i) is (i 1)!(nn!
i)!
ui 1 (1 u)n i for 0 < u < 1: Then …nd the integral E(Y(i) ) =
R1 n!
0 (i 1)!(n i)!
ui (1 u)n i du = n+1
i
:
10
Sample Quantiles
0
-5
-10
-15
-3 -2 -1 0 1 2 3
Standard Normal Quantiles
Figure 2.10: Qqplot of a random sample of 100 observations from a G( 2; 3)

distribution
A qqplot of the female heights is given in Figure 2.11. Overall the points lie reasonably
along a straight line. The qqplot has a staircase look because the heights are rounded to the
closest centimeter. As was the case for the relative frequency histogram and the empirical
cumulative distribution function, the qqplot indicates that the Normal model is reasonable
for these data.
1.85
1.8
1.75
1.7
Sample Quantiles
1.65
1.6
1.55
1.5
1.45
1.4
-3 -2 -1 0 1 2 3
Figure 2.11: Qqplot of heights of females

A qqplot of the times between eruptions of Old Faithful is given in Figure 2.12. The
points do not lie along a straight line which indicates as we saw before that the Normal is
not a reasonable model for these data. The two places at which the shape of the points
changes direction correspond to the two modes of these data that we observed previously.
140
120
100
Sample Quantiles
80
60
40
20
0
-3 -2 -1 0 1 2 3
Standard Normal Q uant iles
Figure 2.12: Qqplot of times between eruptions of Old Faithful
A qqplot of the lifetimes of brake pads (Example 1.3.3) is given in Figure 2.13. The
points form a U-shaped curve. This pattern is consistent with the long right tail and
positive skewness that we observed before. The Normal is not a reasonable model for these
data.
200
150
Sample Quantiles
100
50
-50
-3 -2 -1 0 1 2 3
Standard Normal Q uant iles
Figure 2.13: Qqplot of lifetimes of brake pads

A qqplot of the data in Figure 1.4 is given in Figure 2.14. These points form an S-shaped
curve which is consistent with the fact that the data are reasonably symmetric but the data
do not have tails like the Normal distribution.
1.5
1
Sample Quantiles
0.5
-0.5
-3 -2 -1 0 1 2 3
Figure 2.14: Qqplot of 100 observations

R Code for Checking Models Using Histograms, Empirical c.d.f.’s and Qqplots
# Normal Data Example
y<-rnorm(100,5,2) # generate 100 observations from a G(5,2) distribution
mn<-mean(y) # find the sample mean
s<-sd(y) # find the sample standard deviation
summary(y) # five number summary
skewness(y,type=’1’) # find the sample skewness as given in the Course Notes
kurtosis(y,type=’1’)+3 # find the sample kurtosis as given in the Course Notes

hist(y,freq=F) # graph the relative frequency histogram
w<-mn+s*seq(-3,3,0.01) # calculate points at which to graph the Normal pdf
d<-dnorm(w,mn,s) # calculate values of Normal pdf at these points
points(w,d,type=’l’) # superimpose the Normal pdf on the histogram
A<-ecdf(y) # calculate the empirical cdf for the data
e<-pnorm(w,mn,s) # calculate the values of the Normal cdf
plot(A,verticals=T,do.points=F,xlab=’y’,ylab=’ecdf’) # plot the ecdf
points(w,e,type=’l’) # superimpose the Normal cdf
qqnorm(y) # graph a qqplot of the data
#
# Exponential Data Example
y<-rexp(100,5) # generate 100 observations from Exponential(5) dist’n
mn<-mean(y) # find the sample mean
s<-sd(y) # find the sample standard deviation
summary(y) # five number summary
skewness(y,type=’1’) # find the sample skewness as given in the Course Notes
kurtosis(y,type=’1’)+3 # find the sample kurtosis as given in the Course Notes

hist(y,freq=F) # graph the relative frequency histogram
w<-mn+s*seq(-3,3,0.01) # calculate points at which to graph the Normal pdf
d<-dnorm(w,mn,s) # calculate values of Normal pdf at these points
points(w,d,type=’l’) # superimpose the Normal pdf on the histogram
A<-ecdf(y) # calculate the empirical cdf for the data
e<-pnorm(w,mn,s) # calculate the values of the Normal cdf
plot(A,verticals=T,do.points=F,xlab=’y’,ylab=’ecdf’) # plot the ecdf
points(w,e,type=’l’) # superimpose the Normal cdf
qqnorm(y) # graph a qqplot of the data

1. For each of the functions G ( ) given below …nd the value of which maximizes G ( )
by …nding the value of which maximizes g ( ) = log G ( ). Use the First Derivative
Test to verify that the value corresponds to a maximum. Note: a and b are constants.
(a) G ( ) = a
(1 )b ; 0 < <1
a b=
(b) G ( ) = e ; >0
a b
(c) G ( ) = e ; >0
2
(d) G ( ) = e a( b) ; 2 <:
2. Consider the following two experiments whose purpose was to estimate , the fraction
of a large population with blood type B.
Experiment 1: Individuals were selected at random until 10 with blood type B were
found. The total number of people examined was 100.
Experiment 2: One hundred individuals were selected at random and it was found
that 10 of them have blood type B.
(a) Find the probability of the observed results (as a function of ) for the two
experiments. Thus obtain the likelihood function for for each experiment and
show that they are proportional. Show the maximum likelihood estimate ^ is
the same in each case. What is the maximum likelihood estimate of ?
(b) Suppose n people came to a blood donor clinic. Assuming = 0:10, use the Nor-
mal approximation to the Binomial distribution (remember to use a continuity
correction) to determine how large should n be to ensure that the probability
of getting 10 or more donors with blood type B is at least 0:90? Use The R
functions gbinom() or pbinom() to determine the exact value of n.
3. Specimens of a new high-impact plastic are tested by repeatedly striking them with
a hammer until they fracture. Let Y = the number of blows required to fracture a
specimen. If the specimen has a constant probability of surviving a blow, indepen-
dently of the number of previous blows received, then the probability function for Y
is
f (y; ) = P (Y = y; ) = y 1 (1 ) for y = 1; 2; : : : ; 0 < < 1:
Suppose the observed data are y1 ; y2 ; : : : ; yn .
(a) Find the likelihood function L( ) and the maximum likelihood estimate ^.
P
200
(b) Find the relative likelihood function R ( ). If n = 200 and yi = 400 then plot
i=1
R ( ).
(c) Estimate the probability that a specimen fractures on the …rst blow.
4. In modelling the number of transactions of a certain type received by a central com-

puter for a company with many on-line terminals the Poisson distribution can be used.
If the transactions arrive at random at the rate of per minute then the probability
of y transactions in a time interval of length t minutes is
( t)y t
P (Y = y; ) = f (y; ) = e for y = 0; 1; : : : and > 0:
y!
(a) The numbers of transactions received in 10 separate one minute intervals were
8, 3, 2, 4, 5, 3, 6, 5, 4, 1. Write down the likelihood function for and …nd the
maximum likelihood estimate ^.
(b) Estimate the probability that during a two-minute interval, no transactions ar-
rive.
(c) Use the R function rpois() with the value = 4:1 to simulate the number of
transactions received in 100 one minute intervals. Calculate the sample mean
and variance; are they approximately the same? (Note that E(Y ) = V ar(Y ) =
for the Poisson model.)
5. Suppose y1 ; y2 ; : : : ; yn is an observed random sample from the distribution with prob-

ability density function
2y y2 =
f (y; ) = e for y > 0
where the parameter > 0.
P
20
(b) Find the relative likelihood function R ( ). If n = 20 and yi2 = 72 then plot
i=1
R ( ).
6. Suppose y1 ; y2 ; : : : ; yn is an observed random sample from the distribution with prob-

f (y) = ( + 1)y for 0 < y < 1 and > 1:
(b) Find the log relative likelihood function r ( ). If n = 15 and
P
15
log yi = 34:5 then plot r ( ).
i=1
7. Suppose that in a population of twins, males (M ) and females (F ) are equally likely
to occur and that the probability that a pair of twins is identical is . If twins are
not identical, their sexes are independent.
(a) Show that

1+ 1
P (M M ) = P (F F ) = and P (M F ) =
4 2
(b) Suppose that n pairs of twins are randomly selected; it is found that n1 are M M ,
n2 are F F , and n3 are M F , but it is not known whether each set is identical or
fraternal. Use these data to …nd the maximum likelihood estimate ^ of . What
is the value of ^ if n = 50 and n1 = 16, n2 = 16, n3 = 18?
8. The following model has been proposed for the distribution of the Y = the number
of children in a family, for a large population of families:
1 2
P (Y = 0; ) =
1
and
y 1
P (Y = y; ) = for y = 1; 2; : : : and 0 < :
2
(a) What does the parameter represent?
(b) Suppose that n families are selected at random and the observed data were
y 0 1 ymax > ymax Total
fy f0 f1 fmax 0 n
where fy = the observed number of families with y children and ymax = maximum
number of children observed in a family. Find the probability of observing these
data and thus determine the maximum likelihood estimate of .
(c) Consider a di¤erent type of sampling in which a single child is selected at random
and then the number of o¤spring in that child’s family is determined. Let X
represent the number of children in the family of a randomly chosen child. Show
that
1
P (X = x; ) = cx x for x = 1; 2; : : : and 0 <
2
and determine c.
(d) Suppose that the type of sampling in part (c) was used and that with n = 33
the following data were obtained:
x 1 2 3 4 >4 Total
fx 22 7 3 1 0 33
Find the probability of observing these data and thus determine the maximum
likelihood estimate of . Estimate the probability a couple has no children using
these data.
(e) Suppose the sample in (d) was incorrectly assumed to have arisen from the
sampling plan in (b). What would ^ be found to be? This problem shows that
the way the data have been collected can a¤ect the model.
9. When Wayne Gretzky played for the Edmonton Oilers he scored an incredible 1669
points in 696 games. The data are given in the frequency table below:
Number of Points Observed Number of

in a Game: y Games with y points: fy
0 69
1 155
2 171
3 143
4 79
5 57
6 14
7 6
8 2
9 0
Total 696
The P oisson ( ) model has been proposed for the random variable Y = number of
points Wayne scores in a game.
(a) Show that the likelihood function for based on the Poisson model and the data
in the frequency table simpli…es to
1669 696
L( ) = e ; > 0:
(b) Find the maximum likelihood estimate of .

(c) Determine the expected frequencies based on the Poisson model and comment
on how well the Poisson model …ts the data. What does this imply about the
type of hockey player Wayne was during his time with the Edmonton Oilers?
(Recall the assumptions for a Poisson process.)
10. Radioactive particles are emitted randomly over time from a source at an average rate
of per second. In n time periods of varying lengths t1 ; t2 ; : : : ; tn (seconds), the num-
bers of particles emitted (as determined by an automatic counter) were y1 ; y2 ; : : : ; yn
respectively.
(a) Determine an estimate of from these data. What assumptions have you made
to do this?
(b) Suppose that the intervals are all of equal length (t1 = t2 = = tn = t) and that
instead of knowing the yi ’s, we know only whether or not there were one or more
particles emitted in each time interval of length t. Find the likelihood function
for based on these data, and determine the maximum likelihood estimate of .
11. The marks for 100 students on a tutorial test in STAT 231 were:
3 5 11:5 13 13 13 13:5 13:5 13:5 13:5

14 14 14:5 14:5 14:5 15 15 15 15:5 15:5
15:5 16 16 16 16 16:6 16:5 17 17 17
17 17 17 17 17 17 17:5 17:5 18 18
18:5 18:5 18:5 18:5 19 19 19 19 19 19:5
19:5 19:5 20 20 20 20 20 20 20 20
20 20 20:5 20:5 20:5 20:5 21 21 21 21:5
21:5 21:5 22 22 22 22 22 22:5 22:5 22:5
23 23 23 23 23 23:5 24:5 25 25 25
25 25 25:5 26 26 26 26:5 27 27 30
For these data

P
100 P
100
yi = 1914 and yi2 = 38609:
i=1 i=1
The sample skewness is 0:50 and the sample kurtosis is 4:30.

A boxplot and qqplot of the data are given in Figures 2.15 and 2.16.
(a) Determine the …ve-number summary for these data.

(b) Determine the sample mean y and the sample standard deviation s for these
data.
(c) Determine the proportion of observations in the interval [y s; y + s].
Compare this with P (Y 2 [ ; + ]) where Y v G ( ; ).
(d) Find the interquartile range (IQR) for these data. Show that for Normally
distributed data IQR = 1:349 . How well do these data satisfy this relationship?
(e) Using both the numerical and graphical summaries for these data, assess whether
it is reasonable to assume that the data are approximately Normally distributed.
Be sure to support your conclusion with clear reasons.
30
25
20
15
10
Marks
Figure 2.15: Boxplot of tutorial test marks
Qqplot of Marks
35
30
25
Sample Quantiles
20
15
10
0
-3 -2 -1 0 1 2 3
N(0,1) Quantiles
Figure 2.16: Qqplot of tutorial test marks

12. In a study of osteoporosis, the heights in centimeters of a sample of 351 elderly

women randomly selected from a community were recorded. The observed data are
given below.
Heights of Elderly Women
142 145 145 145 146 147 147 147 147 148 148 149 150 150 150
150 150 150 151 151 151 151 151 151 152 152 152 152 152 152
152 152 152 152 152 152 153 153 153 153 153 153 153 153 153
153 153 153 153 153 153 153 153 154 154 154 154 154 154 154
154 154 154 154 155 155 155 155 155 155 155 155 155 155 155
155 155 155 155 155 155 155 155 155 155 156 156 156 156 156
156 156 156 156 156 156 156 156 156 156 156 156 156 156 156
157 157 157 157 157 157 157 157 157 157 157 157 157 157 157
157 157 157 157 157 158 158 158 158 158 158 158 158 158 158
158 158 158 158 158 158 158 158 158 158 158 158 158 158 158
158 158 158 158 158 158 159 159 159 159 159 159 159 159 159
159 159 159 159 159 159 159 159 160 160 160 160 160 160 160
160 160 160 160 160 160 160 160 160 160 160 160 160 160 161
161 161 161 161 161 161 161 161 161 161 161 161 161 161 161
161 161 161 161 162 162 162 162 162 162 162 162 162 162 162
162 162 162 162 162 162 162 163 163 163 163 163 163 163 163
163 163 163 163 163 163 163 163 163 163 163 163 163 163 163
163 163 163 163 163 163 163 164 164 164 164 164 164 164 164
164 164 164 164 164 164 164 164 164 165 165 165 165 165 165
165 165 165 156 165 165 165 165 165 165 165 165 166 166 166
166 166 166 166 166 166 166 166 167 167 167 167 167 167 167
168 168 168 168 168 168 169 169 169 169 169 169 169 169 170
170 170 170 170 170 170 170 170 170 170 171 171 171 173 174
173 174 176 177 178 178
For these data
P
351 P
351
yi = 56081 yi2 = 8973063
i=1 i=1
(a) Determine the sample mean y and the sample standard deviation s for these
data.
(b) Determine the proportion of observations in the interval [y s; y + s] and
[y 2s; y + 2s]. Compare these proportions with P (Y 2 [ ; + ]) and
P (Y 2 [ ; + ]) where Y v G ( ; ).
(c) Find the sample skewness and sample kurtosis for these data. Are these values
close to what you would expect for Normally distributed data?
(d) Find the …ve-number summary for these data.
(e) Find the IQR for these data. Does the IQR agree with what you expect for
Normally distributed data?
(f) Construct a relative frequency histogram and superimpose a Gaussian probabil-
ity density function with = y and = s.
(g) Construct an empirical distribution function for these data and superimpose a
Gaussian cumulative distribution function with = y and = s.
(h) Draw a boxplot for these data.
(i) Plot a qqplot for these data. Do you observe anything unusual about the qplot?
Why might cause this?
(j) Based on the above information indicate whether it is reasonable to assume a
Gaussian distribution for these data.
13. Consider the data on heights of adult males and females from Chapter 1. (The data
are posted on the course webpage.)
(a) Assuming that for each sex the heights Y in the population from which the sam-
ples were drawn is adequately represented by Y G( ; ), obtain the maximum
likelihood estimates ^ and ^ in each case.
(b) Give the maximum likelihood estimates for q (0:1) and q (0:9), the 10th and 90th
percentiles of the height distribution for males and for females.
(c) Give the maximum likelihood estimate for the probability P (Y > 1:83) for males
and females (i.e. the fraction of the population over 1:83 m, or 6 ft).
(d) A simpler estimate of P (Y > 1:83) that doesn’t use the Gaussian model is
number of person in sample with y > 1:83

n
where here n = 150. Obtain these estimates for males and for females. Can
you think of any advantages for this estimate over the one in part (c)? Can you
think of any disadvantages?
(e) Suggest and try a method of estimating the 10th and 90th percentile of the
height distribution that is similar to that in part (d).
14. The qqplot of the brake pad data in Figure 2.13 indicates that the Normal distribution
is not a reasonable model for these data. Sometimes transforming the data gives a
data set for which the Normal model is more reasonable. A log transformation is often
used. Plot a qqplot of the log lifetimes and indicate whether the Normal distribution
is a reasonable model for these data. The data are posted on the course webpage.
15. In a large population of males ages 40 50, the proportion who are regular smokers is
where 0 < < 1 and the proportion who have hypertension (high blood pressure)
is where 0 < < 1. If the events S (a person is a smoker) and H (a person has
hypertension) are independent, then for a man picked at random from the population
the probabilities he falls into the four categories SH; S H; SH; S H are respectively,
; (1 ); (1 ) ; (1 )(1 ). Explain why this is true.
(a) Suppose that 100 men are selected and the numbers in each of the four categories
are as follows:
Category SH S H SH S H
Frequency 20 15 22 43
Assuming that S and H are independent events, determine the likelihood func-
tion for and based on the Multinomial distribution, and …nd the maximum
likelihood estimates of and .
(b) Compute the expected frequencies for each of the four categories using the max-
imum likelihood estimates. Do you think the model used is appropriate? Why
might it be inappropriate?
16. Interpreting qqplots: Consider the following datasets de…ned by R commands.

For each generate the Normal qqplot using qqnorm(y) and on the basis of the qqplot
determine whether the underlying distribution is symmetric, light-tailed, heavy tailed,
whether the skewness is positive, negative or approximately zero, and whether the
kurtosis is larger or smaller than that of the Gaussian, i.e. 3. Repeat changing the
sample size n = 100 to n = 25. How much more di¢ cult is it in this case to draw a
clear conclusion?
(a) y<-rnorm(100)
(b) y<-runif(100)
(c) y<-rexp(100)
(d) y<-rgamma(100,4,1)
(e) y<-rt(100,3)
(f) y<-rcauchy(100)
17. A qqplot was generated for 100 values of a variate. See Figure 2.17. Based on this
qqplot, answer the following questions:
(a) What is the approximate value of the sample median of these data?
(b) What is the approximate value of the IQR of these data?
(c) Would the frequency histogram of these data be reasonably symmetric about
the sample mean?
(d) The frequency histogram for these data would most resemble a Normal probabil-
ity density function, an Exponential probability density function or a Uniform
probability density function?
QQ Plot of Sample Data versus Standard Normal

3
2.5
1.5
Sample Quantiles
0.5
-0.5
-1
-1.5
-3 -2 -1 0 1 2 3
N(0,1) Quantiles
Figure 2.17: Qqplot for 100 observations
18. Estimation from capture-recapture studies: In order to estimate the number

of animals, N , in a wild habitat the capture-recapture method is often used. In this
scheme k animals are caught, tagged, and then released. Later on n animals are
caught and the number Y of these that have tags are noted. The idea is to use this
information to estimate N .
(a) Show that under suitable assumptions

k N k
y n y
P (Y = y) = N
n
(b) For observed k, n and y …nd the value N ^ that maximizes the probability in
part (a). Does this ever di¤er much from the intuitive estimate N ~ = kn=y?
(Hint: The likelihood L(N ) depends on the discrete parameter N , and a good
way to …nd where L(N ) is maximized over f1; 2; 3; : : :g is to examine the ratios
L(N + 1)=L(N ):)
(c) When might the model in part (a) be unsatisfactory?
19. Uniform data: Suppose y1 ; y2 ; : : : ; yn is an observed random sample from the

U nif orm (0; ) distribution.
(a) Find the likelihood function, L( ).

(b) Obtain the maximum likelihood estimate of . Warning: The maximum likeli-
hood estimate is not found by solving l0 ( ) = 0.
20. Censored lifetime data: Consider the Exponential distribution as a model for the
lifetimes of equipment. In experiments, it is often not feasible to run the study long
enough that all the pieces of equipment fail. For example, suppose that n pieces of
equipment are each tested for a maximum of C hours (C is called a “censoring time”).
The observed data are: k (where 0 k n) pieces fail, at times y1 ; : : : ; yk and n k
pieces are still working after time C.
(a) If Y v Exponential ( ), show that P (Y > C; ) = e C= , for C > 0:

(b) Determine the likelihood function for based on the observed data described
above. Show that the maximum likelihood estimate of is
^ = 1 P yi + (n
k
k)C :
k i=1
(c) What does part (b) give when k = 0? Explain this intuitively.
(d) A standard test for the reliability of electronic components is to subject them
to large ‡uctuations in temperature inside specially designed ovens. For one
particular type of component, 50 units were tested and k = 5 failed before 400
P5
hours, when the test was terminated, with yi = 450 hours. Find the maximum
i=1
likelihood estimate of .
21. Poisson model with a covariate: Let Y represent the number of claims in a given
year for a single general insurance policy holder. Each policy holder has a numerical
“risk score”x assigned by the company, based on available information. The risk score
may be used as a covariate (explanatory variable) when modeling the distribution of
Y , and it has been found that models of the form
[ (x)]y (x)
P (Y = yjx) = e for y = 0; 1; : : :
y!
where (x) = exp( + x), are useful.
(a) Suppose that n randomly chosen policy holders with risk scores x1 ; x2 ; : : : ; xn
had y1 ; y2 ; : : : ; yn claims, respectively, in a given year. Determine the likelihood
function for and based on these data.
(b) Can ^ and ^ be found explicitly?
3. PLANNING AND
CONDUCTING EMPIRICAL
STUDIES
3.1 Empirical Studies

An empirical study is one which is carried out to learn about a population or process by
collecting data. We have given several examples in the preceding two chapters but we have
not yet considered the details of such studies. In this chapter we consider how to conduct
an empirical study in a systematic way. Well-conducted empirical studies are needed to
produce maximal information within existing cost and time constraints. A poorly planned
or executed study can be worthless or even misleading. For example, in the …eld of medicine
thousands of empirical studies are conducted every year at very high costs to society and
with critical consequences. These investigations must be well planned and executed so that
the knowledge they produce is useful, reliable and obtained at reasonable cost.
It is helpful to think of planning and conducting a study as a set of steps. We describe
below the set of steps to which we assign the acronym PPDAC
Problem: a clear statement of the study’s objectives, usually involving one or more
questions
Plan: the procedures used to carry out the study including how we will collect the
data.
Data: the physical collection of the data, as described in the Plan.
Analysis: the analysis of the data collected in light of the Problem and the Plan.
Conclusion: The conclusions that are drawn about the Problem and their limitations.
PPDAC has been designed to emphasize the statistical aspects of empirical studies. We
develop each of the …ve steps in more detail below. Several examples of the use of PPDAC
in an empirical study will be given. We identify the steps in the following example.
89
90 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES
Example 3.1
The following news item was published by the University of Sussex, UK on February 16,
2015. It describes an empirical investigation in the …eld of psychology.
Campaigns to get young people to drink less should focus on the bene…ts of not
drinking and how it can be achieved:
Pointing out the advantages and achievability of staying sober is more e¤ective than traditional
approaches that warn of the risks of heavy drinking, according to the research carried out at the
University of Sussex by researcher Dr Dominic Conroy. The study, published this week in the British
Journal of Health Psychology, found that university students were more likely to reduce their overall
drinking levels if they focused on the bene…ts of abstaining, such as more money and better health.
They were also less likely to binge drink if they had imagined strategies for how non-drinking might
be achieved – for example, being direct but polite when declining a drink, or choosing to spend
time with supportive friends. Typical promotions around healthy drinking focus on the risks of high
alcohol consumption and encourage people to monitor their drinking behaviour (e.g. by keeping a
drinks diary). However, the current study found that completing a drinks diary was less e¤ective in
encouraging safer drinking behaviour than completing an exercise relating to non-drinking.
Dr Conroy says: “We focused on students because, in the UK, they remain a group who drink
heavily relative to their non-student peers of the same age. Similarly, attitudes about the acceptabil-
ity of heavy drinking are relatively lenient among students. “Recent campaigns, such as the NHS
Change4Life initiative, give good online guidance as to how many units you should be drinking
and how many units are in speci…c drinks. “Our research contributes to existing health promotion
advice, which seeks to encourage young people to consider taking ’dry days’ yet does not always
indicate the range of bene…ts nor suggest how non-drinking can be more successfully ‘managed’in
social situations.”
Dr Conroy studied 211 English university students aged 18-25 over the course of a month. Par-
ticipants in the study completed one of four exercises involving either: imagining positive outcomes
of non-drinking during a social occasion; imagining strategies required to successfully not drink
during a social occasion; imagining both positive outcomes and required strategies; or completing a
drinks diary task.
At the start of the study, participants in the outcome group were asked to list positive outcomes
of not drinking and those in the process group listed what strategies they might use to reduce their
drinking. Those in the combined group did both. They were reminded of their answers via email
during the one month course of the study and asked to continue practising this mental simulation.
All groups completed an online survey at various points, indicating how much they had drunk
the previous week. Over the course of one month, Dr Conroy found that students who imagined
positive outcomes of non-drinking reduced their weekly alcohol consumption from 20 units to 14
units on average. Similarly, students who imagined required strategies for non-drinking reduced the
frequency of binge drinking episodes –classi…ed as six or more units in one session for women, and
eight or more units for men –from 1.05 episodes a week to 0.73 episodes a week on average.
Interestingly, the research indicates that perceptions of non-drinkers were also more favourable
3.1. EMPIRICAL STUDIES 91
after taking part in the study. Dr Conroy says this could not be directly linked to the intervention
but was an interesting additional feature of the study. He says: “Studies have suggested that holding
negative views of non-drinkers may be closely linked to personal drinking behaviour and we were
interested to see in the current study that these views may have improved as a result of taking
part in a non-drinking exercise. “I think this shows that health campaigns need to be targeted
and easy to …t into daily life but also help support people to accomplish changes in behaviour that
might sometimes involve ‘going against the grain’, such as periodically not drinking even when in
the company of other people who are drinking.”
Here are the …ve steps:

Problem: To study the e¤ect of four di¤erent mental exercises related to non-
drinking on the drinking behaviour of young people.
Plan: Recruit university 211 students aged 18-25 in the United Kingdom and assign
the students to one of the four mental exercises. (The article in the British Journal of
Health Psychology indicated that academic departments across English universities
were asked to forward a pre-prepared recruitment message to their students containing
a URL to an online survey.) Collect information from the students via online surveys
at various points including how much alcohol they had drunk the previous week.
Data: The data collected included which mental exercise group the student was in
and information about their alcoholic consumption in the week before they completed
the various online surveys.
Analysis: Look at di¤erences in alcoholic consumption between the four groups.
Conclusion: The study found that completing mental exercises relating to non-
drinking was more e¤ective in encouraging safer drinking behaviour than completing
a drinks diary alone.
Note that in the Problem step, we describe what we are trying to learn or what
questions we want to answer. The Plan step describes how the data are to be measured
and collected. In the Data step, the Plan is executed. The Analysis step corresponds
to what many people think Statistics is all about. We carry out both simple and complex
calculations to process the data into information. Finally, in the Conclusion step, we answer
the questions formulated at the Problem step.
PPDAC can be used in two ways - …rst to actively formulate, plan and carry out investi-
gations and second as a framework to critically scrutinize reported empirical investigations.
These reports include articles in the popular press (as in the above example), scienti…c
papers, government policy statements and various business reports. If you see the phrase
“evidence based decision” or “evidence based management”, look for an empirical study.
To discuss the steps of PPDAC in more detail we need to introduce a number of technical
terms. Every subject has its own jargon, i.e. words with special meaning, and you need to
learn the terms describing the details of PPDAC to be successful in this course.
3.2 The Steps of PPDAC
1. Problem
The elements of the Problem address questions starting with “What”
What conclusions are we trying to draw?
What group of things or people do we want the conclusions to apply?
What variates can we de…ne?
What is(are) the question(s) we are trying to answer?
Types of Problems
Three common types of statistical problems that are encountered are described below.
Descriptive: The problem is to determine a particular attribute of a population.

Much of the function of o¢ cial statistical agencies such as Statistics Canada involves
problems of this type. For example, the government needs to know the national
unemployment rate and whether it has increased or decreased over the past month.
Causative: The problem is to determine the existence or non-existence of a causal

relationship between two variates. For example:
“Does taking a low dose of aspirin reduce the risk of heart disease among men over the
age of 50?”
“Does changing from assignments to multiple term tests improve student learning in
STAT 231?”
“Does second-hand smoke from parents cause asthma in their children.
“Does compulsory driver training reduce the incidence of accidents among new drivers?”
Predictive: The problem is to predict the response of a variate for a given unit. This
is often the case in …nance or in economics. For example, …nancial institutions need
to predict the price of a stock or interest rates in a week or a month because this
e¤ects the value of their investments.
In the second type of problem, the experimenter is interested in whether one variate
x tends to cause an increase or a decrease in another variate Y . Where possible this is
conducted in a controlled experiment in which x is increased or decreased while holding
everything else in the experiment constant and we observe the changes in Y: As indicated
in Chapter 1, an experiment in which the experimenter manipulates the values of the ex-
planatory variates is referred to as an experimental study. On the other hand in the study
of whether second-hand smoke causes asthma, it is unlikely that the experimenter would
3.2. THE STEPS OF PPDAC 93
be able to manipulate the explanatory variate and so the experimenter needs to rely on
a potentially less informative observational study, one that depends on data that is col-
lected without the ability to control explanatory variates. We will see in Chapter 8 how
an empirical study must be carefully designed in order to answer such causative questions.
Important considerations in an observational study are the design of the survey and ques-
tionnaire, who to ask, what to ask, how many to ask, where to sample etc.
De…ning the Problem

The …rst step in describing the Problem is to de…ne the units and the target population
or target process.
De…nition 14 The target population or process is the collection of units to which the ex-
perimenters conducting the empirical study wish the conclusions to apply.
In Chapter 1, we considered a survey of teenagers in Ontario in a speci…c week to learn

about their smoking behaviour. In this example the units are teenagers in Ontario at the
time of the survey and the target population is all such teenagers.
In another example, we considered the comparison of two machines with respect to
the volume of liquid in cans being …lled. The units are the individual cans. The target
population (or perhaps it is better to call it a process) is all such cans …lled now and into
the future under current operating conditions. Sometimes we will be vague in specifying
the target population, i.e. “cans …lled under current conditions” is not very clear. What
do we mean by current conditions, for example?
De…nition 15 A variate is a characteristic associated with each unit.
For each teenager (unit) in the target population, the variate of primary interest is
whether or not the teenager smokes. Other variates of interest de…ned for each unit might
be age and sex. In the can-…lling example, the volume of liquid in each can is a variate.
The machine that …lled the can is another variate. A key point to notice is that the values
of the variates change from unit to unit in the population. There are usually many variates
associated with each unit. At this stage, we will be interested in only those that help specify
the questions of interest.
De…nition 16 An attribute is a function of the variates over a population.
We specify the questions of interest in the Problem in terms of attributes of the target
population. In the smoking example, one important attribute is the proportion of teenagers
in the target population. In the can-…lling example, the attributes of interest were the
average volume and the variability of the volumes for all cans …lled by each machine under
current conditions. Possible questions of interest (among others) are:
“What proportion of teenagers in Ontario smoke?”
“Is the standard deviation of volumes of cans …lled by the new machine less than that
of the old machine?”
We can also ask questions about graphical attributes of the target population such as
the population histogram or a scatterplot of one variate versus another over the whole
population.
It is very important that the Problem step contain clear questions about one or more
attributes of the target population.
2. Plan
In most cases, we cannot calculate the attributes of interest for the target population directly
because we can only examine a subset of the units in the target population. This may be
due to lack of resources and time, as in the smoking survey or a physical impossibility as
in the can-…lling study where we can only look at cans available now and not in the future.
Or, in an even more di¢ cult situation, we may be forced to carry out a clinical trial using
mice because it is unethical to use humans and so we do not examine any units in the target
population. Obviously there will be uncertainty in our answers. The purpose of the Plan
step is to decide what units we will examine (the sample), what data we will collect and
how we will do so. The Plan depends on the questions posed in the Problem step.
De…nition 17 The study population or study process is the collection of units available to
be included in the study.
Often the study population is a subset of the target population (as in the teenage
smoking survey). However, in many medical applications, the study population consists of
laboratory animals whereas the target population consists of humans. In this case the units
in the study population are laboratory animals and the units in the target population are
humans. In the development of new products, we may want to draw conclusions about a
production process in the future but we can only look at units produced in a laboratory
in a pilot process. In this case, the study units are not part of the target population. In
many surveys, the study population is a list of people de…ned by their telephone number.
The sample is selected by calling a subset of the telephone numbers. Therefore the study
population excludes those people without telephones or with unlisted numbers.
The study population is often not identical to the target population.
De…nition 18 If the attributes in the study population di¤ er from the attributes in the
target population then the di¤ erence is called study error.
We cannot quantify study error but must rely on context experts to know, for example,
that conclusions from an investigation using mice will be relevant to the human target
population. We can however warn the context experts of the possibility of such error,
especially when the study population is very di¤erent from the target population.
De…nition 19 The sampling protocol is the procedure used to select a sample of units from
the study population. The number of units sampled is called the sample size.
In Chapter 2, we discussed modeling the data and often claimed that we had a “random
sample”so that our model was simple. In practice, it is exceedingly di¢ cult and expensive
to select a random sample of units from the study population and so other less rigorous
methods are used. Often we “take what we can get”. Sample size is usually driven by
economics or availability. We will show in later chapters how we can use the model to help
with sample size determination.
De…nition 20 If the attributes in the sample di¤ er from the attributes in the study popu-
lation the di¤ erence is called sample error or sampling error.
Even with random sampling, we are looking at only a subset of the units in the study
population. Di¤ering sampling protocols are likely to produce di¤erent sample errors. Also,
since we do not know the values of the study population attributes, we cannot know the
sampling error. However, we can use the model to get an idea of how large this error might
be. These ideas are discussed in Chapter 4.
We must decide which variates we are going to measure or determine for the units in
the sample. For any attributes of interest, as de…ned in the Problem step, we will certainly
measure the corresponding variates for the units in the sample. As we shall see, we may
also decide to measure other variates that can aid the analysis. In the smoking survey,
we will try to determine whether each teenager in the sample smokes or not (this requires
a careful de…nition) and also many demographic variates such as age and sex so that we
can compare the smoking rate across age groups, sex etc. In experimental studies, the
experimenters assign the value of a variate to each unit in the sample. For example, in a
clinical trial, sampled units can be assigned to the treatment group or the placebo group
by the experimenters. When the value of a variate is determined for a given unit, errors
are often introduced by the measurement system which determines the value.
De…nition 21 If the measured value and the true value of a variate are not identical the
di¤ erence is called measurement error.
Measurement errors are usually unknown. In practice, we need to ensure that the
measurement systems used do not contribute substantial error to the conclusions. We may
have to study the measurement systems which are used in separate studies to ensure that
this is so.
The …gure below shows the steps in the Plan and the sources of error:
Target Population
l Study error
Study Population
# Sample error
Sample
# Measurement error
Measured variate values
Steps in the Plan and Sources of Error
A person using PPDAC for an empirical study should, by the end of the Plan step, have
a good understanding of the study population, the sampling protocol, the variates which
are to be measured, and the quality of the measurement systems that are intended for use.
In this course you will most often use PPDAC to critically examine the Conclusions from
a study done by someone else. You should examine each step in the Plan (you may have
to ask to see the Plan since many reports omit it) for strengths and weaknesses. You must
also pay attention to the various types of error that may occur and how they might impact
the conclusions.
3. Data
The object of the Data step is to collect the data according to the Plan. Any deviations
from the Plan should be noted. The data must be stored in a way that facilitates the
Analysis.
The previous sections noted the need to de…ne variates clearly and to have satisfactory
methods of measuring them. It is di¢ cult to discuss the Data step except in the context
of speci…c examples, but we mention a few relevant points.
Mistakes can occur in recording or entering data into a data base. For complex
investigations, it is useful to put checks in place to avoid these mistakes. For example,
if a …eld is missed, the data base should prompt the data entry person to complete
the record if possible.
In many studies the units must be tracked and measured over a long period of time
(e.g. consider a study examining the ability of aspirin to reduce strokes in which
persons are followed for 3 to 5 years). This requires careful management.
When data are recorded over time or in di¤erent locations, the time and place for
each measurement should be recorded.
There may be departures from the study Plan that arise over time (e.g. persons may
drop out of a long term medical study because of adverse reactions to a treatment; it
may take longer than anticipated to collect the data so the number of units sampled
must be reduced). Departures from the Plan should be recorded since they may have
an important impact on the Analysis and Conclusion.
In some studies the amount of data may be extremely large, so data base design and
management is important.
Missing data and response bias

Suppose we wish to conduct a study to determine if ethnic residents of a city are satis…ed
with police service in their neighbourhood. A questionnaire is prepared. A sample of 300
mailing addresses in a predominantly ethnic neighbourhood is chosen and a uniformed
police o¢ cer is sent to each address to interview an adult resident. Is there a possible bias
in this study? It is likely that those who are strong supporters of the police are quite happy
to respond but those with misgivings about the police will either choose not to respond
at all or change some of their responses to favour the police. This type of bias is called
response bias. When those that do respond have a somewhat di¤erent characteristics than
the population at large, the quality of the data is threatened, especially when the response
rate (the proportion who do respond to the survey) is lower. For example in Canada in
2011, the long form of the Canadian Census (response rate around 98%) was replaced by
the National Household Survey (a voluntary version with similar questions, response rate
around 68%) and there was considerable discussion18 of the resulting response bias. See for
example the CBC story “Census Mourned on World Statistics Day”19 .
4. Analysis
In Chapter 1 we discussed di¤erent methods of summarizing the data using numerical and
graphical summaries. A key step in formal analyses is the selection of an appropriate model
that can describe the data and how it was collected. In Chapter 2 we discussed methods
for checking the …t of the model. We also need to describe the Problem in terms of the
model parameters and properties. You will see many more formal analyses in subsequent
chapters.
5. Conclusions
The purpose of the Conclusion step is to answer the questions posed in the Problem. In
other words, the Conclusion is directed by the Problem. An attempt should be made
18
http://www.youtube.com/watch?v=0A7ojjsmSsY
19
http://www.cbc.ca/news/technology/story/2010/10/20/long-form-census-world-statistics-day.html
to quantify (or at least discuss) potential errors as described in the Plan step and any
limitations to the conclusions.
3.3 Case Study

Introduction
This case study is an example of more than one use of PPDAC which demonstrates some
real problems that arise with measurement systems. The documentation given here has
been rewritten from the original report to emphasize the underlying PPDAC framework.
Background
An automatic in-line gauge measures the diameter of a crankshaft journal on 100% of
the 500 parts produced per shift. The measurement system does not involve an operator
directly except for calibration and maintenance. Figure 3.1 shows the diameter in question.
The journal is a “cylindrical”part of the crankshaft. The diameter of the journal must
be de…ned since the cross-section of the journal is not perfectly round and there may be
taper along the axis of the cylinder. The gauge measures the maximum diameter as the
crankshaft is rotated at a …xed distance from the end of the cylinder.
Figure 3.1: Crankshaft with arrow pointing to “journal”
The speci…cation for the diameter is 10 to +10 units with a target of 0. The mea-
surements are re-scaled automatically by the gauge to make it easier to see deviations from
the target. If the measured diameter is less than 10, the crankshaft is scrapped and a
cost is incurred. If the diameter exceeds +10, the crankshaft can be reworked, again at
considerable cost. Otherwise, the crankshaft is judged acceptable.
3.3. CASE STUDY 99
Overall Project
A project is planned to reduce scrap/rework by reducing part-to-part variation in the
diameter. A …rst step involves an investigation of the measurement system itself. There
is some speculation that the measurement system contributes substantially to the overall
process variation and that bias in the measurement system is resulting in the scrapping
and reworking of good parts. To decide if the measurement system is making a substantial
contribution to the overall process variability, we also need a measure of this attribute for
the current and future population of crankshafts. Since there are three di¤erent attributes
of interest, it is convenient to split the project into three separate applications of PPDAC.
Study 1
In this application of PPDAC, we estimate the properties of the errors produced by the
measurement system. In terms of the model, we will estimate the bias and variability due
to the measurement system. We hope that these estimates can be used to predict the future
performance of the system.
Problem
The target process is all future measurements made by the gauge on crankshafts to be
produced. The response variate is the measured diameter associated with each unit. The
attributes of interest are the average measurement error and the population standard de-
viation of these errors. We can quantify these concepts using a model (see below). A
detailed …shbone diagram for the measurement system is also shown in Figure 3.2. In such
a diagram, we list explanatory variates organized by the major “bones” that might be re-
sponsible for variation in the response variate, here the measured journal diameter. We can
use the diagram in formulating the Plan.
Note that the measurement system includes the gauge itself, the way the part is loaded
into the gauge, who loads the part, the calibration procedure (every two hours, a master
part is put through the gauge and adjustments are made based on the measured diameter
of the master part; that is “the gauge is zeroed”), and so on.
Plan
To determine the properties of the measurement errors we must measure crankshafts with
known diameters. “Known” implies that the diameters were measured by an o¤-line mea-
surement system that is very reliable. For any measurement system study in which bias is
an issue, there must be a reference measurement system which is known to have negligible
bias and variability which is much smaller than the system under study.
There are many issues in establishing a study process or a study population. For con-
venience, we want to conduct the study quickly using only a few parts. However, this
restriction may lead to study error if the bias and variability of the measurement system
Gauge J o u rn a l
M e a s u re m e n ts te m p e r a tu r e
m a in te n a n c e
a c tu a l s iz e
p o s itio n o f p a r t
c o n d itio n d ir t
w ear on head
o u t- o f- r o u n d
M e a s u re d J o u rn a l D i a m e te r
tr a in in g
fr e q u e n c y
m a s te r u s e d a tte n tio n to in s tr u c tio n s
E n vi ro n m e n t
C a l i b ra ti o n O p e ra t o r
Figure 3.2: Fishbone diagram for variation in measured journal diameter
change as other explanatory variates change over time or parts. We guard against this
latter possibility by using three crankshafts with known diameters as part of the de…nition
of the study process. Since the units are the taking of measurements, we de…ne the study
population as all measurements that can be taken in one day on the three selected crank-
shafts. These crankshafts were selected so that the known diameters were spread out over
the range of diameters Normally seen. This will allow us see if the attributes of the system
depend on the size of the diameter being measured. The known diameters which were used
were: 10, 0, and +10: Remember the diameters have been rescaled so that a diameter of
10 is okay.
No other explanatory variates were measured. To de…ne the sampling protocol, it
was proposed to measure the three crankshafts ten times each in a random order. Each
measurement involved the loading of the crankshaft into the gauge. Note that this was to
be done quickly to avoid delay of production of the crankshafts. The whole procedure took
only a few minutes.
The preparation for the data collection was very simple. One operator was instructed
to follow the sampling protocol and write down the measured diameters in the order that
they were collected.
Data
The repeated measurements on the three crankshafts are shown below. Note that due to
poor explanation of the sampling protocol, the operator measured each part ten times in
a row and did not use a random ordering. (Unfortunately non-adherence to the sampling
protocol often happens when real data are collected and it is important to consider the
e¤ects of this in the Analysis and Conclusion steps.)
3.3. CASE STUDY 101
Crankshaft 1 Crankshaft 2 Crankshaft 3

10 8 2 1 9 11
12 12 2 2 8 12
8 10 0 1 10 9
11 10 1 1 12 10
12 10 0 0 10 12
Analysis
A model to describe the repeated measurement of the known diameters is
Yij = i + Rij ; Rij G(0; m) independent (3.1)
where i = 1 to 3 indexes the three crankshafts and j = 1; : : : ; 10 indexes the ten repeated
measurements. The parameter i represents the long term average measurement for crank-
shaft i. The random variables Rij (called the residuals) represent the variability of the
measurement system, while m quanti…es this variability. Note that we have assumed, for
simplicity, that the variability m is the same for all three crankshafts in the study.
We can rewrite the model in terms of the random variables Yij so that Yij G( i ; m ).
Now we can write the likelihood as in Example 2.3.2 and maximize it with respect to the
four parameters 1 , 2 , 3 , and m (the trick is to solve @`=@ i = 0, i = 1; 2; 3 …rst). Not
surprisingly the maximum likelihood estimates for 1 , 2 , 3 are the sample averages for
each crankshaft so that
1 P n
^ i = yi = yij for i = 1; 2; 3:
10 j=1
To examine the assumption that m is the same for all three crankshafts we can calculate
the sample standard deviation for each of the three crankshafts. Let
s
1 P10
si = (yij yi )2 for i = 1; 2; 3:
9 j=1
The data can be summarized as:
yi si
Crankshaft 1 10:3 1:49
The estimate of the bias for crankshaft 1 is the di¤erence between the observed average
y1 and the known diameter value which is equal to 10 for crankshaft 1, that is, the
estimated bias is 10:3 ( 10) = 0:3. For crankshafts 2 and 3 the estimated biases are
0:6 0 = 0:6 and 10:3 10 = 0:3 respectively so the estimated biases in this study are all
small.
Note that the sample standard deviations s1 ; s2 ; s3 are all about the same size and
our assumption about a common value seems reasonable. (Note: it is possible to test this
assumption more formally.) An estimate of m is given by
r
s21 + s22 + s23
sm = = 1:37
3
Note that this estimate is not the average of the three sample standard deviations but the
square root of the average of the three sample variances. (Why does this estimate make
sense? Is it the maximum likelihood estimate of m ? What if the number of measurements
for each crankshaft were not equal?)
Conclusion
The observed biases 0:3, 0:6, 0:3 appear to be small, especially when measured against
the estimate of m and there is no apparent dependence of bias on crankshaft diameter.
To interpret the variability, we can use the model (3.1). Recall that if Yij v G ( i ; m )
then
P ( i 2 m Yij i + 2 m ) = 0:95
Therefore if we repeatedly measure the same journal diameter, then about 95% of the time
we would expect to see the observations vary by about 2 (1:37) = 2:74.
There are several limitations to these conclusions. Because we have carried out the
study on one day only and used only three crankshafts, the conclusion may not apply to
all future measurements (study error). The fact that the measurements were taken within
a few minutes on one day might be misleading if something special was happening at that
time (sampling error). Since the measurements were not taken in random order, another
source of sampling error is the possible drift of the gauge over time.
We could recommend that, if the study were to be repeated, more than three known-
value crankshafts could be used, that the time frame for taking the measurements could be
extended and that more measurements be taken on each crankshaft. Of course, we would
also note that these recommendations would add to the cost and complexity of the study.
We would also insist that the operator be better informed about the Plan.
Study 2
The second study is designed to estimate the overall population standard deviation of the
diameters of current and future crankshafts (the target population). We need to estimate
this attribute to determine what variation is due to the process and what is due to the mea-
surement system. A cause-and-e¤ect or …shbone diagram listing some possible explanatory
variates for the variability in journal diameter is given in Figure 3.3. Note that there are
many explanatory variates other than the measurement system. Variability in the response
variate is induced by changes in the explanatory variates, including those associated with
the measurement system.
3.3. CASE STUDY 103
M e th o d M ac hine
M e a s u r e m e n ts
m a in t e n a n c e s p e e d o f r o ta tio n
s e t- u p o f to o lin g
o p e r a to r a n g le o f c u t
lin e s p e e d
c a lib r a tio n
c u ttin g to o l e d g e
p o s itio n in g a u g e J o u r n a l D i a m e te r
s e t- u p m e th o d
hardnes s
d ir t o n p a r t
tr a in in g
quenchant
te m p e r a tu r e o p e r a to r
c a s tin g c h e m is tr y
e n v ir o n m e n t m a in te n a n c e
c a s tin g lo t
E n vi r o n m e n t M a te r i a l O p e r a to r
Figure 3.3: Fishbbone diagram for cause-and-e¤ect
Plan
The study population is de…ned as those crankshafts available over the next week, about
7500 parts (500 per shift times 15 shifts). No other explanatory variates were measured.
Initially it was proposed to select a sample of 150 parts over the week (ten from each
shift). However, when it was learned that the gauge software stores the measurements for
the most recent 2000 crankshafts measured, it was decided to select a point in time near
the end of the week and use the 2000 measured values from the gauge memory to be the
sample. One could easily criticize this choice (sampling error), but the data were easily
available and inexpensive.
Data
The individual observed measurements are too numerous to list but a histogram of the data
is shown in Figure 3.4. From this, we can see that the measured diameters vary from 14
to +16.
Analysis
A model for these data is given by
Yi = + Ri ; Ri G(0; ) independently for i = 1; :::; 2000
where Yi represents the distribution of the measurement of the ith diameter, represents
the study population mean diameter and the residual Ri represents the variability due to
sampling and the measurement system. We let quantify this variability. We have not
included a bias term in the model because we assume, based on our results from Study 1,
Figure 3.4: Histogram of 2000 measured values from the gauge memory
that the measurement system bias is small. As well we assume that the sampling protocol
does not contribute substantial bias.
The histogram of the 2000 measured diameters shows that there is considerable spread in
the measured diameters. About 4:2% of the parts require reworking and 1:8% are scrapped.
The shape of the histogram is approximately symmetrical and centred close to zero. The
sample mean is
P
1 2000
y= yi = 0:82
2000 i=1
which gives us an estimate of (the maximum likelihood estimate) and the sample standard
deviation is s
P
1 2000
s= (yi y)2 = 5:17
1999 i=1
which gives us an estimate of (not quite the maximum likelihood estimate).
Conclusion
The overall process variation is estimated by s. Since the sample contained 2000 parts
measured consecutively, many of the explanatory variates did not have time to change as
they would in the study populations Thus, there is a danger of sampling error producing
an estimate of the variation that is too small.
The variability due to the measurement system, estimated to be 1:37 in Study 1, is much
less than the overall variability which is estimated to be 5:17. One way to compare the two
standard deviations m and is to separate the total variability into the variability due
to the measurement system m and that due to all other sources. In other words, we are
interested in estimating the variability that would be present if there were no variability
3.3. CASE STUDY 105
in the measurement system ( m = 0). If we assume that the total variability arises from
two independent sources, the measurement system and all other sources, then we have
2 = 2 + 2 or
m p p
2 2
p = m
where p quanti…es the variability due to all other uncontrollable variates (sampling vari-
ability). An estimate of p is given by
p q
s 2 sm = (5:17)2 (1:37)2 = 4:99
2
Hence, eliminating all of the variability due to the measurement system would produce an
estimated variability of 4:99 which is a small reduction from 5:17. The measurement system
seems to be performing well and not contributing substantially to the overall variation.
Study 3: A Brief Description

A limitation of Study 1 was that it was conducted over a very short time period. To address
this concern, a third study was recommended to study the measurement system over a longer
period during normal production use. In Study 3, a master crankshaft of known diameter
equal to zero was measured every half hour until 30 measurements were collected. A plot of
the measurements versus the times at which the measurements were taken is given in Figure
3.5 in a plot called a run chart. In the …rst study the standard deviation was estimated
to be 1:37. In a sample of observations from a G (0; 1:37) distribution we would expect
approximately 95% of the observations to lie in the interval [0 2 (1:37) ; 0 + 2 (1:37)] =
[ 2:74; 2:74] which is obviously not true for the data displayed in the run chart. These
data have a much larger variability. This was a shocking result for the people in charge of
the process.
Figure 3.5: Scatter plot of diameter versus time

Comments
Study 3 revealed that the measurement system had a serious long term problem. At …rst,
it was suspected that the cause of the variability was the fact that the gauge was not
calibrated over the course of the study. Study 3 was repeated with a calibration before
each measurement. A pattern similar to that for Study 3 was seen. A detailed examination
of the gauge by a repairperson from the manufacturer revealed that one of the electronic
components was not working properly. This was repaired and Study 3 was repeated. This
study showed variation similar to the variation of the short term study (Study 1) so that
the overall project could continue. When Study 2 was repeated, the overall variation and
the number of scrap and reworked crankshafts was substantially reduced. The project was
considered complete and long term monitoring showed that the scrap rate was reduced to
about 0:7% which produced an annual savings of more than $100,000.
As well, three similar gauges that were used in the factory were put through the “long
term” test. All were working well.
Summary
An important part of any Plan is the choice and assessment of the measurement
system.
The measurement system may contribute substantial error that can result in poor
decisions (e.g. scrapping good parts, accepting bad parts).
We represent systematic measurement error by bias in the model. The bias can be
assessed only by measuring units with known values, taken from another reference
measurement system. The bias may be constant or depend on the size of the unit
being measured, the person making the measurements, and so on.
Variability can be assessed by repeatedly measuring the same unit. The variability
may depend on the unit being measured or any other explanatory variates.
Both bias and variability may be a function of time. This can be assessed by examining
these attributes over a su¢ ciently long time span as in Study 3.

1. Four weeks before a national election, a political party conducts a poll to assess what
proportion of eligible voters plan to vote and, of those, what proportion support the
party. This will determine how they run the rest of the campaign. They are able to
obtain a list of eligible voters and their telephone numbers in the 20 most populated
areas. They select 3000 names from the list and call them. Of these, 1104 eligible
voters agree to participate in the survey with the results summarized in the table
below. Answer the questions below based on this information.
Support Party
Plan to Vote YES NO
YES 351 381
NO 107 265
(a) De…ne the Problem for this study. What type of Problem is this and why?
(b) What is the target population?
(c) Identify the variates and their types for this study.
(d) What is the study population?
(e) What is the sample?
(f) Describe one possible source of study error is?
(g) Describe one possible source of sampling error?
(h) There are two attributes of interest in the target population. In each case,
describe the attribute and provide an estimate based on the given data.
2. U.S. to fund study of Ontario math curriculum, Globe & Mail, January 17,
2014, Caroline Alphonso - Education Reporter (article has been condensed)
The U.S. Department of Education has funded a $2.7-million (U.S.) project, led by
a team of Canadian researchers at Toronto’s Hospital for Sick Children. The study
will look at how elementary students at several Ontario schools fare in math using
the current provincial curriculum as compared to the JUMP math program, which
combines the conventional way of learning the subject with so-called discovery learn-
ing. Math teaching has come under scrutiny since OECD results that measured the
scholastic abilities of 15-year-olds in 65 countries showed an increasing percentage of
Canadian students failing the math test in nearly all provinces. Dr. Tracy Solomon
and her team are collecting and analyzing two years of data on students in primary
and junior grades from one school board, which she declined to name. The students
were in Grades 2 and 5 when the study began, and are now in Grades 3 and 6, which
means they will participate in Ontario’s standardized testing program this year. The
research team randomly assigned some schools to teach math according to the Ontario
curriculum, which allows open-ended student investigations and problem-solving. The
other schools are using the JUMP program. Dr. Solomon said the research team is
using classroom testing data, lab tests on how children learn and other measures to
study the impact of the two programs on student learning.
Answer the questions below based on this article.
(a) What type of study is this? Why?

(b) De…ne the Problem for this study.
(c) What type of Problem is it? Why?
(d) De…ne the target population.
(e) Give two variates of interest in this problem and specify the type of variate for
each.
(f) De…ne the study population.
(g) What is the sampling protocol?
(h) What is a possible source of study error is?
(i) What is a possible source of sampling error?
(j) What is a possible source of measurement error?
(k) Why was it important for the researchers to randomly assign some schools to
teach math according to the Ontario curriculum and some other schools to teach
math using the Jump program?
3. Playing racing games may encourage risky driving, study …nds, Globe &
Mail, January 8, 2015 (article has been condensed)
Playing an intense racing game makes players more likely to take risks such as speed-
ing, passing on the wrong side, running red lights or using a cellphone in a simulated
driving task shortly afterwards, according to a new study. Young adults with more
adventurous personalities were more inclined to take risks, and more intense games
led to greater risk-taking, the authors write in the journal Injury Prevention. Other
research has found a connection between racing games and inclination to risk-taking
while driving, so the new results broaden that evidence base, said lead author of the
new study, Mingming Deng of the School of Management at Xi’an Jiaotong University
in Xi’an, China. “I think racing gamers should be [paying] more attention in their
real driving,” Deng said.
The researchers recruited 40 student volunteers at Xi’an Jiaotong University, mostly
men, for the study. The students took personality tests at the start and were divided
randomly into two groups. Half of the students played a circuit-racing-type driving
game that included time trials on a race course similar to Formula 1 racing, for about
20 minutes, while the other group played computer solitaire, a neutral game for com-
parison. After a …ve-minute break, all the students took the Vienna Risk-Taking Test,
viewing 24 “risky” videotaped road-tra¢ c situations on a computer screen presented
from the driver’s perspective, including driving up to a railway crossing whose gate
has already started lowering. How long the viewer waits to hit the “stop” key for
the manoeuvre is considered a measure of their willingness to take risks on the road.
Students who had been playing the racing game waited an average of almost 12 sec-
onds to hit the stop button compared with 10 seconds for the solitaire group. The
participants’ experience playing these types of games outside of the study did not
seem to make a di¤erence.
Answer the questions below based on this article.
(a) What type of study is this? Why?

(b) De…ne the Problem for this study.
(c) What type of Problem is this? Why?
(d) De…ne a suitable target population for this study.
(e) What are the two most important variates in this study and what is their type?
(f) De…ne a suitable study population for this study.
(g) Describe the sampling protocol for this study.
(h) Give a possible source of study error for this study in relation to your answer to
(d).
(i) Give a possible source of sampling error for this study.
(j) Describe the attribute of most interest for the target population and provide an
estimate based on the given data.
4. Suppose you wish to study the smoking habits of teenagers and young adults, in order
to understand what personal factors are related to whether, and how much, a person
smokes. Brie‡y describe the main components of such a study, using the PPDAC
framework. Be speci…c about the target and study population, the sample, and the
variates you would collect.
5. Suppose you wanted to study the relationship between a person’s “resting”pulse rate
(heart beats per minute) and the amount and type of exercise they get.
(a) List some factors (including exercise) that might a¤ect resting pulse rate. You
may wish to draw a cause and e¤ect (…shbone) diagram to represent potential
causal factors.
(b) Describe brie‡y how you might study the relationship between pulse rate and
exercise using (i) an observational study, and (ii) an experimental study.
6. A large company uses photocopiers leased from two suppliers A and B. The lease
rates are slightly lower for B’s machines but there is a perception among workers
that they break down and cause disruptions in work ‡ow substantially more often.
Describe brie‡y how you might design and carry out a study of this issue, with the
ultimate objective being a decision whether to continue the lease with company B.
What additional factors might a¤ect this decision?
7. For a study like the one in Example 1.3.1, where heights x and weights y of individuals
are to be recorded, discuss sources of variability due to the measurement of x and y
on any individual.
4. ESTIMATION
4.1 Statistical Models and Estimation

In statistical estimation we use two models:
(1) A model for variation in the population or process being studied which includes the
attributes which are to be estimated.
(2) A model which takes in to account how the data were collected and which is con-
structed in conjunction with the model in (1).
We use these two models to develop methods for estimating the unknown attributes and
determining the uncertainty in the estimates. The unknown attributes are usually repre-
sented by unknown parameters in the models or by functions of the unknown parameters.
We have already seen in Chapter 2, that these unknown parameters can be estimated using
the method of maximum likelihood and the invariance property of maximum likelihood
estimates.
Several issues arise:
(1) Where do we get our probability model? What if it is not a good description of the
population or process?
We discussed the …rst question in Chapters 1 and 2. It is important to check the
adequacy (or “…t”) of the model; some ways of doing this were discussed in Chapter
2 and more formal methods will be considered in Chapter 7. If the model used is not
satisfactory, we may not be able to use the estimates based on it. For the lifetimes of
brake pads data introduced in Example 1.3.3, a Gaussian model does not appear to
be suitable (see Chapter 2, Problem 11).
(2) The estimation of parameters or population attributes depends on data collected from
the population or process, and the likelihood function is based on the probability of
the observed data. This implies that factors associated with the selection of sample
units or the measurement of variates (e.g. measurement error) must be included in
the model. In many examples it is assumed that the variate of interest is measured
without error for a random sample of units from the population. We will typically
assume that the data come from a random sample of population units, but in any
111
112 4. ESTIMATION
given application we would need to design the data collection plan to ensure this
assumption is valid.
(3) Suppose in the model chosen the population mean is represented by the parameter
. The sample mean y is an estimate of , but not usually equal to it. How far away
from is y likely to be? If we take a sample of only n = 50 units, would we expect
the estimate y to be as “good” as y based on 150 units? (What does “good” mean?)
We focus on the third point in this chapter and assume that we can deal with the …rst
two points with the methods discussed in Chapters 1 and 2.
4.2 Estimators and Sampling Distributions

Suppose that some attribute of interest for a population or process can be represented by
a parameter in a statistical model. We assume that can be estimated using a random
sample drawn from the population or process in question. Recall in Chapter 2 that a point
estimate of , denoted as ^, was de…ned as a function of the observed sample y1 ; : : : ; yn ,
^ = g(y1 ; : : : ; yn ): (4.1)
For example
^ = y = 1 P yi
n
n i=1
is a point estimate of if y1 ; : : : ; yn is an observed random sample from a Poisson distrib-
ution with mean .
The method of maximum likelihood provides a general method for obtaining estimates,
but other methods exist. For example, if = E(Y ) = is the average (mean) value of y
in the population, then the sample mean ^ = y is an intuitively sensible estimate; it is the
maximum likelihood estimate of if Y has a G ( ; ) distribution but because of the Central
Limit Theorem it is a good estimate of more generally. Thus, while we will use maximum
likelihood estimation a great deal, you should remember that the discussion below applies
to estimates of any type.
The problem facing us in this chapter is how to determine or quantify the uncertainty
in an estimate. We do this using sampling distributions 20 , which are based on the following
idea. If we select random samples on repeated occasions, then the estimates ^ obtained from
the di¤erent samples will vary. For example, …ve separate random samples of n = 50 persons
from the same male population described in Example 1.3.1 gave …ve di¤erent estimates
^ = y of E(Y ) as:
1:723 1:743 1:734 1:752 1:736:
Estimates vary as we take repeated samples and therefore we associate a random variable
and a distribution with these estimates.
20
See the video at www.watstat.ca called “What is a sampling distribution?”
4.2. ESTIMATORS AND SAMPLING DISTRIBUTIONS 113
More precisely, we de…ne this idea as follows. Let the random variables Y1 ; : : : ; Yn
represent the observations in a random sample, and associate with the estimate ^ given by
(4.1) a random variable
~ = g(Y1 ; : : : ; Yn ):
The random variable ~ = g(Y1 ; : : : ; Yn ) is simply a rule that tells us how to process the
data to obtain a numerical value ^ = g(y1 ; : : : ; yn ) which is an estimate of the unknown
parameter for a given data set y1 ; : : : ; yn . For example
~ = Y = 1 P Yi
n
n i=1
is a random variable and ^ = y is a numerical value. We call ~ the estimator of corre-

sponding to ^. (We will always use ^ to denote an estimate, that is, a numerical value, and
~ to denote the corresponding estimator, the random variable.)
De…nition 22 A (point) estimator ~ is a random variable which is a function

~ = g(Y1 ; Y2 ; : : : ; Yn ) of the random variables Y1 ; Y2 ; : : : ; Yn . The distribution of ~ is called
the sampling distribution of the estimator.
Since ~ is a function of the random variables Y1 ; : : : ; Yn we can …nd its distribution,

at least in principle. Two ways to do this are (i) using mathematics and (ii) by computer
simulation. Once we know the sampling distribution of an estimator ~ then we are in a
position to express the uncertainty in an estimate. The following example illustrates how
we examine the probability that the estimator ~ is “close” to .
Example 4.2.1
Suppose we want to estimate the mean = E(Y ) of a random variable, and that
a Gaussian distribution Y G( ; ) describes variation in Y in the population. Let
Y1 ; : : : ; Yn represent a random sample from the population, and consider the estimator
1 Pn
~=Y = Yi
n i=1
for . Recall that if the distribution of Yi is G( ; ) then the distribution of Y is Gaussian,

p
G( ; = n). Consider the probability that the random variable j~ j is less than or equal
to some speci…ed value . We have
p p
P (j~ j )=P Y + =P n= Z n= (4.2)
p
where Z = (Y )=( = n) G(0; 1). Clearly, as n increases, the probability (4.2)
approaches one. Furthermore, if we know (even approximately) then we can …nd the
probability for any given and n. For example, suppose Y represents the height of a
male (in meters) in the population of Example 1.3.1, and that we take = 0:01. That
114 4. ESTIMATION
is, we want to …nd the probability that j~ j is no more than 0:01 meters. Assuming
= s = 0:07 (meters), (4.2) gives the following results for sample sizes n = 50 and n = 100:
n = 50: P (j~ j 0:01) = P ( 1:01 Z 1:01) = 0:688

n = 100: P (j~ j 0:01) = P ( 1:43 Z 1:43) = 0:847
This indicates that a larger sample is “better” in the sense that the probability is higher
that ~ will be within 0:01m of the true (and unknown) average height in the population.
It also allows us to express the uncertainty in an estimate ^ = y from an observed sample
y1 ; : : : ; yn by indicating the probability that any single random sample will give an estimate
within a certain distance of .
Example 4.2.2
In the Example 4.2.1 we were able to determine the distribution of the estimator exactly,
using properties of Gaussian random variables. Often we are not be able to do this and
in this case we could use simulation to study the distribution21 . For example, suppose we
have a random sample y1 ; : : : ; yn which we have assumed comes from an Exponential( )
distribution. The maximum likelihood estimate of is ^ = y. What is the sampling
distribution for ~ = Y ? We can examine the sampling distribution by using simulation.
This involves taking repeated samples, y1 ; : : : ; yn , giving (possibly di¤erent) values of y for
each sample as follows:
1. Generate a sample of size n. In R this is done using the statement y<-rexp(n; 1= ) :

(Note that in R the parameter is speci…ed as 1= .)
2. Compute ^ = y from the sample. In R this is done using the statement ybar<-mean(y).
Repeat these two steps k times. The k values y1 ; : : : ; yk can then be considered as a
sample from the distribution of ~, and we can study the distribution by plotting a histogram
of the values.
The histogram in Figure 4.1 was obtained by drawing k = 10000 samples of size n = 15
from an Exponential(10) distribution, calculating the values y1 ; : : : ; y10000 and then plotting
the relative frequency histogram. What do you notice about the distribution particularly
with respect to symmetry? Does the distribution look like a Gaussian distribution?
The approach illustrated in the preceding example can be used more generally. The
main idea is that, for a given estimator ~, we need to determine its sampling distribution
in order to be able to compute probabilities of the form P (j~ j ) so that we can
quantify the uncertainty of the estimate.
21
This approach can also be used to study sampling from a …nite population of N values, fy1 ; : : : ; yN g,
where we might not want to use a continuous probability distribution for Y .
4.2. ESTIMATORS AND SAMPLING DISTRIBUTIONS 115
0.16
0.14
0.12
Relative
Frequency
0.1
0.08
0.06
0.04
0.02
0
0 5 10 15 20 25
Figure 4.1: Relative frequency histogram of means from 10000 samples of size 15
from an Exponential(10) distribution
The estimates and estimators we have discussed so far are often referred to as point es-
timates and point estimators. This is because they consist of a single value or “point”. The
discussion of sampling distributions shows how to address the uncertainty in an estimate.
We also usually prefer to indicate explicitly the uncertainty in the estimate. This leads to
the concept of an interval estimate 22 , which takes the form
[L (y) ; U (y)]
where L (y) and U (y) are functions of the observed data y. Notice that this provides an in-
terval with endpoints L and U both of which depend on the data. If we let L (Y) and U (Y)
represent the associated random variables then [L (Y) ; U (Y)] is a random interval. If we
were to draw many random samples from the same population and each time we constructed
the interval [L (y) ; U (y)] how often would the statement 2 [L (y) ; U (y)] be true? The
probability that the parameter falls in this random interval is P [L (Y) U (Y)] and
hopefully this probability is large. This probability gives an indication how good the rule is
by which the interval estimate was obtained. For example if P [L (Y) U (Y)] = 0:95
then this means that 95% of the time (that is, for 95% of the di¤erent samples we might
draw), the true value of the parameter falls in the interval [L (y) ; U (y)] constructed from
the data set y. This means we can be reasonably safe in assuming, on this occasion, and
for this data set, it does so. In general, uncertainty in an estimate is explicitly stated by
giving the interval estimate along with the probability P ( 2 [L (Y) ; U (Y)]).
22
See the video What is a con…dence Interval? at watstat.ca
116 4. ESTIMATION
4.3 Interval Estimation Using the Likelihood Function

The likelihood function can be used to obtain interval estimates for parameters in a very
straightforward way. We do this here for the case in which the probability model involves
only a single scalar parameter . Individual models often have constraints on the parame-
ters. For example in the Gaussian distribution, the mean can be any real number 2 <
but the standard deviation must be positive, that is, > 0: Similarly for the Binomial
model the probability of success must lie in the interval [0; 1]: These constraints are usually
identi…ed by requiring that the parameter falls in some set , called the parameter space.
As mentioned in Chapter 2 we often rescale the likelihood function to have a maximum
value of one to obtain the relative likelihood function.
De…nition 23 Suppose is scalar and that some observed data (say a random sample
y1 ; : : : ; yn ) have given a likelihood function L( ). The relative likelihood function R( ) is
de…ned as
L( )
R( ) = for 2
L(^)
where ^ is the maximum likelihood estimate and is the parameter space. Note that
0 R( ) 1 for all 2 :
De…nition 24 A 100p% likelihood interval for is the set f : R( ) pg.
Actually, f : R( ) pg is not necessarily an interval unless R( ) is unimodal, but this

is the case for all models that we consider here. The motivation for this approach is that
the values of that give large values of L( ) and hence R( ), are the most plausible in
light of the data. The main challenge is to decide what p to choose; we show later that
choosing p 2 [0:10; 0:15] is often useful. If you return to the likelihood function for the
Harris/Decima poll (Example 2.2.1) in Figure 2.2, the interval that the pollsters provided,
which was 26 2:2 percent, looks like it was constructed such that the values of the likeli-
hood at the endpoints is around 1=10 of its maximum value so p is between 0:10 and 0:15.
Example 4.3.1 Polls

Let be the proportion of people in a large population who have a speci…c characteristic.
Suppose n persons are randomly selected for a poll and y people are observed to have the
characteristic of interest. If we let Y be the number who have the characteristic, then
Y Binomial(n; ) is a reasonable model. As we have seen previously the likelihood
function is
n y
L( ) = (1 )n y for 0 < < 1
y
and the maximum likelihood estimate of is the sample proportion ^ = y=n. The relative
likelihood function is y
(1 )n y
R( ) = for 0 < < 1:
^y (1 ^)n y
4.3. INTERVAL ESTIMATION USING THE LIKELIHOOD FUNCTION 117
n=200
-1
-2
log RL
-3 n=1000
-4
-5
0.30 0.35 0.40 0.45 0.50

theta
1.0
0.8 n=200
0.6
RL
0.4 n=1000
0.2
0.0
0.30 0.35 0.40 0.45 0.50

theta
Figure 4.2: Relative likelihood function and log relative likelihood function for a
Binomial model
Figure 4.2 shows the relative likelihood functions R( ) for two polls:
Poll 1 : n = 200; y = 80
Poll 2 : n = 1000; y = 400:
In each case ^ = 0:40, but the relative likelihood function is more “concentrated”around ^
for the larger poll (Poll 2). The 10% likelihood intervals also re‡ect this:
Poll 1 : R( ) 0:1 for 0:33 0:47

Poll 2 : R( ) 0:1 for 0:37 0:43:
The graph also shows the log relative likelihood function.
De…nition 25 The log relative likelihood function is

" #
L( )
r( ) = log R( ) = log = l( ) l(^) for 2
L(^)
where l( ) = log L( ) is the log likelihood function.

118 4. ESTIMATION
It is often more convenient to compute r( ) instead of R( ) and to compute a 100p%

likelihood interval using the fact that R( ) p if and only if r( ) log p. While both
plots are unimodal and have identical locations of the maximum, they di¤er in terms of the
shape. The plot of the relative likelihood function resembles a Normal probability density
function in shape while that of the log relative likelihood resembles a quadratic function
of . Likelihood intervals become narrower as the sample size increases (see, for example,
Figure 4.2), which re‡ects the fact that larger samples contain more information about
. Likelihood intervals intervals cannot usually be found explicitly. They must be found
numerically by using a function like uniroot in R or they can be read from graph of R( )
or r( ) = log R( ).
Table 4.1 gives rough guidelines for interpreting likelihood intervals. These are only
guidelines for this course. The interpretation of a likelihood interval must always be made
in the context of a given study.
Table 4.1: Interpretation of Likelihood Intervals

Values of inside a 50% likelihood interval are very plausible in light of the observed data.
Values of inside a 10% likelihood interval are plausible in light of the observed data.
Values of outside a 10% likelihood interval are implausible in light of the observed data.
Values of outside a 1% likelihood interval are very implausible in light of the observed data.
The one apparent shortcoming of likelihood intervals so far is that we do not know how
probable it is that a given interval will contain the true parameter value. As a result we
also do not have a basis for the choice of p. Sometimes it is argued that values like p = 0:10
or p = 0:05 make sense because they rule out parameter values for which the probability
of the observed data is less than 1=10 or 1=20 of the probability when = ^. However, a
more satisfying approach is to apply the sampling distribution ideas in Section 4.2 to the
interval estimates. This leads to the concept of con…dence intervals, which we describe next.
In Section 4.6 we revisit likelihood intervals and show that they are also con…dence intervals.
The idea of a likelihood interval for a parameter can also be extended to the case of
a vector of parameters . In this case R( ) P gives likelihood “regions” for 23 .
23
Models With Two or More Parameters. When there is a vector = ( 1 ; : : : ; k ) of unknown

parameters, we may want to get interval estimates for individual parameters j , j = 1; : : : ; k or for
functions = h( 1 ; : : : ; k ). For example, suppose a model has two parameters 1 ; 2 and a likeli-
hood function L( 1 ; 2 ) based on observed data. Then we can de…ne the relative likelihood function
R( 1 ; 2 ) = L( 1 ; 2 )=L(^1 ; ^2 ) as in the scalar case. The set of pairs ( 1 ; 2 ) which satisfy R( 1 ; 2 ) p is
then called a 100p% likelihood region for ( 1 ; 2 ).
4.4. CONFIDENCE INTERVALS AND PIVOTAL QUANTITIES 119
4.4 Con…dence Intervals and Pivotal Quantities

Suppose we assume that the model chosen for the data y is correct and that the interval
estimate for the parameter is given by [L(y); U (y)]. To quantify the uncertainty in the
interval estimate we look at an important property of the corresponding interval estimator
[L(Y); U (Y)] called the coverage probability which is de…ned as follows.
De…nition 26 The value
C( ) = P [L(Y) U (Y)] (4.3)
is called the coverage probability for the interval estimator [L(Y); U (Y)].
A few words are in order about the meaning of the probability statement in (4.3). The
parameter is an unknown …xed constant associated with the population. It is not a
random variable and therefore does not have a distribution. The statement (4.3) can be
interpreted in the following way. Suppose we were about to draw a random sample of the
same size from the same population and the true value of the parameter was . Suppose
also that we knew that we would construct an interval of the form [L(y); U (y)] once we
had collected the data. Then the probability that will be contained in this new interval
is C( )24 .
How then does C( ) assist in the evaluation of interval estimates? In practice, we try
to …nd intervals for which C( ) is fairly close to 1 (values 0:90, 0:95 and 0:99 are often
used) while keeping the interval fairly narrow. Such interval estimates are called con…dence
intervals.
De…nition 27 A 100p% con…dence interval25 for a parameter is an interval estimate

[L(y); U (y)] for which
P [L(Y) U (Y)] = p (4.4)
where p is called the con…dence coe¢ cient.
If p = 0:95, for example, then (4.4) indicates that 95% of the samples that we would
draw from this model result in an interval which includes the true value of the parameter
(and of course 5% do not). This gives us some con…dence that for a particular sample, such
as the one at hand, the true value of the parameter is contained in the interval.
The following example illustrates that the con…dence coe¢ cient sometimes does not
depend on the unknown parameter .
24
When we use the observed data y; L(y) and U (y) are numerical values not random variables. We do
not know whether 2 [L(y); U (y)] or not. P [L(y) U (y)] makes no more sense than P (1 3)
since L(y); ; U (y) are all numerical values: there is no random variable to which the probability statement
can refer.
25
See the video at www.watstat.com called “What is a con…dence interval”. See also the Java applet
http://www.math.uah.edu/stat/applets/MeanEstimateExperiment.html
120 4. ESTIMATION
Example 4.4.1 Gaussian distribution with known standard deviation

Suppose Y1 ; : : : ; Yn is a random sample from a G( ; 1) distribution, that is, = E(Yi )
is unknown but sd (Yi ) = 1 is known. Consider the interval
h i
1=2 1=2
Y 1:96n ; Y + 1:96n
1 P
n p
where Y = n Yi is the sample mean. Since Y G( ; 1= n), then
i=1
p p
P Y 1:96= n Y + 1:96= n
p
=P 1:96 n Y 1:96
= P ( 1:96 Z 1:96)
= 0:95
p p
where Z G(0; 1). Thus the interval [y 1:96= n; y + 1:96= n] is a 95% con…dence in-
terval for the unknown mean . This is an example in which the con…dence coe¢ cient
does not depend on the unknown parameter, an extremely desirable feature of an interval
estimator.
We repeat the very important interpretation of a 95% con…dence interval (since so many
people get the interpretation incorrect!). Suppose the experiment which was used to esti-
mate was conducted a large number of times and each time a 95% con…dence interval
p p
for was constructed using the observed data and the interval [y 1:96= n; y + 1:96= n].
Then, approximately 95% of these constructed intervals would contain the true, but un-
p p
known value of . Since we only have one interval [y 1:96= n; y + 1:96= n]] we do not
know whether it contains the true value of or not. We can only say that we are 95%
p p
con…dent that the given interval [y 1:96= n; y + 1:96= n] contains the true value of
since we are told it is a 95% con…dence interval. In other words, we hope we were one of
the “lucky” 95% who constructed an interval containing the true value of . Warning:
p p
You cannot say that the probability that the interval [y 1:96= n; y + 1:96= n] contains
the true value of is 0:95!!!
If in Example 4.4.1 a particular sample of size n = 16 had observed mean y = 10:4, then
the observed 95% con…dence interval would be [y 1:96=4; y + 1:96=4], or [9:91; 10:89]. We
cannot say that the probability that P ( 2 [9:91; 10:89]) = 0:95. We can only say that
we are 95% con…dent that the interval [9:91; 10:89] contains .
Con…dence intervals become narrower as the size of the sample on which they are based
increases. For example, note the e¤ect of n in Example 4.4.1. The width of the con…dence
p
interval is 2(1:96)= n which decreases as n increases. We noted this earlier for likelihood
intervals, and we will show in Section 4.6 that likelihood intervals are a type of con…dence
interval.
Recall that the coverage probability for the interval in the above example did not depend
on the unknown parameter, a highly desirable property because we’d like to know the
coverage probability while not knowing the value of the unknown parameter. We next
consider a general method for …nding con…dence intervals which have this property.
Pivotal Quantities
De…nition 28 A pivotal quantity Q = Q(Y; ) is a function of the data Y and the un-
known parameter such that the distribution of the random variable Q is fully known. That
is, probability statements such as P (Q a) and P (Q b) depend on a and b but not on
or any other unknown information.
We now describe how a pivotal quantity can be used to construct a con…dence interval.
We begin with the statement P [a Q(Y; ) b] = p where Q(Y; ) is a pivotal quantity
whose distribution is completely known. Suppose that we can re-express the inequality
a g(Y; ) b in the form L(Y) U (Y) for some functions L and U: Then since
p = P [a Q(Y; ) b] = P [L(Y) U (Y)]

= P ( 2 [L(Y); U (Y)]) ;
the interval [L (y) ; U (y)] is a 100p% con…dence interval for . The con…dence coe¢ cient
for the interval [L (y) ; U (y)] is equal to p which does not depend on . The con…dence
coe¢ cient does depend on a and b, but these are determined by the known distribution of
Q(Y; ).
Example 4.4.2 Con…dence interval for the mean of a Gaussian distribution with
known standard deviation
Suppose Y = (Y1 ; : : : ; Yn ) is a random sample from the G( ; 0 ) distribution where
E (Yi ) = is unknown but sd (Yi ) = 0 is known. Since
Y
Q = Q (Y; ) = p G(0; 1)
0= n
and G(0; 1) is a completely known distribution, Q is a pivotal quantity. To obtain a 95%
con…dence interval for we need to …nd values a and b such that P (a Q (Y; ) b) = 0:95.
Now
Y
0:95 = P a p b
0= n
p p
=P Y b 0= n Y a 0= n ;
so that
p p
y b 0= n; y a 0= n
122 4. ESTIMATION
is a 95% con…dence interval for based on the observed data y = (y1 ; : : : ; yn ). Note that
there are in…nitely many pairs (a; b) giving P (a Q b) = 0:95. A common choice for the
Gaussian distribution is to pick points symmetric about zero, a = 1:96, b = 1:96. This
p p p
results in the interval [y 1:96 0 = n; y + 1:96 0 = n] or y 1:96 0 = n which turns out
to be the narrowest possible 95% con…dence interval.
p p
The interval [y 1:96 0 = n; y + 1:96 0 = n] is often referred to as a two-sided con…-
dence interval. Note that this interval takes the form
point estimate a standard deviation of the estimator.
Many two-sided con…dence intervals we will encounter in this course will take a similar form.
Another choice for a and b would be a = 1, b = 1:645, which gives the interval
p p
[y 1:645 0 = n; 1). The interval [y 1:645 0 = n; 1) is usually referred to as a one-sided
con…dence interval. This type of interval is useful when we are interested in determining a
lower bound on the value of .
It turns out that for most distributions it is not possible to …nd exact pivotal quantities
or con…dence intervals for whose coverage probabilities do not depend somewhat on the
true value of . However, in general we can …nd quantities Qn = Qn (Y1 ; : : : ; Yn ; ) such that
as n ! 1, the distribution of Qn ceases to depend on or other unknown information. We
then say that Qn is asymptotically pivotal, and in practice we treat Qn as a pivotal quantity
for su¢ ciently large values of n; more accurately, we call Qn an approximate pivotal quantity.
Example 4.4.3 Approximate con…dence interval for Binomial model

Suppose Y Binomial(n; ). From the Central Limit Theorem we know that for large
n, Q1 = (Y n )=[n (1 )]1=2 has approximately a G(0; 1) distribution. It can also be
shown that the distribution of
Y n
Qn = Qn (Y ; ) =
~
[n (1 ~)]1=2
where ~ = Y =n, is also close to G(0; 1) for large n. Thus Qn can be used as an approximate
pivotal quantity to construct con…dence intervals for . For example,
0:95 t P ( 1:96 Qn 1:96)

h i1=2 h i1=2
=P ~ 1:96 ~(1 ~)=n ~ + 1:96 ~(1 ~)=n :
Thus s
^(1 ^)
^ 1:96 (4.5)
n
gives an approximate 95% con…dence interval for where ^ = y=n and y is the observed
data.
As a numerical example, suppose we observed n = 100, y = 18. Then (4.5) gives
0:18 1:96 [0:18(0:82)=100]1=2 or [0:115; 0:255].
Remark: It is important to understand that con…dence intervals may vary a great deal
when we take repeated samples. For example, in Example 4.4.3, ten samples of size n = 100
which were simulated for a population with = 0:25 gave the following approximate 95%
con…dence intervals for :
[0:20; 0:38] [0:14; 0:31] [0:23; 0:42] [0:22; 0:41] [0:18; 0:36]
[0:14; 0:31] [0:10; 0:26] [0:21; 0:40] [0:15; 0:33] [0:19; 0:37]
For larger samples (larger n), the con…dence intervals are narrower and will have better
agreement. For example, try generating a few samples of size n = 1000 and compare the
con…dence intervals for .
Choosing a Sample Size
We have seen that con…dence intervals for a parameter tend to get narrower as the sample
size n increases. When designing a study we often decide how large a sample to collect
on the basis of (i) how narrow we would like con…dence intervals to be, and (ii) how much
we can a¤ord to spend (it costs time and money to collect data). The following example
illustrates the procedure.
Example 4.4.5 Sample size and estimation of a Binomial probability

Suppose we want to estimate the probability from a Binomial experiment in which
Y v Binomial(n; ) distribution. We use the approximate pivotal quantity
Y n
Q=
[n (1 ~]1=2
~
which was introduced in Example 4.4.3 and which has approximately a G(0; 1) distribution
to obtain con…dence intervals for . Here is a criterion that is widely used for choosing the
size of n: Choose n large enough so that the width of a 95% con…dence interval for is no
wider than 2 (0:03). Let us see where this leads and why this rule is used.
From Example 4.4.3, we know that
s
^(1 ^)
^ 1:96
n
is an approximate 0:95 con…dence interval for and that the width of this interval is
s
^(1 ^)
2 (1:96) :
n
124 4. ESTIMATION
To make this con…dence interval narrower that 2 (0:03) (or even narrower, say 2 (0:025)),
we need n large enough so that
s
^(1 ^)
1:96 0:03
n
or
2
1:96 ^(1 ^):
n
0:03
Of course we don’t know what ^ is because we have not taken a sample, but we note that
the worst case scenario occurs when ^ = 0:5. So to be conservative, we …nd n such that
2
1:96
n (0:5)2 t 1067:1
0:03
Thus, choosing n = 1068 (or larger) will result in an approximate 95% con…dence interval
of the form ^ c, where c 0:03. If you look or listen carefully when polling results are
announced, you’ll often hear words like “this poll is accurate to within 3 percentage points
19 times out of 20.”What this really means is that the estimator ~ (which is usually given
in percentile form) approximately satis…es P (j~ j 0:03) = 0:95, or equivalently, that
the actual estimate is the centre of an approximate 95% con…dence interval ^ c, for
^
which c = 0:03. In practice, many polls are based on 1050 1100 people, giving “accuracy
to within 3 percent” with probability 0:95. Of course, one needs to be able to a¤ord to
collect a sample of this size. If we were satis…ed with an accuracy of 5 percent, then we’d
only need n = 385 (show this). In many situations however this might not be su¢ ciently
accurate for the purpose of the study.
Exercise: Show that to ensure that the width of the approximate 95% con…dence interval
is 2 (0:02) = 0:04 or smaller, you need n = 2401: What should n be to make a 99% con…-
dence interval less than 2 (0:02) = 0:04 or less?
Remark: Very large Binomial polls (n 2000) are not done very often. Although we can
in theory estimate very precisely with an extremely large poll, there are two problems:
1. It is di¢ cult to pick a sample that is truly random, so Y Binomial(n; ) is only an

approximation.
2. In many settings the value of ‡uctuates over time. A poll is at best a snapshot at
one point in time.
As a result, the “real” accuracy of a poll cannot generally be made arbitrarily high.
Sample sizes can be similarly determined so as to give con…dence intervals of some
desired length in other settings. We consider this topic again in Section 4.7 for the G ( ; )
distribution.
4.5. THE CHI-SQUARED AND T DISTRIBUTIONS 125
Census versus a Random Sample
Conducting a complete census is usually costly and time-consuming. This example il-
lustrates how a random sample, which is less expensive, can be used to obtain “good”
information about the attributes of interest for a population.
Suppose interviewers are hired at $20 per hour to conduct door to door interviews of
adults in a municipality of 50,000 households. There are two choices:
(1) conduct a census using all 50,000 households or
(2) take a random sample of households in the municipality and then interview a member
of each household.
If a random sample is used it is estimated that each interview will take approximately
20 minutes (travel time plus interview time). If a census is used it is estimated that each
interview will take approximately 10 minutes each since there is less travel time. We can
summarize the costs and precision one would obtain for one question on the form which
asks whether a person agrees/disagrees with a statement about the funding levels for higher
education. Let be the proportion in the population who agree. Suppose we decide that a
“good”estimate of is one that is accurate to within 2% of the true value 95% of the time.
For a census, six interviews can be completed in one hour. At $20 per hour the inter-
viewer cost for the census is approximately
50000
$20 = $166; 667
6
since there are 50,000 households.
For a random sample, three interviews can be completed in one hour. An approximate
95% con…dence interval for of the form ^ 0:02 requires n = 2401. The cost of the random
sample of size n = 2401 is
2401
$20 t $16; 000
3
as compared to $166; 667 for the census - more than ten times the cost of the random
sample!
Of course, we have also not compared the costs of processing 50; 000 versus 2401 surveys
but it is obvious again that the random sample will be less costly and time consuming.
4.5 The Chi-squared and t Distributions

In this section we introduce two new distributions, the Chi-squared distribution and the
Student t distribution. These two distributions play an important role in constructing
con…dence intervals and the tests of hypotheses to be discussed in Chapter 5.
126 4. ESTIMATION
2
The (Chi-squared) Distribution
To de…ne the Chi-squared distribution we …rst recall the Gamma function and its properties:
Z1
1 y
( )= y e dy for > 0:
0
Properties of the Gamma Function:

(1) ( ) = ( 1) ( 1)
(2) ( ) = ( 1)! for = 1; 2; : : :
p
(3) (1=2) =
The 2 (k) distribution is a continuous family of distributions on (0; 1) with probability

density function of the form
1
f (x; k) = x(k=2) 1
e x=2
for x > 0 (4.6)
2k=2 (k=2)
where k 2 f1; 2; : : :g is a parameter of the distribution. We write X 2 (k). The parameter
k is referred to as the “degrees of freedom” (d.f.) parameter. In Figure 4.3 you see the
characteristic shapes of the Chi-squared probability density functions. For k = 2; the
probability density function is the Exponential (2) probability density function. For k > 2;
the probability density function is unimodal with maximum value at x = k 2. For values
of k > 30, the probability density function resembles that of a N (k; 2k) probability density
function.
Chisquared df =1 Chisquared df =2
4 0.5
0.4
3
0.3
p.d.f.
p.d.f.
2
0.2
1
0.1
0 0
0 1 2 3 4 5 0 2 4 6 8 10
Chisquared df =4 Chisquared df =8
0.2 0.12
0.1
0.15
0.08
p.d.f.
p.d.f.
0.1 0.06
0.04
0.05
0.02
0 0
0 5 10 15 0 5 10 15 20
Figure 4.3: Chi-squared probabilities densities with degrees of freedom k = 1,2,4,8

4.5. THE CHI-SQUARED AND T DISTRIBUTIONS 127
The cumulative distribution function, F (x; k), can be given in closed algebraic form for
even values of k. In R the functions dchisq(x; k) and pchisq(x; k) give the probability den-
sity function f (x; k) and cumulative distribution function F (x; k) for the 2 (k) distribution.
A table with selected values is given at the end of these course notes.
If X v 2 (k) then
E (X) = k and V ar(X) = 2k:
This result follows by …rst showing that
2j (k=2 + j)
E Xj = for j = 1; 2; : : : :
(k=2)
This is true since
Z1
j 1
E X = xj x(k=2) 1
e x=2
dx
2k=2 (k=2)
0
Z1
1
= x(k=2)+j 1
e x=2
dx let y = x=2 or x = 2y
2k=2 (k=2)
0
Z1 Z1
1 2j
= (2y)(k=2)+j 1
e y
2dy = y (k=2)+j 1
e y
dy
2k=2 (k=2) (k=2)
0 0
2j (k=2 + j)
= :
(k=2)
Letting j = 1 we obtain
E (X) = 2 (k=2 + 1)= (k=2) = 2 (k=2) = k:
Letting j = 2 we obtain
E X2 = 22 (k=2 + 2)= (k=2)

= 4 (k=2 + 1) (k=2) = k (k + 2) (4.7)
and therefore
V ar (X) = E X 2 [E (X)]2 = 2k:
The following results will also be very useful.
Theorem 29 Let W1 ; : : : ; Wn be independent random variables with Wi 2 (k ): Then

i
Pn
2(
P
n
S= Wi ki ).
i=1 i=1
For a proof of this result see Problem 16.

128 4. ESTIMATION
Theorem 30 If Z G(0; 1) then the distribution of W = Z 2 is 2 (1).
Proof. Suppose W = Z 2 where Z G(0; 1). Let represent the cumulative distribution
function of a G(0; 1) random variable and let represent the probability density function of
a G(0; 1) random variable. Then
p p p p
P (W w) = P ( w Z w) = ( w) ( w) for w > 0
and the probability density function of W is
d p p p p 1 1=2
( w) ( w) = ( w) + ( w) w
dw 2
1 1=2 w=2
=p w e for w > 0
2
which is the probability density function of a 2 (1) random variable as required.
Corollary 31 If Z1 ; : : : ; Zn are mutually independent G(0; 1) random variables and

Pn
S= Zi2 ; then S 2 (n).
i=1
Proof. Since Zi G(0; 1) then by Theorem 30, Zi2 v 2 (1) and the result follows by
Theorem 29.
Student’s t Distribution
Student’s t distribution (or more simply the t distribution) has probability density function
(k+1)=2
t2
f (t; k) = ck 1 + for t 2 < and k = 1; 2; : : :
k
where the constant ck is given by

k+1
2
ck = p :
k ( k2 )
The parameter k is called the degrees of freedom. We write T t (k) to indicate that
the random variable T has a Student t distribution with k degrees of freedom. In Figure
4.4 the probability density function f (t; k) for k = 2 is plotted together with the G (0; 1)
probability density function.
Obviously the t probability density function is similar to that of the G (0; 1) distribution
in several respects: it is symmetric about the origin, it is unimodal, and indeed for large
values of k, the graph of the probability density function f (t; k) is indistinguishable from
that of the G (0; 1) probability density function. The primary di¤erence, for small k such
as the one plotted, is in the tails of the distribution. The t probability density function has
4.6. LIKELIHOOD-BASED CONFIDENCE INTERVALS 129
0.4
0.35
0.3
0.25
pdf
0.2
0.15
0.1
0.05
0
-5 0 5
x
Figure 4.4: Probability density functions for t (2) distribution (dashed red ) and
G (0; 1) distribution (solid blue)
fatter “tails”or more area in the extreme left and right tails. Problem 22 at the end of this
chapter considers some properties of f (x; k).
Probabilities for the t distribution are available from tables at the end of these notes26 or
computer software. In R, the cumulative distribution function F (t; k) = P (T t; k) where
T t (k) is obtained using pt(t,k). For example, pt(1.5,10) gives P (T 1:5; 10) = 0:918.
The t distribution arises as a result of the following theorem involving the ratio of a
N (0; 1) random variable and an independent Chi-squared random variable. We will not
attempt to prove this theorem here.
Theorem 32 Suppose Z G(0; 1) and U 2 (k) independently. Let T = qZ . Then T

U
k
has Student’s t distribution with k degrees of freedom.
4.6 Likelihood-Based Con…dence Intervals

We will now show that likelihood intervals are also con…dence intervals. Recall the relative
likelihood R( ) = L( )=L(^) is a function of the maximum likelihood estimate ^. Replace
the estimate ^ by the random variable (the estimator) ~ and de…ne the random variable
( )
L( )
( ) = 2 log = 2l(~) 2l( )
L(~)
26
See the video at www.watstat.ca called "Using the t table"
130 4. ESTIMATION
where ~ is the maximum likelihood estimator. The random variable ( ) is called the
likelihood ratio statistic. The following theorem implies that ( ) is an asymptotic pivotal
quantity.
Theorem 33 If L( ) is based on Y = (Y1 ; : : : ; Yn ), a random sample of size n, and if

is the true value of the scalar parameter, then (under mild mathematical conditions) the
distribution of ( ) converges to a 2 (1) distribution as n ! 1.
This theorem means that ( ) can be used as a pivotal quantity for su¢ ciently large n
in order to obtain approximate con…dence intervals for . More importantly we can use this
result to show that the likelihood intervals discussed in Section 4.3 are also approximate
con…dence intervals.
Theorem 34 A 100p% likelihood interval is an approximate 100q% where

p
q = 2P Z 2 log p 1 and Z v N (0; 1).
Proof. A 100p% likelihood interval is de…ned by f ; R( ) pg which can be rewritten as

( " # )
L( )
f ; R( ) pg = : 2 log 2 log p
L(^)
n o
= : 2 log L(^) 2 log L( ) 2 log p
n o
= : 2`(^) 2`( ) 2 log p :
By Theorem 33 the con…dence coe¢ cient for this interval can be approximated by
P( ( ) 2 log p) t P (W 2 log p) where W v 2 (1)

p
= P jZj 2 log p where Z v N (0; 1)
p
= 2P Z 2 log p 1:
as required.
Example: If p = 0:1 then

p
q = 2P Z 2 log (0:1) 1 where Z v G (0; 1)
= 2P (Z 2:15) 1 = 0:96844
and therefore a 10% likelihood interval is an approximate 97% con…dence interval.
Exercise: Show that a 1% likelihood interval is an approximate 99:8% con…dence interval

and that a 50% likelihood interval is an approximate 76% con…dence interval.
Conversely Theorem 33 can also be used to …nd an approximate 100p% likelihood based
con…dence interval.
4.6. LIKELIHOOD-BASED CONFIDENCE INTERVALS 131
Theorem 35 If a is a value such that
p = 2P (Z a) 1 where Z v N (0; 1)
n o
then the likelihood interval : R( ) e a2 =2 is an approximate 100p% con…dence inter-
val.
n o
Proof. The con…dence coe¢ cient corresponding to the likelihood interval : R( ) e a2 =2
is
L( ) a2 =2 L( )
P e = P 2 log a2
L(~) ~
L( )
t P W a 2
where W v 2
(1) by Theorem 33
= 2P (Z a) 1 where Z v N (0; 1)
= p
as required.
Example:
Since
0:95 = 2P (Z 1:96) 1 where Z v N (0; 1)
and
(1:96)2 =2 1:9208
e =e t 0:1465 t 0:15;
therefore a 15% likelihood interval for is also an approximate 95% con…dence interval for
.
Exercise: Show that a 26% likelihood interval is an approximate 90% con…dence interval
and a 4% likelihood interval is an approximate 99% con…dence interval.
Example 4.6.1 Approximate con…dence interval for Binomial model

From Example 4.3.1 we have that the relative likelihood function for the Binomial model
is y
(1 )n y
R( ) = y for 0 < < 1:
^ (1 ^)n y
Suppose the observed data were n = 100 and y = 40 so that ^ = 40=100 = 0:4. From the
graph of the relative likelihood function given in Figure 4.5 we can read o¤ the 15% likeli-
hood interval which is [0:31; 0:495] which is also an approximate 95% con…dence interval.
We can compare this to the approximate 95% con…dence interval based on

s
^(1 ^)
^ 1:96
n
132 1.4
4. ESTIMATION
1.2
0.8
R(θ)
0.6
0.4
0.2
0
0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6
θ
Figure 4.5: Relative likelihood function for Binomial with n = 100 and y = 40
which gives the interval [0:304; 0:496]. The two intervals di¤er slightly (they are both based
on approximations) but are very close.
Suppose n = 30 and ^ = 0:1: From the graph of the relative likelihood function given
in Figure 4.6 we can read o¤ the 15% likelihood interval which is [0:03; 0:24] which is also
an approximate 95% con…dence interval.
0.9
0.8
0.7
R(θ)
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
θ
Figure 4.6: Relative likelihood function for Binomial with n = 30 and y = 3
We can compare this to the approximate 95% con…dence interval based on

s
^(1 ^)
^ 1:96 (4.8)
n
4.7. CONFIDENCE INTERVALS FOR PARAMETERS IN THE G( ; ) MODEL 133
which gives the interval [ 0:0074; 0:2074] which is quite di¤erent than the likelihood based
approximate con…dence interval and which also contains negative values for . Of course
can only take on values between 0 and 1. This happens because the con…dence interval in
(4.8) is always symmetric about ^ and if ^ is close to 0 or 1 and n is not very large then
the interval can contain values less than 0 or bigger than 1. The graph of the likelihood
interval in Figure 4.6 is not symmetric about ^. In this case the 15% likelihood interval is a
better summary of the values which are supported by the data. If ^ is close to 0:5 or n is
large then the likelihood interval will be fairly symmetric and there will be little di¤erence
in the two approximate con…dence intervals as we saw in the previous example in which n
was equal to 100 and ^ was equal to 0:4.
4.7 Con…dence Intervals for Parameters in the G( ; ) Model

Suppose that Y G( ; ) models a response variate y in some population or process. A
random sample Y1 ; : : : ; Yn is selected, and we want to estimate the model parameters. We
have already seen in Section 2.2 that the maximum likelihood estimators of and 2 are
1 Pn 1 Pn
~=Y = Yi and ~ 2 = (Yi Y )2 :
n i=1 n i=1
A closely related point estimator of 2 is the sample variance,

1 P
n
S2 = (Yi Y )2
n 1 i=1
which di¤ers from ~ 2 only by the choice of denominator. Indeed if n is large there is very
little di¤erence between S 2 and ~ 2 . Note that the sample variance has the advantage that
it is an “unbiased” estimator, that is, E(S 2 ) = 2 . This follows since
2
E (Yi )2 = V ar (Yi ) = 2
, E (Y )2 = V ar Y =
n
and
1 P
n
E(S 2 ) = E (Yi Y )2
n 1 i=1
1 Pn
= E (Yi )2 n(Y )2
n 1 i=1
1 P
n
= E (Yi )2 nE (Y )2
n 1 i=1
1 2 1
2 2
= n n = (n 1)
n 1 n n 1
2
= :
We now consider interval estimation for and .

134 4. ESTIMATION
Con…dence Intervals for

If were known then
Y
Z= p G(0; 1) (4.9)
= n
would be a pivotal quantity that could be used to obtain con…dence intervals for . However,
is generally unknown. Fortunately it turns out that if we simply replace with either
the maximum likelihood estimator ~ or the sample variance S in Z, then we still have a
pivotal quantity. We will write the pivotal quantity in terms of S. The pivotal quantity is
Y
T = p (4.10)
S= n
Since S, unlike , is a random variable in (4.10) the distribution of T is no longer G(0; 1).
The random variable T actually has a t distribution which was introduced in Section 4.5.
Theorem 36 Suppose Y1 ; : : : ; Yn is a random sample from the G( ; ) distribution with

sample mean Y and sample variance S 2 . Then
Y
T = p v t (n 1) : (4.11)
S= n
To see how this result follows from Theorem 36 let

Y
Z= p G(0; 1)
= n
and
(n 1)S 2
U= 2
:
We choose this function of S 2 since it can be shown that U 2 (n 1). It can also be
27
shown that Z and U are independent random variables . By Theorem 36 with k = n 1,
we have
p Y
Z = n Y
q = q = p t (n 1) :
U S 2 S= n
k 2
In other words if we replace in the pivotal quantity (4.9) by its estimator S, the distri-
bution of the resulting pivotal quantity has a t(n 1) distribution rather than a G(0; 1)
distribution. The degrees of freedom are inherited from the degrees of freedom of the Chi-
squared random variable U or from S 2 .
27
The proof of the remarkable result that, for a random sample from a Normal distribution, the sample
mean and the sample variance are independent random variables, is beyond the scope of this course.
We now show how to use the t distribution to obtain a con…dence interval for when
is unknown. Since (4.11) has a t distribution with n 1 degrees of freedom which is
a completely known distribution, we can use this pivotal quantity to construct a 100p%
con…dence interval for . Since the t distribution is symmetric we determine the constant
a such that P ( a T a) = p using the t tables provided in these course notes or R.
Note that, due to symmetry, P ( a T a) = p is equivalent to P (T a) = (1 + p) =2
(you should verify this) and since the t tables tabulate the cumulative distribution function
P (T t), it is easier to …nd a such that P (T a) = (1 + p) =2. Then since
p = P( a T a)
Y
= P a p a
S= n
p p
= P Y aS= n Y + aS= n
a 100p% con…dence interval for is given by

p p
y as= n; y + as= n : (4.12)
(Note that if we attempted to use (4.9) to build a con…dence interval we would have two
unknowns in the inequality since both and are unknown.) As usual the method used
to construct this interval implies that 100p% of the con…dence intervals constructed from
samples drawn from this population contain the true value of .
p
We note that this interval is of the form y as= n or
estimate a estimated standard deviation of estimator.
Recall that a con…dence interval for in the case of a G( ; ) population when is known
has a similar form
estimate a standard deviation of estimator
except that the standard deviation of the estimator is known in this case and the value of
a is taken from a G(0; 1) distribution rather than the t distribution.
Example 4.7.1 IQ test

Scores Y for an IQ test administered to ten year old children in a very large population
have close to a G( ; ) distribution. A random sample of 10 children in a particular large
inner city school obtained test scores as follows:
103; 115; 97; 101; 100; 108; 111; 91; 119; 101
P
10 P
10
yi = 1046 and yi2 = 110072:
i=1 i=1
136 4. ESTIMATION
We wish to use these data to estimate the parameter which represents the mean test
score for ten year old children at this school. Since
P (T 2:262) = 0:975 for T s t (9) ;

p
a 95% con…dence interval for based on (4.12) is y 2:262s= 10 or
h p p i
y 2:262s= 10; y 2:262s= 10 :
For the given data y = 104:6 and s = 8:57, so the con…dence interval is 104:6 6:13 or
[98:47; 110:73].
Behaviour as n ! 1
As n increases, con…dence intervals behave in a largely predictable fashion. First the
estimated standard deviation gets closer28 to the true standard deviation . Second as
the degrees of freedom increase, the t distribution approaches the Gaussian so that the
quantiles of the t distribution approach that of the G(0; 1) distribution. For example, if
in Example 4.7.1 we knew that = 8:57 then we would use the 95% con…dence interval
p p
y 1:96 (8:57) = n instead of y 2:262 (8:57) = n with n = 10. In general for large n, the
p
width of the con…dence interval gets narrower as n increases (but at the rate 1= n) so the
con…dence intervals shrink to include only the point y.
Sample size required for a given width of con…dence interval for

If we know the value of approximately (possibly from previous studies), we can determine
the value of n needed to make a 95% con…dence interval a given length. This is used in
deciding how large a sample to take in a future study. A 95% con…dence interval using the
p
Normal quantiles takes the form y 1:96 = n. If we wish a 95% con…dence interval of the
form y d (the width of the con…dence interval is then 2d), we should choose
p
1:96 = n t d
or n t (1:96 =d)2 :
We would usually choose n a little larger than this formula gives to accommodate the fact
that we used Normal quantiles rather than the quantiles of the t distribution which are
larger in value.
Con…dence Intervals for

Suppose that Y1 ; Y2 ; : : : ; Yn is random sample from the G( ; ) distribution. We have
seen that there are two closely related estimators for the population variance, ~ 2 and the
sample variance S 2 . We use S 2 to build a con…dence interval for the parameter 2 . Such a
construction depends on the following result, which we will not prove.
28
this will be justi…ed shortly
Theorem 37 Suppose Y1 ; : : : ; Yn is a random sample from the G( ; ) distribution with

sample variance S 2 . Then the random variable
(n 1)S 2 1 P
n
2
= 2
(Yi Y )2 (4.13)
i=1
has a Chi-squared distribution with n 1 degrees of freedom.
While we will not prove this result, we should at least try to explain the puzzling number
Pn
of degrees of freedom n 1, which, at …rst glance, seems wrong since (Yi Y )2 is the
i=1
sum of n squared Normal random variables. Does this contradict Corollary 31? It is true
that each Wi = (Yi Y ) is a Normally distributed random variable. However Wi does not
have a N (0; 1) distribution and more importantly the Wi0 s are not independent! (See
Problem 17.) It is easy to see that W1 ; W2 ; : : : ; Wn are not independent random variables
Pn nP1
since Wi = 0 implies Wn = Wi so the last term can be determined using the sum
i=1 i=1
P
n P
n
of the …rst n 1 terms. Therefore in the sum, (Yi Y )2 = Wi2 there are really only
i=1 i=1
n 1 terms that are linearly independent or “free”. This is an intuitive explanation for
the n 1 degrees of freedom both of the Chi-squared and of the t distribution. In both
cases, the degrees of freedom are inherited from S 2 and are related to the dimension of the
subspace inhabited by the terms in the sum for S 2 , that is, Wi = Yi Y ; i = 1; : : : ; n:
We will now show how we can use Theorem 37 to construct a 100p% con…dence interval
for the parameter 2 or . First note that (4.13) is a pivotal quantity since its distribution
is completely known. Using Chi-squared tables or R we can …nd constants a and b such
that
P (a U b) = p
where U s 2 (n 1). Since
p = P (a U b)
(n 1)S 2
= P a 2
b
(n 1)S 2 2 (n 1)S 2
= P
b a
r r !
(n 1) S 2 (n 1) S 2
= P
b b
a 100p% con…dence interval for 2 is

(n 1)s2 (n 1)s2
; (4.14)
b a
and a 100p% con…dence interval for is
138 4. ESTIMATION
"r r #
(n 1) s2 (n 1) s2
; : (4.15)
b a
As usual the choice for a; b is not unique. For convenience, a and b are usually chosen such
that
1 p
P (U a) = P (U > b) = (4.16)
2
where U s 2 (n 1). Note that since the Chi-squared tables provided in these course
notes tabulate the cumulative distribution function, P (U u), this means using the tables
to …nd a and b such that
(1 p) (1 p) (1 + p)
P (U a) = and P (U b) = p + = :
2 2 2
The intervals (4.14) and (4.15) are called equal-tailed con…dence intervals. The choice (4.16)
for a, b does not give the narrowest con…dence interval. The narrowest interval must be
found numerically. For large n the equal-tailed interval and the narrowest interval are
nearly the same.
Note that, unlike con…dence intervals for , the con…dence interval for 2 is not symmet-
ric about s2 , the estimate of 2 . This happens of course because the 2 (n 1) distribution
is not a symmetric distribution.
In some applications we are interested in an upper bound on (because small is
“good” in some sense). In this case we take b = 1 and …nd a such that P (a U ) = p or
q
(n 1)s2
P (U a) = 1 p so that a one-sided 100p% con…dence interval for is 0; a .
Example 4.7.2 Optical glass lenses

A manufacturing process produces wafer-shaped pieces of optical glass for lenses. Pieces
must be very close to 25 mm thick, and only a small amount of variability around this can
be tolerated. If Y represents the thickness of a randomly selected piece of glass then, to
a close approximation, Y G( ; ). The parameter represents the standard deviation
of the population of lens thicknesses produced by this manufacturing process. Periodically,
random samples of n = 15 pieces of glass are selected and the values of and are estimated
to see if they are consistent with = 25 and with being under 0:02 mm. On one such
occasion the observed data were
P
15
y = 25:009, s = 0:013 and (yi y)2 = (14) s2 = 0:002347:
i=1
To obtain a 95% con…dence interval for we determine a and b such that

1 0:95 1 + 0:95
P (U a) = = 0:025 and P (U b) = = 0:975
2 2
where U s 2 (14). From Chi-squared tables or R we obtain
P (U 5:63) = 0:025 and P (U 26:12) = 0:975

so a = 5:63 and b = 26:12. Substituting these values along with (14) s2 = 0:002347 into
(4.15) we obtain "r #
r
0:002347 0:002347
; = [0:0095; 0:0204] :
26:12 5:63
as the 95% con…dence interval for .

It seems plausible that 0:02, though the right endpoint of the 95% con…dence
interval is very slightly over 0:02. Using P (6:57 U < 1) = 0:95 we can obtain a one-
sided 95% con…dence interval for which is given by
2 s 3 " r #
(n 1) s2 0:002347
40; 5 = 0; = [0; 0:0189]
a1 6:57
and the value 0:02 is not in the interval. Why are the intervals di¤erent? Both cover the
true value of the parameter for 95% of all samples so they have the same con…dence
coe¢ cient. However the one-sided interval, since it allows smaller (as small as zero) values
on the left end of the interval, can achieve the same coverage with a smaller right end-point.
If our primary concern was for values of being too large, that is, for an upper bound for
the interval, then the one-sided interval is the one that should be used for this purpose.
Prediction Interval for a Future Observation

In Chapter 3 we mentioned that a common type of statistical problem was a predictive
problem in which the experimenter wishes to predict the response of a variate for a given
unit. This is often the case in …nance or in economics. For example, …nancial institutions
need to predict the price of a stock or interest rates in a week or a month because this
e¤ects the value of their investments. We will now show how to do this in the case where
the Gaussian model for the data is valid.
Suppose that y1 ; y2 ; : : : ; yn is an observed random sample from a G( ; ) population
and that Y is an new observation which is to be drawn at random from the same G( ; )
population. We want to estimate Y and obtain an interval of values for Y . As usual
we estimate the unknown parameters and using ^ = y and s respectively. Our best
point estimate of Y based on the data we have already observed is ^ with corresponding
estimator ~ = Y v N ; 2 =n . To obtain an interval of values for Y we note that
Y v G ( ; ) independently of ~ = Y v n ; 2 =n . Since E (Y ~ ) = = 0 and
V ar (Y ~ ) = 2 2
+ =n therefore
1
Y ~=Y Y vN 0; 2
1+ :
n
Also
Y Y
q v t (n 1)
1
S 1+ n
140 4. ESTIMATION
is a pivotal quantity which can be used to obtain an interval of values for Y . Let a be the
value such that P ( a T a) = p or P (T a) = (1 + p) =2 which is obtained from t
tables or by using R. Since
p = P( a T a)
0 1
Y Y
= P@ a q aA
S 1 + n1
r r !
1 1
= P Y aS 1 + Y Y + aS 1+
n n
therefore " r r #
1 1
y as 1 + ; y + as 1+ (4.17)
n n
is an interval of values for the future observation Y with con…dence coe¢ cient p. The
interval (4.17) is called a 100p% prediction interval instead of a con…dence interval since
Y is not a parameter but a random variable. Note that the interval (4.17) is wider than a
100p% con…dence interval for mean . This makes sense since is an unknown constant
with no variability while Y is a random variable with its own variability V ar (Y ) = 2 .
Example 4.7.2 Revisited Optical glass lenses

Suppose in Example 4.7.2 a 95% prediction interval is required for a glass lens drawn at
random from the population of glass lenses. Now y = 25:009, s = 0:013 and for T s t (14)
we have P (T 2:1448) = (1 + 0:95) =2 = 0:975. Therefore a 95% prediction interval for
this new lens is given by
" r r #
1 1
25:009 2:1448 (0:013) 1 + ; 25:009 + 2:1448 (0:013) 1 +
15 15
= [25:009 0:0288; 25:009 + 0:0288]
= [24:980; 25:038] :
Note that this interval is much wider than a 95% con…dence interval for = the mean
of the population of lens thicknesses produced by this manufacturing process which is given
by
h p p i
25:009 2:1448 (0:013) = 15; 25:009 + 2:1448 (0:013) = 15
= [25:009 0:007; 25:009 + 0:007]
= [25:002; 25:016] :
4.8. A CASE STUDY: TESTING RELIABILITY OF COMPUTER POWER SUPPLIES30 141
4.8 A Case Study: Testing Reliability of Computer Power

Supplies29
Components of electronic products often must be very reliable, that is, they must perform
over long periods of time without failing. Consequently, manufacturers who supply com-
ponents to a company that produces, e.g. personal computers, must satisfy the company
that their components are reliable.
Demonstrating that a component is highly reliable is di¢ cult because if the component
is used under “normal”conditions it will usually take a very long time to fail. It is generally
not feasible for a manufacturer to carry out tests on components that last for years (or even
months, in most cases) and therefore they use what are called accelerated life tests. These
involve placing high levels of stress on the components so that they fail in much less than
the normal time. If a model relating the level of stress to the lifetime of the component is
known then such experiments can be used to estimate lifetime at normal stress levels for
the population from which the experimental units are taken.
We consider below some life test experiments on power supplies for personal comput-
ers, with ambient temperature being the stress factor. As the temperature increases, the
lifetimes of components tend to decrease and at a temperature of around 70 Celsius the
average lifetimes tend to be of the order of 100 hours. The normal usage temperature is
around 20 C. The data in Table 4.2 show the lifetimes (i.e. times to failure) yi of compo-
nents tests at each of 40 , 50 , 60 and 70 C. The experiment was terminated after 600
hours and for temperatures 40 , 50 and 60 some of the 25 components being tested had
still not failed. Such observations are called censored observations: we only know in each
case that the lifetime in question was over 600 hours. In Table 4.2 the asterisks denote the
censored observations. Note the data have been organized so that the lifetimes are listed
…rst followed by the censored times.
It is known from past experience that, at each temperature level, lifetimes are ap-
proximately Exponentially distributed; let us therefore suppose that at temperature t;
(t = 40; 50; 60; 70), component lifetimes Y have an Exponential distribution with probabil-
ity density function
1
f (y; t ) = e y=( t ) for y 0
t
where E(Y ) = t is the mean lifetime of components subjected to temperature t.
We begin by determining the likelihood function for the experiment at t = 60 . The
data are y1 ; : : : ; y25 where we note that y23 = 600; y24 = 600; y25 = 600 are censored
observations. We assume these data arise from an Exponential( ) distribution where we
let = 40 for the moment for convenience.
29
May be omitted
142 4. ESTIMATION
Table 4.2: Lifetimes (in hours) from an accelerated life test experiment in PC
power supplies Temperature
70 C 60 C 50 C 40 C
2 1 55 78
5 20 139 211
9 40 206 297
10 47 263 556
10 56 347 600
11 58 402 600
64 63 410 600
66 88 563 600
69 92 600 600
70 103 600 600
71 108 600 600
73 125 600 600
75 155 600 600
77 177 600 600
97 209 600 600
103 224 600 600
115 295 600 600
130 298 600 600
131 352 600 600
134 392 600 600
145 441 600 600
181 489 600 600
242 600 600 600
263 600 600 600
283 600 600 600
Notes: Lifetimes are given in ascending order; asterisks( ) denote censored observations.
The contribution to the likelihood function for an observed lifetime yi is

1 yi =
f (yi ; ) = e :
For the censored observations we only know that the lifetime is greater than 600. Since
Z1
1 y= 600=
P (Y ; ) = P (Y > 600; ) = e dy = e
600
the contribution to the likelihood function of each observation censored at 600 is e 600= .
4.8. A CASE STUDY: TESTING RELIABILITY OF COMPUTER POWER SUPPLIES31 143
Therefore the likelihood function for based on the data y1 ; : : : ; y25 is
Q
22 1
yi = Q
25
yi = k s=
L( ) = e e = e
i=1 i=23
P
25
where k = 22 = the number of uncensored observations and s = yi = sum of all lifetimes
i=1
and censored times.
Question 1 Show that the maximum likelihood estimate of is given by ^ = s=k and
thus ^40 = s=k.
Question 2 Assuming that the Exponential model is correct, the likelihood function for
t ; t = 40; 50; 60; 70 can be obtained using the method above and is given by
kt st = t
L( t ) = t e
where kt = number of uncensored observations at temperature t and st = sum of all lifetimes

and censored times at temperature t.
Find the maximum likelihood estimates of ^t ; t = 40; 50; 60; 70. Graph the relative
likelihood functions for 40 and 70 on the same graph and comment on any qualitative
di¤erences.
Question 3 Graph the empirical cumulative distribution function for t = 40. Note that,
due to the censoring, the empirical cumulative distribution function F^ (y) is constant and
equal to one for y 600. On the same plot graph the cumulative distribution function for
an Exponential(^40 ). What would you conclude about the …t of the Exponential model for
t = 40? Repeat this exercise for t = 50. What happens if you use this technique to check
the Exponential model for t = 60 and 70?
Questions 4 Engineers use a model (called the Arrhenius model) that relates the mean
lifetime of a component to the ambient temperature. The model states that
t = exp + (4.18)
t + 273:2
where t is the temperature in degrees Celsius and and are parameters. Plot the points
log ^t ; (t + 273:2) 1 for t = 40; 50; 60; 70. If the model is correct why should these
points lie roughly along a straight line? Do they?
Using the graph give rough point estimates of and . Extrapolate the line or use your
estimates of and to estimate 20 , the mean lifetime at t = 20 C which is the normal
operating temperature.
Question 5 Question 4 indicates how to obtain a rough point estimate of
20 = exp + :
20 + 273:2
144 4. ESTIMATION
Suppose we wanted to …nd the maximum likelihood estimate of 20 . This would require
the maximum likelihood estimates of and which requires the joint likelihood function
of and . Explain why this likelihood is given by
Q
70
kt st = t
L( ; ) = t e
t=40
where t is given by (4.18). (Note that the product is only over t = 40; 50; 60; 70.) Outline
how you might attempt to get an interval estimate for 20 based on the likelihood function
for and . If you obtained an interval estimate for 20 , would you have any concerns
about indicating to the engineers what mean lifetime could be expected at 20 C? (Explain.)
Question 6 Engineers and statisticians have to design reliability tests like the one just
discussed, and considerations such as the following are often used:
Suppose that the mean lifetime at 20 C is supposed to be about 90,000 hours and that
at 70 C you know from past experience that it is about 100 hours. If the model (4.18) holds,
determine what and should be approximately and thus what is roughly equal to at
40 , 50 and 60 C. How might you use this information in deciding how long a period of time
to run the life test? In particular, give the approximate expected number of uncensored
lifetimes from an experiment that was terminated after 600 hours.

1. R Code for plotting a Binomial relative likelihood
Suppose for a Binomial experiment we observe y = 15 successes in n = 40 trials. The
following R code will plot the relative likelihood function of and the line R ( ) = 0:1
which can be used to determine a 10% likelihood interval.
> y<-15
> n<-40
> thetahat<-y/n
> theta<-seq(0.15,0.65,0.001) # points between 0.15 and 0.65 spaced 0.001
apart
> Rtheta<-exp(y*log(theta/thetahat)+(n-y)*log((1-theta)/(1-thetahat)))
> plot(theta,Rtheta,type="l") # plots R( )
> R10<-0.10+0*theta
> points(theta,R10,type="l") # draws a horizontal line at 0.10
> title(main="Binomial Likelihood for y=15 and n=40")
Modify this code for y = 75 successes in n = 200 trials and y = 150 successes in
n = 400 trials and observe what happens to the width of the 10% likelihood interval.
2. R Code for plotting a Poisson relative likelihood

Suppose we have a sample y1 ; y2 ; :::; yn from a Poisson distribution and n = 25;
y = 5: The following R-code will plot the relative likelihood function of and the line
R ( ) = 0:1 which can be used to determine a 10% likelihood interval.
> thetahat<-5
> n<-25
> theta<-seq(3.7,6.5,0.001)
> Rtheta<-exp(n*thetahat*log(theta/thetahat)+n*(thetahat-theta))
> plot(theta,Rtheta,type="l")
> R10<-0.10+0*theta
> points(theta,R10,type="l") # draws a horizontal line at 0.10
> title(main="Poisson Likelihood for ybar=5 and n=25")
Modify this code for larger sample sizes n = 100; n = 400 and observe what happens
to the width of the 10% likelihood interval.
146 4. ESTIMATION
3. Suppose that a fraction of a large population of persons over 18 years of age never
drink alcohol. In order to estimate , a random sample of n persons is to be selected
and the number y who do not drink determined; the maximum likelihood estimate of
is then ^ = y=n. We want our estimate ^ to have a high probability of being close
to , and want to know how large n should be to achieve this. Consider the random
variable Y and estimator ~ = Y =n.
(a) Determine P 0:03 ~ 0:03 , if n = 1000 and = 0:5 using the Normal
approximation to the Binomial. You do not need to use a continuity correction.
(b) If = 0:50 determine how large n should be to ensure that
P 0:03 ~ 0:03 0:95:
(c) Determine how large n should to ensure that
P 0:03 ~ 0:03 0:95

for all values of .
4. The following excerpt is from a March 2, 2012 cbc.ca news article:titled:

“Canadians lead in time spent online: Canadians are spending more time online
than users in 10 other countries, a new report has found. The report, 2012 Canada
Digital Future in Focus, by the internet marketing research company comScore, found
Canadians spent an average of 45:3 hours on the internet in the fourth quarter of 2011.
The report also states that smartphones now account for 45% of all mobile phone use
by Canadians.”
Assume that these results are based on a random sample of 1000 Canadians.
(a) Suppose a 95% con…dence interval for , the mean time Canadians spent on the
internet in this quarter, is reported to be [42:8; 47:8]. How should this interval
be interpreted?
(b) Construct an approximate 95% con…dence interval for the proportion of Cana-
dians whose mobile phone is a smartphone.
(c) Since this study was conducted in March 2012 the research company has been
asked to conduct a new survey to determine if the proportion of Canadians whose
mobile phone is a smartphone has changed. What size sample should be used to
ensure that an approximate 95% con…dence interval is less than 2 (0:02)?
5. In the U.S.A. the prevalence of HIV (Human Immunode…ciency Virus) infections

in the population of child-bearing women has been estimated by doing blood tests
(anonymously) on all women giving birth in a hospital. One study tested 29; 000
women and found that 64 were HIV positive (had the virus). Give an approximate
99% con…dence interval for , the fraction of the population that is HIV positive.
State any concerns you have about the accuracy of this estimate.
6. Two hundred adults are chosen at random from a population and each adult is asked
whether information about abortions should be included in high school public health
sessions. Suppose that 70% say they should.
(a) Obtain an approximate 95% con…dence interval for the proportion of the pop-
ulation who support abortion information included in high school public health
sessions.
(b) Suppose you found out that the 200 persons interviewed consisted of 50 married
couples and 100 other persons. The 50 couples were randomly selected, as were
the other 100 persons. Discuss the validity (or non-validity) of the analysis in
(a).
7. For Chapter 2, Problem 3 (b) determine an approximate 95% con…dence interval for
by using a 15% likelihood interval for . The likelihood interval can be found from
the graph of R( ) or by using the function uniroot in R.
8. For Chapter 2, Problem 5(b) determine an approximate 95% con…dence interval for
9. For Chapter 2, Problem 6(b) determine an approximate 95% con…dence interval for
the graph of r( ) or by using the function uniroot in R.
10. Recall Chapter 2, Problem 7.
(a) Plot the relative likelihood function R( ) and determine a 10% likelihood inter-
val. The likelihood interval can be found from the graph of R( ) or by using the
function uniroot in R. Is very accurately determined?
(b) Suppose that we can …nd out whether each pair of twins is identical or not, and
that it is determined that of 50 pairs, 17 were identical. Obtain the likelihood
function, the maximum likelihood estimate and a 10% likelihood interval for
in this case. Plot the relative likelihood function on the same graph as the one
in (a), and compare the accuracy of estimation in the two cases.
11. For Chapter 2, Problem 8(c) determine an approximate 95% con…dence interval for
12. Suppose that a fraction of a large population of persons are infected with a certain
virus. Let n and k be integers. Suppose that blood samples for n k people are to be
tested to obtain information about . In order to save time and money, pooled testing
is used, that is, samples are mixed together k at a time to give a total of n pooled
148 4. ESTIMATION
samples. A pooled sample will test negative if all k individuals in that sample are not
infected.
(a) Find the probability that y out of n samples will be negative, if the nk people
are a random sample from the population. State any assumptions you make.
(b) Obtain a general expression for the maximum likelihood estimate ^ in terms of
n, k and y.
(c) Suppose n = 100, k = 10 and y = 89. Find the maximum likelihood estimate of
, and a 10% likelihood interval for .
13. A manufacturing process produces …bers of varying lengths. The length of a …ber Y
is a continuous random variable with probability density function
y
f (y; ) = 2 e y= ; y 0; >0
where is an unknown parameter.
(a) If Y has probability density function f (y; ) show that E (Y ) = 2 and

V ar (Y ) = 2 2 . Hint: Use the Gamma function.
(b) Let y1 ; y2 ; : : : ; yn be the lengths of n …bers selected at random. Find the maxi-
mum likelihood estimate of based on these data.
(c) Suppose Y1 ; Y2 ; : : : ; Yn are independent and identically distributed random vari-
ables with probability density function f (y; ) given above. Find E Y and
V ar Y using the result in (a).
(d) Justify the statement
!
Y 2
P 1:96 p 1:96 t 0:95:
2=n
(e) Explain how you would use the statement in (c) to construct an approximate
95% con…dence interval for .
(f) Suppose n = 18 …bers were selected at random and the lengths were:
6:19 7:92 1:23 8:13 4:29 1:04 3:67 9:87 10:34
1:41 10:76 3:69 1:34 6:80 4:21 3:44 2:51 2:08
P
18
For these data yi = 88:92. Give the maximum likelihood estimate of and
i=1
an approximate 95% con…dence interval for using your result from (e).
14. The lifetime T (in days) of a particular type of light bulb is assumed to have a
distribution with probability density function
1 3 2 t
f (t; ) = t e for t > 0 and > 0:
2
(a) Suppose t1 ; t2 ; : : : ; tn is a random sample from this distribution. Find the maxi-
mum likelihood estimate ^ and the relative likelihood function R( ).
P20
(b) If n = 20 and ti = 996, graph R( ) and determine the 15% likelihood interval
i=1
for which is also an approximate 95% con…dence interval for . The interval
can be obtained from the graph of R( ) or by using the function uniroot in R.
(c) Suppose we wish to estimate the mean lifetime of a light bulb. Show E(T ) = 3= .
Hint: Use the Gamma function. Find an approximate 95% con…dence interval
for the mean.
(d) Show that the probability p that a light bulb lasts less than 50 days is
50 2
p = p ( ) = P (T 50; ) = 1 e [1250 + 50 + 1]:
Determine the maximum likelihood estimate of p. Find an approximate 95%

con…dence interval for p from the approximate 95% con…dence interval for .
For the data referred to in (b), the number of light bulbs which lasted less than
50 days was 11 (out of 20). Using a Binomial model, obtain an approximate 95%
con…dence interval for p. What are the pros and cons of the second interval over
the …rst one?
15. The Chi-squared distribution:
(a) Determine the following using 2 tables provided in the Course Notes:
(i) If X v 2 (10) …nd P (X 2:6) and P (X > 16).
(ii) If X v 2 (4) …nd P (X > 15).
(iii) If X v 2 (40) …nd P (X 24:4) and P (X 55:8). Compare these values
with P (Y 24:4) and P (Y 55:8) if Y v N (40; 80).
(iv) If X v 2 (25) …nd a and b such that P (X a) = 0:025 and P (X > b) =
0:025.
(v) If X v 2 (12) …nd a and b such that P (X a) = 0:05 and P (X > b) = 0:05.
(b) Determine the following WITHOUT using 2 tables:
(i) If X v 2 (1) …nd P (X 2) and P (X > 1:4).
(ii) If X v 2 (2) …nd P (X 2) and P (X > 3).
16. Properties of the Chi-squared distribution: Suppose X 2 (k) with proba-

bility density function given by
1
f (x; k) = x(k=2) 1
e x=2
for y > 0:
2k=2 (k=2)
(a) Show that this probability density function integrates to one for k = 1; 2; : : :
using the properties of the Gamma function.
150 4. ESTIMATION
(b) Plot the probability density function for k = 5, k = 10 and k = 25 on the same
graph. What do you notice?
(c) Show that the moment generating function of Y is given by
1
M (t) = E etX = (1 2t) k=2
for t <
2
and use this to show that E(X) = k and V ar(X) = 2k.
(d) Prove Theorem 29 using moment generating functions.
17. Suppose Yi v N ; 2 ; i = 1; : : : ; n independently and let Wi = Yi Y ; i = 1; : : : ; n.
(a) Show that Wi ; i = 1; : : : ; n can be written as a linear combination of independent

Normal random variables.
(b) Show that E (Wi ) = 0 and V ar (Wi ) = 2 1 n1 ; i = 1; : : : ; n. Hint: Show
Cov Yi ; Y ; = n1 , i = 1; : : : ; n. Note that this result along with the result in (a)
implies that
1
Wi = Yi Y vN 0; 2
1 ; i = 1; : : : ; n:
n
2
(c) Show that Cov (Wi ; Wj ) = n , for all i 6= j which implies that the Wi0 s are
correlated random variable and therefore not independent random variables.
18. Student’s t distribution: Suppose T v t (k).
(a) Plot the probability density function for k = 1; 5; 25. Plot the N (0; 1) probability
density function on the same graph. What do you notice?
(b) Show that f (t; k) is unimodal.
(c) Use Theorem 32 to show that E (T ) = 0. Hint: If X and Y are independent
random variables then E [g (X) h (Y )] = E [g (X)] E [h (Y )].
(d) Use the t tables provided in the Course Notes to answer the following:
(i) If T v t(10) …nd P (T 0:88), P (T 0:88) and P (jT j 0:88).
(ii) If T v t(17) …nd P (jT j > 2:90).
(iii) If T v t(30) …nd P (T 2:04) and P (T 0:26). Compare these values
with P (Z 2:04) and P (Z 0:26) if Z v N (0; 1).
(iv) If T v t(18) …nd a and b such that P (T a) = 0:025 and P (T > b) = 0:025.
(v) If T v t(13) …nd a and b such that P (T a) = 0:05 and P (T > b) = 0:05.
19. Limiting t distribution: Suppose T v t (k) with probability density function

k+1
t2 2
f (t; k) = ck 1+ for t 2 < and k = 1; 2; : : :
k
where
k+1
2
ck = p k
:
k 2
Show that
1 1 2
lim f (t; k) = p exp t for t 2 <
k!1 2 2
which is the probability density function of the G(0; 1) distribution. Hint: You may
p
use the fact that lim ck = 1= 2 which is a property of the Gamma function.
k!1
20. In an early study concerning survival time for patients diagnosed with Acquired Im-
mune De…ciency Syndrome (AIDS), the survival times (i.e. times between diagnosis
P30
of AIDS and death) of 30 male patients were such that yi = 11; 400 days.
i=1
(a) Assuming that survival times are Exponentially distributed with mean days,
graph the relative likelihood function for these data and obtain an approximate
90% con…dence interval for . This interval may be obtained from the graph of
the relative likelihood function or by using the function uniroot in R.
(b) Show that m = ln 2 is the median survival time. Using the interval obtained
in (a), give an approximate 90% con…dence interval for m.
21.
(a) If Y v Exponential ( ) then show that W = 2Y = has a 2 (2) distribution.

(Hint: compare the probability density function of W with (4.6).
(b) Suppose Y1 ; : : : ; Yn is a random sample from the Exponential( ) distribution.
Use the results of Section 4.5 to prove that
P
n
2
U =2 Yi = (2n) :
i=1
This result implies that U is a pivotal quantity which can be used to obtain
con…dence intervals for .
(c) Refer to the data in the previous problem. Using the fact that
P (43:19 W 79:08) = 0:90
where W s 2 (60) obtain a 90% con…dence interval for based on U . Compare

this with the approximate con…dence interval for obtained in the previous
problem.
22. Company A leased photocopiers to the federal government, but at the end of their
recent contract the government declined to renew the arrangement and decided to
lease from a new vendor, Company B. One of the main reasons for this decision was
a perception that the reliability of Company A’s machines was poor.
152 4. ESTIMATION
(a) Over the preceding year the monthly numbers of failures requiring a service call
from Company A were
12 14 15 16 18 19 19 22 23 25 28 29
Assuming that the number of service calls needed in a one month period has
a Poisson distribution with mean , obtain and graph the relative likelihood
function R( ) based on the data above.
(b) In the …rst year using Company B’s photocopiers, the monthly numbers of service
calls were
7 8 9 10 10 12 12 13 13 14 15 17
Under the same assumption as in part (a), obtain R( ) for these data and graph
it on the same graph as used in (a).
(c) Determine the 15% likelihood interval for which is also an approximate 95%
con…dence interval for for each company. The intervals can be obtained from
the graphs of the relative likelihood functions or by using the function uniroot
in R. Do you think the government’s decision was a good one, as far as the
reliability of the machines is concerned?
(d) What conditions would need to be satis…ed to make the assumptions and analysis
in (a) to (c) valid?
(e) If Y1 ; : : : ; Yn is a random sample from the Poisson( ) distribution then the ran-
dom variable
Y
p
Y =n
has approximately a N (0; 1) distribution. Show how this result leads to an
approximate 95% con…dence interval for given by
r
y
y 1:96 :
n
Using this result determine the approximate 95% con…dence intervals for each
company based on the result. Compare these intervals with the intervals obtained
in (c).
23. A study on the common octopus (Octopus Vulgaris) was conducted by researchers
at the University of Vigo in Vigo, Spain. Nineteen octopi were caught in July 2008
in the Ria de Vigo (a large estuary on the northwestern coast of Spain). Several
measurements were made on each octopus including their weight in grams. These
weights are given in the table below.
680 1030 1340 1330 1260 770 830 1470 1380 1220
920 880 1020 1050 1140 960 1060 1140 860
Let yi = weight of the i0 th octopus, i = 1; : : : ; 19. For these data

P
19 P
19
yi = 20340 and (yi y)2 = 884095:
i=1 i=1
To analyze these data the model Yi v G ( ; ) ; i = 1; : : : ; 19 independently is assumed

where and are unknown parameters.
(a) Use a qqplot to determine how reasonable the Gaussian model is for these data.
(b) Describe a suitable study population for this study. The parameters and
correspond to what attributes of interest in the study population?
(c) The researchers at the University of Vigo were interested in determining whether
the octopi in the Ria de Vigo are healthy. For common octopi, a population mean
weight of 1100 grams is considered to be a healthy population. Determine a 95%
con…dence interval for . What should the researchers conclude about the health
of the octopi, in terms of weight, in the Ria de Vigo?
(d) Determine a 90% con…dence interval for based on these data.
24. Consider the data on weights of adult males and females from Chapter 1. (The data
are posted on the course webpage.)
(a) Determine whether is is reasonable to assume a Normal model for the female
heights and a di¤erent Normal model for the male heights.
(b) Obtain a 95% con…dence interval for the mean for the females and males sepa-
rately. Does there appear to be a di¤erence in the means for females and males?
(We will see how to test this formally in Chapter 6.)
(c) Obtain a 95% con…dence interval for the standard deviation for the females and
males separately. Does there appear to be a di¤erence in the standard deviations?
25. Sixteen packages are randomly selected from the production of a detergent packaging
machine. Let yi = weight in grams of the i0 th package, i = 1; : : : ; 16.
287 293 295 295 297 298 299 300

300 302 302 303 306 307 308 311
For these data

P
16 P
16
yi = 4803 and yi 2 = 1442369:
i=1 i=1
To analyze these data the model Yi v G ( ; ) ; i = 1; : : : ; 12 independently is assumed

(a) Describe a suitable study population for this study. The parameters and
154 4. ESTIMATION
(b) Obtain 95% con…dence intervals for and .

(c) Let Y represent the weight of a future, independent, randomly selected package.
Obtain a 95% prediction interval for Y .
26. Radon is a colourless, odourless gas that is naturally released by rocks and soils and
may concentrate in highly insulated houses. Because radon is slightly radioactive,
there is some concern that it may be a health hazard. Radon detectors are sold to
homeowners worried about this risk, but the detectors may be inaccurate. Univer-
sity researchers placed 12 detectors in a chamber where they were exposed to 105
picocuries per liter of radon over 3 days. The readings given by the detectors were:
91:9 97:8 111:4 122:3 105:4 95:0 103:8 99:6 96:6 119:3 104:8 101:7
Let yi = reading for the i0 th detector, i = 1; : : : ; 12. For these data

P
12 P
12
yi = 1249:6 and (yi y)2 = 971:43:
i=1 i=1
To analyze these data assume the model Yi v G ( ; ) ; i = 1; : : : ; 12 independently

where and are unknown parameters. University researchers obtained a 13th radon
detector. It is to be exposed to 105 picocuries per liter of radon over 3 days. Calculate
a 95% prediction interval for the reading for this new radon detector.
27. A manufacturer wishes to determine the mean breaking strength (force) of a type
of string to “within 0:5 kilograms”, which we interpret as requiring that the 95%
con…dence interval for a should have length at most 1 kilogram. If breaking strength
P
10
Y of strings tested are G( ; ) and if 10 preliminary tests gave (yi y)2 = 45, how
i=1
many additional measurements would you advise the manufacturer to take?
28. A chemist has two ways of measuring a particular quantity; one has more random
error than the other. For method I, measurements X1 ; X2 ; : : : ; Xm follow a Normal
distribution with mean and variance 21 , whereas for method II, measurements
Y1 ; Y2 ; : : : ; Yn have a Normal distribution with mean and variance 22 .
(a) Assuming that 21 and 22 are known, …nd the combined likelihood function for
based on observed data x1 ; x2 ; : : : ; xm and y1 ; y2 ; : : : ; yn and show that the
maximum likelihood estimate of is
w1 x + w2 y
^=
w1 + w2
where w1 = m= 2 and w2 = n= 2. Why does this estimate make sense?
1 2
(b) Suppose that 1 = 1, 2 = 0:5 and n = m = 10. How would you rationalize
to a non-statistician why you were using the estimate (x + 4y) =5 instead of
(x + y) =2?
(c) Suppose that 1 = 1, 2 = 0:5 and n = m = 10, determine the standard

deviation of the maximum likelihood estimator
w1 X + w2 Y
~=
w1 + w2
and the estimator (X + Y )=2. Why is ~ a better estimator?
29. Challenge Problem: For “two-sided” intervals based on the t distribution, we

usually pick the interval which is symmetrical about y. Show that this choice provides
the shortest 100p% con…dence interval.
30. Challenge Problem: A sequence of random variables fXn g is said to converge in

probability to the constant c if for all > 0,
lim P (jXn cj )=0

n!1
p
We denote this by writing Xn ! c.
p
(a) If fXn g and fYn g are two sequences of random variables with Xn ! c1 and
p p p
Yn ! c2 , show that Xn + Yn ! c1 + c2 and Xn Yn ! c1 c2 .
(b) Let X1 ; X2 ; : : : be independent and identically distributed random variables with
probability density function f (x; ). A point estimator ~n based on a random
p
sample X1 ; : : : ; Xn is said to be consistent for if ~n ! as n ! 1.
(i) Let X1 ; : : : ; Xn be independent and identically distributed U nif orm(0; )
random variables. Show that ~n = max (X1 ; : : : ; Xn ) is consistent for .
(ii) Let X Binomial(n; ). Show that ~n = X=n is consistent for .
31. Challenge Problem: Refer to the de…nition of consistency in Problem 27(b). Dif-
…culties can arise when the number of parameters increases with the amount of data.
Suppose that two independent measurements of blood sugar are taken on each of n
individuals and consider the model
2
Xi1 ; Xi2 N ( i; ) for i = 1; ;n
where Xi1 and Xi2 are the independent measurements. The variance 2 is to be
estimated, but the i ’s are also unknown.
(a) Find the maximum likelihood estimator ~ 2 and show that it is not consistent.
(b) Suggest an alternative way to estimate 2 by considering the di¤erences Wi =
Xi1 Xi2 .
156 4. ESTIMATION
(c) What does represent physically if the measurements are taken very close to-
gether in time?
32. Challenge Problem: Proof of Central Limit Theorem (Special Case) Suppose
Y1 ; Y2 ; : : : are independent random variables with E(Yi ) = ; V ar(Yi ) = 2 and that
they have the same distribution, whose moment generating function exists.
2
(a) Show that (Yi )= has moment generating function of the form (1 + t2 +
p
terms in t3 ; th4 ; : : :) and thusi that (Yi )= n has moment generating function
t2
of the form 1 + 2n + o(n) , where o(n) signi…es a remainder term Rn with the
property that Rn =n ! 0 as n ! 1.
(b) Let p
P
n (Y
i ) n(Y )
Zn = p =
i=1 n
h in
t2
and note that its moment generating function is of the form 1 + 2n + o(n) .
2
Show that as n ! 1 this approaches the limit et =2 , which is the moment
generating function for G(0; 1). (Hint: For any real number a, (1 + a=n)n ! ea
as n ! 1.)
5. TESTS OF HYPOTHESES
5.1 Introduction
32 What does it mean to test a hypothesis in the light of observed data or information?
Suppose a statement has been formulated such as “I have extrasensory perception.” or
“This drug that I developed reduces pain better than those currently available.” and an
experiment is conducted to determine how credible the statement is in light of observed
data. How do we measure credibility? If there are two alternatives: “I have ESP.” and
“I do not have ESP.” should they both be considered a priori as equally plausible? If I
correctly guess the outcome on 53 of 100 tosses of a fair coin, would you conclude that
my gift is real since I was correct more than 50% of the time? If I develop a treatment
for pain in my basement laboratory using a mixture of seaweed and tofu, would you treat
the claims “this product is superior to aspirin”and “this product is no better than aspirin”
symmetrically?
When studying tests of hypotheses it is helpful to draw an analogy with the criminal
court system used in many places in the world, where the two hypotheses “the defendant is
innocent”and “the defendant is guilty”are not treated symmetrically. In these courts, the
court assumes a priori that the …rst hypothesis, “the defendant is innocent” is true, and
then the prosecution attempts to …nd su¢ cient evidence to show that this hypothesis of
innocence is not plausible. There is no requirement that the defendant be proved innocent.
At the end of the trial the judge or jury may conclude that there was insu¢ cient evidence
for a …nding of guilty and the defendant is then exonerated. Of course there are two types
of errors that this system can (and inevitably does) make; convict an innocent defendant or
fail to convict a guilty defendant. The two hypotheses are usually not given equal weight a
priori because these two errors have very di¤erent consequences.
Statistical tests of hypotheses are analogous to this legal example. We often begin by
specifying a single “default” hypothesis (“the defendant is innocent” in the legal context)
and then check whether the data collected is unlikely under this hypothesis. This default
hypothesis is often referred to as the “null”hypothesis and is denoted by H0 (“null”is used
because it often means a new treatment has no e¤ect). Of course, there is an alternative
32
For an introduction to testing hypotheses, see the video called "A Test of Signi…cance" at
www.watstat.ca
157
158 5. TESTS OF HYPOTHESES
hypothesis, which may not always be speci…ed. In many cases the alternative hypothesis is
simply that H0 is not true.
We will outline the logic of tests of hypotheses in the …rst example, the claim that I have
ESP. In an e¤ort to prove or disprove this claim, an unbiased observer tosses a fair coin
100 times and before each toss I guess the outcome of the toss. We count Y , the number
of correct guesses which we can assume has a Binomial distribution with n = 100. The
probability that I guess the outcome correctly on a given toss is an unknown parameter .
If I have no unusual ESP capacity at all, then we would assume = 0:5, whereas if I have
some form of ESP, either a positive attraction or an aversion to the correct answer, then
we expect 6= 0:5. We begin by asking the following questions in this context:
(1) Which of the two possibilities, = 0:5 or 6= 0:5, should be assigned to H0 , the null
hypothesis?
(2) What observed values of Y are highly inconsistent with H0 and what observed values
of Y are compatible with H0 ?
(3) What observed values of Y would lead to us to conclude that the data provide no
evidence against H0 and what observed values of Y would lead us to conclude that
the data provide strong evidence against H0 ?
In answer to question (1), hopefully you observed that these two hypotheses ESP and
NO ESP are not equally credible and decided that the null hypothesis should be H0 : = 0:5
or H0 : I do not have ESP.
To answer question (2), we note that observed values of Y that are very small (e.g.
0 10) or very large (e.g. 90 100) would clearly lead us to to believe that H0 is false,
whereas values near 50 are perfectly consistent with H0 . This leads naturally to the concept
of a test statistic or discrepancy measure.
De…nition 38 A test statistic or discrepancy measure D is a function of the data Y that is

constructed to measure the degree of “agreement”between the data Y and the null hypothesis
H0 .
Usually we de…ne D so that D = 0 represents the best possible agreement between the
data and H0 , and values of D not close to 0 indicate poor agreement. A general method for
constructing test statistics will be described in Sections 5:3, but in this example, it seems
natural to use D(Y ) = jY 50j.
Question (3) could be resolved easily if we could specify a threshold value for D, or
equivalently some function of D. In the given example, the observed value of Y was y = 52
and so the observed value of D is d = j52 50j = 2. One might ask what is the probability,
when H0 is true, that the discrepancy measure results in a value less than d. Equivalently,
what is the probability, assuming H0 is true, that the discrepancy measure is greater than
or equal to d? In other words we want to determine P (D d; H0 ) where the notation
5.1. INTRODUCTION 159
“; H0 ” means “assuming that H0 is true”. We can compute this easily in the our given
example. If H0 is true then Y Binomial(100; 0:5) and
P (D d; H0 ) = P (jY 50j j52 50j ; H0 )

= P (jY 50j 2) where Y Binomial(100; 0:5)
=1 P (49 Y 51)
100 100 100
=1 (0:5)100 (0:5)100 (0:5)100
49 50 51
t 0:76:
How can we interpret this value in terms of the test of H0 ? Roughly 76% of claimants
similarly tested for ESP, who have no abilities at all but simply randomly guess, will
perform as well or better (that is, result in at least as large a value of D as the observed
value of 2) than I did. This does not prove I do not have ESP but it does indicate we
have failed to …nd any evidence in these data to support rejecting H0 . There is no evidence
against H0 in the observed value d = 2, and this was indicated by the high probability that,
when H0 is true, we obtain at least this much measured disagreement with H0 .
We now proceed to a more formal treatment of hypothesis tests. We will concentrate
on two types of hypotheses:
(1) the hypothesis H0 : = 0 where it is assumed that the data Y have arisen from a
family of distributions with probability (density) function f (y; ) with parameter
(2) the hypothesis H0 : Y f0 (y) where it is assumed that the data Y have a speci…ed
probability (density) function f0 (y).
The ESP example is an example of a type (1) hypothesis. If we wish to determine

if is reasonable to assume a given data set is a random sample from an Exponential(1)
distribution then this is an example of a type (2) hypothesis. We will see more examples
of type (2) hypotheses in Chapter 7.
A statistical test of hypothesis proceeds as follows. First, assume that the hypothesis
H0 will be tested using some random data Y. We then adopt a test statistic or discrepancy
measure D(Y) for which, normally, large values of D are less consistent with H0 . Let
d = D (y) be the corresponding observed value of D. We then calculate the p-value or
observed signi…cance level of the test.
De…nition 39 Suppose we use the test statistic D = D (Y) to test the hypothesis H0 .
Suppose also that d = D (y) is the observed value of D. The p-value or observed signi…cance
level of the test of hypothesis H0 using test statistic D is
p value = P (D d; H0 ):
In other words, the p value is the probability (calculated assuming H0 is true) of

observing a value of the test statistic greater than or equal to the observed value of the test
statistic. If d (the observed value of D) is large and consequently the p value = P (D
d; H0 ) is small then one of the following two statements is correct:
(1) H0 is true but by chance we have observed an outcome that does not happen very
often when H0 is true
or
(2) H0 is false.
If the p value is close to 0:05, then the event of observing a D value as unusual or
more unusual as we have observed happens only 5 times out of 100, that is, not very often.
Therefore we interpret a p value close to 0:05 as indicating that the observed data are
providing evidence against H0 . If the p value is very small, for example less than 0:001,
then the event of observing a D value as unusual or more unusual as we have observed
happens only 1 time out of 1000, that is, very rarely. Therefore we interpret a p value
close to 0:001 as indicating that the observed data are providing strong evidence against
H0 . If the p value is greater than 0:1, then the event of observing a D value as unusual
or more unusual as we have observed happens more than 1 time out of 10, that is, fairly
often and therefore the observed data are consistent with H0 and there is no evidence to
support (2).
Remarks:
(1) Note that the p value is de…ned as P (D d; H0 ) and not P (D = d; H0 ) even
though the event that has been observed is D = d. If D is a continuous random variable
then P (D = d; H0 ) is always equal to zero which is not very useful. If D is a discrete
random variable with many possible values then P (D = d; H0 ) will be small which is also
not very useful. Therefore to determine how unusual the observed result is we compare it
to all the other results which are as unusual or more unusual than what has been observed.
(2) The p value is NOT the probability that H0 is true. This is a common misinter-
pretation.
The following table gives a rough guideline for interpreting p values. These are only
guidelines for this course. The interpretation of p values must always be made in the
context of a given study.
Table 5.1: Interpretation of p values

p value Interpretation
p value > 0:10 No evidence against H0 based on the observed data.
0:05 < p value 0:10 Weak evidence against H0 based on the observed data.
0:01 < p value 0:05 Evidence against H0 based on the observed data.
0:001 < p value 0:01 Strong evidence against H0 based on the observed data.
p value 0:001 Very strong evidence against H0 based on the observed data.
Example 5.1.1 Test of hypothesis for Binomial for large n

Suppose that in the ESP experiment the coin was tossed n = 200 times and I correctly
guessed 110 of the outcomes. In this case we use the test statistic D = jY 100j with
observed value d = j110 100j = 10. The p value is
p value = P (jY 100j 10) where Y Binomial(200; 0:5)
which can be calculated using R or using the Normal approximation to the Binomial since
n = 200 is large. Using the Normal approximation (without a continuity correction since
it is not essential to have an exact value) we obtain
p value = P (jY 100j 10) where Y Binomial(200; 0:5)

!
jY 100j 10
= P p p
200 (0:5) (0:5) 200 (0:5) (0:5)
t P (jZj 1:41) where Z v N (0; 1)
= 2 [1 P (Z 1:41)]
= 2 (1 0:92073)
= 0:15854
so there is no evidence against the hypothesis that I was guessing.
Example 5.1.2 Test of hypothesis for Binomial

Suppose that it is suspected that a 6-sided die has been “doctored”so that the number
one turns up more often than if the die were fair. Let = P (die turns up one) on a
single toss and consider the hypothesis H0 : = 1=6. To test H0 , we toss the die n
times and observe the number of times Y that a one occurs. Assuming H0 : = 1=6 is
true, Y v Binomial(n; 1=6) distribution. A reasonable test statistic would then be either
D1 = jY n=6j or (if we wanted to focus on the possibility that was bigger than 1=6),
D = max [(Y n=6); 0].
Suppose that n = 180 tosses gave y = 44. Using D = max [(Y n=6); 0], we get
d = max [(44 180=6); 0] = 14. The p value (calculated using R) is
p value = P (D 14; H0 )
= P (Y 44) where Y Binomial(180; 1=6)
180
X y 180 y
180 1 5
=
y 6 6
y=44
= 0:005
which provides strong evidence against H0 , and suggests that is bigger than 1=6. This is
an example of a one-sided test which is described in more detail below.

Suppose that in the experiment in Example 5.1.2 we observed y = 35 ones in n = 180
tosses. The p value (calculated using R) is now
p value = P (Y 35; = 1=6)

180
X y 180 y
180 1 5
=
y 6 6
y=35
= 0:18
and this probability is not especially small. Indeed almost one die in …ve, though fair, would
show this level of discrepancy with H0 . We conclude that there is no evidence against H0
in light of the observed data.
Note that we do not claim that H0 is true, only that there is no evidence in light of the
data that it is not true. Similarly in the legal example, if we do not …nd evidence against
H0 : “defendant is innocent”, this does not mean we have proven he or she is innocent, only
that, for the given data, the amount of evidence against H0 was insu¢ cient to conclude
otherwise.
The approach to testing a hypothesis described above is very general and straightfor-
ward, but a few points should be stressed:
1. If the p value is very small then as indicated in the table there is strong evidence
against H0 in light of the observed data; this is often termed “statistically
signi…cant” evidence against H0 . While we believe that statistical evidence is best
measured when we interpret p values as in the above table, it is common in some
of the literature to adopt a threshold value for the p value such as 0:05 and “reject
H0 ” whenever the p-value is below this threshold. This may be necessary
when there are only two options for your decision. For example in a trial, a person is
either convicted or acquitted of a crime.
2. If the p value is not small, we do not conclude that H0 is true. We simply say
there is no evidence against H0 in light of the observed data. The reason for
this “hedging”is that in most settings a hypothesis may never be strictly “true”. (For
example, one might argue when testing H0 : = 1=6 in Example 5.1.2 that no real
die ever has a probability of exactly 1=6 for side 1.) Hypotheses can be “disproved”
(with a small degree of possible error) but not proved. Again, if we are limited to
two possible decisions, if you fail to “reject H0 ” in the language above, you may say
that “H0 is accepted” when the p value is larger than the predetermined
threshold. This does not mean that we have determined that H0 is true, but that
there is insu¢ cient evidence on hand to reject it33 .
33
If the untimely demise of all of the prosecution witnesses at your trial leads to your acquittal, does this
prove your innocence?
3. Just because there is strong evidence (“highly statistically signi…cant” evidence)

against a hypothesis H0 , there is no implication about how “wrong” H0 is. In prac-
tice, we supplement a hypothesis test with an interval estimate that indicates the
magnitude of the departure from H0 . This is how we check whether a result is “sci-
enti…cally” signi…cant as well as statistically signi…cant.
4. So far we have not re…ned the conclusion when we do …nd strong evidence against the
null hypothesis. Often we have in mind an “alternative” hypothesis. For example if
the standard treatment for pain provides relief in about 50% of cases, and we test, for
patients medicated with an alternative H0 : P (relief) = 0:5 we will obviously wish to
know, if we …nd strong evidence against H0 , in what direction that evidence lies. If
the probability of relief is greater than 0:5 we might consider further tests or adopting
the drug, but if it is less, then the drug will be abandoned for this purpose. We will
try and adapt to this type of problem with our choice of discrepancy measure D.
5. It is important to keep in mind that although we might be able to …nd evidence

against a given hypothesis, this does not mean that the di¤erences found are of
practical signi…cance. For example a patient person willing to toss a particular coin
one million times can almost certainly …nd evidence against H0 : P (heads) = 0:5.
This does not mean that in a game involving a few dozens or hundreds of tosses that
H0 is not a tenable and useful approximation. Similarly, if we collect large amounts
of …nancial data, it is quite easy to …nd evidence against the hypothesis that stock or
stock index returns are Normally distributed. Nevertheless for small amounts of data
and for the pricing of options, such an assumption is usually made and considered
useful.
A drawback with the approach to testing described so far is that we do not have a
general method for choosing the test statistic or discrepancy measure D. Often there are
“intuitively obvious” test statistics that can be used; this was the case in the examples in
this section. In Section 5:3 we will see how to use the likelihood function to construct a
test statistic in more complicated situations where it is not always easy to come up with
an intuitive test statistic.
A …nal point is that once we have speci…ed a test statistic D, we need to be able to
compute the p value for the observed data. Calculating probabilities involving D brings
us back to distribution theory. In most cases the exact p value is di¢ cult to determine
mathematically, and we must use either an approximation or computer simulation. Fortu-
nately, for the tests considered in Section 5:3 we can use an approximation based on the 2
distribution.
For the Gaussian model with unknown mean and standard deviation we use test statis-
tics based on the pivotal quantities that were used in Chapter 4 for constructing con…dence
intervals.
5.2 Tests of Hypotheses for Parameters in the G( ; ) Model

Suppose that Y G( ; ) models a variate y in some population or process. A random
sample Y1 ; : : : ; Yn is selected, and we want to test hypotheses concerning one of the two
parameters ( ; ). The maximum likelihood estimators of and 2 are
1 Pn 1 Pn
~=Y = Yi and ~ 2 = (Yi Y )2 :
n i=1 n i=1
As usual we prefer to use the sample variance estimator

1 P
n
S2 = (Yi Y )2
n 1 i=1
to estimate 2 .
Recall from Chapter 4 that
Y
T = p v t (n 1) :
S= n
We use this pivotal quantity to construct a test of hypothesis for the parameter when the
standard deviation is unknown.
Hypothesis Tests for

For a Normally distributed population, we may wish to test a hypothesis H0 : = 0,
where 0 is some speci…ed value34 . To do this we can use the test statistic
jY j
D= p 0 (5.1)
S= n
We then obtain a p value from the t distribution as follows. Let
jy j
d= p0 (5.2)
s= n
be the value of D observed in a sample with mean y and standard deviation s, then
p value = P (D d; H0 is true)
= P (jT j d) = 1 P( d T d)
= 2 [1 P (T d)] where T t (n 1) : (5.3)
One-sided hypothesis tests

The values of the parameter to be considered when H0 is not true are often described
as an alternative hypothesis which is denoted by HA . Suppose data on the e¤ects of a
34
Often when we test a hypothesis we have in mind an alternative, i.e. what if the hypothesis H0 is false.
In this case the alternative is 6= 0
5.2. TESTS OF HYPOTHESES FOR PARAMETERS IN THE G( ; ) MODEL 165
new treatment follow a G( ; ) distribution and that the new treatment can either have no
e¤ect represented by = 0 or a bene…cial e¤ect represented by > 0 . In this example
the null hypothesis is H0 : = 0 and the alternative hypothesis is HA : > 0 . To test
H0 : = 0 using this alternative we could use the test statistic
Y
D= p 0
S= n
so that large values of D provide evidence against H0 in the direction of the alternative
> 0 . Under H0 : = 0 the test statistic D has a t (n 1) distribution. Let the
observed value be
y
d= p0
s= n
Then
p value = P (D d; H0 is true)
= P (T d)
=1 P (T d) where T t (n 1) :
In Example 5.1.2, the hypothesis of interest was H0 : = 1=6 where was the probabil-
ity that the upturned face was a one. If the alternative of interest is that is not equal to
1=6 then the alternative hypothesis is HA : 6= 1=6 and the test statistic D = jY n=6j is
a good choice. If the alternative of interest is that is bigger than 1=6 then the alternative
hypothesis is HA : > 1=6 and the test statistic D = max [(Y n=6); 0] is a better choice.
Example 5.2.1 Testing for bias in a measurement system

Two cheap scales A and B for measuring weight are tested by taking 10 weighings of a
one kg weight on each of the scales. The measurements on A and B are
A: 1:026 0:998 1:017 1:045 0:978 1:004 1:018 0:965 1:010 1:000
B: 1:011 0:966 0:965 0:999 0:988 0:987 0:956 0:969 0:980 0:988
Let Y represent a single measurement on one of the scales, and let represent the
average measurement E(Y ) in repeated weighings of a single 1 kg weight. If an experiment
involving n weighings is conducted then a test of H0 : = 1 can be based on the test
statistic (5.1) with observed value (5.2) and 0 = 1.
The samples from scales A and B above give us
A : y = 1:0061; s = 0:0230; d = 0:839

B : y = 0:9810; s = 0:0170; d = 3:534:
The p value for A is
p value = P (D 0:839; = 1)
= P (jT j 0:839) where T t (9)
= 2 [1 P (T 0:839)]
= 2 (1 0:7884)
t 0:42
where the probability is obtained using R. Alternatively if we use the t tables provided in
these notes we obtain P (T 0:5435) = 0:7 and P (T 0:88834) = 0:8 so
0:4 = 2 (1 0:8) < p value < 2 (1 0:7) = 0:6:
In either case we have that the p value > 0:1 and thus there is no evidence of bias, that
is, there is no evidence against H0 : = 1 for scale A based on the observed data.
For scale B, however, we obtain
p value = P (D 3:534; = 1)
= P (jT j 3:534) where T t (9)
= 2 [1 P (T 3:534)]
= 0:0064
where the probability is obtained using R. Alternatively if we use the t tables we obtain
P (T 3:2498) = 0:995 and P (T 4:2968) = 0:999 so
0:002 = 2 (1 0:999) < p value < 2 (1 0:995) = 0:01
In either case we have that the p value < 0:01 and thus there is strong evidence against
H0 : = 1. The observed data suggest strongly that scale B is biased.
Finally, note that just although there is strong evidence against H0 for scale B, the
degree of bias in its measurements is not necessarily large enough to be of practical concern.
In fact, we can obtain a 95% con…dence interval for for scale B by using the pivotal
quantity
Y
T = p t (9) :
S= 10
For t tables we have P (T 2:2622) = 0:975 and a 95% con…dence interval for is given by
p
y 2:2622s= 10 = 0:981 0:012 or [0:969; 0:993] :
Evidently scale B consistently understates the weight but the bias in measuring the 1 kg
weight is likely fairly small (about 1% 3%).
Remark: The function t.test in R will give con…dence intervals and test hypotheses about
; for a data set y use t.test(y).
5.2. TESTS OF HYPOTHESES FOR PARAMETERS IN THE G( ; ) MODEL 167
Relationship between Hypothesis Testing and Interval Estimation

Suppose y1 ; : : : ; yn is an observed random sample from the G( ; ) distribution. Suppose
we test H0 : = 0 . Now
p value 0:05
jY j jy j
if and only if P p 0 p 0 ; H0 : = 0 is true 0:05
S= n s= n
jy j
if and only if P jT j p0 0:05 where T t (n 1)
s= n
jy j
if and only if P jT j p0 0:95
s= n
jy j
if and only if p0 a where P (jT j a) = 0:95
s= n
p p
if and only if 0 2 y as= n; y + as= n
which is a 95% con…dence interval for . In other words, the p value for testing H0 : = 0
is greater than or equal to 0:05 if and only if the value = 0 is inside a 95% con…dence
interval for (assuming we use the same pivotal quantity).
More generally, suppose we have data y, a model f (y; ) and we use the same pivotal
quantity to construct a con…dence interval for and a test of the hypothesis H0 : = 0 .
Then the parameter value = 0 is inside a 100q% con…dence interval for if and only if
the p value for testing H0 : = 0 is greater than 1 q.

For the weigh scale example a 95% con…dence interval for the mean for the second
scale was [0:969; 0:993]. Since = 1 is not in this interval we know that the p value for
testing H0 : = 1 would be less than 0:05. (In fact we showed the p value equals 0:0064
which is indeed less than 0:05.)
Hypothesis tests for

Suppose that we have a sample Y1 ; Y2 ; : : : ; Yn of independent random variables each from
the same G( ; ) distribution. Recall that we used the pivotal quantity
(n 1)S 2 1 P
n
2
= 2
(Yi Y )2 s 2
(n 1)
i=1
to construct con…dence intervals for the parameter . We may also wish to test a hypothesis
such as H0 : = 0 . One approach is to use a likelihood ratio test statistic which is
described in the next section. Alternatively we could use the test statistic
(n 1)S 2
U= 2
0
for testing H0 : = 0 . Large values of U and small values of U provide evidence against
H0 . (Why is this?) Now U has a Chi-squared distribution when H0 is true and the
Chi-squared distribution is not symmetric which makes the determination of “large” and
“small” values somewhat problematic. The following simpler calculation approximates the
p value:
1. Let u = (n 1)s2 = 2
0 denote the observed value of U from the data.
2. If u is large (that is, if P (U u) > 21 ) compute the p value as
p value = 2P (U u)
where U s 2 (n 1).
3. If u is small (that is, if P (U u) < 21 ) compute the p value as
p value = 2P (U u)
where U s 2 (n 1).
1
Figure 5.1 shows a picture for a large observed value of u. In this case P (U u) > 2
and the p value = 2P (U u).
0.09
0.08
0.07
0.06
p.d.f.
0.05
0.04 P(U< u)
0.03
0.02
0.01
P(U> u)
0
0 5 10 15 u 20 25 30
Figure 5.1: Picture of large observed u
Example 5.2.2
For the manufacturing process in Example 4.7.2, test the hypothesis H0 : = 0:008
(0:008 is the desired or target value of the manufacturer would like to achieve). Note that
since the value = 0:008 is outside the two-sided 95% con…dence interval for in Example
4.5.2, the p value for a test of H0 based on the test statistic U = (n 1)S 2 = 20 will be
less than 0:05. To …nd the p value, we follow the procedure above:
5.3. LIKELIHOOD RATIO TESTS OF HYPOTHESES - ONE PARAMETER 169
1. u = (n 1)s2 = 2
0 = (14) s2 = (0:008)2 = 0:002347= (0:008)2 = 36:67
2. The p value is
2
p value = 2P (U u) = 2P (U 36:67) = 0:0017 where U s (14)
where the probability is obtain using R. Alternatively if we use the Chi-squared tables
provided in these notes we obtain P (U 31:319) = 0:995 so
p value < 2 (1 0:995) = 0:01
In either case we have that the p value < 0:01 and thus there is strong evidence
based on the observed data against H0 : = 0:008. Since the observed value of
p
s = 0:002347=14 = 0:0129 is greater than 0:008, the data suggest that is bigger
than 0:008.
5.3 Likelihood Ratio Tests of Hypotheses - One Parameter

When a pivotal quantity exists then it is usually straightforward to construct a test of
hypothesis as we have seen Section 5.2 for the Gaussian distribution parameters. When
a pivotal quantity does not exist then a general method for …nding a test statistic with
good properties can be based on the likelihood function. In Chapter 2 we used likelihood
functions to gauge the plausibility of parameter values in the light of the observed data. It
should seem natural, then, to base a test of hypothesis on a likelihood value or, in comparing
the plausibility of two values, a ratio of the likelihood values. Let us suppose, for example,
that we are engaged in an argument over the value of a parameter in a given model (we
agree on the model but disagree on the parameter value). I claim that the parameter value
is 0 whereas you claim it is 1 . Having some data y at hand, it would seem reasonable to
attempt to settle this argument using the ratio of the likelihood values at these two values,
that is,
L( 0 )
: (5.4)
L( 1 )
As usual we de…ne the likelihood function L( ) = L ( ; y) = f (y; ) where f (y; ) is the
probability (density) function of the random variable Y representing the data. If the value
of the ratio L( 0 )=L( 1 ) is much greater than one then the data support the value 0 more
than 1 .
Let us now consider testing the plausibility of my hypothesized value 0 against an
unspeci…ed alternative. In this case it is natural to replace 1 in (5.4) by the value which
appears most plausible given the data, that is, the maximum likelihood estimate ^. The
resulting ratio is just the value of the relative likelihood function at 0 :
L( 0 )
R( 0 ) = :
L(^)
If R( 0 ) is close to one, then 0 is plausible in light of the observed data, but if R( 0 ) is

very small and close to zero, then 0 is not plausible in light of the observed data and this
suggests evidence against H0 . Therefore the corresponding random variable, L( 0 )=L(~)35 ,
appears to be a natural statistic for testing H0 : = 0 . This only leaves determining
the distribution of L( 0 )=L(~) under H0 so we can determine p values. Equivalently,
we usually work instead with a simple function of L( 0 )=L(~). We use the likelihood ratio
statistic which was introduced in Chapter 4:
L( 0 )
( 0) = 2 log = 2l(~) 2l( 0 ): (5.5)
L(~)
We choose this particular function because, if H0 : = 0 is true, then ( 0 ) v 2 (1).

Note that small values of R( 0 ) correspond to large observed values of ( 0 ) and therefore
large observed value of ( 0 ) indicate evidence against the hypothesis H0 : = 0 . We
illustrate this in Figure 5.2. Notice that the more plausible values of the parameter
correspond to larger values of R( ) or equivalently, in the bottom panel, to small values of
( ) = 2 log [R( )] : The particular value displayed 0 is around 0:3 and it appears that
( 0 ) = 2 log [R( 0 )] is quite large, in this case around 9. To know whether this is too
large to be consistent with H0 , we need to compute the p value.
To determine the p value we …rst calculate the observed value of ( 0 ), denoted by
( 0 ) and given by " #
L( 0 )
( 0 ) = 2 log = 2l(^) 2l( 0 )
L(^)
where ^ is the maximum likelihood estimate of based on the observed data. The approx-
imate p value is then
p value t P [W ( 0 )] where W s 2 (1) (5.6)

p
= P jZj ( 0 ) where Z v G (0; 1)
h p i
= 2 1 P Z ( 0)
Let us summarize the construction of a test from the likelihood function. Let the random
variable (or vector of random variables) Y represent data generated from a distribution
with probability function or probability density function f (y; ) which depends on the
scalar parameter . Let be the parameter space (set of possible values) for . Consider
a hypothesis of the form
H0 : = 0
where 0 is a single point (hence of dimension 0). We can test H0 using as our test statis-
tic the likelihood ratio test statistic , de…ned by (5.5). Then large observed values of
35
Recall that L ( ) = L ( ; y) is a function of the observed data y and therefore replacing y by the
corresponding random variable Y means that L ( ; Y) is a random variable. Therefore the random variable
L( 0 )=L(~) = L( 0 ; Y)=L(~; Y) is a function of Y in several places including ~ = g (Y).
0 .8
m o re p l a u s i b l e v a l u e s
0 .6
R( θ )
0 .4
0 .2
le s s p la u s ib le
0
0 .2 5 0 .3 0 .3 5 0 .4 0 .4 5 0 .5 0 .5 5 0 .6
1 2
1 0
θ ))
6
-2log(R(
le s s p la u s ib le m o re p l a u s i b l e v a l u e s
le s s p la u s ib le
0
0 .2 5 0 .3 0 .3 5 0 .4 0 .4 5 0 .5 0 .5 5 0 .6
θ = 0 .3
0
Figure 5.2: Top panel: Graph of the relative likelihood function.

Bottom Panel: ( ) = 2 log R( ): Note that ( 0 ) is relatively large when R( 0 ) is
small.
( 0 ) correspond to a disagreement between the hypothesis H0 : = 0 and the data and

so provide evidence against H0 . Moreover if H0 : = 0 is true, ( 0 ) has approximately
a 2 (1) distribution so that an approximate p value is obtained from (5.6). The theory
behind the approximation is based on a result which shows that under H0 , the distribution
of approaches 2 (1) as the size of the data set becomes large.
Example 5.3.1 Likelihood ratio test statistic for Binomial model

The likelihood ratio test statistic for testing the hypothesis H0 : = 0 for a Binomial
model is (show it!)
" ! !#
~ 1 ~
( 0 ) = 2n ~ log + (1 ~) log
0 1 0
where ~ = Y =n is the maximum likelihood estimator of . The observed value of ( 0 ) is

" ! !#
^ 1 ^
( 0 ) = 2n ^ log + (1 ^) log
0 1 0
where ^ = y=n. If ^ and 0 are equal then ( 0 ) = 0. If ^ is either much larger or much
smaller than 0 , then ( 0 ) will be large in value.
Suppose we use the likelihood ratio test statistic to test H0 : = 0:5 for the ESP
example and the data in Example 5.1.1 which were n = 200 and y = 110 so that ^ = 0:55.
The observed value of the likelihood ratio statistic for testing H0 : = 0:5 is
0:55 1 0:55
(0:5) = 2 (200) (0:55) log + (1 0:55) log = 2:003
0:5 1 0:5
and the approximate p value is
p value t P (W 2:003) where W s 2 (1)

h p i
=2 1 P Z 2:003 where Z v G (0; 1)
= 2 [1 P (Z 1:42)] = 2 (1 0:9222)
= 0:1556
and there is no evidence against H0 : = 0:5 based on the data. Note that the test statistic
D = jY 100j used in Example 5.1.1 and the likelihood ratio test statistic (0:5) give
nearly identical results. This is because n = 200 is large.
Example 5.3.2 Likelihood ratio test statistic for Exponential model

Suppose y1 ; : : : ; yn are the observed values of a random sample from the Exponential( )
distribution. The likelihood function is
Q
n Q
n 1
yi = 1 1P
n
L( ) = f (yi ; ) = e = n exp yi for >0
i=1 i=1 i=1

1P
n y
l( ) = n log yi = n log + for > 0:
i=1
Solving l ( ) = 0 gives the maximum likelihood estimate ^ = y and corresponding maximum

likelihood estimator ~ = Y . The likelihood ratio test statistic for testing H0 : = 0 is
( 0 ) = 2l ~ 2l ( 0 ) = 2l Y 2l ( 0 )
Y Y
= 2n log Y + + log 0 +
Y 0
Y Y
= 2n 1 log
0 0
and the observed value of ( 0 ) is

y y
( 0 ) = 2n 1 log :
0 0
Again we observe that, if ^ and 0 are equal then ( 0 ) = 0 and if ^ is either much larger
or much smaller than 0 , then ( 0 ) will be large in value.
The variability in lifetimes of light bulbs (in hours, say, of operation before failure) is
often well described by an Exponential( ) distribution where = E(Y ) > 0 is the average
(mean) lifetime. A manufacturer claims that the mean lifetime of a particular brand of
bulbs is 2000 hours. We can examine this claim by testing the hypothesis H0 : = 2000.
Suppose a random sample of n = 50 light bulbs was tested over a long period and that the
observed lifetimes were:
572 2732 1363 716 231 83 1206 3952 3804 2713

347 2739 411 2825 147 2100 3253 2764 969 1496
2090 371 1071 1197 173 2505 556 565 1933 1132
5158 5839 1267 499 137 4082 1128 1513 8862 2175
3638 461 2335 1275 3596 1015 2671 849 744 580
P
50
with yi = 93840. For these data the maximum likelihood estimate of is ^ = y =
i=1
93840=50 = 1876:8. To check whether the Exponential model is reasonable for these data
we plot the empirical cumulative distribution function for these data and then superimpose
the cumulative distribution function for a Exponential(1876:8) random variable. See Figure
5.3. Since the agreement between the empirical cumulative distribution function and the
0.9
Exponential(1876.8)
0.8
0.7
e.c .d.f.
0.6
0.5
0.4
0.3
0.2
0.1
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Lifetimes of Light Bulbs
Figure 5.3: Empirical c.d.f. and Exponential(1876:8) c.d.f.
Exponential(1876:8) cumulative distribution function is quite good we assume the Expo-

nential model to test the hypothesis that the mean lifetime the light bulbs is 2000 hours.
The observed value of the likelihood ratio test statistic for testing H0 : = 2000 is
1876:8 1876:8
(2000) = 2 (50) 1 log = 0:1979:
2000 2000
The p value is
p value t P (W 0:1979) where W s 2 (1)

h p i
=2 1 P Z 0:1979 where Z v G (0; 1)
= 2 [1 P (Z 0:44)] = 2 (1 0:67003)
= 0:65994
The p value is large so we conclude that there is no evidence against H0 : = 2000

and no evidence against the manufacturer’s claim that is 2000 hours based on the data.
Although the maximum likelihood estimate ^ was under 2000 hours (1876:8) it was not
su¢ ciently under to give evidence against H0 : = 2000.
Example 5.3.3 Likelihood ratio test of hypothesis for for G( ; ), known

Suppose Y G( ; ) with probability density function
1 1
f (y; ; ) = p exp 2
(y )2 for y 2 <:
2 2
Let us begin with the (rather unrealistic) assumption that the standard deviation has a
known value and so the only unknown parameter is . In this case the likelihood function
for an observed sample y1 ; y2 ; : : : ; yn from this distribution is
Q
n
n=2 n 1 P
n
L( ) = f (yi ; ; ) = (2 ) exp 2
(yi )2 for 2<
i=1 2 i=1
or more simply
1 P
n
L( ) = exp 2
(yi )2 for 2 <:
2 i=1
1 P
n
l( ) = 2
(yi )2 for 2 <:
2 i=1
To …nd the maximum likelihood estimate of we solve the equation

1 P
n
l0 ( ) = 2
(yi )=0
i=1
which gives ^ = y. The corresponding maximum likelihood estimator of is

1 Pn
~=Y = Yi :
n i=1
Note that the log likelihood function can be written as
1 P
n
l( ) = 2
(yi )2
2 i=1
1 P
n
= 2
(yi y)2 + n(y )2
2 i=1
5.4. LIKELIHOOD RATIO TESTS OF HYPOTHESES - MULTIPARAMETER 175
where we have used the algebraic identity36
P
n P
n
(yi )2 = (yi y)2 + n(y )2 :
i=1 i=1
To test the hypothesis H0 : = 0 we use the likelihood ratio statistic
= 2l(~ ) 2l( 0)
1 P n
2 1 P
n
= 2 (Yi 0) 2
(Yi ~ )2
i=1 i=1
1 P
n P
n
= 2
(Yi Y )2 + n(Y 0)
2
(Yi ~ )2
i=1 i=1
1 2
= 2
n(Y 0) since ~ = Y
2
Y
= p 0 : (5.7)
= n
The purpose for writing the likelihood ratio statistic in the form (5.7) is to draw attention
to the fact that is the square of the standard Normal random variable Y =pn0 and therefore
has exactly a 2 (1) distribution. Of course it is not clear in general that the likelihood ratio
test statistic has an approximate 2 (1) distribution, but in this special case, the distribution
of is clearly 2 (1) (not only asymptotically but for all values of n).
5.4 Likelihood Ratio Tests of Hypotheses - Multiparameter

Let the data Y represent data generated from a distribution with probability or probability
density function f (y; ) which depends on the k-dimensional parameter . Let be the
parameter space (set of possible values) for .
Consider a hypothesis of the form
H0 : 2 0
where 0 and 0 is of dimension p < k. For example H0 might specify particular values
for k p of the components of but leave the remaining parameters alone. The dimensions
of and 0 refer to the minimum number of parameters (or “coordinates”) needed to
specify points in them. Again we test H0 using as our test statistic the likelihood ratio
test statistic , de…ned as follows. Let ^ denote the maximum likelihood estimate of
over so that, as before,
L(^) = max L( ):
2
36 P
n P
n
You should be able to verify the identity (yi c)2 = (yi y)2 + n(y c)2 for any value of c
i=1 i=1
Similarly we let ^ 0 denote the maximum likelihood estimate of over 0 (i.e. we maximize
the likelihood with the parameter constrained to lie in the set 0 ) so that
L(^0 ) = max L( ):
2 0
Now consider the corresponding statistic (random variable)

" #
L( ~0 )
= 2l(~) 2l(~0 ) = 2 log (5.8)
L(~)
and let " #
L(^0 )
= 2l(^) 2l(^0 ) = 2 log
L(^)
denote an observed value of . If the observed value is very large, then there is evidence
against H0 (con…rm that this means L(^) is much larger than L(^0 )). In this case it can
be shown that under H0 , the distribution of is approximately 2 (k p) as the size of
the data set becomes large. Again, large values of indicate evidence against H0 so the
p value is given approximately by
p value = P ( ; H0 ) t P (W ) (5.9)
where W s 2 (k p).
The likelihood ratio test covers a great many di¤erent types of examples, but we only
provide a few here.
Example 5.4.3 Comparison of two Poisson means

In Problem 15 of Chapter 4 some data were given on the numbers of failures per month
for each of two companies’ photocopiers. To a good approximation we can assume that
in a given month the number of failures Y follows a Poisson distribution with probability
function
ye
f (y; ) = P (Y = y) = for y = 0; 1; : : :
y!
where = E(Y ) is the mean number of failures per month. (This ignores that the number
of days that the copiers are used varies a little across months. Adjustments could be made
to the analysis to deal with this.) Denote the value of for Company A’s copiers as A and
the value for Company B’s as B . Let us test the hypothesis that the two photocopiers
have the same mean number of failures
H0 : A = B:
Essentially we have data from two Poisson distributions with possibly di¤erent parameters.
For convenience let (x1 ; : : : ; xn ) denote the observations for Company A’s photocopier which
are assumed to be a random sample from the model
x exp (
A A)
P (X = x; A) = for x = 0; 1; : : : and A > 0:
x!
Similarly let (y1 ; : : : ; ym ) denote the observations for Company B’s photocopier which are
assumed to be a random sample from the model
y
B exp ( B)
P (Y = y; B) = for y = 0; 1; : : : and B >0
y!
independently of the observations for Company A’s photocopier. In this case the parameter
vector is the two dimensional vector = ( A ; B ) and = f( A ; B ) : A > 0; B > 0g.
The note that the dimension of is k = 2. Since the null hypothesis speci…es that the
two parameters A and B are equal but does not otherwise specify their values, we have
0 = f( ; ) : > 0g which is a space of dimension p = 1.
To construct the likelihood ratio test of H0 : A = B we need the likelihood function
for the parameter vector = ( A ; B ). We …rst note that the likelihood function for A
only based on the data (x1 ; : : : ; xn ) is
xi
Q
n Q
n
A exp ( A)
L1 ( A) = f (xi ; A) = for A >0
i=1 i=1 xi !
or more simply
Q
n
xi
L1 ( A) = A exp ( A) for A > 0:
i=1
Similarly the likelihood function for B only based on (y1 ; : : : ; ym ) is given by
Q
m
yj
L2 ( B) = B exp ( B) for B > 0:
j=1
Since the data from A and B are independent, the likelihood function for =( A; B) is
obtained as a product of the individual likelihoods
L( ) = L( A; B) = L1 ( A) L2 ( B)
Qn
xi Q
m
yj
= A exp ( A) B exp ( B) for ( A; B) 2
i=1 j=1
and the log likelihood function for =( A; B) is

!
P
n P
m
l( ) = n A m B + xi log A + yj log B: (5.10)
i=1 j=1
The number of failures in twelve consecutive months for company A and company B’s
copiers are given below; there were the same number of copiers from each company in use
so n = m = 12
Company A: 16 14 25 19 23 12 22 28 19 15 18 29
Company B: 13 7 12 9 15 17 10 13 8 10 12 14
P
12 P
12
We note that xi = 240 and yj = 140.
i=1 j=1
l( ) = l( A; B) = 12 A + 240 log A 12 B + 140 log B for ( A; B) 2 :
The values of A and B which maximize l( A; B) are obtained by solving the two equa-
tions37
@l @l
= 0; = 0;
@ A @ B
which gives two equations in two unknowns:
240
12 + =0
A
140
12 + =0
B
The maximum likelihood estimates of A and B (unconstrained) are ^ A = 240=12 = 20:0
and ^ B = 140=12 = 11:667. That is, ^ = (20:0; 11:667):
To determine
L(^0 ) = max L( )
2 0
we need to …nd the (constrained) maximum likelihood estimate ^0 , which is the value of
= ( A ; B ) which maximizes l( A ; B ) under the constraint A = B . To do this we
merely let = A = B in (5.10) to obtain
l( ; ) = 12 + 240 log 12 + 140 log
= 24 + 380 log for > 0:
Solving @l( ; )=@ = 0, we …nd ^ = 380=24 = 15:833(= ^ A = ^ B ); that is,
^0 = (15:833; 15:833).
The next step is to compute the observed value of the likelihood ratio statistic, which
from (5.8) is
= 2l(^) 2l(^0 )
= 2l(20:0; 11:667) 2l(15:833; 15:833)
= 2 (682:92 669:60)
= 26:64
Finally, we compute the approximate p value for the test, which by (5.9) is
p value = P ( 26:64; H0 )
t P (W 26:64) where W s 2 (1)
h p i
=2 1 P Z 26:64 where Z v G (0; 1)
t 0:
37
think of this as maximizing over each parameter with the other parameter …xed.
Our conclusion is that there is very strong evidence against the hypothesis H0 : A = B .
The data indicate that Company B’s copiers have a lower rate of failure than Company
A’s copiers.
Note that we could also follow up this conclusion by giving a con…dence interval for the
mean di¤erence A B since this would indicate the magnitude of the di¤erence in the
two failure rates. The maximum likelihood estimates ^ A = 20:0 average failures per month
and ^B = 11:67 failures per month di¤er a lot, but we could also give a con…dence interval
in order to express the uncertainty in such estimates.
Example 5.4.4 Likelihood ratio tests of hypotheses for for G( ; ) model for
unknown
Consider a test of H0 : = 0 based on a random sample y1 ; y2 ; : : : ; yn . In this case
the unconstrained parameter space is = f( ; ) : 1 < < 1; > 0g, obviously a
2-dimensional space, but under the constraint imposed by H0 , the parameter must lie in
the space 0 = f( ; 0 ); 1 < < 1g a space of dimension 1. Thus k = 2, and p = 1.
The likelihood function is
Q
n Q
n 1 1
L( ) = L( ; ) = f (Yi ; ; ) = p exp 2
(yi )2
i=1 i=1 2 2
1 P
n
l( ; ) = n log( ) 2
(yi )2 + c
2 i=1
where h i
n=2
c = log (2 )
does not depend on or . The maximum likelihood estimators of ( ; ) in the uncon-

strained case are
~=Y
1 Pn
~2 = (Yi Y )2 :
n i=1
Under the constraint imposed by H0 : = 0 the maximum likelihood estimator of the
parameter is also Y so the likelihood ratio statistic is
( 0) = 2l Y ; ~ 2l Y ; 0
1 P n 1 P
n
= 2n log(~ ) 2 (Yi Y )2 + 2n log( 0) + 2 (Yi Y )2
~ i=1 0 i=1
0 1 1 2
= 2n log + 2 2 n~
~ 0 ~
~2 ~2
=n 2 1 log 2 :
0 0
This is not as obviously a Chi-squared random variable. It is, as one might expect, a
function of ~ 2 = 20 which is the ratio of the maximum likelihood estimator of the variance
divided by the value of 2 under H0 . In fact the value of ( 0 ) increases as the quantity
~ 2 = 20 gets further away from the value 1 in either direction.
The test proceeds by obtaining the observed value of ( 0 )
^2 ^2
( 0) =n 2 1 log 2
0 0
and then obtaining and interpreting the p value
p value t P (W > ( 0 )) where W v 2 (1)

h p i
= 2 1 P Z ( 0) where Z v G (0; 1)
Remark: It can be shown that the likelihood ratio statistic ( 0 ) is a function of

U = (n 1)S 2 = 20 , in fact ( 0 ) = U n log (U=n) n. See Problem 11(b). This is not
a one-to-one function of U but ( 0 ) is zero when U = n and ( 0 ) is large when U=n is
much bigger than or much less than one (that is, when S 2 = 20 is much bigger than one or
much less than one). Since U has a Chi-squared distribution with n 1 degrees of freedom
when H0 is true, we can use U as the test statistic for testing H0 : = 0 and compute ex-
act p values instead of using the Chi-squared approximation for the distribution of ( 0 ).
Example 5.4.5 Tests of hypotheses for Multinomial model

Consider a random vector Y = (Y1 ; : : : ; Yk ) with Multinomial probability function
n! y1 y2 yk P
k
f (y1 ; : : : ; yk ; 1; : : : ; k ) = 1 2 k for 0 yj n where yj = n:
y1 ! yk ! j=1
Suppose we wish to test a hypothesis of the form: H0 : j = j ( ) where the probabili-

ties j ( ) are all functions of an unknown parameter (possibly vector) with dimension
dim( ) = p < k 1. The parameter in the original model is = ( 1 ; :::; k ) and the para-
P
k
meter space = f( 1 ; : : : ; k ) : 0 j 1; where j = 1g has dimension k 1. The
j=1
parameter in the model assuming H0 is 0 = ( 1 ( ); ::; k ( )) and the parameter space
0 = f( 1 ( ); ::; k ( )) : for all g has dimension p. The likelihood function is
Q
k n! yj
L( ) = j
j=1 y1 ! yk !
or more simply
Q
k
yj
L( ) = j :
j=1
L( ) is maximized over (of dimension k 1) by the vector ^ with ^j = yj =n, j = 1; : : : ; k.

The likelihood ratio test statistic for testing H0 : j = j ( ) is
" #
L( ~0 )
= 2l(~) 2l(~0 ) = 2 log ;
L(~)
where L( 0 ) is maximized over 0 by the vector ~0 with ^j = j (^ ). If H0 is true and

n is large the distribution of is approximately 2 (k 1 p) and the p value can be
calculated approximately as
2
p value = P ( ; H0 ) t P (W ) where W s (k 1 p)
where
= 2l(^) 2l(^0 )
is the observed value of . We will give speci…c examples of the Multinomial model in
Chapter 7.

1. A woman who claims to have special guessing abilities is given a test, as follows: a
deck which contains …ve cards with the numbers 1 to 5 is shu- ed and a card drawn
out of sight of the woman. The woman then guesses the card, the deck is reshu- ed
with the card replaced, and the procedure is repeated several times.
(a) Let be the probability the woman guesses the card correctly and let Y be
the number of correct guesses in n repetitions of the procedure. Discuss why
Y Binomial(n; ) would be an appropriate model. If you wanted to test the
hypothesis that the woman is guessing at random what is the appropriate null
hypothesis H0 in terms of the parameter ?
(b) Suppose the woman guessed correctly 8 times in 20 repetitions. Using the test
statistic D = jY E (Y )j, calculate the p value for your hypothesis H0 in
(a) and give a conclusion about whether you think the woman has any special
guessing ability.
(c) In a longer sequence of 100 repetitions over two days, the woman guessed cor-
rectly 32 times. Using the test statistic D = jY E (Y )j, calculate the p value
for these data. What would you conclude now?
2. The accident rate over a certain stretch of highway was about = 10 per year for a
period of several years. In the most recent year, however, the number of accidents was
25. We want to know whether this many accidents is very probable if = 10; if not,
we might conclude that the accident rate has increased for some reason. Investigate
this question by assuming that the number of accidents in the current year follows a
Poisson distribution with mean and then testing H0 : = 10. Use the test statistic
D = max(0; Y 10) where Y represents the number of accidents in the most recent
year.
3. A hospital lab has just purchased a new instrument for measuring levels of dioxin
(in parts per billion). To calibrate the new instrument, 20 samples of a “standard”
water solution known to contain 45 parts per billion dioxin are measured by the new
instrument. The observed data are given below:
44:1 46:0 46:6 41:3 44:8 47:8 44:5 45:1 42:9 44:5
42:5 41:5 39:6 42:0 45:8 48:9 46:6 42:9 47:0 43:7
For these data
P
20 P
20
yi = 888:1 and yi 2 = 395:45:
i=1 i=1
(a) Use a qqplot to check whether a G ( ; ) model is reasonable for these data.
(b) Describe a suitable study population for this study. The parameters and
(c) Assuming a G ( ; ) model for these data test the hypothesis H0 : = 45.
Determine a 95% con…dence interval for . What would you conclude about
how well the new instrument is working?
(d) The manufacturer of these instruments claims that the variability in measure-
ments is less than two parts per billion. Test the hypothesis that H0 : 2 = 4
and determine a 95% con…dence interval for . What would you conclude about
the manufacturer’s claim?
4. In Problem 3 suppose we accept the manufacturer’s claim and assume we know 2 = 2.

Test the hypothesis H0 : = 45 and determine a 95% con…dence interval for . Hint:
Use the pivotal quantity
Y
Z= p v N (0; 1)
= n
with = 1.
5. Radon is a colourless, odourless gas that is naturally released by rocks and soils and
may concentrate in highly insulated houses. Because radon is slightly radioactive,
there is some concern that it may be a health hazard. Radon detectors are sold to
homeowners worried about this risk, but the detectors may be inaccurate. Univer-
sity researchers placed 12 detectors in a chamber where they were exposed to 105
picocuries per liter of radon over 3 days. The readings given by the detectors were:
91:9 97:8 111:4 122:3 105:4 95:0 103:8 99:6 96:6 119:3 104:8 101:7
Let yi = reading for the i0 th detector, i = 1; : : : ; 12. For these data

P
12 P
12
yi = 1249:6 and yi 2 = 131096:44:
i=1 i=1
To analyze these data assume the model
Yi v N ; 2
= G( ; ); i = 1; : : : ; 12 independently
(a) Test the hypothesis H0 : = 105. Determine a 95% con…dence interval for .
(b) Determine a 95% con…dence interval for .
(c) As a statistician what would you say to the university researchers about the
accuracy and precision of the detectors?
6. Suppose in Problem 5 we assume that = 105. Test the hypothesis H0 : 2 = 100

and determine a 95% con…dence interval for . Hint: Use the pivotal quantity
1 P
n
2
(Yi )2 v 2
(n)
i=1
with = 105.
7. Between 10 a.m. on November 4, 2014 and 10 p.m. on November 6, 2014 the Fed-
eration of Students at the University of Waterloo conducted a referendum on the
question “Should classes start on the …rst Thursday after Labour Day to allow for
two additional days o¤ in the Fall term?”. All undergraduates were able to cast their
ballot online. Six thousand of the 30; 990 eligible voters voted. Of the 6000 who
voted, 4440 answered yes to this question.
(a) The Federation of Students used an empirical study to determine whether or

not students support a fall term break. The Plan step of the empirical study
involved using an online referendum. Give at least one advantage and at least
one disadvantage of using the online referendum in this context.
(b) Describe a suitable target population and study population for this study.
(c) Assume the model Y v Binomial (6000; ) where Y = number of people who
responded yes to the question “Should classes start on the …rst Thursday after
Labour Day to allow for two additional days o¤ in the Fall term?”The parameter
corresponds to what attribute of interest in the study population? How valid
do you think the Binomial model is and why?
(d) Give the maximum likelihood estimate of . How valid do you think this estimate
is?
(e) Determine an approximate 95% con…dence interval for .
(f) By reference to the approximate con…dence interval, indicate what you know
about the approximate p value for a test of the hypothesis H0 : = 0:7?
8. Data on the number of accidents at a busy intersection in Waterloo over the last 5
years indicated that the average number of accidents at the intersection was 3 acci-
dents per week. After the installation of new tra¢ c signals the number of accidents
per week for a 25 week period were recorded as follows:
4 5 0 4 2 0 1 4 1 3 1 1 2
2 2 1 1 3 2 3 2 0 2 2 3
Let yi = the number of accidents in week i; i = 1; 2; : : : ; 25: To analyse these data we

assume Yi has a Poisson distribution with mean ; i = 1; 2; : : : ; 25 independently.
(a) To decide whether the mean number of accidents at this intersection has changed
after the installation of the new tra¢ c signals we wish to test the hypothesis H0 :
P25
= 3: Why is the discrepancy measure D = Yi 75 reasonable? Calculate
i=1
the exact p value for testing H0 : = 3. What would you conclude?
(b) Justify the following statement:

!
Y
P p c t P (Z c) where Z s N (0; 1) :
=n
(c) Why is the discrepancy measure D = Y 3 reasonable for testing H0 : = 3?

Calculate the approximate p value using the approximation in (b). Compare
this to the value in (a).
9. Suppose that Y1 ; : : : ; Yn is a random sample from a Poisson( ) distribution. Show

that the likelihood ratio test statistic for testing H0 : = 0 is
Y
( 0 ) = 2n Y log + 0 Y :
0
Use this test statistic for testing H0 : = 3 for the data in Problem 8. Compare this
answer to the answers in 8 (a) and 8 (c).
10. For Chapter 2, Problem 5 (b) test the hypothesis H0 : = 5 using the likelihood ratio
test statistic. Is this result consistent with the approximate 95% con…dence interval
for that you found in Chapter 4, Problem 8?
11. For Chapter 2, Problem 6 (b) test the hypothesis H0 : = 0:1 using the likelihood
ratio test statistic. Is this result consistent with the approximate 95% con…dence
interval for that you found in Chapter 4, Problem 9?
12. Data from the 2011 Canadian census indicate that 18% of all families in Canada
have one child. Suppose the data in Chapter 2, Problem 7 (d) represented 33 children
chosen at random from the Waterloo Region. Based on these data, test the hypothesis
that the percentage of families with one child in Waterloo Region is the same as the
national percentage using the likelihood ratio test statistic. Is this result consistent
with the approximate 95% con…dence interval for that you found in Chapter 4,
Problem 10?
13. A company that produces power systems for personal computers has to demonstrate
a high degree of reliability for its systems. Because the systems are very reliable
under normal use conditions, it is customary to ‘stress’the systems by running them
at a considerably higher temperature than they would normally encounter, and to
measure the time until the system fails. According to a contract with one personal
computer manufacturer, the average time to failure for systems run at 70 C should
be no less than 1; 000 hours. From one production lot, 20 power systems were put on
test and observed until failure at 70 . The 20 failure times y1 ; : : : ; y20 were (in hours):
374:2 544:0 1113:9 509:4 1244:3
551:9 853:2 3391:2 297:0 63:1
250:2 678:1 379:6 1818:9 1191:1
162:8 1060:1 1501:4 332:2 2382:0
P
20
(Note: yi = 18; 698:6). Failure times are assumed to have an Exponential( )
i=1
distribution.
(a) Check whether the Exponential model is reasonable for these data. (See Example
5:3:2.)
(b) Use a likelihood ratio test to test H0 : = 1000 hours. Is there any evidence
that the company’s power systems do not meet the contracted standard?
14. The R function runif () generates pseudo random Uniform(0; 1) random variables.
The command y runif (n) will produce a vector of n values y1 ; : : : ; yn .
(a) Suggest a test statistic which could be used to test that the yi ’s, i = 1; : : : ; n are
consistent with a random sample from Uniform(0; 1).
(See: www.dtic.mil/cgi-bin/GetTRDoc?AD=ADA393366)
(b) Generate 1000 yi ’s and carry out the test in (a).
15. The Poisson model is often used to compare rates of occurrence for certain types of
events in di¤erent geographic regions. For example, consider K regions with popula-
tions P1 ; : : : ; PK and let j , j = 1; : : : ; K be the annual expected number of events
per person for region j. By assuming that the number of events Yj for region j in a
given t-year period has a Poisson distribution with mean Pj j t, we can estimate and
compare the j ’s or test that they are equal.
(a) Under what conditions might the stated Poisson model be reasonable?
(b) Suppose you observe values y1 ; : : : ; yK for a given t-year period. Describe how
to test the hypothesis that 1 = 2 = = K.
(c) The data below show the numbers of children yj born with “birth defects”for 5
regions over a given …ve year period, along with the total numbers of births Pj
for each region. Test the hypothesis that the …ve rates of birth defects are equal.
Pj 2025 1116 3210 1687 2840

yj 27 18 41 29 31
16. Challenge Problem: Likelihood ratio test statistic for Gaussian model
and unknown: Suppose that Y1 ; : : : ; Yn are independent G( ; ) observations.
(a) Show that the likelihood ratio test statistic for testing H0 : = 0 ( unknown)
is given by
T2
( 0 ) = n log 1 +
n 1
p
where T = n(Y 0 )=S and S is the sample standard deviation. Note: you
will want to use the identity
P
n
2 P
n
(Yi 0) = (Yi Y )2 + n(Y 0)
2
:
i=1 i=1
(b) Show that the likelihood ratio test statistic for testing H0 : = 0 ( unknown)
can be written as ( 0 ) = U n log (U=n) n where
(n 1)S 2
U= 2 :
0
See Example 5.4.4.
17. Challenge Problem: Likelihood ratio test statistic for comparing two
Exponential means: Suppose that X1 ; : : : ; Xm is a random sample from the
Exponential( 1 ) distribution and independently and Y1 ; : : : ; Yn is a random sample
from the Exponential( 2 ) distribution. Determine the likelihood ratio test statistic
for testing H0 : 1 = 2 .
18. In the Wintario lottery draw, six digit numbers were produced by six machines that
operate independently and which each simulate a random selection from the digits
0; 1; : : : ; 9. Of 736 numbers drawn over a period from 1980-82, the following frequen-
cies were observed for position 1 in the six digit numbers:
Digit (i): 0 1 2 3 4 5 6 7 8 9 Total

Frequency (fi ): 70 75 63 59 81 92 75 100 63 58 736
Consider the 736 draws as trials in a Multinomial experiment and let
j = P (digit j is drawn on any trial); j = 0; 1; : : : ; 9:
If the machines operate in a truly “random” fashion, then we should have j = 0:1;
j = 0; 1; : : : ; 9.
(a) Test this hypothesis using a likelihood ratio test. What do you conclude?
(b) The data above were for digits in the …rst position of the six digit Wintario
numbers. Suppose you were told that similar likelihood ratio tests had in fact
been carried out for each of the six positions, and that position 1 had been
singled out for presentation above because it gave the largest observed value of
the likelihood ratio statistic . What would you now do to test the hypothesis
j = 0:1; j = 0; 1; 2; : : : ; 9? (Hint: Find P (largest of 6 independent ’s is ).)
6. GAUSSIAN RESPONSE
MODELS
6.1 Introduction
A response variate Y is one whose distribution has parameters which depend on the value
of other variates. For the Gaussian models we have studied so far, we assumed that we had
a random sample Y1 ; Y2 ; : : : ; Yn from the same Gaussian distribution G( ; ). A Gaussian
response model generalizes this to permit the parameters of the Gaussian distribution for
Yi to depend on a vector xi of covariates (explanatory variates which are measured for
the response variate Yi ). Gaussian models are by far the most common models used in
statistics.
De…nition 40 A Gaussian response model is one for which the distribution of the response
variate Y , given the associated vector of covariates x = (x1 ; x2 ; : : : ; xk ) for an individual
unit, is of the form
Y G( (x) ; (x)):
If observations are made on n randomly selected units we write the model as
Yi G ( (xi ); (xi )) for i = 1; : : : ; n independently.
In most examples we will assume (xi ) = is constant. This assumption is not necessary
but it does make the models easier to analyze. The choice of (x) is guided by past
information and on current data from the population or process. The di¤erence between
various Gaussian response models is in the choice of the function (x) and the covariates.
We often assume (xi ) is a linear function of the covariates. These models are called
Gaussian linear models and can be written as
Yi G ( (xi ); ) for i = 1; : : : ; n independently (6.1)

P
k
with (xi ) = 0 + j xij ;
j=1
189
190 6. GAUSSIAN RESPONSE MODELS
where xi = (xi1 ; xi2 ; : : : ; xik ) is the vector of known covariates associated with unit i and
0 ; 1 ; : : : ; k are unknown parameters. These models are also referred to as linear regres-
sion models 38 , and the j ’s are called the regression coe¢ cients.
Here are some examples of settings where Gaussian response models can be used.
Example 6.1.1 Can …ller study

The soft drink bottle …lling process of Example 1.5.2 involved two machines (Old and
New). For a given machine it is reasonable to represent the distribution for the amount of
liquid Y deposited in a single bottle by a Gaussian distribution.
In this case we can think of the machines as acting like a covariate, with and di¤ering
for the two machines. We could write
Y G( O; O) for observations from the old machine

Y G( N; N) for observations from the new machine.
In this case there is no formula relating and to the machines; they are simply di¤erent.
Notice that an important feature of a machine is the variability of its production so we
have, in this case, permitted the two variance parameters to be di¤erent.
Example 6.1.2 Price versus size of commercial buildings39

Ontario property taxes are based on “market value”, which is determined by comparing
a property to the price of those which have recently been sold. The value of a property is
separated into components for land and for buildings. Here we deal with the value of the
buildings only but a similar analysis could be conducted for the value of the property.
Table 6.1: Size and Price of 30 Buildings

Size Price Size Price Size Price
3:26 226:2 0:86 532:8 0:38 636:4
3:08 233:7 0:80 563:4 0:38 657:9
3:03 248:5 0:77 578:0 0:38 597:3
2:29 360:4 0:73 597:3 0:38 611:5
1:83 415:2 0:60 617:3 0:38 670:4
1:65 458:8 0:48 624:4 0:34 660:6
1:14 509:9 0:46 616:4 0:26 623:8
1:11 525:8 0:45 620:9 0:24 672:5
1:11 523:7 0:41 624:3 0:23 673:5
1:00 534:7 0:40 641:7 0:20 611:8
38
The term regression is used because it was introduced in the 19th century in connection with these
models, but we will not explain why it was used here.
39
This reference can be found in earlier course notes for Oldford and MacKay, STAT 231 Ch. 16
A manufacturing company was appealing the assessed market value of its property,
which included a large building. Sales records were collected on the 30 largest buildings
sold in the previous three years in the area. The data are given in Table 6.1 and plotted in
Figure 6.1. They include the size of the building x (in m2 =105 ) and the selling price y (in
$ per m2 ). The purpose of the analysis is to determine whether and to what extent we can
determine the value of a property from the single covariate x so that we know whether the
assessed value appears to be too high. The building in question was 4:47 105 m2 , with
an assessed market value of $75 per m2 .
The scatterplot shows that the price y is roughly inversely proportional to the size x
but there is obviously variability in the price of buildings having the same area (size). In
this case we might consider a model where the price of a building of size xi is represented
by a random variable Yi , with
Yi s G ( 0 + 1 xi ; ) for i = 1; : : : ; n independently
where 0 and 1 are unknown parameters. We assume a common standard deviation for
the observations.
700
650
600
550
500
Pric e
450
400
350
300
250
200
0 0.5 1 1.5 2 2.5 3 3.5
Size
Figure 6.1: Scatterplot of price versus building size
Example 6.1.3 Strength of steel bolts

The “breaking strength”of steel bolts is measured by subjecting a bolt to an increasing
(lateral) force and determining the force at which the bolt breaks. This force is called
the breaking strength; it depends on the diameter of the bolt and the material the bolt is
composed of. There is variability in breaking strengths since two bolts of the same dimension
and material will generally break at di¤erent forces. Understanding the distribution of
breaking strengths is very important in manufacturing and construction.
The data below show the breaking strengths y of six steel bolts at each of …ve di¤erent
bolt diameters x. The data are plotted in Figure 6.2.
Diameter x 0:10 0:20 0:30 0:40 0:50

1:62 1:71 1:86 2:14 2:45
Breaking 1:73 1:78 1:86 2:07 2:42
Strength 1:70 1:79 1:90 2:11 2:33
1:66 1:86 1:95 2:18 2:36
1:74 1:70 1:96 2:17 2:38
1:72 1:84 2:00 2:07 2:31
The scatterplot gives a clear picture of the relationship between y and x. A reasonable
model for the breaking strength Y of a randomly selected bolt of diameter x would appear
to be Y G( (x); ). The variability in y values appears to be about the same for bolts of
di¤erent diameters which again provides some justi…cation for assuming to be constant.
It is not obvious what the best choice for (x) would be although the relationship looks
slightly nonlinear so we might try a quadratic function
2
(x) = 0 + 1x + 2x
where 0; 1; 2 are unknown parameters.
2. 5
2. 4
2. 3
2. 2
S t rengt h
2. 1
1. 9
1. 8
1. 7
1. 6
0.05 0. 1 0.15 0. 2 0.25 0. 3 0.35 0. 4 0.45 0. 5 0.55
D ia m e t e r
Figure 6.2: Scatterplot of strength versus bolt diameter

6.2. SIMPLE LINEAR REGRESSION 193
Remark: Sometimes the model (6.1) is written a little di¤erently as
Yi = (xi ) + Ri where Ri G(0; ):
This splits Yi into a deterministic component, (xi ); and a random component, Ri .

We now consider estimation and testing procedures for these Gaussian response models.
We begin with models which have no covariates so that the observations are all from the
same Gaussian distribution.
G( ; ) Model
In Chapters 4 and 5 we discussed estimation and testing hypotheses for samples from a
Gaussian distribution. Suppose that Y G( ; ) models a response variate y in some
population or process. A random sample Y1 ; : : : ; Yn is selected, and we want to estimate
the model parameters and possibly to test hypotheses about them. We can write this model
in the form
Yi = + Ri where Ri G(0; ): (6.2)
so this is a special case of the Gaussian response model in which the mean function is con-
stant. The estimator of the parameter that we used is the maximum likelihood estimator
Pn
Y = n1 Yi . This estimator is also a “least squares estimator”. Y has the property that
i=1
it is closer to the data than any other constant, or
P
n P
n
min (Yi )2 = (Yi Y )2 :
i=1 i=1
You should be able to verify this. It will turn out that the methods for estimation, construct-
ing con…dence intervals and tests of hypothesis discussed earlier for the single Gaussian
G( ; ) are all special cases of the more general methods derived in Section 6.5.
In the next section we begin with a simple generalization of (6.2) to the case in which
the mean is a linear function of a single covariate.
6.2 Simple Linear Regression

40 Many studies involve covariates x, as described in Section 6.1. In this section we consider
the case in which there is a single covariate x. Consider the model with independent Yi ’s
such that
Yi G( (xi ) ; ) where (xi ) = + xi (6.3)
This is of the form (6.1) with ( 0 ; 1 ) replaced by ( ; ).
The likelihood function for ( ; ; ) is
Q
n 1 1
L( ; ; ) = p exp 2
(yi xi )2
i=1 2 2
40
See the video at www.watstat.ca called ”Regression and Crickets3”
or more simply
1 P
n
L( ; ; ) = n
exp 2
(yi xi )2 :
2 i=1
1 P
n
l( ; ; ) = n log 2
(yi xi )2 :
2 i=1
To obtain the maximum likelihood estimates we solve the three equations
@l 1 P
n n
= 2
(yi xi ) = 2
(y x) = 0 (6.4)
@ i=1
@l 1 Pn 1 P
n P
n P
n
= 2
(yi xi ) xi = 2
xi yi xi x2i =0 (6.5)
@ i=1 i=1 i=1 i=1
@l n 1 P
n
= + 3
(yi xi )2 = 0
@ i=1
simultaneously. We obtain the maximum likelihood estimators
~ = Sxy ; (6.6)
Sxx
~=Y ~ x; (6.7)
1 Pn
~ xi )2
~2 = (Yi ~ (6.8)
n i=1
where
P
n P
n
Sxx = (xi x)2 = (xi x)xi
i=1 i=1
Pn P
n
Sxy = (xi x)(Yi Y)= (xi x)Yi
i=1 i=1
Pn
Syy = (Yi Y )2
i=1
The alternative expressions for Sxy and Syy 41 are easy to obtain.
We will use
1 P
n
~ xi )2 = 1 ~ Sxy )
Se2 = (Yi ~ (Syy
n 2 i=1 n 2
41 P
n
Since (xi x) = 0,
i=1
P
n P
n P
n P
n
(xi x)(xi x) = (xi x)xi x (xi x) = (xi x)xi
i=1 i=1 i=1 i=1
and
Pn P
n P
n P
n
(xi x)(Yi Y)= (xi x)Yi Y (xi x) = (xi x)Yi
i=1 i=1 i=1 i=1
as the estimator of 2 rather than the maximum likelihood estimator ~ 2 given by (6.8)
since it can be shown that E Se2 = 2 . Note that Se2 can be more easily calculated using
1 ~ Sxy )
Se2 = (Syy
n 2
which follows since
P
n P
n
(Yi ~ ~ xi )2 = (Yi Y + ~x ~ xi )2
i=1 i=1
Pn P
n 2 P
n
= (Yi Y )2 2~ (Yi Y ) (xi x) + ~ (xi x)2
i=1 i=1 i=1
Sxy
= Syy 2 ~ Sxy + ~ Sxx
Sxx
= Syy ~ Sxy :
Least squares estimation

If we are given data (xi ; yi ), i = 1; 2; : : : ; n then one criterion which could be used to obtain
a line of “best …t” to these data is to …t the line which minimizes the sum of the squares
of the distances between the observed points, (xi ; yi ), i = 1; 2; : : : ; n, and the …tted line
y = + x. Mathematically this means we want to …nd the values of and which
minimize the function
P
n
g( ; ) = [yi ( + xi )]2 :
i=1
Such estimates are called least squares estimates. To …nd the least squares estimates we
need to solve the two equations
@g P
n
= (yi xi ) = n (y x) = 0
@ i=1
@g Pn P
n P
n P
n
= (yi xi ) xi = xi yi xi x2i = 0:
@ i=1 i=1 i=1 i=1
simultaneously. We note that this is equivalent to solving the maximum likelihood equations
(6.4) and (6.5). In summary we have that the least squares estimates and the maximum
likelihood estimates obtained assuming the model (6.3) are the same estimates. Of course
the method of least squares only provides point estimates of the unknown parameters
and while assuming the model (6.3) allows us to obtain both estimates and con…dence
intervals for the unknown parameters. We now show how to obtain con…dence intervals
based on the model (6.3).
Distribution of the estimator ~

Notice that we can rewrite the expression for ~ as
~ = Sxy = P ai Yi
n (xi x)
where ai =
Sxx i=1 Sxx
to make it clear that ~ is a linear combination of the Normal random variables Yi and is
therefore Normally distributed with easily obtained expected value and variance. In fact it
P
n Pn
is easy to show that these non-random coe¢ cients satisfy ai = 0 and ai xi = 1 and
i=1 i=1
P
n
a2i = 1=Sxx . Therefore
i=1
P
n P
n
E( ~ ) = ai E(Yi ) = ai ( + xi )
i=1 i=1
Pn P
n
= ai xi since ai = 0
i=1 i=1
P
n
= since ai xi = 1:
i=1
Similarly
P
n
V ar( ~ ) = a2i V ar(Yi ) since the Yi are independent random variables
i=1
2 P
n
= a2i
i=1
2 P
n 1
= since a2i = :
Sxx i=1 Sxx
In summary
~ G ;p :
Sxx
Con…dence intervals for and test of hypothesis of no relationship

Con…dence intervals for are important because the parameter represents the increase
in the mean value of Y , resulting from an increase of one unit in the value of x. As well, if
= 0 then x has no e¤ect on Y (within this model).
Since
~ G ;p ;
Sxx
(n 2)Se2 2
2
(n 2) (6.9)
and the fact that it can be shown that ~ and Se2 are independent random variables, then
by Theorem 32 it follows that
~
p v t (n 2) : (6.10)
Se = Sxx
This pivotal quantity can be used to obtain con…dence intervals for and to construct tests
of hypotheses about .
Using t-tables or R …nd the constant a such that P ( a T a) = p where

T s t (n 2). Since
!
~
p = P( a T a) = P a p a
Se = Sxx
p p
= P ~ aSe = Sxx ~ + aSe = Sxx ;
therefore a 100p% con…dence interval for is given by

p h p p i
^ ase = Sxx = ^ ase = Sxx ; ^ + ase = Sxx
To test the hypothesis of no relationship or H0 : = 0 we use the test statistic
~ 0
p
Se = Sxx
with observed value

^ 0
p
se = Sxx
and p value given by
0 1
^ 0
p value = P @jT j p A
se = Sxx
2 0 13
^ 0
= 2 41 P @T p A5 where T v t (n 2) :
se = Sxx
Note also that (6.9) can be used to obtain con…dence intervals or tests for , but these
are usually of less interest than inference about or the other quantities below.
Remark: In regression models we often “rede…ne”a covariate xi as x0i = xi c, where c is

P
n P
n
a constant value that makes x0i close to zero. (Often we take c = x, which makes x0i
i=1 i=1
exactly zero.) The reasons for doing this are that it reduces round-o¤ errors in calculations,
and that it makes the parameter more interpretable. Note that does not change if we
“centre” xi this way, because
E(Y jx) = + x= + (x0 + c) = ( + c) + x0 :
Thus, the intercept changes if we rede…ne x, but not . In the examples we consider here
we have kept the given de…nition of xi , for simplicity.
Con…dence intervals for the mean response (x) = + x

We are often interested in estimating the quantity (x) = + x since it represents the
mean response at a speci…ed value of the covariate x. We can obtain a pivotal quantity for
doing this. The maximum likelihood estimator of (x) obtains by replacing the unknown
values ; by their maximum likelihood estimators,
~ (x) = ~ + ~ x = Y + ~ (x x);
since ~ = Y ~ x. Since
~ = Sxy = P (xi x) Yi
n
Sxx i=1 Sxx

we can rewrite ~ (x) as
P
n 1 (xi x)
~ (x) = Y + ~ (x x) = ai Yi where ai = + (x x) : (6.11)
i=1 n Sxx
Since ~ (x) is a linear combination of Gaussian random variables it has a Gaussian distrib-
ution. We can use (6.11) to determine the mean and variance of the random variable ~ (x).
You should verify the following properties of the coe¢ cients ai :
P
n P
n P
n 1 (x x)2
ai = 1, ai xi = x and a2i = + :
i=1 i=1 i=1 n Sxx
Therefore
P
n
E[~ (x)] = ai E(Yi )
i=1
Pn
= ai ( + xi )
i=1
P
n P
n
= ai + ai xi
i=1 i=1
P
n P
n
= + x since ai = 1 and ai xi = x
i=1 i=1
= (x):
In other words ~ (x) is an unbiased estimator of (x): Also

P
n
V ar [~ (x)] = a2i V ar(Yi ) since the Yi are independent random variables
i=1
2 P
n
= a2i
i=1
2 1 (x x)2
= + :
n Sxx
Note that the variance of ~ (x) is smallest in the middle of the data, or when x is close to
x and much larger when (x x)2 is large.
In summary, we have shown that

0 s 1
1 (x x)2
~ (x) G @ (x); + A: (6.12)
n Sxx
Since (6.12) holds independently of (6.9) then by Theorem (32) we obtain the pivotal
quantity
~ (x) (x)
q s t (n 2) (6.13)
2
Se n1 + Sxxx)
(x
which can be used to obtain con…dence intervals for (x) in the usual manner. Using
t-tables or R …nd the constant a such that P ( a T a) = p where T s t (n 2). Since
0 1
~ (x) (x)
p = P ( a T a) = P @ a q aA
1 (x x)2
Se n + Sxx
0 s s 1
1 (x x) 2 2
1 (x x) A
= P @ ~ (x) aSe + (x) ~ (x) + aSe + ;
n Sxx n Sxx
therefore a 100p% con…dence interval for (x) is given by

2 s s 3
1 (x x)2 1 (x x) 2
4 ^ (x) ase + ; ^ (x) + ase + 5 (6.14)
n Sxx n Sxx
where ^ (x) = ^ + ^ x,
1 P
n
^ xi )2 = 1 ^ Sxy )
s2e = (yi ^ (Syy
n 2 i=1 n 2
and Syy and Sxy are replaced by their observed values.
Remark: Note that since = (0); a 95% con…dence interval for , is given by (6.14)
with x = 0 which gives s
1 (x)2
^ ase + (6.15)
n Sxx
In fact one can see from (6.15) that if x is large in magnitude (which means the average xi
is large), then the con…dence interval for will be very wide. This would be disturbing if
the value x = 0 is a value of interest, but often it is not. In the following example it refers
to a building of area x = 0, which is nonsensical!
Remark: The results of the analyses below can be obtained using the R function lm,
with the command lm(y x). We give the detailed results below to illustrate how the
calculations are made. In R, summary(lm(y x)) gives a lot of useful output.
Example 6.1.2 Revisited Price versus size of commercial buildings

Example 6.1.2 gave data on the selling price per square meter y and area x of commercial
buildings. Figure 6.1 suggested that a linear regression model of the form E(Y jx) = + x
would be reasonable. For the given data
n = 30; x = 0:9543; y = 548:9700; Sxx = 22:9453; Sxy = 3316:6771; Syy = 489; 624:723
so we …nd
^ = Sxy = 3316:6771 = 144:5469;

Sxx 22:9453
^=y ^ x = 548:9700 ( 144:5469) (0:9543) = 686:9159;
1 1
s2e = (Syy ^ Sxy ) = [489624:723 ( 144:5469) ( 3316:6771)] = 364:6199;
n 2 28
and se = 19:0950:
(Note that when calculating these values using a calculator you should use as many decimal
places as possible otherwise the values are a¤ected by roundo¤ error.) Since ^ is negative
this implies that the larger sized buildings tend to sell for less per square meter. (The
estimate ^ = 144:55 indicates a drop in average price of $144:55 per square meter for
each increase of one unit in x; remember x’s units are m2 (105 )).
The line y = ^ + ^ x is often called the …tted regression line for y on x. If we plot the
…tted line on the same graph as the scatterplot of points (xi ; yi ), i = 1; : : : ; n as in Figure
6.3, we see the …tted line passes close to the points.
A con…dence interval for is not of major interest in the setting here, where the data
were called on to indicate a fair assessment value for a large building with x = 4:47. One
way to address this is to estimate (x) when x = 4:47. We get the maximum likelihood
estimate for (4:47) as
^ (4:47) = ^ + ^ (4:47) = $40:79
which we note is much below the assessed value of $75 per square meter. However, one
can object that there is uncertainty in this estimate, and that it would be better to give a
con…dence interval for (4:47). Using (6.14) and P (T 2:0484) = 0:975 for T s t (28) we
get a 95% con…dence interval for (4:47) as
s
1 (4:47 x)2
^ (4:47) 2:0484se +
30 Sxx
= $40:79 $29:58
= [$11:21; $70:37] :
Thus the assessed value of $75 is outside this interval.

700
650
600
550
P ric e
500
450
y=686.9-144.5x
400
350
300
250
200
0 0.5 1 1.5 2 2.5 3 3.5
Size
Figure 6.3: Scatterplot and …tted line for building price versus size
However (playing lawyer for the assessor), we could raise another objection: we are
considering a single building but we have constructed a con…dence interval for the average
of all buildings of size x = 4:47( 105 )m2 . The constructed con…dence interval is for a point
on the line, not a point Y generated by adding to + (4:47) the random error R s G (0; )
which has a non-negligible variance. This suggests that what we should do is predict the
y value for a building with x = 4:47, instead of estimating (4:47). We will temporarily
leave the example in order to develop a method for this.
Prediction Interval for Future Response

Suppose we want to estimate or predict the Y value for a random unit, not part of the
sample, which has a speci…c value x for its covariate. We can obtain a pivotal quantity that
can be used to give a prediction interval (or interval “estimate”) for the future response Y ,
as follows.
Note that Y G( (x); ) from (6.3) or alternatively
Y = (x) + R; where R G(0; )
is independent of Y1 ; : : : ; Yn . For a point estimator of Y it is natural to use the maximum

likelihood estimator ~ (x) of (x). We have derived its distribution as
0 s 1
1 (x x)2
~ (x) G @ (x); + A:
n Sxx
Moreover the error in the point estimator of Y is given by
Y ~ (x) = Y (x) + (x) ~ (x) = R + [ (x) ~ (x)] : (6.16)
Since R is independent of ~ (x) (it is not connected to the existing sample), (6.16) is the
sum of independent Normally distributed random variables and is consequently Normally
distributed. Since
E [Y ~ (x)] = E fR + [ (x) ~ (x)]g

= E(R) + E [ (x)] E [~ (x)]
= 0 + (x) (x)
= 0:
and
V ar [Y ~ (x)] = V ar(Y ) + V ar [~ (x)]

2 1 (x x)2
2
= + +
n Sxx
2 1 (x x)2
= 1+ + :
n Sxx
we have !
1=2
1 (x x)2
Y ~ (x) G 0; 1+ + : (6.17)
n Sxx
Since (6.17) holds independently of (6.9) then by Theorem (32) we obtain the pivotal
quantity
Y ~ (x)
q
2
v t (n 2) :
Se 1 + n1 + (xSxxx)
For an interval estimate with con…dence coe¢ cient p we choose a such that
p = P ( a T a) where T s t (n 2). Since
0 1
Y ~ (x)
p=P@ a q aA
1 (x x)2
Se 1 + n + Sxx
0 s s 1
1 (x x)2 1 (x x)2
= P @ ~ (x) aSe 1 + + Y ~ (x) + aSe 1 + + A
n Sxx n Sxx
we obtain the interval

2 s s 3
4 ^ (x) 1 (x x)2 1 (x x)2 5:
ase 1+ + ; ^ (x) + ase 1+ + (6.18)
n Sxx n Sxx
This interval is usually called a 100p% prediction interval instead of a con…dence interval,
since Y is not a parameter but a “future” observation.
Remark: Care must be taken in constructing prediction intervals for values of x which
lie outside the interval of observed x0i s since this assumes that the linear relationship holds
beyond the observed data. This is dangerous since there are no data to support the as-
sumption.
Example 6.1.2 Revisited Price versus size of commercial buildings

Let us obtain a 95% prediction interval for Y when x = 4:47. Using (6.18) and the fact
that P (T 2:0484) = 0:975 when T s t (28) we obtain
r
1 (4:47 x)2
~ (4:47) 2:0484se 1 + +
30 22:945
= $40:79 $49:04 = [ $8:25; $89:83]
The lower limit is negative, which is nonsensical. This happened because we were using
a Gaussian model (Gaussian random variables Y can be positive or negative) in a setting
where the price Y must be positive. Nonetheless, the Gaussian model …ts the data reason-
ably well. We might just truncate the prediction interval and take it to be [0; $89:83].
Now we …nd that the assessed value of $75 is inside this interval. On this basis it’s
di¢ cult to say that the assessed value is unfair (though it is towards the high end of
the prediction interval). Note also that the value x = 4:47 of interest is well outside the
interval of observed x values which was [0:20; 3:26]) in the data set of 30 buildings. Thus any
conclusions we reach are based on an assumption that the linear model E (Y jx) = + x
applies beyond x = 3:26 at least as far as x = 4:47. This may or may not be true, but we
have no way to check it with the data we have.
There is a slight suggestion in Figure 6.3 that V ar(Y ) may be smaller for larger x val-
ues. There is not su¢ cient data to check this either. We mention these points because an
important companion to every statistical analysis is a quali…cation of the conclusions based
on a careful examination of the applicability of the assumptions underlying the analysis.
Remark: Note from (6.14) and (6.18) that the con…dence interval for (x) and the predic-
tion interval for Y are wider the further away x is from x. Thus, as we move further away
from the “middle” of the x’s in the data, we get wider and wider intervals for (x) and Y .

Recall the data given in Example 6.1.3, where Y represented the breaking strength of a
randomly selected steel bolt and x was the bolt’s diameter. A scatterplot of points (xi ; yi )
for 30 bolts suggested a nonlinear relationship between Y and x. A bolt’s strength might be
expected to be proportional to its cross-sectional area, which is proportional to x2 . Figure
6.4 shows a plot of points (x2i ; yi ) which looks quite linear. Because of this let us assign a
new variable name to x2 , say x1 = x2 . We then …t a linear model
Yi G( + x1i ; ) where x1i = x2i
to the data. For these data
n = 30; x1 = 0:11; y = 1:979; Sx1 x1 = 0:2244; Sx1 y = 0:6368; Syy = 1:88147
so we …nd
^ = Sx1 y = 0:6368 = 2:8378;
Sx1 x1 0:2244
^ = y ^ x1 = 1:979 (2:8378) (0:11) = 1:6668;
1 ^ Sx y ) = 1 [1:88147
s2e = (Syy 1 (2:8378) (0:6368)] = 0:002656;
n 2 28
and se = 0:05154:
The …tted regression line y = ^ + ^ x1 is shown on the scatterplot in Figure 6.4. The model
appears to …t the data well.
2.5
2.4
2.3
2.2
Strength y=1.67+2.84x
2.1
1.9
1.8
1.7
1.6
0 0.05 0.1 0.15 0.2 0.25
Diameter Squared
Figure 6.4: Scatterplot plus …tted line for strength versus diameter squared
The parameter represents the increase in average strength (x1 ) from increasing x1 =
x2 by one unit. Using the pivotal quantity (6.10) and the fact that P (T 2:0484) = 0:975
for T s t (28), we obtain the 95% con…dence interval for as
p
^ 2:0484se = Sxx
= 2:8378 0:2228 = [2:6149; 3:0606] :
Table 6.2
Summary of Distributions for Simple Linear Regression
Mean or
degrees
Random variable Distribution Standard Deviation
of
freedom
h i1=2
Sxy
~=
Sxx
Gaussian E( ~ ) = std( ~ ) = 1
Sxx
degrees
~
p Student t of
Se = sxx
where freedom
Se2 = 1
Syy ~ Sxy =n 2
n 2
h i1=2
~x 1 x2
~=Y Gaussian E(~ ) = std(~ ) = n + Sxx
E [~ (x)] std [~ (x)]

h i1=2
(x x)2
~ (x) = ~ + ~ x Gaussian = (x) = 1
n + Sxx
= + x
degrees
~ (x) (x) of
r
Se 1 (x x)2
+ S
Student t
n xx freedom
=n 2
E [Y ~ (x)] std [Y ~ (x)]
h i
2 1=2
Y ~ (x) Gaussian =0 = 1 + n1 + (xSxxx)
degrees
Y ~ (x) of
r
Se 1
1+ n +
(x x)2 Student t
Sxx freedom
=n 2
degrees
(n 2)Se2 of
2 Chi-squared
freedom
=n 2
Checking the Model Assumptions for Simple Linear Regression

There are two main components in Gaussian linear response models:
(1) The assumption that Yi (given any covariates xi ) is Gaussian with constant standard
deviation .
(2) The assumption that E (Yi ) = (xi ) is a linear combination of observed covariates
with unknown coe¢ cients.
Models should always be checked. In problems with only one x covariate, a plot of
the …tted line superimposed on the scatterplot of the data (as in Figures 6.3 and 6.4)
shows pretty clearly how well the model …ts. If there are two or more covariates in the
model, residual plots, which are described below, are very useful for checking the model
assumptions.
Residuals are de…ned as the di¤erence between the observed response and the …tted
values. Consider the simple linear regression model for which Yi G( i ; ) where
i = + xi and Ri = Yi i G(0; ), i = 1; 2; : : : ; n independently. The residuals are
given by
rî = yi î
= yi ^ ^ xi for i = 1; : : : ; n:
The idea behind the rî ’s is that they can be thought of as “observed” Ri ’s. This isn’t
exactly correct since we are using ^ i instead of i in rî , but if the model is correct, then
the rî ’s should behave roughly like a random sample from the G(0; ) distribution. The
rî ’s do have some features that can be used to check the model assumptions. Recall that
the maximum likelihood estimate of is ^ = y ^ x which implies that y ^ ^ x = 0 or
^ x = 1 P yi ^ xi = 1 P rî
n n
0=y ^ ^
n i=1 n i=1
so that the average of the residuals is always zero.

Residual plots can be used to check the model assumptions. Here are three residual
plots which can be used:
(1) Plot points (xi ; rî ); i = 1; : : : ; n. If the model is satisfactory the points should lie
more or less horizontally within a constant band around the line rî = 0 (see Figure
6.5).
(2) Plot points (^ i ; rî ); i = 1; : : : ; n. If the model is satisfactory the points should lie
more or less horizontally within a constant band around the line rî = 0.
(3) Plot a Normal qqplot of the residuals rî . If the model is satisfactory the points should
lie more or less along a straight line.
1
standardized
residual
0
-1
-2
-3
0 10 20 30 40 50
x
Figure 6.5: Residual plot for example in which model assumptions hold
Departures from the “expected” pattern may suggest problems with the model. For
example, Figure 6.6 plot suggests the mean function i = (xi ) is not correctly speci…ed.
The pattern of points suggests that assuming a quadratic form for the mean such as
(xi ) = + xi + x2i might give a better …t to the data than (xi ) = + xi .
Figure 6.7 suggests that for these data the variance is non-constant. Sometimes trans-
p
forming the response can solve this problem. Transformations such as log y and y are
frequently used.
1
standardized
residual
0
-1
-2
-3
0 10 20 30 40 50
x
Figure 6.6: Example of residual plot which indicates that assumption

E (Yi ) = + xi does not hold
Reading these plots requires practice. You should try not to read too much into plots
particularly if the plots are based on a small number of points.
1
standardized
residual
0
-1
-2
-3
50 60 70 80 90 100
x
Figure 6.7: Example of residual plot which indicates that assumption V ar (Yi ) = 2
does not hold
Often we prefer to use standardized residuals

rî yi ^ i
rî = =
se se
yi ^ ^ xi
= for i = 1; : : : ; n:
se
Standardized residuals were used in Figures 6.6 and 6.7. The patterns in the plots are
unchanged whether we use rî or rî , however the rî values tend to lie in the interval [ 3; 3].
The reason for this is that, since the rî ’s behave roughly like a random sample from the
G(0; ) distribution, the rî ’s should behave roughly like a random sample from the G (0; 1)
distribution. Since P ( 3 Z 3) = 0:9973 where Z v G (0; 1), then roughly 99:73% of
the observations should lie in the interval [ 3; 3].
Example 6.2.1 Revisited Strength of steel bolts

Figure 6.8 shows a standardized residual plot for the steel bolt data where the explana-
tory variate is diameter squared. No deviation from the expected pattern is observed. This
is of course also evident from Figure 6.4.
1.5
1
standardized
residual
0.5
-0.5
-1
-1.5
-2
0 0.05 0.1 0.15 0.2 0.25
Diameter Squared
Figure 6.8: Standard residuals versus diameter squared for bolt data
A qqplot of the standardized residuals is given in Figure 6.9. Since the points lie
reasonably along a straight line the Gaussian assumption seems reasonable. Remember
that, since the quantiles of the Normal distribution change more rapidly in the tails of the
distribution, we expect the points at both ends of the line to lie further from the line.

3
2
Quantiles of Standardized Residuals
-1
-2
-3
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
Figure 6.9: Qqplot of standardized residuals for bolt data

6.3 Comparing the Means of Two Populations

Two Gaussian Populations with Common Variance
Suppose Y11 ; Y12 ; : : : ; Y1n1 is a random sample from the G( 1 ; ) distribution and indepen-
dently Y21 ; Y22 ; : : : ; Y2n2 is a random sample from the G( 2 ; ) distribution. Notice that we
have assumed that both populations have the same variance 2 . We use double subscripts
for the Y ’s here, the …rst index to indicate the population from which the sample was drawn,
the second to indicate which draw from that population. We could easily conform with the
notation of (6.1) by stacking these two sets of observations in a vector of n = n1 + n2
observations:
(Y11 ; Y12 ; : : : ; Y1n1 ; Y21 ; Y22 ; : : : ; Y2n2 )T
and obtain the conclusions below as a special case of the linear model. Below we derive the
estimates from the likelihood directly.
The likelihood function for 1 , 2 , is
Q nj
2 Q 1 1 2
L( 1; 2; )= p exp 2
yji j :
j=1 i=1 2 2
Maximization of the likelihood function gives the maximum likelihood estimators:
1 P n1
~1 = Y1i = Y1 ;
n1 i=1
1 P n2
~2 = Y2i = Y2 ;
n2 i=1
P2 Pnj
1
and ~ 2 = (Yji ~ j )2 :
n1 + n2 j=1 i=1
An estimator of the variance 2 (sometimes referred to as the pooled estimator of variance)

adjusted for the degrees of freedom is
(n1 1)S12 + (n2 1)S22

Sp2 =
n1 + n2 2
n1 + n2
= ~2
n1 + n2 2
where
nj
P
1
Sj2 = (Yji Yj )2 ; j = 1; 2:
nj 1 i=1
are the sample variances obtained from the individual samples. The estimator Sp2 can be
written as a weighted average of the estimators Sj2 . In fact
w1 S12 + w2 S22
Sp2 = (6.19)
w1 + w2
6.3. COMPARING THE MEANS OF TWO POPULATIONS 211
where the weights are wj = nj 1. Although you could substitute weights other than
nj 1 in (6.19)42 , when you pool various estimators in order to obtain one that is better
than any of those being pooled, you should do so with weights that relate to a measure of
precision of the estimators. For sample variances, the number of degrees of freedom is such
an indicator.
We will use the estimator Sp2 for 2 rather than ~ 2 since E Sp2 = 2 .
Con…dence intervals for 1 2
To determine whether the two populations di¤er and by how much we will need to generate
con…dence intervals for the di¤erence 1 2 . First note that the maximum likelihood
estimator of this di¤erence is Y 1 Y 2 which has expected value
E(Y 1 Y 2) = 1 2
and variance
2 2 1 1
2
V ar(Y 1 Y 2 ) = V ar(Y 1 ) + V ar(Y 2 ) = + = + :
n1 n2 n1 n2
It naturally follows that an estimator of V ar(Y 1 Y 2 ) from the pooled data is
1 1
Sp2 +
n1 n2
and that this has n1 1 + n2 1 = n1 + n2 2 degrees of freedom. This provides at least
an intuitive justi…cation for the following:
Theorem 41 If Y11 ; Y12 ; : : : ; Y1n1 is a random sample from the G( 1 ; ) distribution and
independently Y21 ; Y22 ; : : : ; Y2n2 is a random sample from the G( 2 ; ) distribution then
(Y 1 Y 2) ( 2)
q 1
v t (n1 + n2 2)
1 1
Sp n1 + n2
and
(n1 + n2 2)Sp2 1 P nj
2 P
2
= 2
(Yji Yj )2 v 2
(n1 + n2 2)
j=1 i=1
Con…dence intervals or tests of hypothesis for 1 2 and can be obtained by using

these pivotal quantities. In particular, a 100p% con…dence interval for 1 2 is given by
r
1 1
y1 y 2 asp + (6.20)
n1 n2
where P ( a T a) = p and T s t (n1 + n2 2).
42
you would most likely be tempted to use w1 = w2 = 1=2:
Example 6.3.1 Durability of paint

In an experiment to assess the durability of two types of white paint used on asphalt
highways, 12 lines (each 4 inches wide) of each paint were laid across a heavily traveled
section of highway, in random order. After a period of time, re‡ectometer readings were
taken for each line of paint; the higher the readings the greater the re‡ectivity and the
visibility of the paint. The measurements of re‡ectivity were as follows:
Paint A 12:5 11:7 9:9 9:6 10:3 9:6 9:4 11:3 8:7 11:5 10:6 9:7
Paint B 9:4 11:6 9:7 10:4 6:9 7:3 8:4 7:2 7:0 8:2 12:7 9:2
The objectives of the experiment were to test whether the average re‡ectivities for paints A
and B are the same, and if there is evidence of a di¤erence, to obtain a con…dence interval
for their di¤erence. (In many problems where two attributes are to be compared we start
by testing the hypothesis that they are equal, even if we feel there may be a di¤erence. If
there is no statistical evidence of a di¤erence then we stop there.)
To do this it is assumed that, to a close approximation, the re‡ectivity measurements Y1i ;
i = 1; : : : ; 12 for paint A are independent G( 1 ; 1 ) random variables, and independently
the measurements Y2i ; i = 1; : : : ; 12 for paint B are independent G( 2 ; 2 ) random variables.
We can test H : 1 2 = 0 and get con…dence intervals for 1 2 by using the pivotal
quantity
Y1 Y2 ( 1 2)
q v t (22) : (6.21)
1 1
Sp 12 + 12
We have assumed43 that the two population variances are identical, 2
1 = 2
2 = 2, with 2
estimated by
1 P12 P
12
s2p = (y1i y1 )2 + (y2i y2 )2 :
22 i=1 i=1
To test H0 : 1 2 = 0 we use the test statistic
Y1 Y2 0 Y1 Y2
D= q = q
1 1 1 1
Sp 12 + 12 Sp 12 + 12
From the data above we …nd

P
12
n1 = 12 y1 = 10:4 (y1i y1 )2 = 14:08 s21 = 1:2800
i=1
P12
n2 = 12 y2 = 9:0 (y2i y2 )2 = 38:64 s22 = 3:5127:
i=1
This gives ^ 1 ^ 2 = y1 y2 = 1:4 and s2p = 2:3964. The observed value of the test statistic
is
jy1 y2 j 1:4
d= q =q = 2:22
1 1 1
sp 12 + 12 2:3964 6
43
If the sample variances di¤ered by a great deal we would not make this assumption. Unfortunately if
the variances are not assumed equal the problem becomes more di¢ cult.
with
p value = P (jT j 2:22) = 2 [1 P (T 2:22)] = 0:038
where T s t (22). There is evidence based on the data against H0 : 1 = 2 .
Since y1 > y2 , the indication is that paint A keeps its visibility better. A 95% con…dence
interval for 1 2 based on (6.21) is obtained using
0:95 = P ( 2:074 T 2:074) where T s t (22)

0 1
Y1 Y2 ( 1 2)
= P @ 2:074 q 2:074)A
1 1
Sp 12 + 12
r r !
2 2
=P 2:074Sp 1 2 2:074Sp :
12 12
This gives the 95% con…dence interval for 1 2 as

r
1 1
^1 ^2 2:074sp + or [0:09; 2:71] :
12 12
This suggests that although the di¤erence in re‡ectivity (and durability) of the paint is
statistically signi…cant, the size of the di¤erence is not really large relative to the sizes of
1 and 2 . (Look at ^ 1 = y1 = 14:08 and ^ 2 = y2 = 9:0. The relative di¤erences are of the
order of 10%).
Remark: The R function t.test will carry out the test above and will give con…dence
intervals for 1 2 . This can be done with the command t.test(y1 ,y2 ,var.equal=T),
where y1 and y2 are the data vectors from 1 and 2.
Two Gaussian Populations with Unequal Variances

The procedures above assume that the two Gaussian distributions have the same standard
deviations. Sometimes this isn’t a reasonable assumption (it can be tested using a likelihood
ratio test, but we will not do this here) and we must assume that Y11 ; Y12 ; : : : ; Y1n1 is a
random sample from the G( 1 ; 1 ) distribution and independently Y21 ; Y22 ; : : : ; Y2n2 is a
random sample from the G( 2 ; 2 ) but 1 6= 2 . In this case there is no exact pivotal
quantity which can be used to obtain a con…dence interval for the di¤erence in means
1 2 . However the random variable
Y1 Y ( 2)
q2 1
(6.22)
S12 S22
n1 + n2
has approximately a G (0; 1) distribution, especially if n1 ; n2 are both large.

To illustrate its use, consider the durability of paint example, where s21 = 1:2800 and
s22 = 3:5127. These appear quite di¤erent but they are in squared units and n1 ; n2 are
small; the standard deviations s1 = 1:13 and s2 = 1:97 do not provide evidence against
the hypothesis that 1 = 2 if a likelihood ratio test is carried out. Nevertheless, let us
use (6.22) to obtain a 95% con…dence interval for 1 2 . This resulting approximate 95%
con…dence interval is s
s21 s2
y1 y2 1:96 + 2 (6.23)
n1 n2
For the given data this equals 1:4 1:24, or [0:16; 2:64] which is not much di¤erent than
the interval obtained assuming the two Gaussian distributions have the same standard de-
viations.
Example 6.3.2 Scholastic Achievement Test Scores

Tests that are designed to measure the achievement of students are often given in various
subjects. Educators and parents often compare results for di¤erent schools or districts. We
consider here the scores on a mathematics test given to Canadian students in the 5th grade.
Summary statistics (sample sizes, means, and standard deviations) of the scores y for the
students in two small school districts in Ontario are as follows:
District 1: n1 = 278 y1 = 60:2 s1 = 10:16

District 2: n2 = 345 y2 = 58:1 s2 = 9:02
The average score is somewhat higher in District 1, but is this di¤erence statistically
signi…cant? We will give a con…dence interval for the di¤erence in average scores in a model
representing this setting. This is done by thinking of the students in each district as a
random sample from a conceptual large population of “similar” students writing “similar”
tests. We assume that the scores in District 1 have a G( 1 ; 1 ) distribution and that
the scores in District 2 have a G( 2 ; 2 ) distribution. We can then test the hypothesis
H0 : 1 = 2 or alternatively construct a con…dence interval for the di¤erence 1 2.
(Achievement tests are usually designed so that the scores are approximately Gaussian, so
this is a sensible procedure.)
Since n1 = 278 and n2 = 345 we use (6.23) to construct an approximate 95% con…dence
interval for 1 2 . We obtain
s
(10:16)2 (9:02)2
60:2 58:1 1:96 + = 2:1 (1:96)(0:779) or [0:57; 1:63] :
278 345
Since 1 2 = 0 is outside the approximate 95% con…dence interval (can you show that
it is also outside the approximate 99% con…dence interval?) we can conclude there is fairly
strong evidence against the hypothesis H0 : 1 = 2 , suggesting that 1 > 2 . We should
not rely only on a comparison of their means. It is a good idea to look carefully at the data
and the distributions suggested for the two groups using histograms or boxplots.
The mean is a little higher for District 1 and because the sample sizes are so large, this
gives a “statistically signi…cant” di¤erence in a test of H0 : 1 = 2 . However, it would
be a mistake44 to conclude that the actual di¤erence in the two distributions is very large.
Unfortunately, “signi…cant” tests like this are often used to make claims about one group
or class or school is “superior” to another and such conclusions are unwarranted if, as is
often the case, the assumptions of the test are not satis…ed.
Comparing Means Using Paired Data

Often experimental studies designed to compare means are conducted with pairs of units,
where the responses within a pair are not independent. The following examples illustrate
this.
Example 6.3.3 Heights of males versus females 45

In a study in England, the heights of 1401 (brother, sister) pairs of adults were deter-
mined. One objective of the study was to compare the heights of adult males and females;
another was to examine the relationship between the heights of male and female siblings.46
Let Y1i and Y2i be the heights of the male and female, respectively, in the i’th (brother,
sister) pair (i = 1; 2; : : : ; 1401). Assuming that the pairs are sampled randomly from the
population, we can use them to estimate
1 = E(Y1i ) and 2 = E(Y2i )
and the di¤erence 1 2 . However, the heights of related persons are not independent,
so to estimate 1 2 the method in the preceding section should not be used since it
required that we have independent random samples of males and females. In fact, the
primary reason for collecting these data was to consider the joint distribution of Y1i ; Y2i and
to examine their relationship. A clear picture of the relationship is obtained by plotting
the points (Y1i ; Y2i ) in a scatterplot.
Example 6.3.4 Comparison of car fuels

In a study to compare standard gasoline with gas containing an additive designed to
improve mileage (i.e. reduce fuel consumption), the following experiment was conducted.
Fifty cars of a variety of makes and engine sizes were chosen. Each car was driven in a
standard way on a test track for 1000 km, with the standard fuel (S) and also with the
enhanced fuel (E). The order in which the S and E fuels was used was randomized for each
car (you can think of a coin being tossed for each car, with fuel S being used …rst if a Head
occurred) and the same driver was used for both fuels in a given car. Drivers were di¤erent
across the 50 cars.
Suppose we let Y1i and Y2i be the amount of fuel consumed (in litres) for the i’th
car with the S and E fuels, respectively. We want to estimate E(Y1i Y2i ). The fuel
44
We assume independence of the sample. How likely is it that marks in a class are independent of one
another and no more alike than marks between two classes or two di¤erent years?
45
See the video at www.watstat.ca called “Paired Con…dence Intervals”
46
Ask yourself “if I had (another?) brother/sister, how tall would they grow to?”
consumptions Y1i ; Y2i for the i’th car are related, because factors such as size, weight and
engine size (and perhaps the driver) a¤ect consumption. As in the preceding example
it would not be appropriate to treat the Y1i ’s (i = 1; : : : ; 50) and Y2i ’s (i = 1; : : : ; 50)
as two independent samples from larger populations. The observations have been paired
deliberately to eliminate some factors (like driver/ car size) which might otherwise e¤ect
the conclusion. Note that in this example it may not be of much interest to consider E(Y1i )
and E(Y2i ) separately, since there is only a single observation on each car type for either
fuel.
Two types of Gaussian models are used to represent settings involving paired data.
The …rst involves what is called a Bivariate Normal distribution for (Y1i ; Y2i ), and it could
be used in the fuel consumption example. This is a continuous bivariate model for which
each component has a Normal distribution and the components may be dependent. We
will not describe this model here47 (it is studied in third year courses), except to note one
fundamental property: If (Y1i ; Y2i ) has a Bivariate Normal distribution then the di¤erence
between the two is also Normally distributed;
2
Y1i Y2i N( 1 2; ) (6.24)
where 2 = V ar(Y1i ) + V ar(Y2i ) 2Cov(Y1i ; Y2i ). Thus, if we are interested in estimating

or testing 1 2 , we can do this by considering the within-pair di¤ erences Yi = Y1i Y2i
and using the methods for a single Gaussian model in Section 6.2.
The second Gaussian model used with paired data assumes
2 2
Y1i G 1 + i; 1 ; and Y2i G 2 + i; 2 independently
where the i ’s are unknown constants. The i ’s represent factors speci…c to the di¤erent
pairs so that some pairs can have larger (smaller) expected values than others. This model
also gives a Gaussian distribution like (6.24), since
E(Y1i Y2i ) = 1 2 (note that the i ’s cancel)

2 2
V ar(Y1i Y2i ) = 1 + 2
This model seems relevant for Example 6.3.2, where i refers to the i’th car type.
Thus, whenever we encounter paired data in which the variation in variables Y1i and
Y2i is adequately modeled by Gaussian distributions, we will make inferences about 1 2
by working with the model (6.24).
47
For Stat 241: Let Y = (Y1 ; : : : ; Yk )T be a k 1 random vector with E(Yi ) = i and Cov(Yi ; Yj ) = ij ;
i; j = 1; : : : ; k: (Note: Cov(Yi ; Yi ) = ii = V ar(Yi ) = 2i :) Let = ( 1 ; : : : ; k )T be the mean vector and
1
be the k k symmetric covariance matrix whose (i; j) entry is ij : Suppose also that exists. If the joint
T 1
1
p.d.f. of (Y1 ; : : : ; Yk ) is given by f (y1 ; : : : ; yk ) = (2 )k=2 j j1=2 exp 1
2
(y ) (y ) ; y 2 <k where
y = (y1 ; : : : ; yk )T then Y is said to have a Multivariate Normal distribution. The case k = 2 is called
bivariate normal.
Example 6.3.3 Revisited Heights of males versus females

The data on 1401 (brother, sister) pairs gave di¤erences Yi = Y1i Y2i , i = 1; : : : ; 1401
for which the sample mean and variance were
P
1 1401
y = 4:895 inches and s2 = (yi y)2 = 6:5480 (inches)2 :
1400 i=1
Using the pivotal quantity

Y
p
S= n
which has a t (1400) distribution, a two-sided 95% con…dence interval for = E(Yi ) is given
p
by y 1:96s= n where n = 1401. (Note that t (1400) is indistinguishable from G(0; 1).)
This gives the 95% con…dence interval 4:895 0:134 inches or [4:76; 5:03] inches.
Remark: The method above assumes that the (brother, sister) pairs are a random sample
from the population of families with a living adult brother and sister. The question arises
as to whether E(Yi ) also represents the di¤erence in the average heights of all adult males
and all adult females (call them 01 and 02 ) in the population. Presumably 01 = 1 (i.e.
the average height of all adult males equals the average height of all adult males who also
have an adult sister) and similarly 02 = 2 , so E(Yi ) does represent this di¤erence. This is
true provided that the males in the sibling pairs are randomly sampled from the population
of all adult males, and similarly the females, but it might be worth checking.
Recall our earlier Example 1.3.1 involving the di¤erence in the average heights of males
and females in New Zealand. This gave the estimate ^ = y1 y2 = 68:72 64:10 = 4:62
inches, which is a little less than the di¤erence in the example above. This is likely due to
the fact that we are considering two distinct populations, but it should be noted that the
New Zealand data are not paired.
Pairing and Experimental Design

In settings where the population can be arranged in pairs, the estimation of a di¤erence
in means, 1 2 , can often be made more precise (shorter con…dence intervals) by using
pairing in the study. The condition for this is that the association (or correlation) between
Y1i and Y2i be positive. This is the case in both Examples 6.3.3 and 6.3.4, so the pairing
in these studies is a good idea.
To illustrate this further, in Example 6.3.3 the height measurement on the 1401 males
gave y1 = 69:720 and s21 = 7:3861 and the height measurements on the females gave
y2 = 64:825 and s22 = 6:7832. If the males and females were two independent samples (this is
not quite right because the heights for the brother-sister combinations are not independent,
but the sample means and variances are close to what we would get if we did have completely
independent samples), then we could use (6.23) to construct an approximate 95% con…dence
interval for 1 2. For the given data we obtain

r
7:3861 6:7832
69:720 64:825 1:96 +
1401 1401
or [4:70; 5:09] :
We note that it is slightly wider than the 95% con…dence interval [4:76; 5:03] obtained
using the pairings.
To see why the pairing is helpful in estimating the mean di¤erence 1 2 , suppose that
2 2
Y1i G 1 ; 1 and Y2i G 2 ; 2 , but that Y1i and Y2i are not necessarily independent
(i = 1; 2; : : : ; n). The estimator of 1 2 is
Y1 Y2
and we have that E(Y1 Y2 ) = 1 2 and
V ar(Y1 Y2 ) = V ar(Y1 ) + V ar(Y2 ) 2Cov(Y1 ; Y2 )

2 2 2
1 2 12
= + ;
n n n
where 12 = Cov(Y1i ; Y2i ). If 12 > 0, then V ar(Y1 Y2 ) is smaller than when 12 = 0
(that is, when Y1i and Y2i are independent). We would expect that the covariance between
the heights of siblings in the same family to be positively correlated since they share parents.
Therefore if we can collect a sample of pairs (Y1i ; Y2i ), this is better than two independent
random samples (one of Y1i ’s and one of Y2i ’s) for estimating 1 2 . Note on the other
hand that if 12 < 0, then pairing is a bad idea since it increases the value of V ar(Y1 Y2 ).
The following example involves an experimental study with pairing.
Example 6.3.5 Fibre in diet and cholesterol level48

In a study 20 subjects, volunteers from workers in a Boston hospital with ordinary choles-
terol levels, were given a low-…bre diet for 6 weeks and a high-…bre diet for another 6 week
period. The order in which the two diets were given was randomized for each subject (per-
son), and there was a two-week gap between the two 6 week periods, in which no dietary
…bre supplements were given. A primary objective of the study was to see if cholesterol
levels are lower with the high-…bre diet.
Details of the study are given in the New England Journal of Medicine, volume 322
(January 18, 1990), pages 147-152. Here we will simply present the data from the study
and estimate the e¤ect of the amount of dietary …bre.
48
from the old Stat 231 notes of MacKay and Oldford
Table 6.3: Cholesterol Levels on Two Diets

Subject y1i (High F) y2i (Low F) yi Subject y1i (High F) y2i (Low F) yi
1 5:55 5:42 0:13 11 4:44 4:43 0:01
2 2:91 2:85 0:06 12 5:22 5:27 0:05
3 4:77 4:25 0:52 13 4:22 3:61 0:61
4 5:63 5:43 0:20 14 4:29 4:65 0:36
5 3:58 4:38 0:80 15 4:03 4:33 0:30
6 5:11 5:05 0:06 16 4:55 4:61 0:06
7 4:29 4:44 0:15 17 4:56 4:45 0:11
8 3:40 3:36 0:04 18 4:67 4:95 0:28
9 4:18 4:38 0:20 19 3:55 4:41 0:86
10 5:41 4:55 0:86 20 4:44 4:38 0:06
Table 6.3 shows the cholesterol levels y (in mmol per liter) for each subject, measured at
the end of each 6 week period. We let the random variables Y1i ; Y2i represent the cholesterol
levels for subject i on the high …bre and low …bre diets, respectively. We’ll also assume that
the di¤erences are represented by the model
Yi = Y1i Y2i G( 1 2; ) for i = 1; : : : ; 20:
The di¤erences yi are also shown in Table 6.3, and from them we calculate the sample mean
and standard deviation
y = 0:020 and s = 0:411:
Since P (T 2:093) = 1 0:025 = 0:975 where T s t (19), a 95% con…dence interval for
1 2 given by (6.20) is
p p
y 2:093 s= n = 0:020 2:093 (0:411) = 20 = 0:020 0:192 or [ 0:212; 0:172]
This con…dence interval includes 1 2 = 0, and there is clearly no evidence that the high
…bre diet gives a lower cholesterol level at least in the time frame represented in this study.
Remark: The results here can be obtained using the R function t.test.
Exercise: Compute the p-value for the test of hypothesis H0 : 1 2 = 0, using the test
statistic (5.1).
Final Remarks: When you see data from a comparative study (that is, one whose
objective is to compare two distributions, often through their means), you have to determine
whether it involves paired data or not. Of course, a sample of Y1i ’s and Y2i ’s cannot be from
a paired study unless there are equal numbers of each, but if there are equal numbers the
study might be either “paired”or “unpaired”. Note also that there is a subtle di¤erence in
the study populations in paired and unpaired studies. In the former it is pairs of individual
units that form the population where as in the latter there are (conceptually at least)
separate individual units for Y1 and Y2 measurements.
6.4 More General Gaussian Response Models49

We now consider general models of the form (6.1):
P
k
Yi G( i ; ) with (xi ) = j xij for i = 1; 2; : : : ; n independently.
j=1
(Note: To facilitate the matrix proof below we have taken 0 = 0 in (6.1). The estimator of
0 can be obtained from the result below by letting xi1 = 1 for i = 1; : : : ; n and 0 = 1 .)
For convenience we de…ne the n k (where n > k) matrix X of covariate values as
X = (xij ) for i = 1; : : : ; n and j = 1; 2; : : : ; k
and the n 1 vector of responses Yn 1 = (Y1 ; : : : ; Yn )T . We assume that the values xij
are non-random quantities which we observe. We now summarize some results about the
maximum likelihood estimators of the parameters = ( 1 ; : : : ; k )T and .
=( T
Maximum Likelihood Estimators of 1; : : : ; k) and of
Theorem 42 The maximum likelihood estimators for =( T

1; : : : ; k) and are:
~ = (X T X) 1
XT Y (6.25)
1 Pn P
k
~ xij
and ~2 = (Yi ~ i )2 where ~ i = j (6.26)
n i=1 j=1
Proof. The likelihood function is

Q
n 1 1 2 P
k
L( ; ) = p exp 2
(yi i) where i = j xij
i=1 2 2 j=1
and the log-likelihood function is
l( ; ) = log L( ; )
1 P
n
2
= n log 2
(yi i) :
2 i=1
Note that if we take the derivative with respect to a particular j and set this derivative
equal to 0, we obtain,
@l 1 P n @ i
= 2 (yi i) =0
@ j 2 i=1 @ j
or
P
n
(yi i ) xij =0
i=1
49
May be omitted in Stat 231/221
6.4. MORE GENERAL GAUSSIAN RESPONSE MODELS51 221
for each j = 1; 2; : : : ; k. In terms of the matrix X and the vector y =(y1 ; :::; yn )T we can
rewrite this system of equations more compactly as
X T (y X )= 0
or X T y = X T X :
Assuming that the k k matrix X T X has an inverse we can solve these equations to obtain
the maximum likelihood estimate of , in matrix notation as
^ = (X T X) 1
XT y
with corresponding maximum likelihood estimator
e = (X T X) 1
X T Y:
In order to …nd the maximum likelihood estimator of , we take the derivative with respect
to and set the derivative equal to zero and obtain
@l @ 1 P
n
2
= n log 2
(yi i) =0
@ @ 2 i=1
or
n 1 P
n
2
+ 3
(yi i) =0
i=1
from which we obtain the maximum likelihood estimate of 2 as
1 Pn
^2 = (yi ^ i )2
n i=1
where
P
k
^ xij
î = j
j=1
The corresponding maximum likelihood estimator 2 is

1 Pn
~2 = (Yi ~ i )2 :
n i=1
where
P
k
~ xij :
~i = j
j=1
Recall that when we estimated the variance for a single sample from the Gaussian
distribution we considered a minor adjustment to the denominator and with this in mind
we also de…ne the following estimator50 of the variance 2 :
1 P n n
Se2 = (Yi ~ i )2 = ~2:
n k i=1 n k
Note that for large n there will be small di¤erences between the observed values of ~ 2 and
Se2 .
50
It is clear why we needed to assume k < n: Otherwise n k 0 and we have no “degrees of freedom”
left for estimating the variance.
Theorem 43 1. The estimators ~ j are all Normally distributed random variables with
expected value j and with variance given by the j 0 th diagonal element of the matrix
2 (X T X) 1 ; j = 1; 2; : : : ; k:
2. The random variable

n~ 2 (n k)Se2
W = 2
= 2
(6.27)
has a Chi-squared distribution with n k degrees of freedom.
3. The random variable W is independent of the random vector ( ~ 1 ; : : : ; ~ k ):
Proof. 52 The estimator ~ j can be written using (6.25) as a linear combination of the
Normal random variables Yi ,
~ = P bji Yi
n
j
i=1
where the matrix B = (bji )k n = (X T X) 1 X T .

Note that BX = (X T X) 1 X T X equals
the identity matrix I. Because ~ j is a linear combination of independent Normal random
variables Yi , it follows that ~ j is Normally distributed. Moreover
P
n
E( ~ j ) = bji E(Yi )
i=1
Pn P
k
= bji i where i = l xil
i=1 l=1
Pn
= bji i
i=1
P
k
Note that i = l xil is the j’th component of the vector X which implies that E( ~ j )
l=1
is the j’th component of the vector BXX . But since BX is the identity matrix, this is
the j’th component of the vector or j : Thus E( ~ j ) = j for all j. The calculation of
the variance is similar.
P
n
V ar( ~ j ) = b2ji V ar(Yi )
i=1
2 P
n
= b2ji
i=1
P
n
and an easy matrix calculation will show, since BB T = (X T X) 1; that b2ji is the j’th
i=1
diagonal element of the matrix (X T X) 1 . We will not attempt to prove part (3) here,
which is usually proved in a subsequent statistics course.
52
This proof can be omitted for Stat 231.
6.4. MORE GENERAL GAUSSIAN RESPONSE MODELS54 223
Remark: The maximum likelihood estimate ^ is also called a least squares estimate
of in that it is obtained by taking the sum of squared vertical distances between the
observations Yi and the corresponding …tted values ^ i and then adjusting the values of the
estimated j until this sum is minimized. Least squares is a method of estimation in linear
models that predates the method of maximum likelihood. Problem 16 describes the method
of least squares.
Remark:53 From Theorem 39 we can obtain con…dence intervals and test hypotheses for
the regression coe¢ cients using the pivotal
~
j j
p s t (n k) (6.28)
Se cj
1
where cj is the j’th diagonal element of the matrix X T X .
Con…dence intervals for j

In a manner similar to the construction of con…dence intervals for the parameter
for observations from the G( ; ) distribution, we can use (6.28) to construct con…dence
intervals for the parameter j . For example for a 95% con…dence interval, we begin by
using the t distribution with n k degrees of freedom to …nd a constant a such that
P ( a < T < a) = 0:95 where T s t (n k) :
We then obtain the con…dence interval by solving the inequality
^
j j
a p a
se cj
to obtain
^ p ^ + ase pcj
j ase cj j j
where
1 P
n P
k
^ xij :
s2e = (yi ^ i )2 and ^ i = j
n k i=1 j=1
Thus a 95% con…dence interval for j is

h p p i
^ as cj ; ^ j + as cj
j
which takes the familiar form
estimate a estimated standard deviation of estimator.

53
2
p
Recall: if Z G(0; 1) and W (m) then the random variable T = Z= W=m s t (m).
~ (n k)S 2
j j
Let Z = p
cj
,W = 2 and m = n k to obtain this result.
We now consider a special case of the Gaussian response models. We have already
seen this case in Chapter 4, but it provides a simple example to validate the more general
formulae.
Single Gaussian distribution

Here, Yi G( ; ), i = 1; :::; n, i.e. (xi ) = and xi = x1i = 1; for all i = 1; 2; : : : ; n;
k = 1 we use the parameter instead of = ( 1 ). Notice that Xn 1 = (1; 1; : : : ; 1)T in
this case. This special case was also mentioned in Section 6.1. The pivotal quantity (6.28)
becomes
~ ~
1
p 1 = p
Se c1 S= n
since (X T X) 1 = 1=n. This pivotal quantity has the t distribution with n k =n 1.
You can also verify using (6.27) that
(n 1)S 2
2
has a Chi-squared(n 1) distribution.


1. Twenty-…ve female nurses working at a large hospital were selected at random and
their age (x) and systolic blood pressure (y) were recorded. The data are:
x y x y x y x y x y
46 136 37 115 58 139 48 134 59 142
36 132 45 129 50 156 35 120 54 135
62 138 39 127 41 132 42 137 57 150
26 115 28 134 31 115 27 120 60 159
53 143 32 133 51 143 34 128 38 127
x = 43:20 y = 133:56
Sxx = 2802:00 Syy = 3284:16 Sxy = 2325:20
To analyze these data assume the simple linear regression model: Yi G( + xi ; ),
i = 1; : : : ; 12:
(a) Give the maximum likelihood (least squares) estimates of and and an unbi-
ased estimate of 2 .
(b) Use the plots discussed in Section 6:2 to check the adequacy of the model.
(c) Construct a 95% con…dence interval for . What is the interpretation of this
interval?
(d) Construct a 90% con…dence interval for the mean systolic blood pressure of
nurses aged x = 35.
(e) Construct a 99% prediction interval for the systolic blood pressure Y of a nurse
aged x = 50.
2. Recall the data in Chapter 1, Problem 10 on the variates x = “value of an actor”and

y = “amount grossed by a movie”.
(a) Fit the simple linear regression model: Yi G( + xi ; ), i = 1; : : : ; 20 inde-

pendently to these data.
(b) Use the plots discussed in Section 6:2 to check the adequacy of the model.
(c) What is the relationship between the maximum likelihood estimate of and the
sample correlation?
(d) Construct a 95% con…dence interval for . The parameter corresponds to what
attribute of interest in the study population?
(e) Test the hypothesis that there is no relationship between the “value of an ac-
tor” and the “amount grossed by a movie”. Are there any limitations to your
conclusion. (Hint: How were the data collected?)
(f) Construct a 95% con…dence interval for the mean amount grossed by movies for
actors whose value is x = 50. Construct a 95% con…dence interval for the mean
amount grossed by movies for actors whose value is x = 100. What assumption
is being made in constructing the interval for x = 100?
3. Recall the steel bolt experiment in Example 6:2:1.
(a) Construct a 95% con…dence interval for the mean breaking strength of bolts of
diameter x = 0:35, that is, x1 = (0:35)2 = 0:1225.
(b) Construct a 95% prediction interval for the breaking strength Y of a single bolt
of diameter x = 0:35. Compare this with the interval in (a).
(c) Suppose that a bolt of diameter 0:35 is exposed to a large force V that could
potentially break it. In structural reliability and safety calculations, V is treated
as a random variable and if Y represents the breaking strength of the bolt (or
some other part of a structure), then the probability of a “failure”of the bolt is
P (V > Y ). Give a point estimate of this value if V G(1:60; 0:10), where V
and Y are independent.
4. There are often both expensive (and highly accurate) and cheaper (and less accurate)
ways of measuring concentrations of various substances (e.g. glucose in human blood,
salt in a can of soup). The table below gives the actual concentration x (determined
by an expensive but very accurate procedure) and the measured concentration y
obtained by a cheap procedure, for each of 20 units.
x y x y x y x y
4:01 3:7 13:81 13:02 24:85 24:69 36:9 37:54
6:24 6:26 15:9 16 28:51 27:88 37:26 37:2
8:12 7:8 17:23 17:27 30:92 30:8 38:94 38:4
9:43 9:78 20:24 19:9 31:44 31:03 39:62 40:03
12:53 12:4 24:81 24:9 33:22 33:01 40:15 39:4
x = 23:7065 y = 23:5505
Sxx = 2818:946855 Syy = 2820:862295 Sxy = 2818:556835
To analyze these data assume the regression model: Yi G( + xi ; ), i = 1; : : : ; 20
independently.
(a) Fit the model to these data. Use the plots discussed in Section 6.2 to check the
adequacy of the model.
(b) Construct a 95% con…dence intervals for the slope and test the hypothesis
= 1. Construct 95% con…dence intervals for the intercept and test the
hypothesis = 0. Why are these hypotheses of interest?
(c) Describe brie‡y how you would characterize the cheap measurement process’s
accuracy to a lay person.
(d) If the units to be measured have true concentrations in the range 0 40, do you
think that the cheap method tends to produce a value that is lower than the true
concentration? Support your answer based on the data and the assumed model.
5. Regression through the origin: Consider the model Yi v G( xi ; ); i = 1; : : : ; n

independently.
(a) Assuming that is known, show that

P
n
xi yi
^= i=1
Pn
x2i
i=1
is the maximum likelihood estimate of and also the least squares estimate of
.
(b) Show that 0 1
P
n
x i Yi
B 2 C
~= i=1
vNB
@ ;
C:
A
Pn P
n
x2i x2i
i=1 i=1
P
n
Hint: Write ~ in the form ai Yi .
i=1
(c) Prove the identity
2 1
P
n
^ xi
2 P
n P
n P
n
yi = yi2 x i yi x2i :
i=1 i=1 i=1 i=1
This identity can be used to calculate

1 P
n
^ xi
2
s2e = yi
n 1 i=1
which is an unbiased estimate of 2.
(d) Show how to use the pivotal quantity

~
s v t (n 1) :
P
n
Se = x2i
i=1
to construct a 95% con…dence interval for .

(e) Explain how to test the hypothesis = 0.
6. For the data in Problem 4

P
20 P
20 P
20
xi yi = 13984:5554 x2i = 14058:9097 yi2 = 13913:3833
i=1 i=1 i=1
(a) Fit the model Yi v G( xi ; ); i = 1; : : : ; 20 independently to these data.

(b) Plot a scatterplot of the data and the …tted line on the same plot. How well
does the model through the origin …t the data?
(c) Construct a 95% con…dence intervals for the slope and test the hypothesis
=1
(d) Let ^ i = ^ xi and rî = (yi ^ i ) =se . Plot the residual plots (xi ; rî ), i =
1; 2; : : : ; 20 and (^ i ; rî ), i = 1; 2; : : : ; 20. Plot a qqplot of the standardized resid-
uals rî . Based on these plots as well as the scatterplot with the …tted line
comment on how well the model …ts the data.
(e) Using the results of this analysis as well as the analysis in Problem 4 what would
conclude about using the model Yi G( + xi ; ) versus Yi v G( xi ; ) for
these data?
7. The following data were recorded concerning the relationship between drinking
(x = per capita wine consumption) and y = death rate from cirrhosis of the liver in
n = 46 states of the U.S.A. (for simplicity the data has been rounded):
x y x y x y x y x y x y
5 41 12 77 7 67 4 52 7 41 16 91
4 32 7 57 18 57 16 87 13 67 2 30
3 39 14 81 6 38 9 67 8 48 6 28
7 58 12 34 31 130 6 40 28 123 3 52
11 75 10 53 13 70 6 56 23 92 8 56
9 60 10 55 20 104 21 58 22 76 13 56
6 54 14 58 19 84 15 74 23 98
3 48 9 63 10 66 17 98 7 34
x = 11:5870 y = 63:5870
Sxx = 2155:1522 Syy = 24801:1521 Sxy = 6175:1522
(a) Fit the simple linear regression model: Yi G( + xi ; ), i = 1; : : : ; 46 inde-

pendently to these data.
(b) Use the plots discussed in Section 6.2 to check the adequacy of the model.
(c) Construct a 95% con…dence interval for .
(d) Test the hypothesis that there is no relationship between wine consumption per
capita and the death rate from cirrhosis of the liver.
8. Skinfold body measurements are used to approximate the body density of individuals.
The data on n = 92 men, aged 20-25, where x = skinfold measurement and Y = body
density are given in Appendix C as well as being posted on the course website.
Note: The R function lm, with the command lm(y~x) gives the detailed the calcula-
tions for linear regression. The command summary(lm(y~x)) also gives useful output.
>Dataset<-read.table("Skinfold Data.txt",header=T,sep="",strip.white=T)
# reads data and headers from file Skinfold Data.txt
>RegModel <-lm(BodyDensity~Skinfold,data=Dataset)
# runs regression Bodydensity=a+b*Skinfold
>summary(RegModel) # summary of output on next page
The output is as follows:

Call:
lm(formula = BodyDensity ~Skinfold, data = Dataset)
Residuals:
Min 1Q Median 3Q Max
-0.0251400 -0.0040412 -0.0001752 0.0041324 0.0192336
Coefficients:
Estimate Std. Error t value Pr(>jtj)
(Intercept) 1.161139 0.005429 213.90 <2e-16 ***
Skinfold -0.062066 0.003353 -18.51 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.007877 on 90 degrees of freedom
Diagnostic plots.
>x<-Dataset$Skinfold
>y<-Dataset$BodyDensity
>muhat<-1.161139-0.062066*x
>plot(x,y)
>points(x,muhat,type="l")
>title(main="Scatterplot of Skinfold/BodyDensity with fitted line")
Residual Plots
>r<- RegModel$residuals
>x<- Dataset$Skinfold
>plot(x,r)
>title(main="residual plot: Skinfold vs residual")
>muhat=1.161139-0.062066*x
>plot(muhat,r)
>title(main="residual plot: fitted values vs residual")
>rstar <- r/0.007877
>plot(muhat,rstar)
>title(main="residual plot: fitted values vs standardized residual")

>qqnorm(rstar)
>title(main="Normal qqplot of residuals")
(a) Run the R code given. What do the scatterplot and residual plots indicate about
the …t of the model?
(b) Do you think that the skinfold measurements provide a reasonable approximation
to the Body Density?
9. The following data, collected by a famous British botanist named Joseph Hooker in
the Himalaya Mountains between 1848 and a850, relates atmospheric pressure to the
boiling point of water. Theory suggests that a graph of log pressure versus boiling
point should give a straight line.
Boiling Point Atmospheric Boiling Point Atmospheric

of Water Pressure of Water Pressure
F Hg F Hg
210:8 29:211 189:5 18:869
210:2 28:559 188:8 18:356
208:4 27:972 188:5 18:507
202:5 24:697 185:7 17:267
200:6 23:726 186:0 17:221
200:1 23:369 185:6 17:062
199:5 23:030 184:1 16:959
197:0 21:892 184:6 16:881
196:4 21:928 184:1 16:817
196:3 21:654 183:2 16:385
195:6 21:605 182:4 16:235
193:4 20:480 181:9 16:106
193:6 20:212 181:9 15:928
191:4 19:758 181:0 15:919
191:1 19:490 180:6 15:376
190:6 19:386
(a) Let y = atmospheric pressure (in Hg) and x = boiling point of water (in F).
Fit a simple linear regression model to the data (xi ; yi ), i = 1; : : : ; 31. Prepare
a scatterplot of y versus x and draw on the …tted line. Plot the standardized
residuals versus x. How well does the model …t these data?
(b) Let z = log y. Fit a simple linear regression model to the data (xi ; zi ), i =
1; : : : ; 31. Prepare a scatterplot of z versus x and draw on the …tted line. Plot
the standardized residuals versus x. How well does the model …t these data?
(c) Based on the results in (a) and (b) which data are best …t by a linear model?
Does this con…rm the theory’s model?
(d) Obtain a 95% con…dence interval for the mean atmospheric pressure if the boiling
point of water is 195 F .
10. An educator believes that the new directed readings activities in the classroom will
help elementary school students improve some aspects of their reading ability. She
arranges for a Grade 3 class of 21 students to take part in the activities for an 8-
week period. A control classroom of 23 Grade 3 students follows the same curriculum
without the activities. At the end of the 8-week period, all students are given a Degree
of Reading Power (DRP) test, which measures the aspects of reading ability that the
treatment is designed to improve. The data are:
24 43 58 71 43 49 61 44 67 49 53
Treatment Group:
56 59 52 62 54 57 33 46 43 57
42 43 55 26 62 37 33 41 19 54 20 85
Control Group:
46 10 17 60 53 42 37 42 55 28 48
Let y1j = the DRP test score for the treatment group, j = 1; : : : ; 21: Let y2j = the
DRP test score for the control group, j = 1; : : : ; 23: For these data
P
21
y1 = 51:4762 (y1j y1 )2 = 2423:2381
j=1
P23
y2 = 41:5217 (y2j y2 )2 = 6469:7391
j=1
To analyze these data assume

Y1j v G ( 1; ); j = 1; : : : ; 21 independently
for the treatment group and independently
for the control group where 1; 2 and are unknown parameters.
(a) The parameters 1 ; 2 and correspond to what attributes of interest in the

study population?
(b) Plot a qqplot of the responses for the treatment group and a qqplot of the
responses for the control group. How reasonable are the Normality assumptions
stated in the assumed model?
(c) Calculate a 95% con…dence interval for the di¤erence in the means 1 2.
(d) Test the hypothesis of no di¤erence between the means, that is, test the hypoth-
esis H0 : 1 = 2 . What conclusion should the educator make based on these
data? Be sure to indicate any limitations to these conclusions.
11. A study was done to compare the durability of diesel engine bearings made of two
di¤erent compounds. Ten bearings of each type were tested. The following table gives
the “times” until failure (in units of millions of cycles):
Type I: y1i 3:03 5:53 5:60 9:30 9:92 12:51 12:95 15:21 16:04 16:84
Type II: y2i 3:19 4:26 4:47 4:53 4:67 4:69 12:78 6:79 9:37 12:75
P
10 P
10
y1 = 10:693 (y1i y1 )2 = 209:02961 y2 = 6:75 (y2i y2 )2 = 116:7974
i=1 i=1
for the Type I bearings and independently
for the Type II bearings where 1; 2 and are unknown parameters.
(a) Obtain a 90% con…dence interval for the di¤erence in the means 1 2.
(b) Test the hypothesis H0 : 1 = 2.
(c) It has been suggested that log failure times are approximately Normally dis-
tributed, but not failure times. Assuming that the log Y ’s for the two types of
bearing are Normally distributed with the same variance, test the hypothesis
that the two distributions have the same mean. How does the answer compare
with that in part (b)?
(d) How might you check whether Y or log Y is closer to Normally distributed?
(e) Give a plot of the data which could be used to describe the data and your
analysis.
12. To compare the mathematical abilities of incoming …rst year students in Mathemat-
ics and Engineering, 30 Math students and 30 Engineering students were selected
randomly from their …rst year classes and given a mathematics aptitude test. A sum-
mary of the resulting marks xi (for the math students) and yi (for the engineering
students), i = 1; : : : ; 30, is as follows:
P
30
Math students: n = 30 y1 = 120 (y1i y1 )2 = 3050
i=1
P30
Engineering students: n = 30 y2 = 114 (y2i y2 )2 = 2937
i=1
for the Math students and independently
for Engineering students where 1; 2 and are unknown parameters.
(a) Obtain a 95% con…dence interval for the di¤erence in mean scores for …rst year
Math and Engineering students.
(b) Test the hypothesis that the di¤erence is zero.
13. Fourteen welded girders were cyclically stressed at 1900 pounds per square inch and
the numbers of cycles to failure were observed. The sample mean and variance of the
log failure times were y1 = 14:564 and s21 = 0:0914. Similar tests on ten additional
girders with repaired welds gave y2 = 14:291 and s22 = 0:0422. Log failure times are
assumed to be independent with a Gaussian distribution. Assuming equal variances
for the two types of girders, obtain a 90% con…dence interval for the di¤erence in
mean log failure times and test the hypothesis of no di¤erence.
14. Consider the data in Problem 9 of Chapter 1 on the lengths of male and female
coyotes.
(a) Construct a 95% con…dence interval the di¤erence in mean lengths for the two
sexes. State your assumptions.
(b) Estimate P (Y1 > Y2 ) (give the maximum likelihood estimate), where Y1 is the
length of a randomly selected female and Y2 is the length of a randomly selected
male. Can you suggest how you might get a con…dence interval?
(c) Give separate con…dence intervals for the average length of males and females.
15. To assess the e¤ect of a low dose of alcohol on reaction time, a sample of 24 student
volunteers took part in a study. Twelve of the students (randomly chosen from the 24)
were given a …xed dose of alcohol (adjusted for body weight) and the other twelve got
a nonalcoholic drink which looked and tasted the same as the alcoholic drink. Each
student was then tested using software that ‡ashes a coloured rectangle randomly
placed on a screen; the student has to move the cursor into the rectangle and double
click the mouse. As soon as the double click occurs, the process is repeated, up to a
total of 20 times. The response variate is the total reaction time (i.e. time to complete
the experiment) over the 20 trials. The data are given below.
“Alcohol” Group:
1:33 1:55 1:43 1:35 1:17 1:35 1:17 1:80 1:68 1:19 0:96 1:46
P
12
y1 = 16:44
12 = 1:370 (y1i y1 )2 = 0:608
i=1
“Non-Alcohol” Group:
1:68 1:30 1:85 1:64 1:62 1:69 1:57 1:82 1:41 1:78 1:40 1:43
P
12
y2 = 19:19
12 = 1:599 (y2i y2 )2 = 0:35569
i=1
Analyze the data with the objective of determining whether there is any evidence
that the dose of alcohol increases reaction time. Justify any models that you use.
16. An experiment was conducted to compare gas mileages of cars using a synthetic oil
and a conventional oil. Eight cars were chosen as representative of the cars in general
use. Each car was run twice under as similar conditions as possible (same drivers,
routes, etc.), once with the synthetic oil and once with the conventional oil, the order
of use of the two oils being randomized. The average gas mileages were as follows:
Car 1 2 3 4 5 6 7 8
Synthetic: y1i 21:2 21:4 15:9 37:0 12:1 21:1 24:5 35:7
Conventional: y21 18:0 20:6 14:2 37:8 10:6 18:5 25:9 34:7
yi = y1i y2i 3:2 0:8 1:7 0:8 1:5 2:6 1:4 1
P
8
y1 = 23:6125 (y1i y1 )2 = 535:16875
i=1
P8
y2 = 22:5375 (y2i y2 )2 = 644:83875
i=1
P
8
y = 1:075 (yi y)2 = 17:135
i=1
(a) Obtain a 95% con…dence interval for the di¤erence in mean gas mileage, and
state the assumptions on which your analysis depends.
(b) Repeat (a) if the natural pairing of the data is (improperly) ignored.
(c) Why is it better to take pairs of measurements on eight cars rather than taking
only one measurement on each of 16 cars?
17. The following table gives the number of sta¤ hours per month lost due to accidents
in eight factories of similar size over a period of one year and after the introduction
of an industrial safety program.
Factory i 1 2 3 4 5 6 7 8
After: y1i 28:7 62:2 28:9 0:0 93:5 49:6 86:3 40:2
Before: y2i 48:5 79:2 25:3 19:7 130:9 57:6 88:8 62:1
yi = y1i y2i 19:8 17:0 3:6 19:7 37:4 8:0 2:5 21:9
P
8
y= 15:3375 (yi y)2 = 1148:79875
i=1
There is a natural pairing of the data by factory. Factories with the best safety records
before the safety program tend to have the best records after the safety program as
well. The analysis of the data must take this pairing into account and therefore the
model
Yi v G ( ; ) ; i = 1; : : : ; 8 independently
is assumed where and are unknown parameters.
(a) The parameters and correspond to what attributes of interest in the study
population?
(b) Calculate a 95% con…dence interval for .
(c) Test the hypothesis of no di¤erence due to the safety program, that is, test the
hypothesis H0 : = 0:
18. Comparing sorting algorithms: Suppose you want to compare two algorithms A
and B that will sort a set of numbers into an increasing sequence. (The R function,
sort(x), will, for example, sort the elements of the numeric vector x.) To compare
the speed of algorithms A and B, you decide to “present” A and B with random
permutations of n numbers, for several values of n. Explain exactly how you would
set up such a study, and discuss what pairing would mean in this context.
19. Sorting algorithms continued: Two sort algorithms as in the preceding problem
were each run on (the same) 20 sets of numbers (there were 500 numbers in each set).
Times to sort the sets of two numbers are shown below.
Set: 1 2 3 4 5 6 7 8 9 10
A: 3:85 2:81 6:47 7:59 4:58 5:47 4:72 3:56 3:22 5:58
B: 2:66 2:98 5:35 6:43 4:28 5:06 4:36 3:91 3:28 5:19
yi 1:19 :17 1:12 1:16 0:30 0:41 0:36 :35 :06 0:39
Set: 11 12 13 14 15 16 17 18 19 20
A: 4:58 5:46 3:31 4:33 4:26 6:29 5:04 5:08 5:08 3:47
B: 4:05 4:78 3:77 3:81 3:17 6:02 4:84 4:81 4:34 3:48
yi 0:53 0:68 :46 0:52 1:09 0:27 0:20 0:27 0:74 :01
20
X
y = 0:409 s2 = 1
19 (yi y)2 = 0:237483
i=1
(a) Since the two algorithms are each run on the same 20 sets of numbers we analyse
the di¤erences yi = yAi yBi , i = 1; 2; : : : ; 20. Construct a 99% con…dence
interval for the di¤erence in the average time to sort with algorithms A and B,
assuming the di¤erence have a Gaussian distribution.
(b) Use a Normal qqplot to determine if a Gaussian model is reasonable for the
di¤erences.
(c) Give a point estimate of the probability that algorithm B will sort a randomly
selected list faster than A.
(d) Another way to estimate the probability p in part (c) is to notice that of the 20
sets of numbers in the study, B sorted faster on 15 sets of numbers. Obtain an
approximate 95% con…dence interval for p. (It is also possible to get a con…dence
interval using the Gaussian model.)
(e) Suppose the study had actually been conducted using two independent samples of
size 20 each. Using the two sample Normal analysis determine a 99% con…dence
interval for the di¤erence in the average time to sort with algorithms A and B.
Note:
y1 = 4:7375 s21 = 1:4697 y2 = 4:3285 s22 = 0:9945
How much better is the paired study as compared to the two sample study?
20. Challenge Problem: Let Y1 ; : : : ; Yn be a random sample from the G( 1 ; 1 ) dis-

tribution and let X1 ; : : : ; Xn be a random sample from the G( 2 ; 2 ) distribution.
Obtain the likelihood ratio test statistic for testing the hypothesis H0 : 1 = 2 and
show that it is a function of F = S12 =S22 , where S12 and S22 are the sample variances
from the y and x samples respectively.
21. Challenge Problem: Readings produced by a set of scales are independent and
Normally distributed about the true weight of the item being measured. A study
is carried out to assess whether the standard deviation of the measurements varies
according to the weight of the item.
(a) Ten weighings of a 10 kilogram weight yielded y = 10:004 and s = 0:013 as the
sample mean and standard deviation. Ten weighings of a 40 kilogram weight
yielded y = 39:989 and s = 0:034. Is there any evidence of a di¤erence in the
standard deviations for the measurements of the two weights?
(b) Suppose you had a further set of weighings of a 20 kilogram item. How could
you study the question of interest further?
22. Challenge Problem: Least squares estimation. Suppose you have a model
where the mean of the response variable Yi given the covariates xi = (xi1 ; : : : ; xik )
has the form
i = E(Yi jxi ) = (xi ; )
where is a k 1 vector of unknown parameters. Then the least squares estimate

of based on data (xi ; yi ); i = 1; : : : ; n is the value that minimizes the objective
function
Pn
S( ) = [yi (xi ; )]2
i=1
Show that the least squares estimate of is the same as the maximum likelihood
estimate of in the Gaussian model Yi G( i ; ), when i is of the form
P
k
i = (xi ; ) = j xij :
j=1
23. Challenge Problem: Optimal Prediction. In many settings we want to use

covariates x to predict a future value Y . (For example, we use economic factors x to
predict the price Y of a commodity a month from now.) The value Y is random, but
suppose we know (x) = E(Y jx) and (x)2 = V ar(Y jx).
(a) Predictions take the form Y^ = g(x), where g( ) is our “prediction” function.
Show that the minimum achievable value of E(Y^ Y )2 is minimized by choosing
g(x) = (x).
(b) Show that the minimum achievable value of E(Y^ Y )2 , that is, its value when
g(x) = (x) is (x)2 .
This shows that if we can determine or estimate (x), then “optimal”prediction
(in terms of Euclidean distance) is possible. Part (b) shows that we should try
to …nd covariates x for which (x)2 = V ar(Y jx) is as small as possible.
(c) What happens when (x)2 is close to zero? (Explain this in ordinary English.)
7. MULTINOMIAL MODELS
AND GOODNESS OF FIT TESTS
7.1 Likelihood Ratio Test for the Multinomial Model

Many important hypothesis testing problems can be addressed using Multinomial models.
Suppose the data arise from a Multinomial distribution with joint probability function
n! y1 yk
f (y1 ; : : : ; yk ; 1; : : : ; k ) = 1 k (7.1)
y1 ! yk !
P
k
where yj = 0; 1; : : : and yj = n. The Multinomial probabilities j satisfy 0 < j <1
j=1
P
k
and j = 1, and we de…ne = ( 1; : : : ; k ). Suppose that we wish to test the hypothesis
j=1
that the probabilities are related in some way, for example that they are all functions of a
lower dimensional parameter , such that
H0 : j = j( ) for j = 1; : : : ; k (7.2)
where dim( ) = p < k 1.

The likelihood function based on (7.1) is proportional to
Q
k
yj
L( ) = j : (7.3)
j=1
Let be the parameter space for . It was shown earlier that L( ) is maximized over
(of dimension m 1) by the vector ^ with ^j = yj =n, j = 1; : : : ; k. A likelihood ratio test
of the hypothesis (7.2) is based on the likelihood ratio statistic
" #
L( ~0 )
= 2l(~) 2l(~0 ) = 2 log ; (7.4)
L(~)
where ~0 maximizes L( ) under the hypothesis (7.2), which restricts to lie in a space
0 of dimension p. (Note that 0 is the space of all ( 1 ( ); 2 ( ); :::; k ( )) as
varies over its possible values.) If H0 is true (that is, if really lies in 0 ) and n is large the
239
240 7. MULTINOMIAL MODELS AND GOODNESS OF FIT TESTS
distribution of is approximately 2 (k 1 p). This enables us to compute p values

from observed data by using the approximation
2
p value = P ( ; H0 ) t P (W ) where W s (k 1 p) (7.5)
and
= 2l(^) 2l(^0 )
is the observed value of . This approximation is very accurate when n is large and none
of the j ’s is too small. When the observed expected frequencies under H0 are all at least
…ve, it is accurate enough for testing purposes.
The test statistic (7.4) can be written in a simple form. Let ~0 = ( 1 (~ ); : : : ; k (~ ))
denote the maximum likelihood estimator of under the hypothesis (7.2). Then, by (7.4),
we obtain
= 2l(~) 2l(~0 )
k
" #
X ~j
=2 Yj log :
j=1 j (~ )
Noting that ~j = Yj =n and de…ning the expected frequencies under H0 as
Ej = n j (~ ) for j = 1; : : : ; k
we can rewrite as
k
X Yj
=2 Yj log : (7.6)
Ej
j=1
An alternative test statistic that was developed historically before the likelihood ratio
test statistic is the Pearson goodness of …t statistic
k
X (Yj Ej )2
D= : (7.7)
Ej
j=1
The Pearson goodness of …t statistic has similar properties to ; for example, their observed
values both equal zero when yj = ej = n j (^ ) for all j = 1; : : : ; k and are larger when
yj ’s and ej ’s di¤er greatly. It turns out that, like , the statistic D also has a limiting
2 (k 1 p) distribution when H0 is true.
The remainder of this chapter consists of the application of the general methods above
to some important testing problems.
7.2. GOODNESS OF FIT TESTS 241
7.2 Goodness of Fit Tests

Recall from Section 2.4 that one way to check the …t of a probability distribution is by
comparing the observed frequencies fj and the expected frequencies ej = n^pj . As indicated
there we did not know how close the observed and expected frequencies needed to be to
conclude that the model was adequate. It is possible to test the correctness of a model by
using the Multinomial model. We illustrate this through two examples.
Example 7.2.1 MM, MN, NN blood types

Recall Example 2.4.2, where people in a population are classi…ed as being one of three
blood types MM, MN, NN. The proportions of the population that are these three types
are 1 , 2 , 3 respectively, with 1 + 2 + 3 = 1. Genetic theory indicates, however, that
the j ’s can be expressed in terms of a single parameter , as
2
1 = , 2 = 2 (1 ), 3 = (1 )2 : (7.8)
Data collected on 100 persons gave y1 = 17, y2 = 46, y3 = 37, and we can use this to test
the hypothesis H0 that (7.8) is correct. (Note that (Y1 ; Y2 ; Y3 ) Multinomial(n; 1 ; 2 ; 3 )
with n = 100.) The likelihood ratio test statistic is given by (7.6), but we have to …nd ~
and then the Ej ’s. The likelihood function under (7.8) is
L1 ( ) = L( 1 ( ); 2( ); 3 ( ))
2 17 46
= c( ) [2 (1 )] [(1 )2 ]37
80
=c (1 )120
where c is a constant. We easily …nd that ^ = 0:40. The observed expected frequencies
under (7.8) are therefore e1 = 100^ 2 = 16, e2 = 100[2^ (1 ^ )] = 48, e3 = 100[(1 ^ )2 ] = 36.
Clearly these are close to the observed frequencies y1 = 17, y2 = 46, y3 = 37. The observed
value of the likelihood ratio statistic (7.6) is
3
X yj 17 46 37
2 yj log = 2 17 log + 46 log + 37 log = 0:17
ej 16 48 36
j=1

2
p value = P ( 0:17; H0 ) t P (W 0:17) where W s (1)
= 2 [1 P (Z 0:41)] where Z s N (0; 1)
= 2(1 0:6591) = 0:6818
so there is no evidence against the model (7.8).

The observed values of the Pearson goodness of …t statistic (7.7) and the likelihood ratio
statistic are usually close when n is large and so it does not matter which test statistic
is used. In this case we …nd that the observed value of (7.7) for these data is also 0:17.
Example 7.2.2 Goodness of …t and Exponential model

Continuous distributions can also be tested by grouping the data into intervals and then
using the Multinomial model. Example 2.6.2 previously did this in an informal way for an
Exponential distribution and the lifetimes of brake pads data.
Suppose a random sample t1 ; : : : ; t100 is collected and we wish to test the hypothesis
that the data come from an Exponential( ) distribution. We partition the range of T into
intervals j = 1; : : : ; k, and count the number of observations yj that fall into each interval.
Assuming an Exponential( ) model, the probability that an observation lies in the j’th
interval Ij = (aj 1 ; aj ) is
Zaj
aj = aj 1=
pj ( ) = f (t; )dt = e e for j = 1; :::; k (7.9)
aj 1
and if yj is the number of observations (t’s) that lie in Ij , then Y1 ; : : : ; Yk follow a

Multinomial(n; p1 ( ); : : : ; pk ( )) distribution with n = 100.
Suppose the observed data are
Interval 0 100 100 200 200 300 300 400 400 600 600 800 > 800
yj 29 22 12 10 10 9 8
ej 27:6 20:0 14:4 10:5 13:1 6:9 7:6
To calculate the expected frequencies we need an estimate of which is obtained by maxi-
mizing the likelihood function
Q
7
L( ) = [pj ( )]yj :
j=1
It is possible to maximize L( ) mathematically. (Hint: rewrite L( ) in terms of the pa-

rameter = e 100= and …nd ^ …rst; then ^ = 100= log ^ .) This gives ^ = 310:0. The
expected frequencies, ej = 100pj (^) j = 1; : : : ; 7, are given in the table.
The observed value of the likelihood ratio statistic (7.6) is
7
X yj 29 8
2 yj log = 2 29 log + + 8 log = 1:91
ej 27:6 7:6
j=1

2
p value = P ( 1:91; H0 ) t P (W 1:91) = 0:86 where W s (5)
so there is no evidence against the model (7.9). Note that the reason the 2 degrees of
freedom are 5 is because k 1 = 6 and p = dim( ) = 1.
The goodness of …t test just discussed has some arbitrary elements, since we could have
used di¤erent intervals and a di¤erent number of intervals. Theory has been developed on
how best to choose the intervals. For this course we only give rough guidelines which are:
chose 4 10 intervals, so that the observed expected frequencies under H0 are at least 5.
7.2. GOODNESS OF FIT TESTS 243
Example 7.2.3 Goodness of …t and Poisson model

Recall the data in Example 2.6.1 collected by the physicists Rutherford and Geiger on
the number of alpha particles omitted from a polonium source during 2608 time intervals
each of length 1=8 minute. The data are given in Table 7.1 along with the expected
frequencies calculated using the Poisson model with the mean estimated by the sample
mean ^ = 3:8715. In order to use the 2 approximation we have combined the last four
classes so that the expected frequency in all classes is at least …ve.
Table 7.1: Frequency Table for Rutherford/Geiger Data

Number of - Observed Expected
particles detected: j Frequency: fj Frequency: ej
0 57 54:3
1 203 210:3
2 383 407:1
3 525 525:3
4 532 508:4
5 408 393:7
6 273 254:0
7 139 140:5
8 45 68:0
9 27 29:2
10 10 11:3
11 6 5:8
Total 2608 2607:9

12
X fj 57 203 6
2 fj log = 2 57 log + 203 log + + 6 log = 14:01
ej 54:3 210:3 5:9
j=1

2
so there is no evidence against the hypothesis that a Poisson model …ts these data.
The observed value of the goodness of …t statistic is
12
X (fj ej )2 (57 54:3)2 (203 210:3)2 (6 5:9)2
2 = + + + = 12:96
ej 54:3 210:3 5:9
j=1

2
so again there is no evidence against the hypothesis that a Poisson model …ts these data.
Example 7.2.3 Lifetime of brake pads and the Exponential model

Recall the data in Example 2.6.2 on the lifetimes of brake pads. The expected frequencies
are calculated using an Exponential model with mean estimated by the sample mean
^ = 49:0275. The data are given in Table 7.2.
Table 7.2: Frequency Table for Brake Pad Data

Observed Expected
Interval
Frequency: fj Frequency: ej
[0; 15) 21 52:72
[15; 30) 45 38:82
[30; 45) 50 28:59
[45; 60) 27 21:05
[60; 75) 21 15:50
[75; 90) 9 11:42
[90; 105) 12 8:41
[105; 120) 7 6:19
[120; +1) 8 17:3
Total 200 200

9
X fj 21 45 8
2 fj log = 2 21 log + 45 log + + 8 log = 50:36:
ej 52:72 38:82 17:3
j=1
The expected frequencies are all at least …ve so the approximate p value is
2
p value = P ( 50:36; H0 ) t P (W 50:36) t 0 where W s (7)
and there is very strong evidence against the hypothesis that an Exponential model …ts
these data. This conclusion is not unexpected since, as we noted in Example 2.6.2, the
observed and expected frequencies are not in close agreement at all. We could have chosen
a di¤erent set of intervals for these continuous data but the same conclusion of a lack of …t
would be obtained for any reasonable choice of intervals.
7.3 Two-Way (Contingency) Tables

Often we want to assess whether two factors or variates appear to be related. One tool for
doing this is to test the hypothesis that the factors are independent and thus statistically
unrelated. We will consider this in the case where both variates are discrete, and take on
a fairly small number of possible values. This turns out to cover a great many important
settings.
Two types of studies give rise to data that can be used to test independence, and in
both cases the data can be arranged as frequencies in a two-way table. These tables are
also called contingency tables.
7.3. TWO-WAY (CONTINGENCY) TABLES 245
Cross-Classi…cation of a Random Sample of Individuals

Suppose that individuals or items in a population can be classi…ed according to each of
two factors A and B. For A, an individual can be any of a mutually exclusive types
A1 ; A2 ; : : : ; Aa and for B an individual can be any of b mutually exclusive types B1 ; B2 ; : : : ; Bb ,
where a 2 and b 2.
If a random sample of n individuals is selected, let yij denote the number that have
A-type Ai and B-type Bj . The observed data may be arranged in a two-way table as seen
below:
AnB B1 B2 Bb Total
A1 y11 y12 y1b r1
A2 y21 y22 y2b r2
.. .. .. .. ..
. . . . .
Aa ya1 yab ra
Total c1 c2 cb n
P
b P
a P
a P
b
where ri = yij , cj = yij and yij = n. Let ij be the probability a randomly
j=1 i=1 i=1 j=1
P
a P
b
selected individual is combined type (Ai ; Bj ) and note that ij = 1. The a b
i=1 j=1
frequencies (Y11 ; Y12 ; : : : ; Yab ) follow a Multinomial distribution with k = ab classes.
To test independence of the A and B classi…cations, we test the hypothesis
H0 : ij = i j for i = 1; : : : ; a; j = 1; : : : ; b (7.10)
P
a P
b
where 0 < i < 1, 0 < j < 1, i = 1, j = 1. Note that
i=1 j=1
i = P (an individual is type Ai )
and
j = P (an individual is type Bj )
and that (7.10) is the standard de…nition for independent events: P (Ai \Bj ) = P (Ai )P (Bj ).
We recognize that testing (7.10) falls into the general framework of Section 7.1, where
k = ab, and the dimension of the parameter space under (7.10) is p = (a 1) + (b 1) =
a + b 2. All that needs to be done in order to use the statistics (7.6) or (7.7) to test H0
is to obtain the maximum likelihood estimates ^ i , ^ j under the model (7.10), and then the
calculate the expected frequencies eij .
Under the model (7.10), the likelihood function for the yij ’s is proportional to
a Q
Q b
L1 ( ; ) = [ ij ( ; )]yij
i=1 j=1
Q
a Q
b
yij
= ( i j) :
i=1 j=1
It is straightforward to maximize `( ; ) = log L( ; ) subject to the linear constraints

Pa Pb
i = 1, j = 1. The maximum likelihood estimates are
i=1 j=1
ri ^ cj
î = ; j= i = 1; : : : a; j = 1; : : : b
n n
and the expected frequencies are given by

ri cj
eij = n^ i ^ j = i = 1; : : : a; j = 1; : : : b (7.11)
n
The observed value of the likelihood ratio statistic (7.6) for testing the hypothesis (7.10)
is then
Xa X b
yij
=2 yij log :
eij
i=1 j=1
The approximate p value is computed as
2
p value = P ( ; H0 ) t P (W ) where W s ((a 1)(b 1))
The 2 degrees of freedom (a 1)(b 1) are determined by
k 1 p = (ab 1) (a 1+b 1) = (a 1)(b 1):
Example 7.3.1 Blood classi…cations

Human blood is classi…ed according to several systems. Two systems are the OAB
system and the Rh system. In the former a person is one of four types O, A, B, AB and in
the latter system a person is Rh+ or Rh . To determine whether these two classi…cation
systems are genetically independent, a random sample of 300 persons were chosen. Their
blood was classi…ed according to the two systems and the observed frequencies are given in
the table below.
O A B AB Total
Rh+ 82 89 54 19 244
Rh 13 27 7 9 56
Total 95 116 61 28 300
We can think of the Rh types as the A-type classi…cation and the OAB types as the B-type
classi…cation in the general theory above. The row and column totals are also shown in the
table, since they are the values needed to compute the eij ’s in (7.11).
To carry out the test that a person’s Rh and OAB blood types are statistically inde-
pendent, we merely need to compute the eij ’s by (7.11). For example,
(244)(95) 244(116) 244 (61)

e11 = = 77:3; e12 = = 94:4 and e13 = = 49:6
300 300 300
The remaining expected frequencies can be obtained by subtraction and these are given in
the table below in brackets next to the observed frequencies.
O A B AB Total
Rh+ 82 (77:3) 89 (94:4) 54 (49:6) 19 (22:8) 244
Rh 13 (17:7) 27 (21:6) 7 (11:4) 9 (5:2) 56
Total 95 116 61 28 300
The degrees of freedom for the Chi-squared approximation are (a 1)(b 1) = (3) (1) =
3 which is consistent with the fact that, once we had calculated three of the expected
frequencies, the remaining expected frequencies could be obtained by subtraction.
The observed value of the likelihood ratio test statistic is = 8:52, and the p value
is approximately P (W 8:52) = 0:036 where W s 2 (3) so there is evidence against the
hypothesis of independence based on the data. Note that by comparing the eij ’s and the
yij ’s we get some idea about the lack of independence, or relationship, between the two
classi…cations. We see here that the degree of dependence does not appear large.
Testing Equality of Multinomial Parameters from Two or More Groups

A similar problem arises when individuals in a population can be one of b types B1 ; : : : ; Bb ,
but where the population is sub-divided into a groups A1 ; : : : ; Aa . In this case, we might
be interested in whether the proportions of individuals of types B1 ; : : : ; Bb are the same for
each group. This is essentially the same as the question of independence in the preceding
section: we want to know whether the probability ij that a person in population group i
is B-type Bj is the same for all i = 1; : : : ; a. That is, ij = P (Bj jAi ) and we want to know
if this depends on Ai or not.
Although the framework is super…cially the same as the preceding section, the details
are a little di¤erent. In particular, the probabilities ij satisfy
i1 + i2 + + ib = 1 for each i = 1; : : : ; a
and the hypothesis we are interested in testing is
H0 : 1 = 2 = = a; (7.12)
where i = ( i1 ; i2 ; : : : ; ib ). Furthermore, the data in this case arise by selecting speci…ed

numbers of individuals ni from groups i = 1; : : : ; a and so there are actually a Multinomial
distributions, Multinomial(ni ; i1 ; : : : ; ib ).
If we denote the observed frequency of Bj -type individuals in the sample from the i’th
group as yij (where yi1 + + yib = ni ), then it can be shown that the likelihood ratio
statistic for testing (7.12) is exactly the same as (7.11), where now the expected frequencies
eij are given by
y+j
eij = ni for i = 1; : : : ; a; j = 1; : : : ; b
n
P
a P
b
where n = n1 + + na and y+j = yij . Since ni = yi+ = yij the expected frequencies
i=1 j=1
have exactly the same form as in the preceding section, when we lay out the data in a
two-way table with a rows and b columns.
Example 7.3.2 Blood classi…cations

The study in Example 7.3.1 could have been conducted di¤erently, by selecting a …xed
number of Rh+ persons and a …xed number of Rh persons, and then determining their
OAB blood type. Then the proper framework would be to test that the probabilities for
the four types O, A, B, AB were the same for Rh+ and for Rh persons, and so the
methods of the present section apply. This study gives exactly the same testing procedure
as one where the numbers of Rh+ and Rh persons in the sample are random, as discussed.
Example 7.3.3 Aspirin and strokes

In a randomized clinical trial to assess the e¤ectiveness of a small daily dose of aspirin
in preventing strokes among high-risk persons, a group of patients were randomly assigned
to get either aspirin or a placebo. A total of 240 patients were assigned to the aspirin group
and 236 were assigned to the placebo group. (There were actually an equal number in each
group but four patients withdrew from the placebo group during the study.) The patients
were followed for three years, and it was determined for each person whether they had a
stroke during that period or not. The data were as follows (expected frequencies are given
in brackets).
Stroke No Stroke Total
Aspirin Group 64 (75:6) 176 (164:4) 240
Placebo Group 86(74:4) 150 (161:6) 236
Total 150 326 476
We can think of the persons receiving aspirin and those receiving placebo as two groups,
and test the hypothesis
H0 : 11 = 21 ;
where 11 = P (stroke) for a person in the aspirin group and 21 = P (stroke) for a person
in the placebo group. The expected frequencies under H0 : 11 = 21 are
(yi+ )(y+j )
eij = for i = 1; 2:
476
This gives the expected frequencies shown in the table in brackets. The observed value of
the likelihood ratio statistic is
2 X
X 2
yij
2 yij log = 5:25
eij
i=1 j=1

2
p value = P ( 5:25; H0 ) t P (W 5:25) where W s (1)
= 2 [1 P (Z 2:29)] where Z s N (0; 1)
= 2(1 0:98899) = 0:02202
so there is evidence against H0 based on the data. A look at the yij ’s and the eij ’s indicates
that persons receiving aspirin have had fewer strokes than expected under H0 , suggesting
that 11 < 21 .
This test can be followed up with estimates for 11 and 21 . Because each row of the
table follows a Binomial distribution, we have
^11 = y11 = 64 = 0:267 and ^21 = y21 = 86 = 0:364:

n1 240 n2 236
We can also give individual con…dence intervals for 11 and 21 . Based on methods derived
earlier we have an approximate 95% con…dence interval for 11 given by
r
(0:267) (0:733)
0:267 1:96 or [0:211; 0:323]
240
and an approximate 95% con…dence interval for 11 given by
r
(0:364) (0:636)
0:364 1:96 or [0:303; 0:425] :
240
Con…dence intervals for the di¤erence in proportions 11 21 can also be obtained from
the approximate G(0; 1) pivotal quantity
(~11 ~21 ) ( 11 21 )
q :
~11 (1 ~11 )=n1 + ~21 (1 ~21 )=n2
Remark: This and other tests involving Binomial probabilities and contingency tables can
be carried out using the R function prop.test.

1. To investigate the e¤ectiveness of a rust-proo…ng procedure, 50 cars that had been
rust-proofed and 50 cars that had not were examined for rust …ve years after pur-
chase. For each car it was noted whether rust was present (actually de…ned as having
moderate or heavy rust) or absent (light or no rust). The data are as follows:
Rust-Proofed Not Rust Proofed

Rust present 14 28
Rust absent 36 22
Total 50 50
(a) Test the hypothesis that the probability of rust occurring is the same for the
rust-proofed cars as for those not rust-proofed. What do you conclude?
(b) Do you have any concerns about inferring that the rust-proo…ng prevents rust?
How might a better study be designed?
2. Two hundred volunteers participated in an experiment to examine the e¤ectiveness

of vitamin C in preventing colds. One hundred were selected at random to receive
daily doses of vitamin C and the others received a placebo. (None of the volunteers
knew which group they were in.) During the study period, 20 of those taking vitamin
C and 30 of those receiving the placebo caught colds. Test the hypothesis that the
probability of catching a cold during the study period was the same for each group.
3. Mass-produced items are packed in cartons of 12 as they come o¤ an assembly line.

The items from 250 cartons are inspected for defects, with the following results:
Number defective: 0 1 2 3 4 5 6 >6 Total

Frequency observed: 103 80 31 19 11 5 1 0 250
Test the hypothesis that the number of defective items Y in a single carton has a
Binomial(12; ) distribution. Why might the Binomial not be a suitable model?
4. The numbers of service interruptions in a communications system over 200 separate

weekdays is summarized in the following frequency table:
Number of interruptions: 0 1 2 3 4 5 >5 Total

Frequency observed: 64 71 42 18 4 1 0 200
Test whether a Poisson model for the number of interruptions Y on a single day is
consistent with these data.
5. The table below records data on 292 litters of mice classi…ed according to litter size
and number of females in the litter.
Number of females = j Total number

ynj 0 1 2 3 4 of litters = yn+
1 8 12 20
Litter 2 23 44 13 80
Size = n 3 10 25 48 13 96
4 5 30 34 22 5 96
(a) For litters of size n (n = 1; 2; 3; 4) assume that the number of females in a litter
of size n has Binomial distribution with parameters n and n = P (female). Test
the Binomial model separately for each of the litter sizes n = 2; n = 3 and
n = 4. (Why is it of scienti…c interest to do this?)
(b) Assuming that the Binomial model is appropriate for each litter size, test the
hypothesis that 1 = 2 = 3 = 4 .
6. A long sequence of digits (0; 1; : : : ; 9) produced by a pseudo random number generator

was examined. There were 51 zeros in the sequence, and for each successive pair of
zeros, the number of (non-zero) digits between them was counted. The results were
as follows:
1 1 6 8 10 22 12 15 0 0
2 26 1 20 4 2 0 10 4 19
2 3 0 5 2 8 1 6 14 2
2 2 21 4 3 0 0 7 2 4
4 7 16 18 2 13 22 7 3 5
Give an appropriate probability model for the number of digits between two successive
zeros, if the pseudo random number generator is truly producing digits for which
P (any digit = j) = 0:1; j = 0; 1; : : : ; 9, independent of any other digit. Construct a
frequency table and test the goodness of …t of your model.
7. 1398 school children with tonsils present were classi…ed according to tonsil size and
absence or presence of the carrier for streptococcus pyogenes. The results were as
follows:
Normal Enlarged Much enlarged Total

Carrier present 19 29 24 72
Carrier absent 497 560 269 1326
Total 516 589 293 1398
Is there evidence of an association between the two classi…cations?

8. The following data on heights of 210 married couples were presented by Yule in 1900.
Tall wife Medium wife Short wife Total

Tall husband 18 28 19 65
Medium husband 20 51 28 99
Short husband 12 25 9 46
Total 50 104 56 210
Test the hypothesis that the heights of husbands and wives are independent.
9. In the following table, 64 sets of triplets are classi…ed according to the age of their
mother at their birth and their sex distribution:
3 boys 2 boys 2 girls 3 girls Total
Mother under 30 5 8 9 7 29
Mother over 30 6 10 13 6 35
Total 11 18 22 13 64
(a) Is there any evidence of an association between the sex distribution and the age
of the mother?
(b) Suppose that the probability of a male birth is 0:5, and that the sexes of triplets
are determined independently. Find the probability that there are y boys in a
set of triples y = 0; 1; 2; 3, and test whether the column totals are consistent with
this distribution.
10. A study was undertaken to determine whether there is an association between the
birth weights of infants and the smoking habits of their parents. Out of 50 infants of
above average weight, 9 had parents who both smoked, 6 had mothers who smoked
but fathers who did not, 12 had fathers who smoked but mothers who did not, and
23 had parents of whom neither smoked. The corresponding results for 50 infants of
below average weight were 21, 10, 6, and 13, respectively.
(a) Test whether these results are consistent with the hypothesis that birth weight
is independent of parental smoking habits.
(b) Are these data consistent with the hypothesis that, given the smoking habits of
the mother, the smoking habits of the father are not related to birth weight?
11. Purchase a box of smarties and count the number of each of the colours: red, green.
yellow, blue, purple, brown, orange, pink. Test the hypothesis that each of the colours
has the same probability H0 : i = 18 ; i = 1; 2; : : : ; 8: The following R code55 can be
modi…ed to give the two test statistics, the likelihood ratio test statistic and Pear-
son’s Chi-squared D:
55
these are the frequencies of smarties for a large number of boxes consumed in Winter 2013.
y<-c(556,678,739,653,725,714,566,797) # Smartie Frequencies

e=sum(y)/8 # the expected frequencies
lambda <- 2*sum(y*log(y/e)) # the LR statistic observed value =74.10
D <- sum((y-e)^2/e) # Pearson’s Chi-squared statistic D=72.86
8. CAUSAL RELATIONSHIPS
8.1 Establishing Causation

56 As mentioned in Chapters 1 and 3, many studies are carried out with causal objectives
in mind. That is, we would like to be able to establish or investigate a possible cause and
e¤ect relationship between variables X and Y .
We use the word “causes”often; for example we might say that “gravity causes dropped
objects to fall to the ground”, or that “smoking causes lung cancer”. The concept of
causation (as in “X causes Y ”) is nevertheless hard to de…ne. One reason is that the
“strengths” of causal relationships vary a lot. For example, on earth gravity may always
lead to a dropped object falling to the ground; however, not everyone who smokes gets lung
cancer.
Idealized de…nitions of causation are often of the following form. Let y be a response
variate associated with units in a population or process, and let x be an explanatory variate
associated with some factor that may a¤ect y. Then, if all other factors that a¤ect y
are held constant, let us change x (or observe di¤erent values of x) and see if y
changes. If y changes then we say that x has a causal e¤ect on y.
In fact, this de…nition is not broad enough, because in many settings a change in x may
only lead to a change in y in some probabilistic sense. For example, giving an individual
person at risk of stroke a small daily dose of aspirin instead of a placebo may not necessarily
lower their risk. (Not everyone is helped by this medication.) However, on average the e¤ect
is to lower the risk of stroke. One way to measure this is by looking at the probability a
randomly selected person has a stroke (say within 3 years) if they are given aspirin versus
if they are not.
Therefore, a better idealized de…nition of causation is to say that changing x should
result in a change in some attribute of the random variable Y (for example, its mean or
some probability such as P (Y > 0)). Thus we revise the de…nition above to say:
If all other factors that a¤ect Y are held constant, let us change x (or observe
di¤erent values of x) and see if some speci…ed attribute of Y changes. If the
speci…ed attribute of Y changes then we say x has a causal e¤ect on Y .
These de…nitions are unfortunately unusable in most settings since we cannot hold all
56
See the video at www.watstat.ca called "Causation and the Flying Spaghetti monster"
255
256 8. CAUSAL RELATIONSHIPS
other factors that a¤ect y constant; often we don’t even know what all the factors are.
However, the de…nition serves as a useful ideal for how we should carry out studies in order
to show that a causal relationship exists. We try to design studies so that alternative (to the
variate x) explanations of what causes changes in attributes of y can be ruled out, leaving
x as the causal agent. This is much easier to do in experimental studies, where explanatory
variables may be controlled, than in observational studies. The following are brief examples.

Recall Example 6.1.3 concerning the (breaking) strength y of a steel bolt and the diam-
eter x of the bolt. It is clear that bolts with larger diameters tend to have higher strength,
and it seems clear on physical and theoretical grounds that increasing the diameter “causes”
an increase in strength. This can be investigated in experimental studies like that in Ex-
ample 6.1.3, when random samples of bolts of di¤erent diameters are tested, and their
strengths y determined.
Clearly, the value of x does not determine y exactly (di¤erent bolts with the same
diameter don’t have the same strength), but we can consider attributes such as the average
value of y. In the experiment we can hold other factors more or less constant (e.g. the
ambient temperature, the way the force is applied; the metallurgical properties of the bolts)
so we feel that the observed larger average values of y for bolts of larger diameter x is due
to a causal relationship.
Note that even here we have to depart slightly from the idealized de…nition of cause
and e¤ect. In particular, a bolt cannot have its diameter x changed, so that we can see
if y changes. All we can do is consider two bolts that are as similar as possible, and are
subject to the same explanatory variables (aside from diameter). This di¢ culty arises in
many experimental studies.
Example 8.1.2 Smoking and lung cancer

Suppose that data have been collected on 10; 000 persons aged 40-80 who have smoked
for at least 20 years, and 10; 000 persons in the same age range who have not. There is
roughly the same distribution of ages in the two groups. The (hypothetical) data concerning
the numbers with lung cancer are as follows:
Lung Cancer No Lung Cancer Total
Smokers 500 9500 10; 000
Non-Smokers 100 9900 10; 000
There are many more lung cancer cases among the smokers, but without further information
or assumptions we cannot conclude that a causal relationship (smoking causes lung cancer)
exists. Alternative explanations might explain some or all of the observed di¤erence. (This
is an observational study and other possible explanatory variables are not controlled.) For
example, family history is an important factor in many cancers; maybe smoking is also
related to family history. Moreover, smoking tends to be connected with other factors such
as diet and alcohol consumption; these may explain some of the e¤ect seen.
8.2. EXPERIMENTAL STUDIES 257
The last example illustrates that association (statistical dependence) between

two variables X and Y does not imply that a causal relationship exists. Suppose
for example that we observe a positive correlation between X and Y ; higher values of X
tend to go with higher values of Y in a unit. Then there are at least three “explanations”:
(i) X causes Y (meaning X has a causative e¤ect on Y ),(ii) Y causes X, and (iii) some
other factor(s) Z cause both X and Y .
We’ll now consider the question of cause and e¤ect in experimental and observational
studies in a little more detail.
8.2 Experimental Studies

Suppose we want to investigate whether a variate x has a causal e¤ect on a response variate
Y . In an experimental setting we can control the values of x that a unit “sees”. In addition,
we can use one or both of the following devices for ruling out alternative explanations for
any observed changes in Y that might be caused by x:
(i) Hold other possible explanatory variates …xed.
(ii) Use randomization to control for other variates.
These devices are mostly simply explained via examples.
Example 8.2.1 Aspirin and the risk of stroke

Suppose 500 persons that are at high risk of stroke have agreed to take part in a clinical
trial to assess whether aspirin lowers the risk of stroke. These persons are representative
of a population of high risk individuals. The study is conducted by giving some persons
aspirin and some a placebo, then comparing the two groups in terms of the number of
strokes observed.
Other factors such as age, sex, weight, existence of high blood pressure, and diet also
may a¤ect the risk of stroke. These variates obviously vary substantially across persons and
cannot be held constant or otherwise controlled. However, such studies use randomization
in the following way: among the study subjects, who gets aspirin and who gets a placebo
is determined by a random mechanism. For example, we might ‡ip a coin (or draw a
random number from f0; 1g), with one outcome (say Heads) indicating a person is to be
given aspirin, and the other indicating that they get the placebo.
The e¤ect of this randomization is to balance the other possible explanatory variables
in the two “treatment” groups (aspirin and placebo). Thus, if at the end of the study we
observe that 20% of the placebo subjects have had a stroke but only 9% of the aspirin
subjects have, then we can attribute the di¤erence to the causative e¤ect of the aspirin.
Here’s how we rule out alternative explanations: suppose you claim that its not the aspirin
but dietary factors and blood pressure that cause this observed e¤ect. I respond that the
randomization procedure has lead to those factors being balanced in the two treatment
groups. That is, the aspirin group and the placebo group both have similar variations in
dietary and blood pressure values across the subjects in the group. Thus, a di¤erence in
the two groups should not be due to these factors.
Example 8.2.2 Driving speed and fuel consumption

It is thought that fuel consumption in automobiles is greater at speeds in excess of 100
km per hour. (Some years ago during oil shortages, many U.S. states reduced speed limits
on freeways because of this.) A study is planned that will focus on freeway-type driving,
because fuel consumption is also a¤ected by the amount of stopping and starting in town
driving, in addition to other factors.
In this case a decision was made to carry out an experimental study at a special paved
track owned by a car company. Obviously a lot of factors besides speed a¤ect fuel con-
sumption: for example, the type of car and engine, tire condition, fuel grade and the driver.
As a result, these factors were controlled in the study by balancing them across di¤erent
driving speeds. An experimental plan of the following type was employed.
84 cars of eight di¤erent types were used; each car was used for 8 test drives.
the cars were each driven twice for 600 km on the track at each of four speeds:
80,100,120 and 140 km/hr.
8 drivers were involved, each driving each of the 8 cars for one test, and each driving
two tests at each of the four speeds.
the cars had similar initial mileages and were carefully checked and serviced so as to
make them as comparable as possible; they used comparable fuels.
the drivers were instructed to drive steadily for the 600 km. Each was allowed a 30
minute rest stop after 300 km.
the order in which each driver did his or her 8 test drives was randomized. The track
was large enough that all 8 drivers could be on it at the same time. (The tests were
conducted over 8 days.)
The response variate was the amount of fuel consumed for each test drive. Obviously
in the analysis we must deal with the fact that the cars di¤er in size and engine type, and
their fuel consumption will depend on that as well as on driving speed. A simple approach
would be to add the fuel amounts consumed for the 16 test drives at each speed, and to
compare them (other methods are also possible). Then, for example, we might …nd that
the average consumption (across the 8 cars) at 80, 100, 120 and 140 km/hr were 43.0,44.1,
45.8 and 47.2 liters, respectively. Statistical methods of testing and estimation could then
be used to test or estimate the di¤erences in average fuel consumption at each of the four
speeds. (Can you think of a way to do this?)
8.3. OBSERVATIONAL STUDIES 259
Exercise: Suppose that statistical tests demonstrated a signi…cant di¤erence in consump-

tion across the four driving speeds, with lower speeds giving lower consumption. What (if
any) quali…cations would you have about concluding there is a causal relationship?
8.3 Observational Studies

In observational studies there are often unmeasured factors that a¤ect the response Y . If
these factors are also related to the explanatory variable x whose (potential) causal e¤ect
we are trying to assess, then we cannot easily make any inferences about causation. For
this reason, we try in observational studies to measure other important factors besides x.
For example, Problem 1 at the end of Chapter 7 discusses an observational study on
whether rust-proo…ng prevents rust. It is clear that an unmeasured factor is the care a car
owner takes in looking after a vehicle; this could quite likely be related to whether a person
decides to have their car rust-proofed.
The following example shows how we must take note of measured factors that a¤ect Y .
Example 8.3.1 Graduate studies admissions

Suppose that over a …ve year period, the applications and admissions to graduate studies
in Engineering and Arts faculties in a university are as follows:
No. Applied No. Admitted % Admitted

Engineering 1000 600 60% Men
200 150 75% Women
Arts 1000 400 40% Men
1800 800 44% Women
Total 2000 1000 50% Men
2000 950 47:5% Women
We want to see if females have a lower probability of admission than males. If we looked
only at the totals for Engineering plus Arts, then it would appear that the probability a
male applicant is admitted is a little higher than the probability for a female applicant.
However, if we look separately at Arts and Engineering, we see the probability for females
being admitted appears higher in each case! The reason for the reverse direction in the
totals is that Engineering has a higher admission rate than Arts, but the fraction of women
applying to Engineering is much lower than for Arts.
In cause and e¤ect language, we would say that the faculty one applies to (i.e. Engi-
neering or Arts) is a causative factor with respect to probability of admission. Furthermore,
it is related to the sex (male or female) of an applicant, so we cannot ignore it in trying to
see if sex is also a causative factor.
Remark: The feature illustrated in the example above is sometimes called Simpson’s
Paradox. In probabilistic terms, it says that for events A; B1 ; B2 and C1 ; : : : ; Ck , we can
have
P (AjB1 Ci ) > P (AjB2 Ci ) for each i = 1; : : : ; k
but have
P (AjB1 ) < P (AjB2 )
P
k
(Note that P (AjB1 ) = P (AjB1 Ci )P (Ci jB1 ) and similarly for P (AjB2 ), so they depend
i=1
on what P (Ci jB1 ) and P (Ci jB2 ) are.) In the example above we can take B1 = fperson
is femaleg, B2 = fperson is maleg, C1 = fperson applies to Engineeringg, C2 = fperson
applies to Artsg, and A = fperson is admittedg.
Exercise: Write down estimated probabilities for the various events based on Example
8.3.1, and so illustrate Simpson’s paradox.
Epidemiologists (specialists in the study of disease) have developed guidelines or criteria
which should be met in order to argue that a causal association exists between a risk factor
x and a disease (represented by a response variable Y = I(person has the disease), for
example). These include
the need to account for other possible risk factors and to demonstrate that x and Y
are consistently related when these factors vary.
the demonstration that association between x and Y holds in di¤erent types of settings
the existence of a plausible scienti…c explanation
Similar criteria apply to other areas.
8.4 Clo…brate Study

In the early seventies, the Coronary Drug Research Group implemented a large medical
trial57 in order to evaluate an experimental drug, clo…brate, for its e¤ect on the risk of
heart attacks in middle-aged people with heart trouble. Clo…brate operates by reducing
the cholesterol level in the blood and thereby potentially reducing the risk of heart disease.
57
The Coronary Drug Research Group, New England Journal of Medicine (1980), pg. 1038.
8.4. CLOFIBRATE STUDY 261
M eas ur ement M ater ial Per s onnel

age
s t res s
f ollow -up t im e m ent al healt h
diet
pers onalit y t y pe
f ollow -up m et hod dos e
gender
def init ion of heart at t ac k ex erc is e
drug s m ok ing s t at us
drink ing s t at us
doc t or m edic at ions
f am ily his t ory
phy s ic al t rait s
pers onal his t ory
F a t a l H e a rt A t t a c k
w eat her m et hod of adm inis t rat ion
loc at ion dos e

w ork env ironm ent w hen t ak en
hom e env ironm ent
f requenc y of drug
Envir onment M ethods
Figure 8.1: Fishbone diagram for Chlo…brate example
Study I: An Experimental Plan

Problem:
Investigate the e¤ect of clo…brate on the risk of fatal heart attack for patients with a
history of a previous heart attack.
The target population consists of all individuals with a previous non-fatal heart attack
who are at risk for a subsequent heart attack. The response of interest is the occurrence/non-
occurrence of a fatal heart attack. This is primarily a causative problem in that the investi-
gators are interested in determining whether the prescription of clo…brate causes a reduction
in the risk of subsequent heart attack. The …shbone diagram (Figure 8.1) indicates a broad
variety of factors a¤ecting the occurrence (or not) of a heart attack.
Plan:
The study population consists of men aged 30 to 64 who had a previous heart attack not
more than three months prior to initial contact. The sample consists of subjects from the
study population who were contacted by participating physicians, asked to participate in
the study, and provided informed consent. (All patients eligible to participate had to sign a
consent form to participate in the study. The consent form usually describes current state
of knowledge regarding the best available relevant treatments, the potential advantages and
disadvantages of the new treatment, and the overall purpose of the study.)
The following treatment protocol was developed:
Randomly assign eligible men to either clo…brate or placebo treatment groups. (This
is an attempt to make the clo…brate and placebo groups alike with respect to most ex-
planatory variates other than the focal explanatory variate. See the …shbone diagram
above.)
Administer treatments in identical capsules in a double-blinded fashion. (In this con-

text, double-blind means that neither the patient nor the individual administering the
treatment knows if it is clo…brate or placebo; only the person heading the investiga-
tion knows. This is to avoid di¤erential reporting rates from physicians enthusiastic
about the new drug - a form of measurement error.)
Follow patients for 5 years and record the occurrence of any fatal heart attacks expe-
rienced in either treatment group.
Determination of whether a fatality was attributable to a heart attack or not is based

on electrocardiograms and physical examinations by physicians.
Data:
1,103 patients were assigned to clo…brate and 2,789 were assigned to the placebo
group.
221 of the patients in the clo…brate group died and 586 of the patients in the placebo
group died.
Analysis:
The proportion of patients in the two groups having subsequent fatal heart attacks
(clo…brate: 221=1103 = 0:20 and placebo: 586=2789 = 0:21) are comparable.
Conclusions:
Clo…brate does not reduce mortality due to heart attacks in high risk patients.
This conclusion has several limitations. For example, study error has been introduced
by restricting the study population to male subjects alone. While clo…brate might be
discarded as a bene…cial treatment for the target population, there is no information in
this study regarding its e¤ects on female patients at risk for secondary heart attacks.
Study II: An Observational Plan

Supplementary analyses indicate that one reason that clo…brate did not appear to save
lives might be because the patients in the clo…brate group did not take their medicine. It
was therefore of interest to investigate the potential bene…t of clo…brate for patients who
adhered to their medication program.
Subjects who took more than 80% of their prescribed treatment were called “adherers”
to the protocol.
8.4. CLOFIBRATE STUDY 263
Problem:
Investigate the occurrence of fatal heart attacks in the group of patients assigned to
clo…brate who were adherers.
The remaining parts of the problem stage are as before.
Plan:
Compare the occurrence of heart attacks in patients assigned to clo…brate who main-
tained the designated treatment schedule with the patients assigned to clo…brate who
abandoned their assigned treatment schedule.
Note that this is a further reduction of the study population.
Data:
In the clo…brate group, 708 patients were adherers and 357 were non-adherers. The
remaining 38 patients could not be classi…ed as adherers or non-adherers and so were
excluded from this analysis. Of the 708 adherers, 106 had a fatal heart attack during
the …ve years of follow up. Of the 357 non-adherers, 88 had a fatal heart attack during
the …ve years of follow up.
Analysis:
The proportion of adherers su¤ering from subsequent heart attack is given by 106=708 =
0:15 while this proportion for the non-adherers is 88=357 = 0:25.
Conclusions:
It would appear that clo…brate does reduce mortality due to heart attack for high
risk patients if properly administered.
However, great care must be taken in interpreting the above results since they are
based on an observational plan. While the data were collected based on an exper-
imental plan, only the treatment was controlled. The comparison of the mortality
rates between the adherers and non-adherers is based on an explanatory variate (ad-
herence) that was not controlled in the original experiment. The investigators did not
decide who would adhere to the protocol and who would not; the subjects decided
themselves.
Now the possibility of confounding is substantial. Perhaps, adherers are more health
conscious and exercised more or ate a healthier diet. Detailed measurements of these
variates are needed to control for them and reduce the possibility of confounding.

1. In an Ontario study, 50267 live births were classi…ed according to the baby’s weight
(less than or greater than 2.5 kg.) and according to the mother’s smoking habits (non-
smoker, 1-20 cigarettes per day, or more than 20 cigarettes per day). The results were
as follows:
No. of cigarettes 0 1 20 > 20
Weight 2:5 1322 1186 793
Weight > 2:5 27036 14142 5788
(a) Test the hypothesis that birth weight is independent of the mother’s smoking
habits.
(b) Explain why it is that these results do not prove that birth weights would increase
if mothers stopped smoking during pregnancy. How should a study to obtain
such proof be designed?
(c) A similar, though weaker, association exists between birth weight and the amount
smoked by the father. Explain why this is to be expected even if the father’s
smoking habits are irrelevant.
2. One hundred and …fty Statistics students took part in a study to evaluate computer-
assisted instruction (CAI). Seventy-…ve received the standard lecture course while
the other 75 received some CAI. All 150 students then wrote the same examination.
Fifteen students in the standard course and 29 of those in the CAI group received a
mark over 80%.
(a) Are these results consistent with the hypothesis that the probability of achieving
a mark over 80% is the same for both groups?
(b) Based on these results, the instructor concluded that CAI increases the chances
of a mark over 80%. How should the study have been carried out in order for
this conclusion to be valid?
(a) The following data were collected some years ago in a study of possible sex bias
in graduate admissions at a large university:
Admitted Not admitted

Male applicants 3738 4704
Female applicants 1494 2827
Test the hypothesis that admission status is independent of sex. Do these data
indicate a lower admission rate for females?
(b) The following table shows the numbers of male and female applicants and the
percentages admitted for the six largest graduate programs in (a):
Men Women
Program Applicants % Admitted Applicants % Admitted
A 825 62 108 82
B 560 63 25 68
C 325 37 593 34
D 417 33 375 35
E 191 28 393 24
F 373 6 341 7
Test the independence of admission status and sex for each program. Do any of
the programs show evidence of a bias against female applicants?
(c) Why is it that the totals in (a) seem to indicate a bias against women, but the
results for individual programs in (b) do not?
4. To assess the (presumed) bene…cial e¤ects of rust-proo…ng cars, a manufacturer ran-

domly selected 200 cars that were sold 5 years earlier and were still used by the original
buyers. One hundred cars were selected from purchases where the rust-proo…ng op-
tion package was included, and one hundred from purchases where it was not (and
where the buyer did not subsequently get the car rust-proofed by a third party). The
amount of rust on the vehicles was measured on a scale in which the responses Y were
assumed to have a Gaussian distribution. For the rust-proofed cars the responses were
assumed to be G( 1 ; ) and for the non-rust-proofed cars the responses were assumed
to be G( 2 ; ). Sample means and standard deviations for the two sets of cars were
(higher y means more rust):
Rust-proofed cars y1 = 11:7 s1 = 2:1

Non-rust-proofed cars y2 = 12:0 s2 = 2:4
(a) Test the hypothesis that there is no di¤erence between the mean amount of rust
for rust-proofed cars as compared to non-rust-proofed cars.
(b) The manufacturer was surprised to …nd that the data did not show a bene…cial
e¤ect of rust-proo…ng. Describe problems with their study and outline how you
might carry out a study designed to demonstrate a causal e¤ect of rust-proo…ng.
5. In randomized clinical trials that compare two (or more) medical treatments it is
customary not to let either the subject or their physician know which treatment they
have been randomly assigned. These are referred to as double blind studies.
Discuss why doing a double blind study is a good idea in a causative study.
6. Public health researchers want to study whether speci…cally designed educational

programs about the e¤ects of cigarette smoking have the e¤ect of discouraging people
from smoking. One particular program is delivered to students in grade 9, with follow-
up in grade 11 to determine each student’s smoking “history”. Brie‡y discuss some
factors you would want to consider in designing such a study, and how you might
address them.
9. REFERENCES AND
SUPPLEMENTARY
RESOURCES
9.1 References
R.J. Mackay and R.W. Oldford (2001). Statistics 231: Empirical Problem Solving (Stat
231 Course Notes)
C.J. Wild and G.A.F. Seber (1999). Chance Encounters: A First Course in Data Analysis
and Inference. John Wiley and Sons, New York.
J. Utts (2003). What Educated Citizens Should Know About Statistics and Probability.
American Statistician 57,74-79
9.2 Departmental Web Resources

Videos on sections: see www.watstat.ca
267
268 9. REFERENCES AND SUPPLEMENTARY RESOURCES
p.f./p.d.f. Mean Variance m.g.f.
Discrete p.f.
Binomialn, p  ny  p y q n−y np npq pe t q n
0  p  1, q  1 − p y  0, 1, 2, . . . , n
Bernoullip p y 1 − p 1−y
p p1 − p pe t q
0  p  1, q  1 − p y  0, 1
yk−1 p k
Negative Binomialk, p y pkqy kq kq
p 1−qe t
p2
0  p  1, q  1 − p y  0, 1, 2, . . . t  − ln q
p
Geometricp pq y q q 1−qe t
p p2
0  p  1, q  1 − p y  0, 1, 2, . . . t  − ln q
r N−r
y n−y
HypergeometricN, r, n
 N
n 
nr
N
n Nr 1 − Nr  N−n
N−1
intractible
r  N, n  N
y  0, 1, 2, . . . , min(r, n
e −  y
Poisson
  e e −1
t
y!
0 y  0, 1, . . .
y y y
Multinomialn,  1 , . . .  k 
n!
y 1 !y 2 !...y k !
 11  22 . . .  k k
k k n 1 , . . . , n k  VarY i   n i 1 −  i 
 i  0, ∑ i 1 y i  0, 1, . . . ; ∑ yi n
i1 i1
Continuous p.d.f.
e bt −e at
b−a 2
Uniforma, b fy  1
b−a
,ayb ab
2 12
b−at
t≠0
Exponential
1
fy  1

e −y/ , y  0  2 1−t
0 t  1/
e −y− /2 
1 2 2
N,  2  or G(,  fy 
e t t /2
2 2
2   2
−    ,   0 −  y  
fy  1
y k/2−1 e −y/2 , y0
2 k/2 Γk/2
Chi-squared(k 1 − 2t −k/2
 k 2k
k0 where Γa   x a−1 −x
e dx t  1/2
0
y2
fy  c k 1  k  −k1/2
k
Student t 0 k−2
−  y   where undefined
k  0 if k  1 if k  2
c k  Γ k1
2
/ k Γ 2k 
Formulae
n n n n
ȳ  1
n ∑ yi s2  1
n−1
∑y i − ȳ  2 x̄  1
n ∑ xi S xx  ∑x i − x̄  2
i1 i1 i1 i1
n n n n
S yy  ∑y i − ȳ   ∑ y 2i −nȳ  2 S xy  ∑x i − x̄ y i − ȳ  ∑x i − x̄ y i
2
i1 i1 i1 i1
n
n 1 −1s 21 n 2 −1s 22
1
s 2e  n−2 ∑ y i −̂ − ̂ x i  2  n−2
1
S yy −̂ S xy  s 2p  n 1 n 2 −2
i1
Pivotals/Test Statistics
Random variable Distribution Mean or df Standard Deviation
Ȳ −
/ n
Gaussian 0 1
n−1S 2
Chi-squared df  n − 1
2
̄Y−
Student t df  n − 1
S/ n
n
̃ 
1/2
S xy
S xx
 1
S xx
∑x i − x̄ Y i Gaussian   1
S xx
i1
̃ −
S e / S xx
̃  Ȳ − ̃ x̄
1/2
 Sx̄
2
Gaussian   1
n xx
1/2
̃ x  ̃  ̃ x
x−x̄  2
Gaussian x    x  1
n  S
xx
̃ x−x
x−x̄  2
Se 1
n  S xx
x−x̄  2 1/2
Y − ̃ x Gaussian 0  1  1n  S xx
Y−̃ x
x−x̄  2
S e 1 1n  S xx
n−2S 2e
Chi-squared df  n − 2
2
Ȳ 1 −Ȳ 2 − 1 − 2 
Student t df  n 1 n 2 −2
Sp 1
n1  n12
n 1 n 2 −2S 2p
Chi-squared df  n 1 n 2 −2
2
Approximate Pivotals
̃ −
 N0, 1 approximately if ̃  Y/n and Y  Binomialn, 
̃ 1−̃ /n
Ȳ −
 N0, 1 approximately for a random sample from Poisson distribution
Ȳ /n
  −2 logR 2ℓ̃  − ℓ  approximately  2 1, if n is large

k
  2 ∑ Y j log Y j /E j   approximately  2 df  where df  k − 1 - (no. of parameters estimated under H 0 )
j1
k
D  ∑ Y j −E j  2 /E j  approximately  2 df  where df  k − 1 - (no. of parameters estimated under H 0 )
j1
Probabilities for Standard Normal N(0,1) Distribution
0.40
0.35
0.30 x
0.25
0.20
0.15 F(x)
0.10
0.05
0.00
-4 -3 -2 -1 0 1 2 3 4
This table gives the values of F(x) for x ≥ 0

x 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.50000 0.50399 0.50798 0.51197 0.51595 0.51994 0.52392 0.52790 0.53188 0.53586
0.1 0.53983 0.54380 0.54776 0.55172 0.55567 0.55962 0.56356 0.56750 0.57142 0.57534
0.2 0.57926 0.58317 0.58706 0.59095 0.59484 0.59871 0.60257 0.60642 0.61026 0.61409
0.3 0.61791 0.62172 0.62552 0.62930 0.63307 0.63683 0.64058 0.64431 0.64803 0.65173
0.4 0.65542 0.65910 0.66276 0.66640 0.67003 0.67364 0.67724 0.68082 0.68439 0.68793
0.5 0.69146 0.69497 0.69847 0.70194 0.70540 0.70884 0.71226 0.71566 0.71904 0.72240
0.6 0.72575 0.72907 0.73237 0.73565 0.73891 0.74215 0.74537 0.74857 0.75175 0.75490
0.7 0.75804 0.76115 0.76424 0.76730 0.77035 0.77337 0.77637 0.77935 0.78230 0.78524
0.8 0.78814 0.79103 0.79389 0.79673 0.79955 0.80234 0.80511 0.80785 0.81057 0.81327
0.9 0.81594 0.81859 0.82121 0.82381 0.82639 0.82894 0.83147 0.83398 0.83646 0.83891
1.0 0.84134 0.84375 0.84614 0.84849 0.85083 0.85314 0.85543 0.85769 0.85993 0.86214
1.1 0.86433 0.86650 0.86864 0.87076 0.87286 0.87493 0.87698 0.87900 0.88100 0.88298
1.2 0.88493 0.88686 0.88877 0.89065 0.89251 0.89435 0.89617 0.89796 0.89973 0.90147
1.3 0.90320 0.90490 0.90658 0.90824 0.90988 0.91149 0.91309 0.91466 0.91621 0.91774
1.4 0.91924 0.92073 0.92220 0.92364 0.92507 0.92647 0.92785 0.92922 0.93056 0.93189
1.5 0.93319 0.93448 0.93574 0.93699 0.93822 0.93943 0.94062 0.94179 0.94295 0.94408
1.6 0.94520 0.94630 0.94738 0.94845 0.94950 0.95053 0.95154 0.95254 0.95352 0.95449
1.7 0.95543 0.95637 0.95728 0.95818 0.95907 0.95994 0.96080 0.96164 0.96246 0.96327
1.8 0.96407 0.96485 0.96562 0.96638 0.96712 0.96784 0.96856 0.96926 0.96995 0.97062
1.9 0.97128 0.97193 0.97257 0.97320 0.97381 0.97441 0.97500 0.97558 0.97615 0.97670
2.0 0.97725 0.97778 0.97831 0.97882 0.97932 0.97982 0.98030 0.98077 0.98124 0.98169
2.1 0.98214 0.98257 0.98300 0.98341 0.98382 0.98422 0.98461 0.98500 0.98537 0.98574
2.2 0.98610 0.98645 0.98679 0.98713 0.98745 0.98778 0.98809 0.98840 0.98870 0.98899
2.3 0.98928 0.98956 0.98983 0.99010 0.99036 0.99061 0.99086 0.99111 0.99134 0.99158
2.4 0.99180 0.99202 0.99224 0.99245 0.99266 0.99286 0.99305 0.99324 0.99343 0.99361
2.5 0.99379 0.99396 0.99413 0.99430 0.99446 0.99461 0.99477 0.99492 0.99506 0.99520
2.6 0.99534 0.99547 0.99560 0.99573 0.99585 0.99598 0.99609 0.99621 0.99632 0.99643
2.7 0.99653 0.99664 0.99674 0.99683 0.99693 0.99702 0.99711 0.99720 0.99728 0.99736
2.8 0.99744 0.99752 0.99760 0.99767 0.99774 0.99781 0.99788 0.99795 0.99801 0.99807
2.9 0.99813 0.99819 0.99825 0.99831 0.99836 0.99841 0.99846 0.99851 0.99856 0.99861
3.0 0.99865 0.99869 0.99874 0.99878 0.99882 0.99886 0.99889 0.99893 0.99896 0.99900
3.1 0.99903 0.99906 0.99910 0.99913 0.99916 0.99918 0.99921 0.99924 0.99926 0.99929
3.2 0.99931 0.99934 0.99936 0.99938 0.99940 0.99942 0.99944 0.99946 0.99948 0.99950
3.3 0.99952 0.99953 0.99955 0.99957 0.99958 0.99960 0.99961 0.99962 0.99964 0.99965
3.4 0.99966 0.99968 0.99969 0.99970 0.99971 0.99972 0.99973 0.99974 0.99975 0.99976
3.5 0.99977 0.99978 0.99978 0.99979 0.99980 0.99981 0.99981 0.99982 0.99983 0.99983
This table gives the values of F -1(p) for p ≥ 0.50

p 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.5 0.0000 0.0251 0.0502 0.0753 0.1004 0.1257 0.1510 0.1764 0.2019 0.2275
0.6 0.2533 0.2793 0.3055 0.3319 0.3585 0.3853 0.4125 0.4399 0.4677 0.4959
0.7 0.5244 0.5534 0.5828 0.6128 0.6433 0.6745 0.7063 0.7388 0.7722 0.8064
0.8 0.8416 0.8779 0.9154 0.9542 0.9945 1.0364 1.0803 1.1264 1.1750 1.2265
0.9 1.2816 1.3408 1.4051 1.4758 1.5548 1.6449 1.7507 1.8808 2.0537 2.3263
CHI-SQUARED DISTRIBUTION QUANTILES
df\p 0.005 0.01 0.025 0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.975 0.99 0.995
1 0.000 0.000 0.001 0.004 0.016 0.064 0.148 0.275 0.455 0.708 1.074 1.642 2.706 3.842 5.024 6.635 7.879
2 0.010 0.020 0.051 0.103 0.211 0.446 0.713 1.022 1.386 1.833 2.408 3.219 4.605 5.992 7.378 9.210 10.597
3 0.072 0.115 0.216 0.352 0.584 1.005 1.424 1.869 2.366 2.946 3.665 4.642 6.251 7.815 9.348 11.345 12.838
4 0.207 0.297 0.484 0.711 1.064 1.649 2.195 2.753 3.357 4.045 4.878 5.989 7.779 9.488 11.143 13.277 14.860
5 0.412 0.554 0.831 1.146 1.610 2.343 3.000 3.656 4.352 5.132 6.064 7.289 9.236 11.070 12.833 15.086 16.750
6 0.676 0.872 1.237 1.635 2.204 3.070 3.828 4.570 5.348 6.211 7.231 8.558 10.645 12.592 14.449 16.812 18.548
7 0.989 1.239 1.690 2.167 2.833 3.822 4.671 5.493 6.346 7.283 8.383 9.803 12.017 14.067 16.013 18.475 20.278
8 1.344 1.647 2.180 2.733 3.490 4.594 5.527 6.423 7.344 8.351 9.525 11.030 13.362 15.507 17.535 20.090 21.955
9 1.735 2.088 2.700 3.325 4.168 5.380 6.393 7.357 8.343 9.414 10.656 12.242 14.684 16.919 19.023 21.666 23.589
10 2.156 2.558 3.247 3.940 4.865 6.179 7.267 8.296 9.342 10.473 11.781 13.442 15.987 18.307 20.483 23.209 25.188
11 2.603 3.054 3.816 4.575 5.578 6.989 8.148 9.237 10.341 11.530 12.899 14.631 17.275 19.675 21.920 24.725 26.757
12 3.074 3.571 4.404 5.226 6.304 7.807 9.034 10.182 11.340 12.584 14.011 15.812 18.549 21.026 23.337 26.217 28.300
13 3.565 4.107 5.009 5.892 7.042 8.634 9.926 11.129 12.340 13.636 15.119 16.985 19.812 22.362 24.736 27.688 29.819
14 4.075 4.660 5.629 6.571 7.790 9.467 10.821 12.078 13.339 14.685 16.222 18.151 21.064 23.685 26.119 29.141 31.319
15 4.601 5.229 6.262 7.261 8.547 10.307 11.721 13.030 14.339 15.733 17.322 19.311 22.307 24.996 27.488 30.578 32.801
16 5.142 5.812 6.908 7.962 9.312 11.152 12.624 13.983 15.338 16.780 18.418 20.465 23.542 26.296 28.845 32.000 34.267
17 5.697 6.408 7.564 8.672 10.085 12.002 13.531 14.937 16.338 17.824 19.511 21.615 24.769 27.587 30.191 33.409 35.718
18 6.265 7.015 8.231 9.391 10.865 12.857 14.440 15.893 17.338 18.868 20.601 22.760 25.989 28.869 31.526 34.805 37.156
19 6.844 7.633 8.907 10.117 11.651 13.716 15.352 16.850 18.338 19.910 21.689 23.900 27.204 30.144 32.852 36.191 38.582
20 7.434 8.260 9.591 10.851 12.443 14.578 16.266 17.809 19.337 20.951 22.775 25.038 28.412 31.410 34.170 37.566 39.997
25 10.520 11.524 13.120 14.611 16.473 18.940 20.867 22.616 24.337 26.143 28.172 30.675 34.382 37.652 40.646 44.314 46.928
30 13.787 14.953 16.791 18.493 20.599 23.364 25.508 27.442 29.336 31.316 33.530 36.250 40.256 43.773 46.979 50.892 53.672
35 17.192 18.509 20.569 22.465 24.797 27.836 30.178 32.282 34.336 36.475 38.859 41.778 46.059 49.802 53.203 57.342 60.275
40 20.707 22.164 24.433 26.509 29.051 32.345 34.872 37.134 39.335 41.622 44.165 47.269 51.805 55.758 59.342 63.691 66.766
45 24.311 25.901 28.366 30.612 33.350 36.884 39.585 41.995 44.335 46.761 49.452 52.729 57.505 61.656 65.410 69.957 73.166
50 27.991 29.707 32.357 34.764 37.689 41.449 44.313 46.864 49.335 51.892 54.723 58.164 63.167 67.505 71.420 76.154 79.490
60 35.534 37.485 40.482 43.188 46.459 50.641 53.809 56.620 59.335 62.135 65.227 68.972 74.397 79.082 83.298 88.379 91.952
70 43.275 45.442 48.758 51.739 55.329 59.898 63.346 66.396 69.334 72.358 75.689 79.715 85.527 90.531 95.023 100.430 104.210
80 51.172 53.540 57.153 60.391 64.278 69.207 72.915 76.188 79.334 82.566 86.120 90.405 96.578 101.880 106.630 112.330 116.320
90 59.196 61.754 65.647 69.126 73.291 78.558 82.511 85.993 89.334 92.761 96.524 101.050 107.570 113.150 118.140 124.120 128.300
100 67.328 70.065 74.222 77.929 82.358 87.945 92.129 95.808 99.334 102.950 106.910 111.670 118.500 124.340 129.560 135.810 140.170
Student t Quantiles
df \ p 0.6 0.7 0.8 0.9 0.95 0.975 0.99 0.995 0.999 0.9995
1 0.3249 0.7265 1.3764 3.0777 6.3138 12.7062 31.8205 63.6567 318.3088 636.6192
2 0.2887 0.6172 1.0607 1.8856 2.9200 4.3027 6.9646 9.9248 22.3271 31.5991
3 0.2767 0.5844 0.9785 1.6377 2.3534 3.1824 4.5407 5.8409 10.2145 12.9240
4 0.2707 0.5686 0.9410 1.5332 2.1318 2.7764 3.7469 4.6041 7.1732 8.6103
5 0.2672 0.5594 0.9195 1.4759 2.0150 2.5706 3.3649 4.0321 5.8934 6.8688
6 0.2648 0.5534 0.9057 1.4398 1.9432 2.4469 3.1427 3.7074 5.2076 5.9588
7 0.2632 0.5491 0.8960 1.4149 1.8946 2.3646 2.9980 3.4995 4.7853 5.4079
8 0.2619 0.5459 0.8889 1.3968 1.8595 2.3060 2.8965 3.3554 4.5008 5.0413
9 0.2610 0.5435 0.8834 1.3830 1.8331 2.2622 2.8214 3.2498 4.2968 4.7809
10 0.2602 0.5415 0.8791 1.3722 1.8125 2.2281 2.7638 3.1693 4.1437 4.5869
11 0.2596 0.5399 0.8755 1.3634 1.7959 2.2010 2.7181 3.1058 4.0247 4.4370
12 0.2590 0.5386 0.8726 1.3562 1.7823 2.1788 2.6810 3.0545 3.9296 4.3178
13 0.2586 0.5375 0.8702 1.3502 1.7709 2.1604 2.6503 3.0123 3.8520 4.2208
14 0.2582 0.5366 0.8681 1.3450 1.7613 2.1448 2.6245 2.9768 3.7874 4.1405
15 0.2579 0.5357 0.8662 1.3406 1.7531 2.1314 2.6025 2.9467 3.7328 4.0728
16 0.2576 0.5350 0.8647 1.3368 1.7459 2.1199 2.5835 2.9208 3.6862 4.0150
17 0.2573 0.5344 0.8633 1.3334 1.7396 2.1098 2.5669 2.8982 3.6458 3.9651
18 0.2571 0.5338 0.8620 1.3304 1.7341 2.1009 2.5524 2.8784 3.6105 3.9216
19 0.2569 0.5333 0.8610 1.3277 1.7291 2.0930 2.5395 2.8609 3.5794 3.8834
20 0.2567 0.5329 0.8600 1.3253 1.7247 2.0860 2.5280 2.8453 3.5518 3.8495
21 0.2566 0.5325 0.8591 1.3232 1.7207 2.0796 2.5176 2.8314 3.5272 3.8193
22 0.2564 0.5321 0.8583 1.3212 1.7171 2.0739 2.5083 2.8188 3.5050 3.7921
23 0.2563 0.5317 0.8575 1.3195 1.7139 2.0687 2.4999 2.8073 3.4850 3.7676
24 0.2562 0.5314 0.8569 1.3178 1.7109 2.0639 2.4922 2.7969 3.4668 3.7454
25 0.2561 0.5312 0.8562 1.3163 1.7081 2.0595 2.4851 2.7874 3.4502 3.7251
26 0.2560 0.5309 0.8557 1.3150 1.7056 2.0555 2.4786 2.7787 3.4350 3.7066
27 0.2559 0.5306 0.8551 1.3137 1.7033 2.0518 2.4727 2.7707 3.4210 3.6896
28 0.2558 0.5304 0.8546 1.3125 1.7011 2.0484 2.4671 2.7633 3.4082 3.6739
29 0.2557 0.5302 0.8542 1.3114 1.6991 2.0452 2.4620 2.7564 3.3962 3.6594
30 0.2556 0.5300 0.8538 1.3104 1.6973 2.0423 2.4573 2.7500 3.3852 3.6460
40 0.2550 0.5286 0.8507 1.3031 1.6839 2.0211 2.4233 2.7045 3.3069 3.5510
50 0.2547 0.5278 0.8489 1.2987 1.6759 2.0086 2.4033 2.6778 3.2614 3.4960
60 0.2545 0.5272 0.8477 1.2958 1.6706 2.0003 2.3901 2.6603 3.2317 3.4602
70 0.2543 0.5268 0.8468 1.2938 1.6669 1.9944 2.3808 2.6479 3.2108 3.4350
80 0.2542 0.5265 0.8461 1.2922 1.6641 1.9901 2.3739 2.6387 3.1953 3.4163
90 0.2541 0.5263 0.8456 1.2910 1.6620 1.9867 2.3685 2.6316 3.1833 3.4019
100 0.2540 0.5261 0.8452 1.2901 1.6602 1.9840 2.3642 2.6259 3.1737 3.3905
>100 0.2535 0.5247 0.8423 1.2832 1.6479 1.9647 2.3338 2.5857 3.1066 3.3101
274 10. FORMULA, DISTRIBUTIONS AND STATISTICAL TABLES
APPENDIX A: ANSWERS TO
END OF CHAPTER PROBLEMS
Chapter 1
1.1 (a) The new mean is
1 Pn 1 P
n
u= (a + byi ) = na + b yi = a + by
n i=1 n i=1
and the new median is m

^ u = a + bm.
^
(b) There is no general result for the sample mean but if all yi 0 and n is an odd
number then the new median is m ^ 2.
(c)
P
n P
n P
n
(yi y) = yi y = ny ny = 0
i=1 i=1 i=1
P
n
In general (yi m)
^ 6= 0 unless m
^ = y.
i=1
(d) Since
1 P
n ny + y0
a(y0 ) = y0 + yi =
n+1 i=1 n+1
therefore
ny + y0
lim a(y0 ) = lim =1
y0 !1 y0 !1 n+1
and
ny + y0
lim a(y0 ) = lim = 1
y0 ! 1 y0 ! 1 n+1
This means that an additional very large (or very small) observation has a large
e¤ect on the sample mean.
(e) Case 1: If n is odd then m ^ =m ^ (y1 ; y2 ; : : : ; yn ) = y( n+1 ) .
2
If y0 > y( n+1 )+1 then there are now an even number of observations and the new
2
^ 0 ) = 12 y( n+1 ) + y( n+1 )+1 . If y( n+1 ) and y( n+1 )+1 are close in

median is m(y
2 2 2 2
value then the median will change by very little and the change does not depend
275
276 APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS
on the value of y0 .
If y0 < y( n+1 ) 1 then there are now an even number of observations and the new
2
^ 0 ) = 21 y( n+1 ) 1 + y( n+1 ) . If y( n+1 ) 1 and y( n+1 ) are close in

median is m(y
2 2 2 2
value then the median will change by very little and the change does not depend
on the value of y0 .
Case 2: If n is even then m ^ =m ^ (y1 ; y2 ; : : : ; yn ) = 12 y( n ) + y( n )+1 .
2 2
If y0 > y( n +1) then there are now an even number of observations and the new
2
^ 0 ) = y( n )+1 . If y( n ) and y( n )+1 are close in value then the median
median is m(y
2 2 2
will change by very little and the change does not depend on the value of y0 .
If y0 < y( n ) then there are now an even number of observations and the new
2
median is m(y^ 0 ) = y( n ) . If y( n ) and y( n )+1 are close in value then the median
2 2 2
will change by very little and the change does not depend on the value of y0 .
(f) Unlike the sample mean the sample median is not a¤ected by outliers (very large
y0 or very small y0 ) so it is a more robust numerical summary of location. In
many countries there are usually a few people with very large incomes. The mean
income is a¤ected by these few very large incomes so reporting the mean income
rather than the median income would give the false impression that people are
doing well in general with respect to income.
(g)
d P
n
V ( )= 2 (yi )= 2n (y ) = 0 if =y
d i=1
and by the First Derivative Test, V ( ) is maximized at = y.
1.2 (a)
P
n P
n
s2u = (ui u)2 = [a + byi (a + by)]2
i=1 i=1
Pn P
n
= (byi by)2 = b2 (yi y)2 = b2 s2
i=1 i=1
su = jbj s
IQR (u1 ; : : : ; un ) = jbj IQR
(b)
P
n n h
P i Pn P
n
(yi y)2 = yi2 2yi y + (y)2 = yi2 2y yi + n (y)2
i=1 i=1 i=1 i=1
Pn P
n
= yi2 2n (y)2 + n (y)2 = yi2 n (y)2
i=1 i=1
277
(c)
" #
1 Pn ny + y0 2
s2 (y0 ) = y 2 + y02 (n + 1)
n i=1 i n+1
1 P n 1 h i
= yi2 + y02 n2 (y)2 + 2nyy0 + y02
n i=1 (n + 1)
1 Pn
= (n + 1) yi2 + (n + 1) y02 n2 (y)2 2nyy0 y02
n (n + 1) i=1
1 Pn Pn
= n yi2 n (y)2 + yi2 + ny02 2nyy0
n (n + 1) i=1 i=1
(n 1) 2 1 Pn
= s + y 2 + ny0 (y0 2y)
(n + 1) n (n + 1) i=1 i
Therefore
1=2
(n 1) 2 1 Pn
lim s (y0 ) = lim s + y 2 + ny0 (y0 2y)
y0 ! 1 y0 ! 1 (n + 1) n (n + 1) i=1 i
1=2
(n 1) 2 1 Pn
= s + y 2 + lim ny0 (y0 2y) =1
(n + 1) n (n + 1) i=1 i y0 ! 1
This means that an additional very large (or very small) observation has a large
e¤ect on the sample standard deviation.
(d) Once y0 is larger than q(0:75) or smaller than q(0:75); then y0 has little e¤ect
on the interquartile range as y0 increases or decreases.
1.3 Since
P
n P
n
1
n (ui u)3 1
n (byi by)3
i=1 i=1
3=2
= 3=2
1 P
n
1 P
n
n (ui u)2 n (byi by)2
i=1 i=1
P
n
1
n (yi y)3
b3 i=1
=
(b2 )3=2 1 P
n 3=2
n (yi y)2
i=1
3
b
= g1
jbj
Therefore g1 (u1 ; : : : ; un ) = g1 if b > 0 and g1 (u1 ; : : : ; un ) = g1 if b < 0. In summary

the magnitude of the sample skewness remains unchanged but the sample skewness
changes sign if b < 0.
Since
P
n P
n
1
n (ui u)4 1
n (byi by)4
i=1 i=1
2 = 2
1 Pn
1 Pn
n (ui u)2 n (byi by)2
i=1 i=1
P
n
1
n (yi y)4
b4 i=1
= = g2
(b2 )2 1 P
n 2
n (yi y)2
i=1
therefore the sample kurtosis is the same for both data sets.
1.4 For the revenues:

sample mean = ( 7) (2500) + 1000 = 16500
sample standard deviation = j 7j (5500) = 38500
sample median = ( 7)(2600) = 18200
sample skewness = ( 1) (1:2) = 1:2
sample skewness = 3:9
range = (7) (7500) = 52500
1.5 (a) The relative frequency histogram of the piston diameters is given in Figure 12.1.
0.06
0.05
0.04
Relative
F requency
0.03
0.02
0.01
0
-13 -11 -9 -7 -5 -3 -1 1 3 5 7 9
Diameter of Piston
Figure 12.1: Histgram of Piston Diameters
^ = q (0:5) = 12 y(25) + y(26) =

(b) y = 100:7=50 = 2:014, m 1
2 (2:1 + 2:5) = 2:3
h i
1
(c) s2 = 49 1110:79 50 (2:014)2 = 18:5302, s = 4:3047,
q (0:25) = 21 y(12) + y(13) = 1
2 [ 0:7 + ( 0:6)] = 0:65
q (0:75) = 12 y(38) + y(39) = 1
2 [5:1 + 5:4] = 5:25
IQR = 5:25 ( 0:65) = 5:9
279
(d) P pk = 0:6184
(e) If y t 10 then P pk t 0. Values of y less than 10 or bigger than +10

indicate that performance is poor. If y t 0 then P pk t 10=3s. Recall that
for Normal data we would expect approximately 99% of the observed data to
lie between 3 t y 3s and + 3 t y + 3s. Therefore if y t 0 and
3s t 10 or 10=3s t 1 then this indicates that performance is good. Therefore
P pk t 10=3s = 1 indicates good performance.
(f)
P (diameters out of speci…cation)

= 1 P ( 10 < Y < 10) where Y v G (2:014; 4:3047)
= 0:03408
1.6
1.7 The empirical c.d.f. is constructed by …rst ordering the data (smallest to largest) to
obtain the order statistic: 0.01 0.39 0.43 0.45 0.52 0.63 0.72 0.76.85 0.88. Then the
empirical c.d.f. is in Figure 12.2
0.9
0.8
0.7
cumlative
relative
0.6
frequency
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Data
Figure 12.2: Empirical c.d.f. for Problem 1.7
1.8 (a) E Yi2 = V ar (Yi ) + E Yi2 = 2 + 2.
(b)
1 Pn 1 Pn 1 Pn 1
E(Y ) = E Yi = E (Yi ) = = (n ) =
n i=1 n i=1 n i=1 n
Since the Yi0 s are independent random variables

2 n 2 n
1 Pn 1 P 1 P 2
V ar(Y ) = V ar Yi = V ar (Yi ) =
n i=1 n i=1 n i=1
2 2
1 2
= n =
n n
2
2
E (Y )2 = E Y + V ar Y = 2
+
n
(c)
1 P
n h i
2
E S2 = E Yi2 nE Y
n 1 i=1
1 Pn 2
2 2 2
= + n +
n 1 i=1 n
1 2 2 2 2
= n + n
n 1
1 2 2
= (n 1) =
n 1
1.9 (a) The relative frequency histograms are given in Figure 12.3.
0 .0 3 0 .0 3
0 .0 2 5 0 .0 2 5
0 .0 2 0 .0 2
Rel ati ve F requency
Rel ati ve F requency
0 .0 1 5 0 .0 1 5
0 .0 1 0 .0 1
0 .0 0 5 0 .0 0 5
0 0
70 75 80 85 90 95 100 105 110 70 75 80 85 90 95 100 105 110
L e n g t h o f F e m a l e Co y o t e L e n g t h o f M a l e Co y o t e
Figure 12.3: Histograms for lengths of female and male coyotes
(b) Five number summary for female coyotes: 71:0 85:5 89:75 93:25 102:5
Five number summary for male coyotes: 78:0 87:0 92:0 96:0 105:0
(c) Female coyotes: x = 89:24; s1 = 6:5482
Male coyotes: y = 92:06; s2 = 6:6960
281
1.10 (a) The two variates are Value (x) and Gross (y), where Value is the average amount
the actor’s movies have made (in millions of U.S. dollars), and Gross is the
amount of the highest grossing movie in which the actor played as a major
character (in millions of U.S. dollars). Since the goal is to study the e¤ect of
an actor’s value (x) on the amount grossed in a movie (y), we choose x as the
explanatory variate and y as the response variate.
(b) A scatterplot of the data is given in Figure 12.4.
600
500
400
gross
300
200
100
0
10 20 30 40 50 60 70 80 90 100
value
Figure 12.4: Scatterplot of gross versus value
(c) The sample correlation is
Sxy
r = p
Sxx Syy
184540:93 20(860:6=20)(3759:5=20)
= h i1=2 h i1=2
43315:04 20 (860:6=20)2 971560:19 20 (3759:5=20)2
= 0:558
There is a moderately strong positive linear relationship between x and y.

(d) In this example we do not have enough evidence to conclude that a causal rela-
tionship exists. Another plausible explanation for the observed data is that there
is a third variate such as “the talent of the actor” that a¤ects both the Value
(x) and Gross(y) (of course it is very di¢ cult to measure the variate “talent”).
Consequently, x and y are expected to be positively correlated, and this is what
we observe in this data set.
1.11 (a) Since the n people are selected at random from a large population it is reasonable
to assume that the people are independent and that the probability a randomly
chosen person has blood type A is equal to . Therefore we have a sequence

of n independent trials (people) with two outcomes on each trial (success =
person has blood type A; failure = person does not have blood type A) and
P (Success) = . The probability function for the random variable Y = the
number of people with blood type A = number of successes in n Bernoulli trials
is given by the Binomial distribution:
n
P (Y = y) = y
(1 )n y
for y = 0; 1; : : : ; n; 0 < < 1:
y
Since Y v Binomial (n; ) then E (Y ) = n and V ar (Y ) = n (1 ):
(b)
n
P (Y = 20) = 20
(1 )30 0< < 1:
20
(c) A reasonable estimate of is given by the proportion of observed successes in
n = 50 trials which is 20=50 = 0:4. An estimate of the probability that in a
sample of n = 10 there will be at least one person with blood type A is given by
10
1 (0:4)0 (0:6)10 = 1 (0:6)10
0
= 0:9940
(d) If y successes are observed in n Bernoulli trials then a reasonable estimate of

is given by the (sample) proportion of successes, that is, a reasonable estimate
of is y=n.
(e)
Y 1 1
E = E (Y ) = (n ) =
n n n
2 2
Y 1 1 (1 )
V ar = V ar (Y ) = [n (1 )] = ! 0 as n ! 1:
n n n n
For large values of n, Y =n should be close to . By the Central Limit Theorem
r r !
Y (1 ) Y (1 )
P 1:96 + 1:96
n n n n
0 1
Y
= P @ qn 1:96A t P (jZj 1:96) where Z v N (0; 1)
(1 )
n
= 2P (Z 1:96) 1 = 2 (0:975) 1 = 0:95
(f) Since there are now four possible outcomes on each independent trial the joint
distribution of Y1 = no. of A types, Y2 = no. of B types, Y3 = no. of AB types,
Y4 = no. of O types is given by the Multinomial distribution.
n! y1 y2 y3 y4
P (Y1 = y1 ; Y2 = y2 ; Y3 = y3 ; Y4 = y4 ) = 1 2 3 4
y1 !y2 !y3 !y4 !
283
P
4
for yi = 0; 1; : : : ; n; i = 1; 2; 3; 4 yi = n
i=1
P
4
and 0 < i < 1; i = 1; 2; 3; 4 i = 1:
i=1
(g) Since we observe outcome A; y1 times in a sample of n people a reasonable

estimate of 1 = proportion of type A in the large population is given by the
sample proportion y1 =n. Similarly a reasonable estimate of i is yi =n for i =
2; 3; 4.
1.12 (a) Since Y v G ( ; ) the probability density function of Y is

1 1
f (y) = p exp 2
(y )2 for y 2 <; 2 <; > 0;
2 2
E (Y ) = and V ar (Y ) = 2.
(b) A reasonable estimate of the mean is the sample mean

1916
y= = 119:75:
16
A reasonable estimate of the variance 2 is the sample variance
1 h i
s2 = 231618 16 (119:75)2 = 145:13:_
15
An estimate of the probability that a random chosen UWaterloo Math student
will have an IQ greater than 120 is given by
P (Y 120) where Y v N 119:75; 145:13_
120 119:75
= P Z p where Z v N (0; 1)
145:13_
= P (Z 0:0208) t P (Z 0:02) = 1 0:50798
= 0:49202:
(i) The distribution of a linear combination of Gaussian (Normal) random vari-
ables has a Gaussian (Normal) distribution. Since
1 Pn 1 Pn 1
E Y = E (Yi ) = = (n ) =
n i=1 n i=1 n
2 n
1 P
and V ar Y = V ar (Yi ) since Yi0 s are independent r.v.’s
n i=1
2 n 2 2
1 P 2 1 2
= = n =
n i=1 n n
p
therefore Y v G ( ; = n).
2
V ar Y = ! 0 as n ! 1:
n
For large values of n, the sample mean Y should be close to the mean .
(ii)
p p
P Y 1:96 = n Y + 1:96 = n
p
= P Y 1:96 = n = P (jZj 1:96) where Z v G (0; 1)
= 2P (Z 1:96) 1
= 2 (0:975) 1 = 0:95
p
(iii) We want P Y 0:95 where Y v G ( ; 12= n) or
1:0
!
Y 1:0
P Y 1:0 = P p p
12= n 12= n
p
n
= P jZj 0:95 where Z v G (0; 1) :
12
p
Since P (jZj 1:96) = 0:95 we want n=12 1:96 or n (1:96)2 (144) =
553:2. Therefore n = 554.
1.13 (a) Since Y v Exponential ( ) the probability density function of Y is
1 y=
f (y) = e for y > 0 and > 0:
Since Y v Exponential ( ) ; E (Y ) = and V ar (Y ) = 2

.
(b) A reasonable estimate of the mean is the sample mean y = 1325:1=20 = 66:255.
An estimate of
Z1
1 y= 100= 100=66:255
P (Y > 100) = e dy = e is e = 0:2211:
100
(i)
1 Pn 1 Pn 1
E Y = E (Yi ) = = (n ) =
n i=1 n i=1 n
2 n
1 P
n i=1
2 n 2 2
1 P 2 1 2
= = n = ! 0 as n ! 1:
n i=1 n n
(ii) By the Central Limit Theorem
p p
P Y 1:6449 = n Y + 1:6449 = n
Y
= P p 1:6449 t P (jZj 1:6449) where Z v N (0; 1)
= n
= 2P (Z 1:6449) 1 = 2 (0:95) 1 = 0:9
285
1.14 (a) Since Y v P oisson ( ) the probability density function of Y is

y
e
f (y) = for y = 0; 1; 2; : : : and > 0:
y!
Since Y v P oisson ( ) ; E (Y ) = and V ar (Y ) = .
(b) Let Yi = no. of accidents on day i, i = 1; 2; : : : ; 6. Then
P (Y1 = 0; Y2 = 2; Y3 = 0; Y4 = 1; Y5 = 3; Y6 = 1)
= P (Y1 = 0) P (Y2 = 2) P (Y3 = 0) P (Y4 = 1) P (Y5 = 3) P (Y6 = 1)
0 2 0 1 3 1
e e e e e e
=
0! 2! 0! 1! 3! 1!
7
e 6
= for > 0:
12
A reasonable estimate of the mean is the sample mean y = 7=6. An estimate
of the probability that there is at least one accident at this intersection next
Wednesday is given by
(7=6)0 e 7=6
7=6
1 =1 e = 0:6886:
0!
(i)
1 Pn 1 Pn 1
E Y = E (Yi ) = = (n ) =
n i=1 n i=1 n
2 n
1 P
n i=1
2 n 2
1 P 1
= = (n ) = ! 0 as n ! 1:
n i=1 n n
(ii) By the Central Limit Theorem
p p
P Y 1:96 =n Y + 1:96 =n
!
Y
= P p 1:96 t P (jZj 1:96) where Z v N (0; 1)
=n
= 2P (Z 1:96) 1
= 2 (0:975) 1
= 0:95
Chapter 2
2.1 (a)
G( ) = a
(1 )b ; 0 < <1
g ( ) = log G ( ) = a log + b log (1 ); 0 < < 1
a b a (1 ) b a (a + b)
g0 ( ) = = =
1 (1 ) (1 )
a
g 0 ( ) = 0 if =
a+b
a a
Since g 0 ( ) > 0 for 0 < < a+b and g 0 ( ) < 0 for 1 > > a+b then by the First
a
Derivative Test g ( ) has a maximum value at = a+b .
(b)
a b=
G( ) = e ; >0
b
g ( ) = log G ( ) = a log ; >0
a b a +b
g0 ( ) = + 2 = 2
b
g 0 ( ) = 0 if =
a
Since g 0 ( ) > 0 for 0 < < ab and g 0 ( ) < 0 for > b
a then by the First
Derivative Test g ( ) has a maximum value at = ab .
(c)
a b
G( ) = e ; >0
g ( ) = log G ( ) = a log b ; >0
a a b
g0 ( ) = b=
a
g 0 ( ) = 0 if =
b
Since g 0 ( ) > 0 for 0 < < ab and g 0 ( ) < 0 for > a
b then by the First
Derivative Test g ( ) has a maximum value at = ab .
(d)
a( b)2
G( ) = e ; 2<
g ( ) = log G ( ) = a( b)2 ; 2<
0
g ( ) = 2a ( b)
0
g ( ) = 0 if =b
Since g 0 ( ) > 0 for < b and g 0 ( ) < 0 for > b then by the First Derivative
Test g ( ) has a maximum value at = b.
287
2.2 (a) The probability of the observed results for Experiment 1 is
P (total number of individuals examined = 100; )

99 10
= (1 )90 for 0 < < 1:
9
The probability of the observed results for Experiment 2 is
P (10 individuals with blood type B ; )

100 10
= (1 )90 for 0 < < 1:
10
(b) The likelihood function in both cases simpli…es to
L( ) = 10
(1 )90 for 0 < < 1:
if we ignore constants with respect to . The log likelihood function is
l ( ) = 10 log + 90 log (1 ) for 0 < < 1:
Now
10 90
l0 ( ) =
1
10 100 10
= = 0 if = = 0:1
(1 ) 100
and the maximum likelihood estimate of is ^ = 0:1.

(c) Let Y = the number of donors with blood type B. Then Y v Binomial (n; 0:1)
and E (Y ) = 0:1n and V ar (Y ) = 0:1 (0:9) n = 0:09n. We went to …nd n such
that P (Y 10) 0:90. By the Normal approximation to the Binomial we have
9:5 0:1n
P (Y 10) t P Z p where Z v N (0; 1) :
0:09n
Since P (Z 1:2816) = 0:90 we solve
9:5 0:1n
p = 1:2816
0:09n
or
n2 204:78n + 9025 = 0
which gives n = 140:6 so we take n = 141. Using gbinom() or pbinom() we …nd

n = 140.
2.3 (a) The likelihood function is

P
n
Q
n yi n
L( ) = yi 1
(1 )= i=1 (1 )n
i=1
= n(y 1)
(1 )n for 0 < <1
and the log likelihood is
( ) = n (y 1) log + n log (1 ) for 0 < < 1:
Solving
n (y 1) n n (y 1) (1 ) n
l0 ( ) = = =0
1 (1 )
gives the maximum likelihood estimate
^= y 1
:
y
(b) The relative likelihood function is

" #n
(y 1)
L( ) n(y 1)
(1 )n 1
R( ) = = n(y n = for 0 < < 1:
L(^) ^ 1)
1 ^ ^ 1 ^
P
200
If n = 200 and yi = 400 then
i=1
400 2 1
y = = 2, ^ = = 0:5
200 2
200
1
and R ( ) = for 0 < < 1:
0:5 0:5
A graph of R ( ) is given in Figure 12.5.
(c) Since p = P (Y = 1; ) = (1 ) then by the Invariance property of maximum
likelihood estimates the maximum likelihood estimate of p is p^ = 1 ^ =
1 0:5 = 0:5.
2.4 (a) Since t = 1, the likelihood function is

yi 1
Q
10 e Q
10
41 10
L( ) = = yi ! e for >0
i=1 yi ! i=1
or more simply
41 10
L( ) = e for > 0:
l ( ) = 41 log 10 for > 0:

289
0.9
0.8
0.7
R( )
0.6
0.5
0.4
0.3
0.2
0.1
0
0.42 0.44 0.46 0.48 0.5 0.52 0.54 0.56 0.58
Figure 12.5: Relative likelihood function for fracture data
Solving
41
l0 ( ) = 10 = 0
gives the maximum likelihood estimate ^ = 4:1.

(b) Since
(2 )0 e 2
2
p = P (no transactions in a two minute interval ; ) = =e
0!
then by the invariance property of maximum likelihood estimates the maximum
^
likelihood estimate of p is p^ = e 2 = 0:000275.
2.5 (a) The joint p.d.f. of the observations y1 ; y2 ; :::; yn is given by
Q
n Q
n 2y
i yi2 =
f (yi ; ) = e for >0
i=1 i=1
Q
n 1 1P
n
= 2n ( yi ) n exp yi2 :
i=1 i=1

1 1P
n
L( ) = n exp yi2 >0
i=1
1P
n
l( ) = n log( ) yi2 > 0:
i=1
Solving
n 1 P
n 1 P
n
l0 ( ) = + 2 yi2 = 2 yi2 n =0
i=1 i=1
^ = 1 P y2
n
n i=1 i
(b) The relative likelihood function is

!n " #n
L( ) ^ ^= ^ ^=
R( ) = = en(1 )= e(1 ) > 0:
^
L( )
P
20
If n = 20 and yi2 = 72 then ^ = 72=20 = 3:6. A graph of R ( ) is given in
i=1
Figure 12.6.
1.2
0.8
R( )
0.6
0.4
0.2
0
2 3 4 5 6 7 8
Figure 12.6: R ( ) for Problem 2:5
2.6 (a) The likelihood function
Q
n Q
n
L( ) = ( + 1)yi = ( + 1)n yi for > 1
i=1 i=1

P
n
l( ) = n log( + 1) + log(yi ) for > 1
i=1
Solving
d n P
n
l( ) = + log(yi ) = 0
d 1+ i=1
291
gives
^= n
1:
P
n
log(yi )
i=1
(b) The log relative likelihood function is
^) P log(yi ) for
+1 n
r( ) = l( ) l(^) = n log +( > 1:
^+1 i=1
P
15
If n = 15 and log(yi ) = 34:5 then ^ = 15
34:5 1= 0:5652. The graph of
i=1
r ( ) is given in Figure 12.7.
-1
r( )
-2
-3
-4
-5
-0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1
Figure 12.7: r ( ) for Problem 2:6
2.7 (a)
1 1 1 1+
P (M M ) = P (F F ) = + (1 )=
2 2 2 4
1+ 1
P (M F ) = 1 2 =
4 2
where = probability the pair is identical.
(b)
n1 n2 n3
n! 1+ 1+ 1
L( ) = where n = n1 + n2 + n3
n1 !n2 !n3 ! 4 4 2
or more simply
L ( ) = (1 + )n1 +n2 (1 )n3 :
Maximizing L ( ) gives ^ = (n1 + n2 n3 ) =n. For n1 = 16, n2 = 16 and
n3 = 18, ^ = 0:28.
2.8 (a) Since P (Y = 1; ) = , the parameter represents the probability a randomly

chosen family has one child.
(b) Let Fy = the number of families with y children. The probability of observing
the data
y 0 1 ymax > ymax Total

fy f0 f1 fmax 0 n
is
0 10
f0 1
X
n! 1 2 2 f2 @ yA
( ) f2 ( ymax fmax
)
f0 !f1 ! fmax !0! 1
y=yamx +1
f0 ymax
n! 1 2 Q yfy
:
f0 !f1 ! fmax ! 1 y=1
Ignoring constants the likelihood function is

f0 ymax
1 2 Q yfy
1 y=1

!
1 2 yP
max
l( ) = f0 log + yfy log for 0 < 0:5
1 y=1
yP
max
= f0 log (1 2 ) f0 log(1 ) + T log where T = yfy :
y=1
Now
2f0 f0 1
l0 ( ) = + + T
1 2 1
1 2
= 2T (f0 + 3T ) + T
(1 ) (1 2 )
and l0 ( ) = 0 if
(f0 + 3T ) [(f0 + 3T )2 8T 2 ]1=2
=
4T
and since
(f0 + 3T ) + [(f0 + 3T )2 8T 2 ]1=2 f0 + 3T 3

> 0:5
4T 4T 4
therefore
^ = (f0 + 3T ) [(f0 + 3T )2 8T 2 ]1=2
4T
293
(c) The probability that a randomly selected family has x children is x , x = 1; 2; :::.
Suppose for simplicity there are N di¤erent families where N is very large.
Then the number of families that have x children is N (probability a family
has x children) = N x for x = 1; 2; :::: and there is a total of xN x children in
P
1
families of x children and a total of xN x children altogether. Therefore the
x=1
probability a randomly chosen child is in a family of x children is:
x
xN
= cx x ; x = 1; 2; :::
P
1
x
xN
x=1
P
1
x 1
Note that = 1 . Therefore taking derivatives
x=0
P
1
x 1 1 P
1
x
x = and x =
x=1 (1 )2 x=1 (1 )2
Solving
P
1
x
cx =1
x=1
gives c = (1 )2 = and
(1 )2 1
P (X = x; ) = x x
= x (1 )2 x 1
for x = 1; 2; : : : and 0 <
2
(d) The probability of observing the given data for model (c) is
33! h i22 h i7 h i3 h i 1
(1 )2 2 (1 )2 3 (1 )2 2
4 (1 )2 3
for 0 < :
22!7!3!1! 2
L ( ) = (1 )2(22+7+3+1) 7+2(3)+3(3)
1
= 16
(1 )66 for 0 <
2
which is maximized for = 16= (16 + 66) = 16=82 = 8=41 = 0:1951.
Since the probability a family has no children is
1 2
P (Y = 0; ) = = g( )
1
then by the Invariance Property of maximum likelihood estimates the maximum
likelihood of g ( ) is
1 2^ 1 2 (0:1951)
g(^) = = = 0:7576
1 ^ 1 0:1951
(e) For these data f0 = 0, T = 49. and l0 ( ) = 49= = 0 has no solution. Since
l0 ( ) = 49= > 0 for all 0 < 0:5, therefore l( ) is an increasing function on
this interval. Thus the maximum value of l( ) occurs at the endpoint = 0:5
and therefore ^ = 0:5.
2.9 (a) The likelihood function based on the Poisson model and the frequency table is
696!
L( ) =
69!155!171!143!79!57!14!6!2!0!
0 69 1 155 2 171 3 143 4 79
e e e e e
0! 1! 2! 3! 4!
0 10
5 57 6 14 7 6 8 2 X1 y
e e e e @ e A
5! 6! 7! 8! y!
y=9
or more simply
0(69)+1(155)+2(171)+3(143)+4(79)+5(57)+6(14)+7(6)+8(2)
L( ) =
(69+155+171+143+79+57+14+6+2)
e
1669 696
= e ; > 0:
(b) The log likelihood function is
l ( ) = 1669 log 696 ; >0
and
1669 1669 696 1669

l0 ( ) = 696 = = 0 if = t 2:3980:
696
The maximum likelihood estimate of is ^ = 1669=696.
(c) The expected frequencies are caluclated using
1669 y
696 e 1669=696
ey = 1669 ; y = 0; 1; : : : ; 8
y!
295
and are given in the table below:
Number of Points Observed Number of Expected Number of

in a Game: y Games with y points: fy Games with y points: ey
0 69 63:27
1 155 151:71
2 171 181:90
3 143 145:40
4 79 87:17
5 57 41:81
6 14 16:71
7 6 5:72
8 2 1:72
9 0 0:60
Total 696 696:01
Their is quite good agreement between the observed and expected freqencies
which indicates the Poisson model is very reasonable. Recall the homogeneity
assumption for the Poisson process. Since a Poisson model …ts the data well
this suggests that Wayne was a very consistent player when he played with the
Edmonton Oilers.
2.10 (a)
P
n
yi
^= i=1
P
n
ti
i=1
(b) Let X = number of intervals of length t with no particles emitted. Then X v

Binomial (n; p) where
( t)0 e t
t
p = P (X = 0; ) = =e
01
Suppose that x intervals were observed with no particles. Since X v Binomial (n; p)
the maximum likelihood estimate of p is p^ = x=n. Since p = e t implies
= (log p) =t then by the Invariance Property of maximum likelihood esti-
mates ^ = (log p^) =t.
2.11 (a) The …ve number summary is: 3 16:25 19:5 22 30

(b) y = 19:14 and s = 4:47
(c) The proportion of observations in the interval [y s; y + s] = [14:68; 23:61] is
71=100 = 0:71. If Y v G ( ; ) then
jY j
P (Y 2 [ ; + ]) = P (jY j )=P 1
= P (jZj 1) = 2P (Z 1) 1 where Z v N (0; 1)

= 2 (0:84134) 1 = 0:68268
t 0:68
The proportion of observations in the interval (0:71) is slightly higher than what
would be expected for Normal data (0:68).
(d)
IQR = 22 16:25 = 5:75:
To show that for Normally distributed data that IQR = 1:349 we need to solve
jY j
0:5 = P (jY j c )=P c for c if Y v G ( ; ) :
From N (0; 1) tables P (jZj c) = 2P (Z c) 1 = 0:5 holds if c = 0:6745.

Therefore
IQR = 2 (0:6745) = 1:349
for Normally distributed data.
(e) For Normally distributed data we expect the skewness to be close to zero and
the sample mean and median to be approximately equal. For these data the
sample skewness = 0:50 and the sample median = 19:5 > mean = 19:14. Both
of these results indicate that the data are not symmetric but slightly skewed
to the left. This is also evident in the boxplot in which neither the box nor the
whiskers are divided approximately in half by the median.
For Normally distributed data we expect the sample kurtosis to be close to 3.
The sample kurtosis for these data equals 4:30 which indicates that there are
more observations in the tails then would be expected for Normally distributed
data.
In the list of observations as well as the boxplot we observe two extreme obser-
vations, 3 and 5, which are also evident in the qqplot (see lower left hand corner
of graph). These extremes have a large in‡uence on the sample mean as well
as on the sample skewness and sample kurtosis. If the sample mean, median,
skewness and kurtosis were recalculated with these observations removed then
the values of these numerical summaries would be more in agreement with what
we expect to see for Normally distributed data.
Except for the outliers, the points in the qqplot lie quite well along a straight
line. For Normally distributed data we expect the points to lie reasonably along
a straight line although the points at both ends may lie further from the straight
297
line since the quantiles of the Normal distribution change more rapidly in both
tails of the distribution.
The proportion of observations in the interval [y s; y + s] is slightly higher than
we would expect for Normally distributed data. This also agrees with the sample
kurtosis value of 4:3 being larger than 3.
Overall, except for the two outliers, it seems reasonable to to assume that the
data are approximately Normally distributed. It would be a good idea to do any
formal analyses of the data with and without the outliers to determine the e¤ect
of these outliers on the conclusions of these analyses.
2.12 (a) The sample mean y = 159:77 and the sample standard deviation s = 6:03 for
the data.
(b) The number of observations in the interval [y s; y + s] = [153:75; 165:80] is 244
or 69:5% and actual number of observations in the interval [y 2s; y + 2s] =
[147:72; 171:83] is 334 or 95:2%. If Y v G ( ; ) then
jY j
P (Y 2 [ ; + ]) = P (jY j )=P 1
= P (jZj 1) = 2P (Z 1) 1 where Z v N (0; 1)

= 2 (0:84134) 1 = 0:68268:
Similarly P (Y 2 [ 2 ; + 2 ]) = 0:9545. The observed and expected pro-

portions and very close to what one would expect if the data were Normally
distributed.
(c) The sample skewness for these data is 0:13 while for Normally distributed data
we expect a sample skewness close to 0: The sample kurtosis for these data is
3:16 while for Normally distributed data we expect a sample kurtosis close to 3:
Both the sample skewness and the sample kurtosis are reasonably close to what
we expect for Normally distributed data.
(d) The …ve-number summary for the data is given by y(1) ; q(0:25); q(0:5); q(0:75);
y(n) = 142; 156; 160; 164; 178
(e) IQR = q(0:75) q(0:25) = 164 156 = 8. For Normally distributed data we
expect IQR = 1:349 . For these data IQR = 1:33s so this relationship is almost
exact.
(f) The frequency histogram and superimposed Gaussian probability density func-
tion are given in the top left graph in Figure 12.8.
(g) The empirical cumulative distribution function and superimposed Gaussian cu-
mulative distribution function are given in the top right graph in Figure 12.8.
(h) The boxplot is given in the bottom left graph in Figure 12.8.
(i) The qqplot is given in the bottom right graph in Figure 12.8. The “steplike”
behaviour of the plot is due to the rounding of the data to the nearest centimeter.
0.07 1
0.9
0.06
0.8
0.05 0.7
Relativ e Frequenc y
empiric al c .d.f.
0.6
0.04
G(159.77,6.03) 0.5
0.03 G(159.77,6.03)
0.4
0.02 0.3
0.2
0.01
0.1
0 0
140 143 146 149 152 155 158 161 164 167 170 173 176 179 140 145 150 155 160 165 170 175 180
Heights Heights
180
175 175
170 170
Sample Q uantiles
165 165
Heights
160 160
155
155
150
150
145
145
Elderly Women 140

-3 -2 -1 0 1 2 3
Figure 12.8: Plots for Heights of Elderly Women
(j) All the numerical summaries indicate good agreement with the Gaussian as-
sumption. The relative frequency histogram has the shape of a Gaussian proba-
bility density function. The empirical cumulative distribution function and the
Gaussian cumulative distribution function also have similar shapes. The box-
plot is consistent with Gaussian data and the points in the qqplot lie reasonably
along a straight line also indicating good agreement with a Gaussian model. A
Gaussian distribution seems very reasonable for these data.
2.13 (a) ^ = 1:744, ^ = 0:0664 (M) ^ = 1:618, ^ = 0:0636 (F)

(b) 1:659 and 1:829 (M) 1:536 and 1:670 (F)
(c) 0:098 (M) and 0:0004 (F)
(d) 11=50 = 0:073 (M) 0 (F)
2.14 See Figure 12.9. Note that the qqplot for the log yi ’s is far more linear than for the
yi ’s indicating that the Normal model is more reasonable for the transformed data.
2.14 (a) If they are independent P (S and H) = P (S)P (H) = : The others are similar.
(b) The Multinomial probability function evaluated at the observed values is
100!
L( ; ) = ( )20 [ (1 )]15 [(1 ) ]22 [(1 ) (1 )]43
20!15!22!43!
299
QQ Plot of Sample Data vers us Standard Normal

6
Sample Quantiles 4
-1
-3 -2 -1 0 1 2 3
N(0,1) Quantiles
Figure 12.9: Qqplot of log brake pad lifetimes
and the log likelihood (ignoring constants) is
l( ; ) = 35 log( ) + 65 log(1 ) + 42 log( ) + 58 log(1 ):
Setting the derivatives to zero gives the maximum likelihood estimates,

(c) The expected frequencies are
100^ ^ ; 100^ 1 ^ ; 100 (1 ^ ) ^ ; 100 (1 ^) 1 ^ respectively
or
35(42) 35(58) 65(42) 65(58)
; ; ; = (14:7; 20:3; 27:3; 37:7)
100 100 100 100
which can be compared with 20; 15; 22; 43. The observed and expected frequen-
cies do not appear to be very close. In Chapter 7 we will see how to construct a
formal test of the model.
2.16
2.17 (a) The median of the N (0; 1) distribution is m = 0. Reading from the qqplot
the sample quantile on the y axis which corresponds to 0 on the x axis is
approximately equal to 1:0 so the sample median is approximately 1:0.
(b) To determine q (0:25) for the these data we note that P (Z 0:6745) = 0:25 if
Z v N (0; 1). Reading from the qqplot the sample quantile on the y axis which
corresponds to 0:67 on the x axis is approximately equal to 0:4 so q (0:25)
is approximately 0:4. To determine q (0:75) for the these data we note that
P (Z 0:6745) = 0:25 if Z v N (0; 1). Reading from the qqplot the sample
quantile on the y axis which corresponds to 0:67 on the x axis is approxi-
mately equal to 1:5 so q (0:75) is approximately 1:5. The IQR for these data is
approximately 1:5 0:4 = 1:1.
(c) The frequency histogram of the data would be approximately symmetric about
the sample mean.
(d) The frequency histogram would most resemble a Uniform probability density
function.
2.18 (a) If there is adequate mixing of the tagged animals, the number of tagged animals
caught in the second round is a random sample selected without replacement so
follows a hypergeometric distribution (see the STAT 230 Course Notes).
(b)
L(N + 1) (N + 1 k)(N + 1 n)
=
L(N ) (N + 1 k n + y)(N + 1)
and L(N ) reaches its maximum within an integer of kn=y:

(c) The model requires su¢ cient mixing between captures that the second stage is
a random sample. If they are herd animals this model will not …t well.
2.19
n
Y
L( ) = f (yi ; )
i=1
Yn
1
= if yi i = 1; 2; : : : ; n
i=1
1
= n if y(n) = max (y1 ; : : : ; yn )
where n is a decreasing function of . Note also that L ( ) = 0 for 0 < < y(n) .
Therefore the maximum value of L ( ) occurs at = y(n) and therefore the maximum
likelihood estimate of is ^ = y(n) .
2.20 (a)
Z1
1 y= C=
P (Y > C; ) = e dy = e :
C
(b) For the i’th piece that failed at time yi < C; the contribution to the likelihood
is 1 e yi = . For those pieces that survive past time C, the contribution to the
likelihood is the probability of the event, P (Y > C; ) = e C= . Therefore the
301
likelihood is
Q
k 1
yi = C=
n k
L( ) = e e
i=1
1P
k C
l( ) = k log( ) yi (n k)
i=1
and solving l0 ( ) = 0 we obtain the maximum likelihood estimate,
^ = 1 P yi + (n
k
k)C :
k i=1
(c) When k = 0 and C > 0 the maximum likelihood estimator is ^ = 1: In this case
there are no failures in the time interval [0; C] and this is more likely to happen
as the expected value of the Exponential gets larger and larger.
2.21 The likelihood function is

n
Y [ (xi )]yi (xi )
L( ; ) = e
yi !
i=1
and, ignoring the terms yi ! which do not contain the parameters, the log likelihood is
n h
P i
l( ; ) = yi ( + xi ) e( + xi )
i=1
To maximize we set the derivatives equal to zero and solve

@ Pn h i
l( ; ) = yi e( + xi )
=0
@ i=1
@ Pn h i
l( ; ) = x i yi e ( + xi )
=0
@ i=1
For a given set of data we can solve this system of equations numerically but not
explicitly.
Chapter 3
3.1 (a) The Problem is to determine the proportion of eligible voters who plan to vote
and, of those, the proportion who plan to support the party. This is a descriptive
Problem since the aim of the study is to determine the attributes just mentioned
for a population of eligible voters.
(b) The target population is all eligible voters. This would include those eligible
voters in all regions and those with/without telephone numbers on the list.
(c) One variate is whether or not an eligible voter plans to vote which is a categorical
variate. Another variate is whether or not an eligible voter supports the party
which is also a categorical variate.
(d) The study population is all eligible voters on the list.
(e) The sample is the 1104 eligible voters who responded to the questions.
(f) A possible source of study error is that the polling …rmed only called eligible
voters in urban areas. Urban eligible voters may have di¤erent views that rural
eligible voters – this is a di¤erence between the target and study populations.
Eligible voters with phones may have di¤erent views than those without.
(g) A possible source of sample error is that many of the people called refused to
participate in the survey. People who refuse to participate may have di¤erent
voting preferences as compared to people who participate. For example, people
who refuse to participate in the survey may also be less likely to vote.
(h) Attribute 1 is the proportion of units who plan to vote. An estimate of this
attribute based on the data is: 732=1104.
Attribute 2 is the proportion of those who plan to vote who also plan to support
the party. An estimate of this attribute based on the data is: 351=732.
3.2 (a) This study is an experimental study since the researchers are in control of which
schools received the regular curriculum and which schools are using the JUMP
program.
(b) The Problem is to compare the performance in math of students at Ontario
schools using the current provincial curriculum as compared to the performance
in math of students at Ontario schools using the JUMP math program.
(c) This is a causative problem since the researcher are interested in whether the
JUMP program causes better student performance in math.
(d) The target population is all elementary students in Ontario public schools at the
time the study or all elementary students in Ontario public schools at the time
the study and into the future.
303
(e) One variate of interest is whether a student the student receives the standard
curriculum or the standard curriculum plus the JUMP program which is a cat-
egorical study. Another variate of interest is classroom test scores which is a
discrete variate since scores only take on a …nite number of countable values.
(f) The study population is all Ontario elementary students in Grades 2 and 5 in
public schools at the time the study.
(g) The sampling protocol was to select the schools in one school board in Ontario.
The researchers did not indicate how this school board was chosen.
(h) A possible source of study error is that the ability of students in Grades 2 and
5 to learn math skills might be di¤erent than students in other grades.
(i) A possible source of sampling error is that the schools in the chosen school board
may not be representative of all the elementary schools in Ontario. For example,
the schools in the chosen board may have larger class sizes compared to other
schools. Student in larger classes may not receive as much help to improve their
math skills as students in smaller classes. Another example is that the chosen
school board might be in a low income area of a city. Students from low income
families may respond di¤erently to changes in the Math curriculum as compared
to students from middle class families.
(j) It is unclear from the article what type of classroom tests will be used or how they
will be graded. So depending on how this is done it could lead to measurement
error. For example, di¤erent schools may use di¤erent grading criteria for the
same test.
(k) Randomization ensures that the di¤erence in the learning outcome is only due
to di¤erent teaching programs, and not due to other potential confounders (e.g.,
class size, parents’education level, parents’social economic status, etc.).
3.3 (a) This is an experimental study because the researchers controlled, using ran-
domization, which students were assigned to the racing-type game and which
students were assigned to the game of solitaire.
(b) The Problem is to determine whether playing racing games makes players more
likely to take risks in a simulated driving test.
(c) This is a causative type Problem because the researchers were interested in
whether playing racing games as compared to playing a game like solitaire caused
players to take more risks in the driving test.
(d) A suitable target population for this study is young adults living in China at the
time of the study OR students attending university in China at the time of the
study.
(e) One important variate is whether the student played the racing-type driving
game or the game of solitaire. This is a categorical variate. The other important
variate was how long, in seconds, the student waited to hit the “stop”key in the
Vienna Risk-Taking Test. This is a continuous variate.
(f) A suitable study population for this study is students attending Xi’an Jiatong
University at the time of the study.
(g) From the article it appears that the researchers recruited volunteers for the study.
The article does not indicate how these volunteers were obtained.
(h) If the target population is young adults living in China and the study population
is students attending university in China at the time of the study then a possible
source of study error is that students who attend university are more educated
and more intelligent (on average) and therefore possibly di¤erent in their levels
of risk-taking as compared to young adults in China not attending university.
(i) Since the sample consisted of volunteers and not a random sample of students
from the Xi’an Jiatong University then a possible source of study error is that
students who volunteer for such studies are more likely to take risks than non-
volunteers who might be more conservative. The risk-taking habits of the vol-
unteers (on average) may be di¤erent than the risk-taking habits of all students
at the Xi’an Jiatong University.
(j) The most attribute of most interest is the average or mean di¤erence in the
time to hit the “stop”key in the Vienna Risk-Taking Test between young adults
who play racing games compared to young adults who play neutral games. An
estimate of this based on the given data is 12 10 = 2 seconds.
305
Chapter 4
4.3 (a) If n = 1000 and = 0:5 then
~ Y
P 0:03 0:03 = P 0:03 0:5 0:03
1000
0 1
Y
0:03 0:5 0:03
= P @q 1000
q q A
(0:5)(0:5) (0:5)(0:5) (0:5)(0:5)
1000 1000 1000
t P ( 1:90 Z 1:90) where Z N (0; 1)
= 2P (Z 1:90) 1 = 2 (0:97128) 1
= 0:94256
(b)
~ Y
P 0:03 0:03 = P 0:03 0:5 0:03
n
0 1
Y
0:03 0:5 0:03
= P @q qn q A
(0:5)(0:5) (0:5)(0:5) (0:5)(0:5)
n n n
p p
t P 0:06 n Z 0:06 n
p
where Z N (0; 1). Since P ( 1:96 Z 1:96) = 0:95; we need 0:06 n 1:96
or n (1:96=0:06)2 = 1067:1. Therefore n should be at least 1068.
(c)
~ Y
P 0:03 0:03 = P 0:03 0:03
n
0 1
Y
0:03 0:03
= P @q qn q A
(1 ) (1 ) (1 )
n n n
p p !
0:03 n 0:03 n
t P p Z p
(1 ) (1 )
where Z N (0; 1). Since P ( 1:96 Z 1:96) = 0:95; we need
p
0:03 n
p 1:96
(1 )
or
2
1:96
n (1 ):
0:03
Since
1:96 2 1:96 2
(1 ) (0:5)2 = 1067:1
0:03 0:03
Therefore n should be at least 1068.
4.4 (a) Suppose the experiment which was used to estimate was conducted a large
number of times and each time a 95% con…dence interval for was constructed
using the observed data. Then, approximately 95% of these constructed intervals
would contain the true, but unknown value of . Since we only have one interval
[42:8; 47:8] we do not know whether it contains the true value of or not. We can
only say that we are 95% con…dent that the given interval [42:8; 47:8] contains
the true value of since we are told it is a 95% con…dence interval. In other
words, we hope we were one of the “lucky” 95% who constructed an interval
containing the true value of . Warning: You cannot say that the probability
that the interval [42:8; 47:8] contains the true value of is 0:95!!!
(b) An approximate 95% con…dence interval for the proportion of Canadians whose
mobile phone is a smartphone is
s r
^(1 ^) 0:45(0:55)
^ 1:96 = 0:45 1:96 = 0:45 0:03083
n 1000
= [0:4192; 0:4808]:
(c) We need n such that

2
1:96
n (0:5)2 = 2401:
0:02
A sample size of 2401 or larger should be used.
4.5 Let Y = number of women who tested positive. Assume that model Y v Binomial (n; ).
Since P ( 2:58 Z 2:58) = 0:99, an approximate 99% con…dence interval is given
by:
s s
^(1 ^) 64 64
( 28936 )
^ 2:58 = 2:58 29000 29000 = 0:0022 0:0007
n 29000 29000
= [0:0015; 0:0029] :
The Binomial model assumes that the 29; 000 women represented 29; 000 independent
trials and that the probability that a randomly chosen women is HIV positive is equal
to . The women may not represent independent trials and the probability that
a randomly chosen women is HIV positive may be higher among certain high risk
women such as women who are intravenous drug users.
4.6 (a) If Y is the number who support this information then Y v Binomial(n; ). An
approximate 95% con…dence interval is given by
r
0:7(0:3)
0:7 1:96 = 0:7 0:6351
200
= [0:6365; 0:7635] :
307
(b) The Binomial model assumes that the 200 people represent 200 independent
trials. If 100 of the people interviewed were 50 married couples then the two
people in a couple are probably not independent with respect to their views.
4.7 From Problem 2:3 we have

200
1
R( ) = = [4 (1 )]200 for 0 < < 1:
0:5 0:5
The 15% likelihood interval can be obtained from the graph of R ( ) given in Figure
12.5 or by using the R command:
uniroot(function(x)((4*x*(1-x))^200-0.15),lower=0.4,upper=0.5).
The 15% likelihood interval is [0:45; 0:55].

20
3:6
R( ) = e(1 3:6= )
> 0:
The 15% likelihood interval can be obtained from the graph of R ( ) given in Figure
12.6 or by using the R command:
uniroot(function(x)(((3.6/x)*exp(1-3.6/x))^20-0.15),lower=2.0,upper=3.0).
The 15% likelihood interval is [2:40; 5:76].
r ( ) = 15 log [2:3 ( + 1)] 34:5 ( + 1) + 15 for > 1:
The 15% likelihood interval can be obtained from the graph of r ( ) given in Figure
?? or by using the R command:
uniroot(function(x)(15*log(2.3*(x+1))-34.5*(x+1)+15-log(0.15)),
lower=-0.8,upper=-0.7).
The 15% likelihood interval is [ 0:75; 0:31].
4.10 (a) For the data n1 = 16, n2 = 16 and n3 = 18, ^ = 0:28 and
(1 + )32 (1 )18
R( ) = ; 0< <1
(1 + 0:28)32 (1 0:28)18
Looking at Figure 12.10 we can see that R( ) = 0:1 corresponds to between

0:5 to 0:6. We use the following command in R:
uniroot(function(x)(((1+x)/1.28)^32*((1-x)/0.72)^18-0.1),
lower=0.5,upper=0.6) to obtain the answer 0:55. Therefore the 10% likelihood
interval is [0; 0:55]. Since the 10% likelihood interval is very wide this indicates
that is not very accurately determined.
(b) For the data for which 17 identical pairs were found, ^ = 17=50 = 0:34 and the
relative likelihood function is
17 (1 )33
R( ) = ; 0< <1
(0:34)17 (1 0:34)33
We use
uniroot(function(x)((x/0.34)^17*((1-x)/0.66)^33-0.1),lower=0,upper=0.3)
to obtain the 10% likelihood interval [0:21; 0:49]. This interval is much narrower
than the interval in (a) which indicates that is more accurately determined by
the second model.
0.8
0.6
R(α)
0.4
0.2
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
α
Figure 12.10: Relative likelihood functions for Twin Data

1
L( ) = 16
(1 )66 for 0 < ;
2
^ = 16=82 = 8=41 and
16
(1 )66 1
R( ) = for 0 < :
(8=41)16 (33=41)66 2
A graph of R ( ) is given in Figure 12.11.The 15% likelihood interval can be obtained

from the graph of R ( ) given in Figure 12.11 or by using the R command:
uniroot(function(x)(((41*x)/8)^16*(41*(1-x)/33)^66-0.15),
lower=0.1,upper=0.15). The 15% likelihood interval is [0:12; 0:29].
309
1.2
0.8
R( )
0.6
0.4
0.2
0
0.05 0.1 0.15 0.2 0.25 0.3 0.35
Figure 12.11: Relative likelihood function for size of family data
4.12 (a) The probability a group tests negative is p = (1 )k . The probability that y
out of n groups test negative is
n y
p (1 p)n y
y = 0; 1; :::; n:
y
We are assuming that the nk people represent independent trials and that does
not vary across subpopulations of the population of interest.
(b) Since L (p) = py (1 p)n y is the usual Binomial likelihood we know p^ = y=n.
Solving p = (1 )k for we obtain = 1 (1 p)1=k . Therefore by the
Invariance Property of maximum likelihood estimates, the maximum likelihood
estimate of is
^=1 p)1=k = 1
(^ (y=n)1=k :
(c) For n = 100, k = 10 and y = 89 we have p^ = 89=100 = 0:89 and ^ =

1 (89=100)1=10 = 0:0116. A 10% likelihood interval for p is found by using:
uniroot(function(x)((x/0.89)^89*((1-x)/0.11)^11-0.1),lower=0.5,upper=0.9)
which gives the interval [0:8113; 0:9451] for p. The 10% likelihood interval for
is
h i
1 (0:9451)1=10 ; 1 (0:8113)1=10 = [0:0056; 0:0207] :
4.13 (a) Since

Z1 Z1
k k y y= y k+1 y=
E Y = y 2e dy = 2 e dy let x = y=
0 0
Z1 Z1
1 k+1 x k
= 2 (x ) e dx = xk+1 e x
dx = k
(k + 2)
0 0
therefore
E (Y ) = (3) = 2 , E Y 2 = 2
(4) = 6 2
V ar (Y ) = E Y 2 [E (Y )]2 = 6 2
(2 )2 = 2 2
as required.
(b) The likelihood function is
Q
n y
i yi = Q
n
2n 1P
n
L( ) = 2e = yi exp yi ; >0
i=1 i=1 i=1
or more simply
2n 1P
n
L( ) = exp yi ; > 0:
i=1
1P
n
l( ) = 2n log yi ; >0
i=1
and
2n 1 P
n 1 P
n
l0 ( ) = + 2 yi = 2 yi 2n ; > 0:
i=1 i=1
Now l0 ( ) = 0 if
1 Pn 1
= yi = y:
2n i=1 2
(Note a First Derivative Test could be used to con…rm that l ( ) has an absolute
maximum at = y=2.) The maximum likelihood estimate of is
^ = y=2:
(c)
1 Pn 1 Pn 1 Pn 1
E Y =E Yi = E (Yi ) = 2 = (2n ) = 2
n i=1 n i=1 n i=1 n
and
1 Pn 1 P n 1 P n
2 1 2 2 2
V ar Y = V ar Yi = V ar (Yi ) = 2 = 2n = :
n i=1 n2 i=1 n2 i=1 n2 n
311
(d) Since Y1 ; Y2 ; : : : ; Yn are independent and identically distributed random variables

then by the Central Limit Theorem
Y 2
p has approximately a N (0; 1) distribution.
2=n
If Z v N (0; 1)
P ( 1:96 Z 1:96) = 0:95:
Therefore !
Y 2
P 1:96 p 1:96 t 0:95:
2=n
(e) Since
!
Y 2
0:95 t P 1:96 p 1:96
2=n
p p
= P Y 1:96 2=n 2 Y + 1:96 2=n
p p
= P Y =2 0:98 2=n Y =2 + 0:98 2=n
an approximate 95% con…dence interval for is

h p p i
^ 0:98^ 2=n; ^ + 0:98^ 2=n
where ^ = y=2.
(f) For these data the maximum likelihood estimate of is
^ = y=2 = 88:92= (2 18) = 2:47
and the approximate 95% con…dence interval for is

r
2
2:47 0:98 (2:47) = [1:66; 3:28] :
18
4.14 (a)
Q
n 1
3 2 1 Q
n P
n
L( ) = ti exp ( ti ) = t2 3n
exp ti
i=1 2 2 i=1 i
n
i=1
or more simply
3n P
n
L( ) = exp ti ; > 0:
i=1

P
n dl 3n P
n
l( ) = 3n log ti = ti :
i=1 d i=1
P n
Solving l( ) = 0, we obtain the maximum likelihood estimate ^ = 3n= ti .
i=1
The relative likelihood function is
3n
L( )
R( ) = = exp 3n 1 ; > 0:
L(^) ^ ^
0.8
R(θ)
0.6
0.4
0.2
0
0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
θ
Figure 12.12: Relative Likelihood for Light Bulb Data
P
20
(b) Since n = 20 and ti = 996, therefore ^ = 3 (20) =996 = 0:06024. Reading
i=1
from the graph in Figure 12.12 or by solving R( ) = 0:15 using the uniroot
function in R, we obtain the 15% likelihood interval [0:0463; 0:0768] which is an
approximate 95% con…dence interval for .
(c)
1 R1 3 3 t 1 R1
E (T ) = t e dt = ( t)3 e ( t) dt
20 20
1 R1 3 x
= x e dx (by letting x = t)
2 0
1 1 3
= (4) = 3! =
2 2
and a 95% approximate con…dence interval for E (T ) = 3= is
3 3
; = [39:1; 64:8] :
0:0463 0:0768
313
(d)
3R
50
p( ) = P (T 50) = t2 e t
dt
2 0
3
2500 50 100 50 2 1 50 1
= e 2 e + 2 e +
2
2 50
= 1 1250 + 50 + 1 e :
Since
h i
p (0:0463) = 1 1250 (0:0463)2 + 50 (0:0463) + 1 e 50(0:0463)
= 0:408
and
h i
p (0:0768) = 1 1250 (0:0768)2 + 50 (0:0768) + 1 e 50(0:0768)
= 0:738
the con…dence intervals for p( ) using the model is [0:408; 0:738].

The con…dence interval using the Binomial model is
r r
p^ (1 p^) 11 (11=20) (9=20)
p^ 1:96 = 1:96 = 0:55 0:218
n 20 20
= [0:332; 0:768] :
The Binomial model involves fewer model assumptions but gives a less precise
(wider) interval.
4.15 (a) (i) If X v 2 (10) then P (X 2:6) = 0:01 and

P (X > 16) = 1 P (X 16) = 1 0:9 = 0:1.
(ii) If X v 2 (4) then P (X > 15) = 1 P (X 15) = 1 0:995 = 0:005.
(iii) If X v 2 (40) then P (X 24:4) = 0:025 and P (X 55:8) = 0:95.
If Y v N (40; 80) then
24:4 40
P (Y 24:4) = P Z p where Z v N (0; 1)
80
= P (Z 1:74) = 1 P (Z 1:74)
= 1 0:95907 = 0:04093 t 0:041
and
55:8 40
P (Y 55:8) = P Z p where Z v N (0; 1)
80
= P (Z 1:77) = 0:96164 t 0:96
If X v 2 (40) then the graph of the probability density function of X will

be fairly symmetric about the mean E (X) = 40 and very similar to the
graph of the probability density function of a N (40; 80) random variable.
We note that P (X 55:8) = 0:95 is close to P (Y 55:8) = 0:96 while
P (X 24:4) = 0:025 and P (Y 24:4) = 0:041 are not as close.
(iv) If X v 2 (25) then solving P (X a) = 0:025 and P (X > b) = 0:025 gives

a = 13:120 and b = 40:646.
(v) If X v 2 (12) then solving P (X a) = 0:05 and P (X > b) = 0:05 gives
a = 5:226 and b = 21:026.
(b)
(i) If X v 2 (1) then
p
P (X 2) = P jZj 2 where Z v N (0; 1)
= 2P (Z 1:41) 1 = 2(0:92073) 1 = 0:84146
and
p
P (X > 1:4) = 1 P jZj 1:4 where Z v N (0; 1)
= 2 [1 P (Z 1:41)] = 2(1 0:88100) = 0:23800
(ii) If X v 2 (2) = Exponential (2) then

2=2 1
P (X 2) = 1 e =1 e t 0:632
and
3=2 1:5
P (X > 3) = e =e t 0:223
4.16 (a)
R1 1 k
1 y 1 R1 y k
2
1 y dy y
k y 2 e 2 dy = k
e 2 let x =
0 2 2
k
2 0 2 2 2
2
1 R1 k 1 x
= k
x2 e dx
2 0
1 k R1 1 x
= k
since x e dx = ( )
2
2 0
= 1
(b) See Figure 12.13. As k increases the probability density function becomes more
symmetric about the line y = k.
(c)
R1 1 k y
M (t) = E eY t = k y2 1
e 2 eyt dy
0 2 2 ( k2 )
1 R1 k
1 ( 12 t)y 1
= k y2 e dy converges for t <
2 2 ( k2 ) 0 2
1 R1 k 1 x 1
= k k x2 e dx by letting x = t y
2 2 ( k2 )( 12 t) 2 0 2
1
1
k k k 1
= 2 ( 2 t) 2 = (1 2t) 2 for t <
2 2
315
0.16
0.14 k=5
0.12
0.1
f(y)
k=10
0.08
0.06 k=25
0.04
0.02
0
0 5 10 15 20 25 30
y
Figure 12.13: Chi-squared probability density functions for k = 5; 10; 25
Therefore
0 k k
1
M (0) = E (Y ) = (1 2t) 2 ( 2)jt=0 = k
2
00 k k k
M (0) = E Y 2 = +1 (1 2t) 2
2
( 2 2)jt=0 = k 2 + 2k
2 2
V ar(Y ) = k 2 + 2k k 2 = 2k
(d) Wi v 2 (ki ) has moment generating function Mi (t) = (1 2t) ki =2 . Therefore

P
n
S= Wi has moment generating function
i=1
P
n
Q
n ki =2
Ms (t) = Mi (t) = (1 2t) i=1
i=1
which is the moment generating function of a 2 distribution with degrees of

P
n Pn
freedom equal to ki . Therefore S v 2 ki as required.
i=1 i=1
4.16
(a) Since
1 Pn 1 1 P
Wi = Yi Y = Yi Yi = 1 Yi Yj i = 1; 2; : : : ; n
n i=1 n n j6=i
therefore Wi is a linear combination of Y1 ; Y2 ; : : : ; Yn and therefore a linear com-

bination of independent Normal random variables.
(b)
E (Wi ) = E Yi Y = E (Yi ) E Y = = 0 i = 1; 2; : : : ; n
Now Cov (Yi ; Yj ) = 0 if i 6= j (since the Yi0 s are independent random variables)
and Cov (Yi ; Yj ) = 2 if i = j (since Cov (Yi ; Yi ) = V ar (Yi ) = 2 ). This implies
1 Pn 1 Pn
Cov Yi ; Y = Cov Yi ; Yi = Cov Yi ; Yi
n i=1 n i=1
1 1 2
= Cov (Yi ; Yi ) = V ar (Yi ) = :
n n n
Therefore
V ar (Wi ) = V ar Yi Y = V ar (Yi ) + V ar Y 2Cov Yi ; Y

2 2 1
2 2
= + 2 = 1 :
n n n
(c)
Cov (Wi ; Wj ) = Cov Yi Y ; Yj Y i 6= j

= Cov (Yi ; Yj ) Cov Yi ; Y Cov Y ; Yj + Cov Y ; Y
2 2
= 0 + V ar Y
n n
2 2 2 2
= + =
n n n
4.18 (a) The graph is given in Figure 12.14. As k increases the graphs become more and
more like the graph of the N (0; 1) probability density function and for k = 25
there is little di¤erence between the t (25) probability density function and the
N (0; 1) probability density function.
(b)
k+1 k+1
1
d d t2 2 k+1 t2 2 2t
f (t; k) = ck 1 + = ck 1+
dt dt k 2 k k
k+1
1
k+1 t2 2
= t ck 1+ = 0 if t = 0
k k
d d
Since dt f (t; k) > 0 if t < 0 and dt f (t; k) < 0 if t > 0 then by the First Derivative
Test f (t; k) has a global maximum at t = 0.
(c)
1 0 0 1
Z 1
E (T ) = E @ q A = E (Z) E @ q A
U U
k k
since Z and U are independent random variables
= 0 since E (Z) = 0:
317
0.4
0.35 k=25
k=5
0.3
f(t;k)
0.25
0.2
k=1
0.15
0.1
0.05
0
-3 -2 -1 0 1 2 3
t
Figure 12.14: Graphs of the t (k) pd.f. for k = 1; 5; 25 and the N (0; 1) p.d.f. (dashed line)
(d)
(i) If T v t(10) then P (T 0:88) = 0:8,
P (T 0:88) = P (T > 0:88) = 1 P (T 0:88) t 1 0:8 = 0:2
and
P (jT j 0:88) = P ( 0:88 T 0:88) = P (T 0:88) P (T 0:88)

= P (T 0:88) [1 P (T 0:88)] = 2P (T 0:88) 1
t 2 (0:8) 1 = 0:6:
(ii) If T v t(17) then
P (jT j 2:90) = 2P (T 2:90) by symmetry

= 2 [1 P (T 2:90)] t 2 (1 0:995) = 0:01
(iii) If T v t(30) then
P (T 2:04) = P (T 2:04) = 1 P (T 2:04)

t 1 0:975 = 0:025
and if X v N (0; 1) then
P (Z 2:04) = 1 P (Z 2:04)
= 1 0:97932 = 0:02068
and these values are close.

If T v t(30) then P (T 0:26) t 0:6 which is close to P (Z 0:26) = 0:60642
if Z v N (0; 1).
(iv) If T v t(18) then P (T 2:1009) = 0:975 so P (T 2:1009) = 0:025 and
by symmetry P (T 2:1009) = 0:025. Therefore a = 2:1009 and b =
2:1009.
(v) If T v t(13) then P (T 1:7709) = 0:95 so P (T 1:7709) = 0:05 and by
symmetry P (T 1:7709) = 0:05. Therefore a = 1:7709 and b = 1:7709.
4.19
k+1
k+1
2 t2 2
lim f (t; k) = lim p k
1+
k!1 k!1 k 2
k
k 1
k+1
2 t2 2 t2 2
= lim p k
1+ 1+
k!1 k 2
k k
1 1 2
= p exp t for t 2 <
2 2
since
1
lim ck = p
k!1 2
1
t2 2
lim 1+ = 1 and
k!1 k
k
t2 2 1 2 a bn
lim 1+ = exp t since lim 1+ = eab
k!1 k 2 y!1 n
4.20 (a) From Example 2.3.2
L( ) = n
e ny=
for > 0 and ^ = y.
Therefore
n
L( ) L( ) e ny= y n
R( ) = = = = en(1 y= )
L(^) L(y) (y) en n
hy in
= e(1 y= ) for > 0:
For the given data n = 30 and ^ = 1

30 (11400) = 380 and
30
380 (1 380= )
R( ) = e for > 0:
319
From Inverse Normal Tables
0:90 = P (jZj 1:6449) where Z v N (0; 1)

= P W (1:6449)2 where W v 2
(1)
= P (W 2:7057) :
n o
Since (see Section 4.6) f : ( ) 2:7057g = : 2l(^) 2l ( ) 2:7057 is an
approximate 90% con…dence interval. Therefore
n o
: 2l(^) 2l ( ) 2:7057
n o
= : R ( ) e 2:7057=2
= f : R( ) 0:2585g
which implies that a 26% likelihood interval is an approximate 90% con…dence

interval.
Using the uniroot function in R and
30
380
R( ) = e(1 380= )
for >0
we obtain the interval as [285:5; 521:3]. Alternatively the likelihood interval can
be determined approximately from a graph of the relative likelihood function.
See Figure 12.15.
1.4
1.2
0.8
R(θ)
0.6
0.4
0.2
0
200 250 300 350 400 450 500 550 600 650
θ
Figure 12.15: Relative likelihood function for survival times for AIDS patients
(b) Since P (X m) = 1 e m= = 0:5, therefore m = log(0:5) = log 2 and the

con…dence interval for m is [285:5 log 2; 521:3 log 2] = [197:9; 361:3] by using the
con…dence interval for obtained in (a).
4.21 (a) Using the cumulative distribution function of the Exponential distribution F (y) =
1 e y= ; we have for w > 0
2Y w w =(2 ) w=2
G (w) = P (W w) = P w =P Y =1 e =1 e :
2
Taking derivative with respect to w gives the probability density function as
w
g (w) = 21 e 2 for w > 0, which can be easily veri…ed as the probability density
function of a 2 (2) random variable.
Pn
(b) Let Wi = 2Yi = v 2 (2), i = 1; 2; : : : ; n. Then by Theorem 29 U = Wi =
i=1
P
n
2Yi = v 2 (2n). Since
i=1
(c)
0:9 = P (43:19 W 79:08) where W v 2
(60)
therefore
2P
n
0:9 = P 43:19 Yi 79:08
i=1
2 P n 2 P n
= P Yi Yi
79:08 i=1 43:19 i=1
and thus
2 P n 2 P n
yi ; yi
79:08 i=1 43:19 i=1
P
30
is a 90% con…dence interval for . Substituting yi = 11400, we obtain the 90%
i=1
con…dence interval for as [288:3; 527:9] which is very close to the approximate
90% likelihood-based con…dence interval [285:5; 521:3]. The intervals are close
since n = 30 is reasonably large.
4.22 (a) From Example 2:2:2 the likelihood function for Poisson data is
ny n
L( ) = e for >0
with corresponding maximum likelihood estimate ^ = y. For Company A, n = 12

and ^ = 20 and the relative likelihood function is
ny n
e
R( ) = ny ny
for > 0:
y e
See Figure 12.16 for a graph of the relative likelihood function (graph on the
right).
321
0.8
R( )
0.6
0.4
0.2
0
5 10 15 20 25
Figure 12.16: Relative Likelihood Functions for Company A and Company B Photocopiers
(b) For Company B, n = 12 and ^ = 11:67. See Figure 12.16 for a graph of the
relative likelihood function (graph on the left).
(c) The 15% likelihood interval for Company A is: [16:2171; 24:3299] and the 15%
likelihood interval for Company B is: [9:8394; 13:7072]. It is clear from these
approximate 95% con…dence intervals that the mean number of service calls for
Company A is much larger than for Company B which implies the decision to
go with Company B is a good one.
(d) The assumptions of the Poisson process (individuality, independence and homo-
geneity) would need to hold.
(e) Since
!
Y
0:95 t P 1:96 p 1:96
Y =n
q q
= P Y 1:96 Y =n Y + 1:96 Y =n
h p p i
therefore the interval y 1:96 y=n; y + 1:96 y=n is an approximate 95%
con…dence interval for . For Company A this interval is [17:5; 22:5] and for
Company B this interval is [9:73; 13:60]. These intervals are similar but not
identical to the intervals in (c) since n = 12 is small. The intervals would be
more similar for a larger value of n.
4.23 (a) Since the points in the qqplot in Figure 12.17 lie reasonably along a straight line
the Gaussian model seems reasonable for these data.
(b) A suitable study population for this study would be common octopi in the Ria de
Vigo. The parameter represents the mean weight in grams of common octopi
Qqplot of Octopus Data versus Standard Normal Quantiles

1500
1400
1300
Sample Quantiles
1200
1100
1000
900
800
700
600
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
N(0,1) Quantiles
Figure 12.17: Qqplot for octopus data
in the Ria de Vigo. The parameter represents the standard deviation of the
weights in grams of common octopi in the Ria de Vigo.
(c) Since P (T 2:1009) = (1 + 0:95) =2 = 0:975,
1=2
20340 1
^=y= = 1070:526 and s = (884095) = 221:62
19 18
therefore a 95% con…dence interval for is

p
1070:526 2:1009 (221:62) = 19 = 1070:526 106:817
= [963:709; 1177:343]
Since the value = 1100 grams is well within this interval then the researchers
could conclude that based on these data the octopi in the Ria de Vigo are
reasonably healthy based on their mean weight.
(d) Since P (W 9:391) = 0:05 = P (W 28:869) where W v 2 (18) a 90% con…-

dence interval for for the given data is
" #
884095 1=2
884095 1=2 h i
; = (306:24)1=2 ; (941:42)1=2 = [175:00; 306:83] :
28:869 9:391
4.24 (a) Qqplots of the weights for females and males separately are shown in Figures
12.18 and 12.19. In both cases the points lie reasonably along a straight line so
it is reasonable to assume a Normal model for each data set.
323
QQ Plot of Sample Data vers us Standard Normal

120
110
100
Quantiles of Input Sample
90
80
70
60
50
40
30
-3 -2 -1 0 1 2 3
Figure 12.18: Qqplot of female weights
(b) A 95% con…dence interval for the mean weight of females is

h p p i
yf 1:9647sf = 150; yf + 1:9647sf = 150
h p p i
= 70:4432 (1:9467) (12:5092) = 150; 70:4432 + (1:9467) (12:5092) = 150
= [68:4365; 72:4499] :
A 95% con…dence interval for the mean weight of males is

h p p i
ym 1:9647sm = 150; ym + 1:9647sm = 150
h p p i
= 82:5919 (1:9467) (12:8536) = 150; 82:5919 + (1:9467) (12:8536) = 150
= [80:5300; 84:6539] :
Note that since the value for t (149) is not available in the t-tables we used
P (T 1:9647) = (1 + 0:95) =2 = 0:975 where T v t (100). Using R we obtain
P (T 1:976) = 0:975 where T v t (149) : The intervals will not change substan-
tially.
We note that the interval for females and the interval for males have no values in
common. The mean weight for males is higher than the mean weight for females.
(c) To obtain con…dence intervals for the standard deviations we note that the piv-
otal quantity (n 1) S 2 = 2 = 149S 2 = 2 has a 2 (149) distribution and the Chi-
squared tables stop at degrees of freedom = 100. Since E 149S 2 = 2 = 149 and
V ar 149S 2 = 2 = 2 (149) = 298 we use 149S 2 = 2 v N (149; 298) approximately

120
110
100

90
80
70
60
50
40
-3 -2 -1 0 1 2 3
Figure 12.19: Qqplot of male weights
to construct an approximate 95% con…dence interval given by

"s s #
149s2 149s2
p ; p
149 + 1:96 298 149 1:96 298
"r r #
149s2 149s2
= ; :
182:8348 115:1652
For the females we obtain

"r r #
149 (156:4806) 149 (156:4806)
;
182:8348 115:1652
hp p i
= 127:5228; 202:4536 = [11:2926; 14:2286] :
For the males we obtain

"r r #
149 (165:2162) 149 (165:2162)
;
182:8348 115:1652
hp p i
= 134:6418; 213:7558 = [11:6035; 14:6204] :
These intervals are quite similar.
4.25 (a) A suitable study population consists of the detergent packages produced by this
particular detergent packaging machine. The parameter corresponds to the
mean weight of the detergent packages produced by this detergent packaging
325
machine. The parameter is the standard deviation of the weights of the deter-
gent packages produced by this detergent packaging machine.
(b) For these data
4803
y = = 300:1875
16
1 h i
s2 = 1442369 16 (300:1875)2 = 37:89583
15
s = 6:155959:
Since P (T 2:1314) = (1 + 0:95)=2 = 0:975 where T v t (15), a 95% con…dence

interval for is
p
300:1875 (2:1314) (6:155959) = 16 = 300:1875 3:2803
= [296:91; 303:47] :
Since P (W 6:262) = (1 0:95) =2 = 0:025 and P (W 27:488) = (1 + 0:95) =2 =

0:975, a 95% con…dence interval for
"r r #
(15) (37:89583) (15) (37:89583)
; = [4:55; 9:53]
27:488 6:262
(c) Since P (T 2:1314) = (1 + 0:95)=2 = 0:975 where T v t (15), a 95% prediction

interval for Y is
r
1
300:1875 (2:1314) (6:155959) 1 +
16
= 300:1875 13:5249
= [286:7; 313:7] :
4.26 For the radon data

1=2
1 P 12
n = 12; y = 104:1333 and s = (yi y)2 = 9:3974:
11 i=1
From t tables, P (T 2:20) = (1 + 0:95)=2 = 0:975 where T v t (11). Therefore a

95% prediction interval for Y , the reading for the new radon detector exposed to 105
picocuries per liter of radon over 3 days, is
" #
1 1=2 1 1=2
104:1333 2:20 (9:3974) 1 + ; 104:1333 + 2:20 (9:3974) 1 +
12 12
= [104:1333 21:5185; 104:1333 + 21:5185]
= [82:6148; 125:6519] :
2 2
4.27 Use 2 t s2 = 45 9 = 5 and d = 0:5. Hence, n
1:96
d = 1:96
0:5 5 = 76:832.
Since 10 observations have already been taken, the manufacturer should be advised
to take at least 77 10 = 67 additional measurements. We note that this calculation
depends on an estimate of from a small sample (n = 10) and the value 1:96 is from
the Normal tables rather than the t tables so the manufacturer should be advised to
take more than 67 additional measurements. If we round 1:96 to 2 to account for the
2 2
the fact that we don’t actually know , and note that 0:5 5 = 80 then this would
suggest that, to be safe, the manufacturer should take an additional 80 10 = 70
measurements.
4.28 (a) The combined likelihood function for is

Q
m 1 1 Q
n 1 1
L( ) = p exp 2 (xi )2 p exp 2 (yi )2
i=1 2 1 2 1 i=1 2 2 2 2
(n+m)=2 1 P
m
= (2 ) 1
m
2
n
exp 2 (xi x)2 + m (x )2
2 1 i=1
1 P
n
exp 2 (yi y)2 + n (y )2
2 2 i=1
or more simply ignoring constants

m n
L ( ) = exp 2 (x )2 (y )2 for 2<
2 1 2 22
since 2 and 2 are known. The log likelihood function is
1 2
m n
l( ) = 2 (x )2 (y )2 :
2 1 2 22
Solving
m 2x +n 2y m 2 +n 2
m n 2 1 2 1
l0 ( ) = 2 (x )+ 2 (y )= 2 2 =0
1 2 1 2
gives the maximum likelihood estimate for as
m 22 x + n 21 y m= 21 x + n= 22 y
^ = =
m 22 + n 21 m= 21 + n= 22
w1 x + w2 y
=
w1 + w2
where w1 = m= 21 and w2 = n= 22 . We …rst note that both x and y are both
estimates of and it makes sense to take a weighted average of the two estimates
to get a better estimate of . If the sample sizes n and m are not equal it makes
sense to weight the estimate that is a function of more observations. It also
makes sense that the mean of the observations that come from a distribution
with smaller variance is a better estimate of and should be given more weight.
By examining the weights w1 and w2 we can see that the estimate ^ does satis…es
both of these requirements.
327
(b) Since the observations in x are observations from a distribution with larger vari-
ability then we don’t want to take just an average of x and y. We would choose
an estimate that weights y more that x since y is a better estimate.
(c)
X + 4Y 1
V ar(~ ) = V ar = V ar(X) + 16V ar(Y )
5 25
1 1 0:25
= + 16 = 0:02:
25 10 10
p
and V ar(~ ) = 0:1414.
X +Y 1 1 1 0:25
V ar = V ar(X) + V ar(Y ) = + = 0:03125:
2 4 4 10 10
r
X+Y
and V ar 2 = 0:1768. We can clearly see now that ~ has a smaller
standard deviation than the estimator X + Y =2.
Chapter 5
5.1. (a) The model Y v Binomial (n; ) is appropriate in the case in which the experi-
ment consists of a sequence of n independent trials with two outcomes on each
trial (Success and Failure) and P (Success) = is the same on each trial. In this
experiment the trials are the guesses. Since the deck is reschu- ed each time it
seems reasonable to assume the guesses are independent. It also seems reason-
able to assume that the women’s ability to guess the number remains the same
on each trial. To test the hypothesis that the women is guessing at random the
appropriate null hypothesis would be H0 : = 15 = 0:2:
(b) For n = 20 and H0 : = 0:2, we have Y v Binomial (20; 0:2) and
E (Y ) = 20 (0:2) = 4. We use the test statistic or discrepancy measure D =
jY E (Y ) j = jY 4j : The observed value of D is d = j8 4j = 4. Then
p value = P (D 4; H0 ) = P (jY 4j 4; H0 )
= P (Y = 0) + P (Y 8)
X 20
20 0 20 20
= (0:2) (0:8) + (0:2)y (0:8)20 y
0 y
y=8
7
X 20
= 1 (0:2)y (0:8)20 y
= 0:04367 using R:
y
y=1
There is evidence based on the data against H0 : = 0:2. These data suggest
that the woman might have some special guessing ability.
(c) For n = 100 and H0 : = 0:2, we have Y v Binomial (100; 0:2),
E (Y ) = 100 (0:2) = 20 and V ar (Y ) = 100 (0:2) (0:8) = 16. We use the test
statistic or discrepancy measure D = jY E (Y ) j = jY 20j : The observed
value of D is d = j32 20j = 12. Then
p value = P (D 12; H0 ) = P (jY 20j 12)

= P (Y 8) + P (Y 32)
8
X 100 100
X 100
= (0:2)y (0:8)100 y
+ (0:2)y (0:8)100 y
y y
y=0 y=32
31
X 100
= 1 (0:2)y (0:8)100 y
= 0:004 using R:
y
y=9
or
p value = P (D 12; H0 ) = P (jY 20j 12)

12
t P jZj p where Z v N (0; 1)
16
= 2 [1 P (Z 3)] = 2 (1 0:99865) = 0:0027.
329
49
48
47
46
Sample Quantiles 45
44
43
42
41
40
39
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
N(0,1) Quantiles
Figure 12.20: Qqplot for Dioxin data
There is strong evidence based on the data against H0 : = 0:2. These data
suggest that the woman has some special guessing ability. Note that we would
not conclude that it has been proven that she does have special guessing ability!
5.2 Assuming H0 : = 10 is true Y v P oisson (10). Therefore
P (D 15; H0 ) = P (Y 10 15) = P (Y 25)

24
X 10 e y 10
= 1 = 0:000047 using R
y!
y=0
or
25 10
P (D 15; H0 ) = P (Y 25) t P Z p where Z v N (0; 1)
10
= P (Z 4:74) t 0
There is very strong evidence based on the data against H0 : = 10:
5.3 (a) A qqplot of the data is given in Figure 12.20. Since the points in the qqplot lie
reasonably along a straight line it seems reasonable to assume a Normal model
for these data.
(b) A study population is a bit di¢ cult to de…ne in this problem. One possible
choice is to de…ne the study population to be all measurements that could be
taken on a given day by this instrument on a standard solution of 45 parts per
billion dioxin. The parameter corresponds to the mean measurement made
by this instrument on the standard solution. The parameter corresponds to
the standard deviation of the measurements made by this instrument on the
standard solution.
(c) For these data

" #1=2
888:1 395:45 20 (44:405)2
y= = 44:405 and s = = 2:3946
20 19
To test H0 : = 45 we use the test statistic
Y 45 Y 45
D= p where T = p v t (19) :
S= 20 S= 20
The observed value of D is
j44:405 45j
d= p = 1:11
2:3946= 20
and
p value = P (D d; H0 )
= P (jT j 1:11) where T v t (19)
= 2 [1 P (T 1:11)]
= 0:2803 (calculated using R).
Alternatively using the t-tables in the Course Notes we have P (T 0:8610) =

0:8 and P (T 1:3277) = 0:9 so
2 (1 0:9) p value 2 (1 0:8)

or 0:2 p value 0:4:
In either case since the p value is larger than 0:1 and we would conclude
that, based on the observed data, there is no evidence against the hypothesis
H0 : = 45. (Note: This does not imply the hypothesis is true!).
A 100p% con…dence interval for based on the pivotal quantity
Y 45
T = p v t (19)
S= 20
is given by h p p i
y as= 20; y + as= 20
where P (T a) = (1+p)=2. From t- tables we have P (T 2:093) = (1 + 0:95) =2 =

0:975. Therefore the 95% con…dence interval for is
h p p i
y 2:093s= 20; y + 2:093s= 20 = [43:28; 45:53]
Based on these data it would appear that the new instrument is working as it
should be since there was not evidence against H0 : = 45. We might notice
331
that the value = 45 is not in the center of the 95% con…dence interval but closer
to the upper endpoint suggesting that the instrument might be under reading
the true value of 45. It would be wise to continue testing the instrument on a
regular basis on a known sample to ensure that the instrument is continuing to
work well.
(d) To test H0 : 2 = 2 we use the test statistic
0
(n 1) S 2
U= 2 v 2
(n 1) :
0
For n = 20 and H0 : 2 = 4; or equivalently H0 : = 2, we have

19S 2
U= v 2
(19) :
(4)
The observed value of U is
19 (5:7342)
u= = 27:24
4
and
p value = 2P (U 27:24) where U v 2

(19)
Alternatively using the Chi-squared tables in the Course Notes we have

P (U 27:204) = 1 0:9 = 0:1 so p value t 2 (0:1) = 0:2. In either case
since the p value is larger than 0:1 and we would conclude that, based on the
observed data, there is no evidence against the hypothesis H0 : 2 = 4.
A 100p% con…dence interval for based on the pivotal quantity
(n 1) S 2
U= 2
v 2
(n 1)
is given by "r r #
(n 1) s2 (n 1) s2
;
b a
where P (U a) = (1 p) =2 = P (U b). For n = 20 and p = 0:95 we have
P (U 8:907) = 0:025 = P (U 32:852) and the con…dence interval for is
"r r #
19 (5:7342) 9 (5:7342)
; = [1:82; 3:50] :
32:852 8:907
Based on these data there is no evidence to contradict the manufacturer’s claim

that the variability in measurements is less than two parts per billion. Note
however that the con…dence for does contain values of larger than 2 so again
it would be wise to continue testing the instrument on a regular basis on a known
sample to ensure that the instrument is continuing to work well.
5.4 To test H0 : = 45 when 2 = 4 is known we use the discrepancy measure
Y 45 Y 45
D= p where Z = p v N (0; 1) :
2= 20 2= 20
The observed value of D is

j44:405 45j
d= p = 1:33
2= 20
and
= P (jZj 1:33) Z v N (0; 1)
= 2 [1 P (Z 1:33)] = 2 (1 0:90824)
= 0:18352.
Based on these data there is no evidence to contradict the manufacturer’s claim that
H0 : = 45.
5.5 (a) To test the hypothesis H0 : = 105 we use the discrepancy measure or test
statistic
Y 105
D= p
S= 12
where
1=2
1 P 12
2
S= Yi Y
11 i=1
and the t-statistic
Y 105
T = p v t (11)
S= 12
assuming the hypothesis H0 : = 105 is true.
For these data y = 104:13, s2 = 88:3115 and s = 9:3974. The observed value of
the discrepancy measure D is
jy 105j j104:13 105j

d= p = p = 0:3194
s= 12 9:3974= 12
and
= P (jT j 0:3194) where T v t (11)
= 2 [1 P (T 0:3194)] = 2 (0:3777)
333
Alternatively using the t-tables in the Course Notes we have P (T 0:260) = 0:6
and P (T 0:54) = 0:7 so
2 (1 0:7) p value 2 (1 0:6)

or 0:6 p value 0:8:
In either case since the p value is much larger than 0:1 and we would conclude
that, based on the observed data, there is no evidence against the hypothesis
H0 : = 105. (Note: This does not imply the hypothesis is true!)
From t- tables we have P (T 2:201) = (1 + 0:95) =2 = 0:975 where T v t (11).
A 95% con…dence interval for is
h p p i
y 2:201s= 12; y + 2:201s= 12 = [98:16; 110:10] :
(b) From Chi-squared tables P (W 3:816) = 0:025 and P (W 21:920) = 0:975.

"r r #
11 (88:3115) 11 (88:3115)
; = [6:66; 15:96] :
21:920 3:816
(c) Since there was no evidence against H0 : = 105 and since the value
= 105 is near the center of the 95% con…dence interval for , the data support
the conclusion that the detector is accurate, that is, that the detector is not
giving biased readings. The con…dence interval for , however, indicates that the
precision of the detectors might be of concern. The 95% con…dence interval for
suggests that the standard deviation could be as large as 16 parts per billion.
As a statistician you would need to rely on the expertise of the researchers
for a decision about whether the size of the is scienti…cally signi…cant and
whether the precision of the detectors is too low. You would also point out to
the researchers that this evidence is based on a fairly small sample of only 12
detectors.
5.6 To test H0 : 2 = 2 when is known we use the test statistic

0
P
12
(Yi )2
i=1
U= 2 v 2
(n) :
0
For n = 12, = 105 and H0 : 2 = 100, we have
P
12
(Yi 105)2
i=1
U= v 2
(12) :
100
Since
P
12 P
12 P
12
(yi 105)2 = yi2 2 (105) yi + 12 (105)2
i=1 i=1 i=1
= 131096:44 210 (1249:6) + 12 (105)2 = 980:44
The observed value of U is

980:44
u= = 9:8044
100
and
p value = 2P (U 9:8044) where U v 2

(12)
Alternatively using the Chi-squared tables in the Course Notes we have P (U 9:034) =
0:3 so p value > 2 (0:3) = 0:6. In either case since the p value is larger than 0:1
and we would conclude that, based on the observed data, there is no evidence against
the hypothesis H0 : 2 = 100.
5.7 (a) The respondents to the survey are students who heard about the online referen-
dum and then decided to vote. These students may not be representative of all
students at the University of Waterloo. For example, it is possible that the stu-
dents who took the time to vote are also the students who most want a fall study
break. Students who don’t care about a fall study break probably did not bother
to vote. This is an example of sampling error. Any online survey such as this
online referendum has the disadvantage that the sample of people who choose to
vote are not necessarily a representative sample of the study population of in-
terest. The advantage of online surveys is that they are inexpensive and easy to
conduct. To obtain a representative sample you would need to select a random
sample of all students at the University of Waterloo. Unfortunately taking such
a sample would be much more time consuming and costly then conducting an
online referendum.
(b) A suitable target population would be the 30; 990 eligible voters. This would
also be the study population. Note that all undergraduates were able to vote
but it is not clear how the list of undergraduates is determined.
(c) The attribute of interest is the proportion of the 30; 990 eligible voters (the
study population) who would respond yes to the question. The parameter in
the Binomial model corresponds to this attribute. A Binomial model assumes
independent trials (students) which might not be a valid assumption. For ex-
ample, if groups of students, say within a speci…c faculty, all got together and
voted, their responses may not be independent events.
(d) The maximum likelihood estimate of based on the observed data is
^ = 4440 = 0:74:
6000
335
Since this estimate is not based on a random sample it is not possible to say how
accurate this estimate is.
(e) An approximate 95% con…dence interval for is given by
r
0:74 (0:26)
0:74 1:96 = 0:74 0:01 = [0:73; 0:75]
6000
(f) Since = 0:7 is not a value contained in the approximate 95% con…dence interval
[0:73; 0:75] for , therefore the approximate p value for testing H0 : = 0:7 is
less than 0:05. (Note that since = 0:7 is far outside the interval, the p value
would be much smaller than 0:05.)
5.8 (a) If H0 : = 3 is true then since Yi has a Poisson distribution with mean 3;
P
25
i = 1; 2; : : : ; 25 independently, then Yi has a Poisson distribution with mean
i=1
3 25 = 75. The discrepancy measure
P
25 P
25 P
25
D= Yi 75 = Yi E Yi
i=1 i=1 i=1
is reasonable since it is measuring the agreement between the data and H0 : = 3

P
25
by using the distance between the observed value of Yi and its expected value
i=1
P
25
E Yi = 75.
i=1
P
25
For the given data, yi = 51: The observed value of the discrepancy measure
i=1
is
P
25
d= yi 75 = j51 75j = 24
i=1
and
P
25
= P Yi 75 24; H0
i=1
P 75x e 75
51 1 75x e
P 75
= +
x=0 x! x=99 x!
75x e 75
P
98
= 1
x=52 x!
Since 0:001 < 0:006716 < 0:01 we would conclude that, based on the data, there
is strong evidence against the hypothesis H0 : = 3:
(b) If Yi has a Poisson distribution with mean and variance ; i = 1; 2; : : : ; n

independently then by the Central Limit Theorem
Y E Y Y
q =p
V ar Y =n
has approximately a N (0; 1) distribution.

(c) If H0 : = 3 is true then E Y = 3. The discrepancy measure D = Y 3 is
reasonable for testing H0 : = 3 since it is measuring the agreement between
the data and H0 : = 3 by using the distance between the observed value of Y
and its expected value E Y = 3.
The observed value of the discrepancy measure is
51
d = jy 3j = 3 = j2:04 3j = 0:96
25
and also
jy 3j 0:96
p =p = 2:77:
3=25 3=25
Therefore
= P Y 3 0:96; H0
!
0:96
t P jZj p where Z s N (0; 1)
3=25
= 2 [1 P (Z 2:77)] = 0:005584
The approximate p-value of 0:005584 is close to the p value 0:006716 calculated
in (a) which is the exact p value. Since we are only interested in whether the
p value is bigger than 0:1 or between 0:1 and 0:05 etc. we are not as worried
about how good the approximation is. In this example the conclusion about H0
is the same for the approximate p value as it is for the exact p value.
5.9 The observed value of the likelihood ratio test statistic for testing H0 : = 3 is
2:04
(3) = 2 (25) 2:04 log +3 2:04 = 8:6624
3
and
p value = P ( (3) 8:6624; H0 )
t P (W 8:6624) where W s 2 (1)
p
= P jZj 8:6624 where Z s N (0; 1)
= 2 [1 P (Z 2:94)] = 0:00328
The p value is close to the p values calculated in (a) and (b).
337
5.10 Since
20
3:6 (1 3:6= )
R( ) = e > 0:
then
20
3:6 (1 3:6=5)
R (5) = e = 0:3791
5
and
(5) = 2 log R (5) = 2 log (0:3791) = 1:9402:
Therefore
p value t P (W 1:9402) where W v 2 (1)

h p i
= 2 1 P Z 1:9402 where Z v N (0; 1)
= 2 [1 P (Z 1:39)] = 2 (1 0:91774) = 0:16452
and since p value > 0:1 there is no evidence, based on the data, to contradict
H0 : = 5. The approximate 95% con…dence interval for is [2:40; 5:76] which
contains the value = 5. This also implies that the p value > 0:05 and so the
approximate con…dence interval is consistent with the test of hypothesis.
5.11 Since
r ( ) = 15 log [2:3 ( + 1)] 34:5 ( + 1) + 15 for > 1
then
r ( 0:1) = 15 log [2:3 ( 0:1 + 1)] 34:5 ( 0:1 + 1) + 15 = 5:1368
and
(5) = 2r ( 0:1) = 2 ( 5:1368) = 10:2735
Therefore

h p i
= 2 1 P Z 10:2735 where Z v N (0; 1)
= 2 [1 P (Z 3:21)] = 2 (1 0:99934) = 0:00132
and since 0:001 < p value < 0:01 there is strong evidence, based on the data,
to contradict H0 : = 0:1. The approximate 95% con…dence interval for is
[ 0:75; 0:31] which does not contain the value = 0:1. This also implies that the
p value < 0:05 and so the approximate con…dence interval is consistent with the
test of hypothesis.
5.12 Since
16
(1 )66 1
R( ) = for 0 < :
(8=41)16 (33=41)66 2
then
(0:18)16 (1 0:18)66
R (0:18) = = 0:9397
(8=41)16 (33=41)66
and
(0:18) = 2 log R (0:18) = 2 log (0:9397) = 0:1244
Therefore

h p i
= 2 1 P Z 0:1244 where Z v N (0; 1)
= 2 [1 P (Z 0:25)] = 2 (1 0:63683) = 0:72634
and since p value > 0:1 there is no evidence, based on the data, to contradict
H0 : = 0:18. The approximate 95% con…dence interval for is [0:12; 0:29] which
contains the value = 0:18. This also implies that the p value > 0:05 and so the
approximate con…dence interval is consistent with the test of hypothesis.
5.13 (a) The maximum likelihood estimate of is ^ = 18698:6=20 = 934:93. The agree-
ment between the plot of the empirical cumulative distribution function and
the cumulative distribution function of an Exponential(934:93) random variable
given in Figure 12.21 indicates that the Exponential is reasonable.
0.9
0.8
Exponential(934.93)
0.7
e.c.d.f.
0.6
0.5
0.4
0.3
0.2
0.1
0
0 500 1000 1500 2000 2500 3000 3500
Failure T ime
Figure 12.21: Empirical c.d.f. and Exponential(934:93) c.d.f. for failure times of power
systems
339
(b) The observed value of the likelihood ratio statistic for testing H0 : = 0 for
Exponential data is (see Example 5:3:2) is
y y
( 0 ) = 2n 1 log :
0 0
For these data n = 20, y = 934:93 and 0 = 1000 gives
934:93 934:93
(1000) = 2 (20) 1 log = 0:0885
1000 1000
with

h p i
= 2 1 P Z 0:0885 where Z v N (0; 1)
= 2 [1 P (Z 0:30)] = 2 (1 0:61791) = 0:76418:
There is no evidence against the hypothesis H0 : = 1000 based on the observed

data.
5.14 A test statistic that could be used will be to test the mean of the generated sample.
The mean should be closed to 0:5 if the random number generator is working well.
5.15 (a) For each given region the assumptions of independence, individuality and homo-
geneity would need to hold for the number of events per person per year.
(b) Assume the observations y1 ; y2 ; : : : ; yK from the di¤erent regions are indepen-
dent. Since Yj v P oisson (Pj j t) then the likelihood function for
= ( 1 ; 2 ; : : : ; K ) is
yj Pj jt
Q
K (P
j j t) e
L( ) =
j=1 yj !
or more simply
Q
K
yj Pj jt
L( ) = j e
j=1
K
X
l( ) = [yj log j Pj j t] :
j=1
Since
@l yj yj (Pj t) j
= Pj t = =0
@ j j j
for j = yj = (Pj t), the maximum likelihood estimate of j is ^j = yj = (Pj t),

j = 1; 2; : : : ; K. So
K h
X i XK
yj
l(^) = yj log ^j Pj ^j t = yj log yj
Pj t
j=1 j=1
K
X yj
= yj log 1 :
Pj t
j=1
The likelihood function assuming H0 : 1 = 2 = = K is given by
Q
K
yj Pj t
L( ) = e
j=1
with log likelihood function

!
P
K P
K
l( ) = yj log t Pj :
j=1 j=1
Since " #
1 P
K P
K 1 P
K P
K
l0 ( ) = yj Pj t = yj t Pj = 0
j=1 j=1 j=1 j=1
P
K P
K
if = yj = Pj t, the maximum likelihood estimate of assuming
j=1 j=1
P
K P
K
H0 : 1 = 2 = = K is ^0 = yj =t Pj . So
j=1 j=1
!
P
K
^0 P Pj t
K
l(^0 ) = yj log ^0
j=1 j=1
0 1 0 1
P
K P
K
! yj C yj C
P B B
K
B j=1 C B j=1 CPK
= yj log B K C B K C Pj t
j=1 @ P A @ P A j=1
t Pj t Pj
j=1 j=1
!" ! #
P
K P
K P
K
= yj log yj =t Pj 1 :
j=1 j=1 j=1
The likelihood ratio test statistic for testing H0 : 1 = 2 = = K is
= 2l(~) 2l(~0 )
K
!" ! #
X Yj P
K P
K P
K
= 2 Yj log 1 2 Yj log Yj =t Pj 1 :
Pj t j=1 j=1 j=1
j=1
341
The observed value of is
= 2l(^) 2l(^0 )
K
!" ! #
X yj P
K P
K P
K
= 2 yj log 1 2 yj log yj =t Pj 1 :
Pj t j=1 j=1 j=1
j=1
The p value is
P( ; H0 ) t P (W ) where W v 2
(K 1) :
(c) For the given data
27 18 41 29 31 146
^= ; ; ; ; and ^0 =
5 (2025) 5 (1116) 5 (3210) 5 (1687) 5 (2840) 5 (10878)
= 3:73 and p value t P (W 3:73) = 0:44 where W ~ 2 (4). There is no

evidence based on the data against H0 : 1 = 2 = = 5 , that is, that the
rates are equal.
1 P
n
1 P
n
5.16 (a) ~ = Y ; ~ 2 = n (Yi Y ); ^ 0 = 0; ~ 20 = n (Yi 0) and ( 0) =
i=1 i=1
n log ~ 20 =~ 2 .
1 P
n P
n
2
(Yi 0) (Yi Y ) + n(Y 0)
~ 20 n
i=1 i=1
= =
~2 1 P
n P
n
n (Yi Y) (Yi Y)
i=1 i=1
so that 2 3
6 n(Y 27
0) 7 T2
( 6
0 ) = n log 41 + Pn 5 = n log 1 + n 1
(Yi Y)
i=1
5.17
5.18 (a) Here is the R code for doing this problem:

> data<-c(70,75,63,59,81,92,75,100,63,58)
> L<-dmultinom(data,prob=data)
> L1<-dmultinom(data,prob=rep(1,10)) #This is L ^
> lambda<-2*(log(L)-log(L1))
> pvalue<-1-pchisq(lambda,9)
= 23:605; p value = 0:005
(b) p value = 1 (0:995)6 = 0:03
Chapter 6
6.1. (a) The maximum likelihood estimates of and are
^ = Sxy = 2325:20 = 0:83; ^ = y ^ x = 133:56 (0:83) (43:20) = 97:71

Sxx 2802:00
and an unbiased estimate of 2 is
1 ^ Sxy ) = 1 [3284:16
s2e = (Syy (0:83) (2325:20)] = 58:8968:
n 2 23
(b) The scatterplot with …tted line and the residual plots shown in Figure 12.22
show no unusual patterns. The model …ts the data well.
160 2.5
155 2
150 1.5
s tandardiz ed res idual

145 1
y =97.71+0.83*x
140 0.5
y
135 0
130 -0.5
125 -1
120 -1.5
115 -2
25 30 35 40 45 50 55 60 65 25 30 35 40 45 50 55 60 65
x x
2.5 2.5
2 2
1.5 1.5
1 1
Sam ple Quanti les
0.5 0.5
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
-2 -2
115 120 125 130 135 140 145 150 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
m uhat N(0,1) Quanti les
Figure 12.22: Scatterplot and residual plots for nurses data
(c) Since P (T 2:0687) = 0:975 where T v t (23) and

1=2
1 ^ Sxy )
se = (Syy = 7:6744
n 2

p
^ 2:0687 (7:6744) = 2802:0 = 0:8298 0:2999 = [0:53; 1:13]:
Meaning: Suppose we repeat the experiment (select 25 female nurses working

at the large hospital at random and record their age and systolic blood pressure)
a large number of times and each time we construct a 95% con…dence interval for
343
for the observed data. Then, approximately 95% of the constructed intervals
would contain the true, but unknown value of . We say that we are 95%
con…dent that our interval contains the true value of :
(d) Since P (T 1:7139) = 0:95 where T v t (23), a 90% con…dence interval for the
mean systolic blood pressure of nurses aged x = 35 is
" #1=2
1 (35 43:20)2
^ + ^ (35) 1:7139 (7:6744) +
25 2802:00
= 126:7553 3:3274 = [123:43; 130:08] :
(e) Since P (T 2:8073) = 0:995 where T v t (23), a 99% prediction interval for the
systolic blood pressure of a nurse aged x = 50 is
" #1=2
1 (50 43:20)2
^ + ^ (50) 2:8073 (7:6744) 1 + +
25 2802:00
= 139:2029 23:0108 = [116:19; 162:21] :
6.2 (a) The maximum likelihood estimate of and are
^ = Sxy = 22769:645 = 3:6238; ^ = y ^ x = 187:975 (3:6238) (43:03) = 32:0444

Sxx 6283:422
The …tted line is y = 32:04 + 3:62x.
(b) The scatterplot with …tted line and the residual plots shown in Figure 12.23
show no unusual patterns. There is one residual value which is larger than 3 for
x = 50:3.
(c) Since
^ = Sxy and sample correlation = r = p Sxy
Sxx Sxx Syy
therefore
1=2 1=2
Syy Sxx
r=^ or ^ = r :
Sxx Syy
(d) Since P (T 2:1009) = 0:975 where T v t (18) and
1=2
1 ^ Sxy )
se = (Syy = 100:6524
n 2

p
^ 2:1009 (100:6524) = 6283:422 = 3:6238 2:6677 = [0:9561; 6:2915]:
The study population is all the actors listed at boxo¢ cemojo.com/people/. The
parameter represents the mean increase in the amounted by a movie for a unit
600 3.5
3
500
2.5

2
400
1.5
300 1
y
y =32.04+3.62*x
0.5
200
0
-0.5
100
-1
0 -1.5
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
x x
3.5 3.5
3 3
2.5 2.5
2 2
Sam ple Quanti les

1.5 1.5
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
50 100 150 200 250 300 350 400 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Figure 12.23: Scatterplot and residual plots for actor data
change in the value of an actor. However, since the 20 data points were obtained
by taking the …rst 20 actors in the list, the sample is not a random sample. If
actors with last names starting with letters at the beginning of the alphabet are
more successful then other actors then the estimate of might be biased.
(e) The hypothesis of no relationship is equivalent to H0 : = 0. Since
2 0 13
^ 0
p value = 2 41 P @T p A5 = 2 [1 P (T 2:85)] = 0:011
se = Sxx
(using R), there is evidence based on the data against H0 : = 0. Note that
this is consistent with the fact that the 95% con…dence interval for does not
contain the value = 0.
(f) Since P (T 2:1009) = 0:975 where T v t (18), a 95% con…dence interval for the
mean amount grossed by movies for actors whose value is x = 50 is
" #1=2
1 (50 43:03)2
32:0444 + (3:6238) (50) 2:1009 (100:6524) +
20 6283:422
= 213:2326 50:8090 = [162:4236; 264:0417] :
A 95% con…dence interval for the mean amount grossed by movies for actors
345
whose value is x = 100 is

" #1=2
1 (100 43:03)2
32:0444 + (3:6238) (100) 2:1009 (100:6524) +
20 6283:422
= 394:4209 159:1644 = [235:2565; 553:5853] :
The largest observed x value is x = 92:8. By constructing a con…dence interval

for the mean amount grossed by movies for actors whose value is x = 100, we
are assuming that the linear relationship hold beyond the observed data.
6.3 (a) Recall this was a regression of the form E(Yi ) = + x1i where x1i = x2i ;
and xi = bolt diameter. Now n = 30; ^ = 1:6668; ^ = 2:8378; se = 0:05154,
Sxx = 0:2244; x1 = 0:11. A point estimate of the mean breaking strength at
x1 = (0:35)2 = 0:1225 is
^ (0:1225) = ^ + ^ (0:1225) = 1:667 + 2:838(0:1225) = 2:01447
A con…dence interval for (0:1225) is

s
1 (0:1225 x1 )2
^ (0:1225) ase +
n Sxx
From t-tables, P (T 2:0484) = 0:975 where T v t (28). The 95% con…dence

interval is
s
1 (0:1225 0:11)2
2:01447 2:0484(0:05154) +
30 0:2244
= 2:01447 0:01932 = [1:9952; 2:0338]
(b) A 95% prediction interval for the strength at x1 = (0:35)2 = 0:1225 is

s
1 (0:1225 x1 )2
^ (0:1225) ase 1 + +
n Sxx
s
1 (0:1225 x1 )2
= 2:01447 2:0484(0:05154) 1 + +
30 0:2244
= 2:01447 0:10732 = [1:9072; 2:1218]
This interval is wider since it is an interval estimate for a single observation (a

random variable) at x1 = 0:35 rather than an interval estimate for a mean (a
constant).
(c) Since Y represents the mean strength of the bolt of diameter x = 0:35, then
based on the assumed model Y v G ( + (0:1225) ; ). Since ; and are
unknown we estimate them using ^ = 1:6668, ^ = 2:8378; and se = 0:05154 and
use Y v G (2:01447; 0:05154). Since V v G (1:60; 0:10) independently of Y v

q
G (2:01447; 0:05154) we have V Y v G 1:60 2:01447; (0:1)2 + (0:05154)2
or V Y v G ( 0:41447; 0:1125). Therefore an estimate of P (V > Y ) is
0 ( 0:41447)
P^ (V > Y ) = P^ (V Y > 0) = P Z> where Z v G (0; 1)
0:1125
= 1 P (Z 3:68) t 0.
6.4 (a)
^ = Sxy = 2818:556835 = 0:9999
Sxx 2818:946855
^=y ^ x = 23:5505 23:7065 0:9999 = 0:1527
A scatterplot of the data as well as the …tted line are given in top left panel of
Figure 12.24. The straight line …ts the data very well. The observed points all
lie very close to the …tted line.
(b) Since P (T 2:1009) = 0:975 where T v t (18) and
!1=2
Syy ^ Sxy
se =
n 2
1=2
2820:862295 (0:9998616) (2818:556835)
= = 0:3870
18
a 95% con…dence interval for is
p
0:9999 2:1009 (0:3870) = 2818:946855 = [0:9845; 1:0152] :
Since the value = 1 is inside the 95% con…dence interval for we know the
p value for testing H0 : = 1 is greater than 0:05. Alternatively
2 0 13
^ 1
p value = 2 41 P @T p A5 = 2 [1 P (T 0:019)] = 0:99
se = Sxx
using R and there is no evidence based on the data against H0 : = 1.

s
1 (0 23:7065)2
0:1527 2:1009(0:3870) + = [ 0:5587; 0:2533]
20 2818:946855
Since = 0 is inside the 95% con…dence interval for we know the p value
for testing H0 : = 0 is greater than 0:05. Alternatively
2 0 13
j^ 0j
p value = 2 41 P @T q A5 = 2 [1 P (T 0:7903)] = 0:4396
1 (0 x)2
se = n + Sxx
347
using R and there is no evidence based on the data against H0 : = 0.

The question of interest is how well the cheaper way of determining concentra-
tions compares with the more expensive way. To put this question in terms of
the model we …rst note that the assumed model is
Yi G( + xi ; ) i = 1; : : : ; n independently:
If the cheaper way worked perfectly then the measurements using the cheaper
way would be identical to the more expensive way plus some variability. That
is, the model would be
Yi G(xi ; ) i = 1; : : : ; n independently:
This means we are interested in whether the model with = 1 and = 0 …ts
the data well. This is the reason why we test the hypotheses H0 : = 1 and
H0 : = 0.
(c) The scatterplot with …tted line and the residual plots shown in Figure 12.24
45 2.5
40 2
35 1.5
30 1
25 0.5
y =-0.15+1.0*x
y
20 0
15 -0.5
10 -1
5 -1.5
0 -2
0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 45
x x
2.5 2.5
2 2
1.5 1.5
1 1
Sam ple Quanti les
0.5 0.5
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
-2 -2
0 5 10 15 20 25 30 35 40 45 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Figure 12.24: Scatterplot and residual plots for cheap versus expensive procedures
(d) The scatterplot plus the …tted line indicates good agreement between the cheaper
way of determining concentrations and the more expensive way. The points
lie quite close to the …tted line. The data suggest that the cheaper way of
determining concentrations is quite accurate since the cheaper way does not
appear to consistently give values which are systematically above (or below) the
concentration determined by the more expensive way.
(e) Since the …tted model is
y= 0:1527 + 0:9999x;
the point estimate of the y-intercept is ^ = 0:1527 which is slightly negative

which suggests the cheaper way is giving values lower than the true concentration
as determined by the more expensive way. However, the con…dence interval for
was [ 0:5587; 0:2533] which certainly includes the value = 0 as well as values
of above and below zero. The data do not suggest the cheaper way is giving
lower values. If the con…dence interval only contained negative values then this
would suggest that the cheaper way is giving lower values.
6.5 (a) The likelihood function is for is

n
Y 1 (yi xi )2
L( ) = p exp 2
2 2 2
i=1
or more simply.
n
!
1 X 2
L( ) = exp 2
(yi xi )
2
i=1

n
1 X
l( ) = 2
(yi xi )2 :
2
i=1
P
Maximizing l( ) is equivalent to minimizing g( ) = ni=1 (yi xi )2 which is
the criterion for determining the least squares estimate of .
Solving
Xn
dg
=2 ( xi yi )xi = 0
d
i=1
we obtain
P
n
xi yi
^= i=1
:
Pn
x2i
i=1
(b) Note that
P
n 0 1
xi Yi Xn
B xi C Pn xi
~= i=1
= B C Yi = ai Yi where ai = n
Pn @Pn A P
x2i i=1 x2i i=1 x2i
i=1 i=1 i=1
349
so ~ is a linear combination of independent Normal random variables and there-

fore has a Normal distribution. Since
0 1
Xn Xn
B xi C Pn
E( ~ ) = B C E (Yi ) = 1 (xi ) ( xi ) = n x2i = :
@Pn A Pn P
i=1 x2i x2i i=1 x2i i=1
i=1 i=1 i=1
and
0 12
Xn
B xi C 1 P
n 2
V ar( ~ ) = B C V ar (Yi ) = x2i 2
=
@Pn A P
n 2 P
n
x2i i=1 x2i
i=1 x2i
i=1 i=1 i=1
therefore 0 1
P
n
x i Yi
B 2 C
~= i=1
vNB
@ ;
C:
A
Pn P
n
x2i x2i
i=1 i=1
(c)
n
X n
X 2
(yi ^ xi )2 = (yi2 2xi yi ^ + x2i ^ )
i=1 i=1
n
X Pn n Pn n
x i yi X i=1 xi yi 2
X
= yi2 i=1
2 Pn 2 x y
i i + P n 2 x2i
x x
i=1 | i=1
{z i } i=1 | i=1{z i } i=1
^ ^2
n Pn 2 Pn 2
X
i=1 xi yi x i yi
= yi2 2 Pn 2 + Pi=1
n 2
i=1 i=1 xi i=1 xi
n Pn 2
X xi yi
= yi2 Pi=1
n 2
i=1 i=1 xi
as required.
(d) Find a in the t-table such that P ( a T a) = 0:95 where T v t (n 1).
Then since
0 1
~
0:95 = P @ a qP aA
n 2
Se = i=1 xi
s s !
Pn P
n
= P ~ aSe = x2i ~ + aSe = x2i
i=1 i=1
a 95% con…dence interval for is given by ^ a pPsne ; ^ + a pPsne .

2 x2i
i=1 xi i=1
(e) De…ne the discrepancy measure
j~ j
D= s :
P
n
Se = x2i
i=1
Under the null hypothesis H0 : = 0, the p value is given by

0 1 2 0 13
B C 6 B C7
B j^ C
0j
6 B j^ 0j
C7
PB
BjT j > s C = 2 61
C 6 PB
BT s C7
C7
@ P
n
2 A 4 @ P
n
2 A5
se = xi se = xi
i=1 i=1
where T t(n 1).
6.6 (a)
P
n
x i yi
^= i=1 13984:5554
= = 0:9947
Pn
14058:9097
x2i
i=1
and the …tted model is y = 0:9947x.
(b) A scatterplot of the data as well as the …tted line are given in top left panel of
Figure 12.25. The straight line …ts the data very well. The observed points all
lie very close to the …tted line.
0 21
Pn
xi yi C
1 B
n
BX 2 i=1 C
se = B y i Pn C
n 1@ A
i=1 x2i
i=1
1=2
1 13984:55542
= 13913:3833 = 0:3831
19 14058:9097
a 95% con…dence interval for is given by

v
u n
uX p
^ ase =t x2i = 0:9947 2:0930(0:3831)= 14058:9097 = [0:9879; 1:0015] :
i=1
For testing H0 : = 1 we have

j0:9947 1j
p value = 2 1 P T p = 2[1 P (T 1:64)] = 0:12
0:3831= 14058:9097
where T t(19), and there is no evidence based on the data against H0 : = 1.
351
45 2.5
40 2
35 1.5

30 1
25 0.5
y =0.995*x
y
20 0
15 -0.5
10 -1
5 -1.5
0 -2
0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 45
x x
2.5 2.5
2 2
1.5 1.5
1 1
Sam ple Quanti les

0.5 0.5
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
-2 -2
0 5 10 15 20 25 30 35 40 45 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Figure 12.25: Scatterplot and residual plots for model through the origin
(d) The scatterplot with …tted line and the residual plots shown in Figure 12.25
(e) Based on this analysis we would conclude that the simple model Y G( xi ; ) is
an adequate model for these data as compared to the model Yi G( + xi ; ).
6.7 (a) The maximum likelihood estimates of and are
^ = Sxy = 6175 = 2:8652 ^=y ^ x = 30:3869:

Sxx 2155:2
and the …tted line is y = 30:3869 + 2:8652x:

(b) The scatterplot with …tted line and the residual plots shown in Figure 12.26.
There are a few large negative residuals but overall the model seems reasonable.
" #1=2
1=2
Syy ^ Sxy 24801:1521 (2:8652) (6175:1522)
se = = = 12:7096
n 2 44
a 95% con…dence interval for is

p
2:8652 2:0154 (12:7096) = 2155:1522
= [2:3135; 3:4171] :
140 1.5
1
120
0.5

100 y =30.36+2.87*x 0
-0.5
80
y
-1
60 -1.5
-2
40
-2.5
20 -3
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
x x
1.5 1.5
1 1
0.5 0.5
0 0
Sam ple Quanti les

-0.5 -0.5
-1 -1
-1.5 -1.5
-2 -2
-2.5 -2.5
-3 -3
30 40 50 60 70 80 90 100 110 120 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
Figure 12.26: Scatterplot and residual plots for death rate due to cirrhosis of the liver versus
wine consumption
(d) The hypothesis of no relationship is equivalent to H0 : = 0. Since

2 0 13
^ 0
p value = 2 41 P @T p A5 = 2 [1 P (T 10:47)] t 0
se = Sxx
there is very strong evidence based on the data against H0 : = 0. Note that
this is consistent with the fact that the 95% con…dence interval for does not
contain the value = 0.
6.8 (a) The scatterplot and residual plots indicate that the model …ts the data well.
6.9 (a)
x = 191:7871 y = 20:0276
Sxx = 2291:3148 Syy = 447:8497 Sxy = 1008:8246
^ = Sxy 1008:8246
= = 0:44028
Sxx 2291:3148
^ = y ^ x = 20:0276 (0:44028) (191:7871) = 64:4128
The …tted line is y = 64:4128 + 0:44028x. The scatterplot and residual plot
are given in the top two panels of Figure 12.27. Both graphs show a distinctive
pattern. In the scatterplot as x increases the points lie above the line, then below
353
then above. Correspondingly in the residual plot as x increases the residuals are
positive then negative then positive. In the residual plot the points do not lie
in a horizontal about the line rî = 0 which suggests that the linear model is not
adequate.
30 2.5
1.5

25 1
0.5
y =-64.41+0.44x
y
20 -0.5
-1
-1.5
15 -2
180 185 190 195 200 205 210 215 180 185 190 195 200 205 210 215
x x
3.5 2
3.4 1.5
3.3 1

3.2 0.5
3.1 0
y
-1.02+0.021x
3 -0.5
2.9 -1
2.8 -1.5
2.7 -2
180 185 190 195 200 205 210 215 180 185 190 195 200 205 210 215
x x
Figure 12.27: Fitted lines and residual plots for atmospheric pressure data
(b)
x = 191:7871 y = 2:9804
Sxx = 2291:3148 Syy = 1:00001 Sxy = 47:81920
^ = Sxy 47:81920
= = 0:02087
Sxx 2291:3148
^ = y ^ x = 2:9804 (0:02087) (191:7871) = 1:02214
The …tted line is z = 1:02214 + 0:02087x. The scatterplot and residual plots
are given in the bottom two panels of Figure 12.27. In both of these plots we do
not observe any unusual patterns. There is no evidence to contradict the linear
model for log(pressure) versus temperature. However this does not “prove”that
the theory’s model is correct - only that there is no evidence to disprove it.
(c) Since P (T 2:0452) = 0:975 where T v t (29), and
!1=2
Syy ^ Sxy
se =
n 2
1=2
1:00001 (0:02087) (47:81920)
= = 0:00838894
29
a 95% con…dence interval for the mean log atmospheric pressure at a temperature
of x = 195 is
" #1=2
1 (100 191:7871)2
1:02214 + (0:02087) (195) 2:0452 (0:008389) +
31 2291:3148
= 3:04747 0:00329 = [3:04418; 3:05076] :
which implies a 95% con…dence interval for the mean atmospheric pressure at a
temperature of x = 195 is
[exp (3:04418) ; exp (3:05076)] = [20:9927; 21:1313] :
6.10 (a) We assume that the study population is the set of all Grade 3 students who
are being taught the same curriculum. (For example in Ontario all Grade 3
students must be taught the same Grade 3 curriculum set out by the Ontario
Government.) The parameter 1 represents the mean score on the DRP test
if all Grade 3 students in the study population took part in the new directed
readings activities for an 8-week period.
The parameter 2 represents the mean score on the DRP test for all Grade 3
students in the study population without the directed readings activities.
The parameter represents the standard deviation of the DRP scores for all
Grade 3 students in the study population which is assumed to be the same
whether the students take part in the new directed readings activities or not.
(b) The qqplot of the responses for the treatment group and the qqplot of the re-
sponses for the control group are given in Figures 12.28 and 12.29. Looking at
these plots we see that the points lie reasonably along a straight line in both plots
and so we would conclude that the normality assumptions seem reasonable.
Normal Probability Plot
0.98
0.95
0.90
0.75
Probability
0.50
0.25
0.10
0.05
0.02
25 30 35 40 45 50 55 60 65 70
Data
Figure 12.28: Normal Qqplot of the Responses for the Treatment Group
355
0.98
0.95
0.90
0.75
Probability
0.50
0.25
0.10
0.05
0.02
10 20 30 40 50 60 70 80
Data
Figure 12.29: Normal Qqplot for the Responses in the Control Group
(c) For the given data

1=2
1
sp = (2423:2381 + 6469:7391) = 14:5512
21 + 23 2
Also P (T 2:018) = 0:975 where T s t (42). A 95% con…dence interval for the
di¤erence in the means, 1 2 is
r
1 1
51:4762 41:5217 (2:018) (14:5512) +
21 23
= 9:9545 8:8628 = [1:0916; 18:8173]
(d) To test the hypothesis of no di¤erence between the means, that is, to test the
hypothesis H0 : 1 = 2 we use the discrepancy measure
Y1 Y2 0
D= q
Sp n11 + n12
where
Y1 Y2 0
T = q s t (n1 + n2 2)
Sp n11 + n12
assuming H0 : 1 = 2 is true. The observed value of D for these data is
jy1 y2 0j j51:4762 41:5217 0j
d= q = q = 2:2666
1 1 1 1
sp n1 + n2 14:5512 21 + 23
and
p value = 2 [1 P (T 2:2666)] where T s t (42)

= 0:02863:
Since the p-value is less than 0:01 there is strong evidence against the hypothesis
H0 : 1 = 2 based on the data.
Although the data suggest there is a di¤erence between the treatment group and
the control group we cannot conclude that the di¤erence is due to the
the new directed readings activities. The di¤erence could simply be due to
the di¤erences in the two Grade 3 classes. Since randomization was not used to
determine which student received the treatment and which student was in the
control group, the di¤erence in the DRP scores could have existed before the
treatment was applied.
6.12 (a) The pooled estimate of variance is

r
209:02961 + 116:7974
sp = = 4:25:
18
From t tables, P (T < 1:734) = 0:95 where T v t (18). The 90% con…dence
interval is
r
1 1
10:693 6:750 1:734 (4:25) + = [0:647; 7:239]
n1 n2
(b) We test the hypothesis H0 : 1 = 2 or equivalently H0 : 1 2 = 0 using the
pivotal
jY1 Y2 j
D= q
1 1
Sp n1 + n2
The observed value of this statistic is

j10:693 6:750j
d= q = 2:074
1 1
4:25 10 + 10
with
p value = 2 [1 P (T 2:074)] = 0:05 where T v t(18)
so there is weak evidence against H0 based on the data.
(c) We repeat the above using as data Zij = log(Yij ): This time the sample means
are 2:248, 1:7950 and the sample variances
q are 0:320, 0:240 respectively,. The
0:320+0:240
pooled estimate of variance is sp = 2 = 0:529. The observed value of
the discrepancy measure is
j2:248 1:795j 0
d= q = 1:9148
1 1
0:529 10 + 10
with
p value = 2 [1 P (T 1:91)] t 0:07 where T v t(18)
so there is even less evidence against H0 based on the data.
357
(d) One could check the Normality assumption with qqplots for each of the variables
Yij and Zij = log(Yij ) although with such a small sample size these will be
di¢ cult to interpret.
6.12 (a) The pooled estimate of the common standard deviation is

r
3050 + 2937
sp = = 10:1599:
58
Using R, P (T 2:0017) = 0:975 where T v t (58). The 95% con…dence interval
for 1 2 is
r
1 1
120 114 2:0017 (10:1599) + =6 5:2511 = [0:7489; 11:2511] :
30 30
(b) Since
j120 114 0j
d= q = 2:2872
1 1
10:1599 30 + 30
with
there is evidence against the hypothesis of no di¤erence based on the data. This
is consistent with the fact that the 95% con…dence interval for 1 2 did not
contain the value 1 2 = 0.
6.13 Let 1 be the mean log failure time for welded girders and 2 be the mean score
for log failure time for repaired welded girders. The pooled estimate of the common
standard deviation is
r
13 (0:0914) + 9 (0:0422)
sp = = 0:26697
22
From t tables, P (T < 2:0739) = 0:975 where T v t (22). The 95% con…dence interval
for 1 2 is
r
1 1
14:564 14:291 2:0739 (0:26697) + = 0:273 0:22924 = [0:04376; 0:50224] :
14 10
Since
j14:564 14:291 0j
d= q = 2:4698
1 1
0:26697 14 + 10
with
there is evidence against the hypothesis of no di¤erence based on the data. This is
consistent with the fact that the 95% con…dence interval for 1 2 did not contain
the value 1 2 = 0.
6.14 (a) For the female coyotes we have
yf = 89:24; s2f = 42:87887; nf = 40:
For the male coyotes we have
ym = 92:06; s2m = 44:83586; nm = 43:
Since nf = 40 and nm = 43 are reasonably large we have that Yf has approxi-

mately a N (89:24; 42:87887=40) distribution and Ym has approximately a
N (92:06; 44:83586=43) distribution. An approximate 95% con…dence interval for
f m is given by
r
42:87887 44:83586
89:24 92:06 1:96 + = [ 5:67; 0:03]:
40 43
The value f m = 0 is just inside the right hand endpoint and the p value for
testing H0 : f m = 0 would be close to 0:05. There is weak evidence based on
the data of a di¤erence between mean length for male and female coyotes. Since
the interval contains mostly negative values the data suggest the mean length
for males is slightly larger than for females.
(b) Using Y1 v N (89:24; 42:87887), Y2 v N (92:06; 44:83586) and
Y1 Y2 v N (89:24 92:06; 42:87887 + 44:83586) or
Y1 Y2 v N ( 2:82; 87:71473) we estimate P (Y1 > Y2 ) = P (Y1 Y2 > 0) as
0 ( 2:82)
P Z> p =1 P (Z 0:30) = 1 0:61791 = 0:38209.
87:71473
(c) Since P (T 2:0227) = 0:975 where T v t (39) a 95% con…dence interval the
mean length of female coyotes is
p
89:24 2:0227 42:87887=40 = 89:24 2:0942 = [87:1457; 91:3342] :
Since P (T 2:0181) = 0:975 where T v t (42) a 95% con…dence interval the

mean length of male coyotes is
p
92:06 2:0181 44:83586=43 = 92:06 2:06073 = [89:9993; 94:1207] :
6.15 We assume that the observations for the “Alcohol” group are a random sample from
a G ( 1 ; ) distribution and that the observations for the “Non-Alcohol” group are a
random sample from a G ( 2 ; ) distribution. To see if there is any di¤erence between
the two groups we construct a 95% con…dence interval for the mean di¤erence in
reaction times 1 2.
The pooled estimate of the common standard deviation is
r
0:608 + 0:35569
sp = = 0:2093:
22
359
Since P (T 2:0739) = 0:975 where T v t (22), a 95% con…dence interval for 1 2

is r
1 1
1:370 1:599 2:0739 (0:2093) + = [ 0:4064; 0:0520]:
12 12
This interval does not contain 1 2 = 0 and only contains negative values. The
data suggest that 1 < 2 , that is, the mean reaction time for the “Alcohol” group
is slower than the mean reaction time for the “Non-Alcohol” group. We are not told
the units of these reaction times so it is unclear whether this di¤erence is of practical
signi…cance.
6.16 (a) We assume that the observed di¤erences are a random sample from a G ( ; )
distribution. An estimate of is
r
17:135
s= = 1:5646
7
Since P (T 2:3646) = 0:975 where T v t (7), a 95% con…dence interval for is

p
1:075 2:3646 (1:5646) = 8 = 1:075 1:3080 = [ 0:2330; 2:3830] :
(b) If the natural pairing is ignored an estimate of the common standard deviation
is r
535:16875 + 644:83875
sp = = 9:18075
14
Since P (T 2:1148) = 0:975 where T v t (14), a 95% con…dence interval for
1 2 is
r
1 1
23:6125 22:5375 2:1148 (9:18075) + = [ 8:7704; 10:9204]:
8 8
We notice that although both intervals in (a) and (b) are centered at the value
1:075, the interval in (b) is very much wider.
(c) A matched pairs study allows for a more precise comparison since di¤erences
between the 8 pairs have been eliminated. That is by analyzing the di¤erences
we do not need to worry that there may have been large di¤erences in the 8 cars
which were used in the study with respect to other explanatory variates which
might a¤ect gas mileage (the response variate) such as size of engine, make of
car, etc.
6.17 (a) We assume that the study population is the set of all factories of similar size.
The parameter represents the mean di¤erence in the number of sta¤ hours per
month lost due to accidents before and after the introduction of an industrial
safety program in the study population.
(b) For these data

1=2
1
s= (1148:79875) = 12:8107:
7
From t tables P (T 2:3646) = 0:975 where T s t (7). A 95% con…dence interval
for is
p
15:3375 2:3646 (12:8107) = 8 = 15:3375 10:6891 = [ 26:0266; 4:6484]
(c) Since
jy 0j j 15:3375 0j
d= p = p = 3:39
s= n 12:8107= 8
with
p value = 2 [1 P (T 3:39)] where T s t (7)

= 0:012:
Since the p-value is between 0:01 and 0:05 there is reasonable evidence against
the hypothesis H0 : = 0 based on the data.
Since this experimental study was conducted as a matched pairs study, an analy-
sis of the di¤erences, yi = y1ii y2i ; allows for a more precise comparison since
di¤erences between the 8 pairs have been eliminated. That is by analyzing the
di¤erences we do not need to worry that there may have been large di¤erences
in the safety records between factories due to other variates such as di¤erences
in the management at the di¤erent factories, di¤erences in the type of work be-
ing conducted at the factories etc. Note however that a drawback to the study
was that we were not told how the 8 factories were selected. To do the analysis
above we have assumed that the 8 factories are a random sample from the study
population of all similar size factories but we do not know if this is the case.
6.18 (a) Since two algorithms are each run on the same 20 sets of numbers we analyse the
di¤erences yi = yAi yBi ; i = 1; : : : ; 20. Since P (T < 2:8609) = (1 + 0:99) =2 =
0:995 where T v t (19), we obtain the con…dence interval
p
0:409 2:8609 (0:487322) = 20 = [0:097; 0:721]
These values are all positive indicating strong evidence based on the data against
H0 : A B = 0 (p value < 0:01), that is, the data suggest that algorithm B
is faster.
(b) To check the Normality assumption we plot a qqplot of the di¤erences. See
Figure 12.30. The data lie reasonably along a straight line and therefore a
Normal model is reasonable.
361

1.4
1.2
0.8

0.6
0.4
0.2
-0.2
-0.4
-0.6
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Figure 12.30: Qqplot for sorting algorithm data.
(c) We can estimate the probability by using the fact that YA YB G( ; ). We

estimate the parameters using ^ = 0:40 and s = 0:487322. Since
0 0:409
P (YA > YB ) = P (YA YB > 0) = P Z>
0:487322
= P (Z > 0:84) = P (Z < 0:84) = 0:80 where Z v N (0; 1)
an estimate of the probability that algorithm B sorts a randomly selected list

faster than A is 0:80.
(d) An estimate of p is p^ = 15=20 = 0:75 and an approximate 95% is given by
r r
p^ (1 p^) 0:75 (0:25)
p^ 1:96 = 0:75 1:96
n 20
or [0:71; 0:94]
(e) r
1:4697 + 0:9945
= 1:11 sp =
2
Since P (T < 2:86) = (1 + 0:99)=2 = 0:995 where T v t(38), the interval,
assuming common variance, is
r r
1 1 1 1
y1 y2 asp + = 0:409 2:68(1:11) +
20 20 20 20
or
[ 0:532; 1:349]:
This second interval [ 0:532; 1:349] is much wider than the …rst interval [0:097; 0:721]
biased on the paired experiment and unlike the …rst interval, it contains the value
zero. Unlike the paired design, independent samples of the same size (20 di¤erent
problems run with each algorithm) is too small to demonstrate the superiority
of algorithm B. The independent samples is a less e¢ cient way to analyse the
di¤erence. This is why in computer simulations, it is essential to be able to run
di¤erent simulations using the same random number seed.
363
Chapter 7
7.1 (a) The expected frequencies are:
eij Rust-Proofed Not Rust Proofed Total

42 50
Rust present 100 = 21 21 42
Rust absent 29 29 58
Total 50 50 100
The observed value of the likelihood ratio statistic is likelihood ratio statistic is
14 28 36 22
= 2 14 log + 28 log + 36 log + 22 log = 8:1701
21 21 29 29
with

h p i
= 2 1 P Z 8:1701 where Z v N (0; 1)
= 2 [1 P (Z 2:86)] = 0:0042587
so there is strong evidence against the hypothesis that the probability of rust
occurring is the same for rust-proofed and non-rust-proofed cars based on the
observed data.
7.2 If the probability of catching the cold is the same for each group, then it is estimated
as 50=200 = 0:25 in which case the expected frequencies ej in the four categories are
25; 75; 25; 75 respectively. The observed frequencies yj are 20; 80; 30; 70. The observed
value of the likelihood ratio statistic is
20 80 30 70
2 20 log + 80 log + 30 log + 70 log = 2:6807
25 75 25 75
with

h p i
= 2 1 P Z 2:68 where Z v N (0; 1)
= 2 [1 P (Z 1:64)]
= 0:10157
Based on the observed data there is no evidence against the hypothesis that the
probability of catching a cold during the study period was the same for each group.
7.3 The total number of defectives among the 250 12 = 3000 items inspected is
80 1 + 31 2 + 19 3 + 11 4+5 5+1 6 = 274

and the maximum likelihood estimate of = the proportion of defectives is
^ = 274 = 0:09133:
3000
We want to test the hypothesis that the number of defectives in a box is Binomial(12; ).
Under this hypothesis and using ^ = 0:091333 we obtain the expected numbers in each
category
Number defective 0 1 2 3 4 5 6 Total

ei 79:21 95:54 52:82 17:70 4 0:64 0:08 250
where
12 î ^)12 i
ei = 250 (1 for i = 0; 1; : : : ; 5
i
and the last category is obtained by subtraction. Since the expected numbers in
the last three categories are all less than 5 we pool these categories to improve the
Chi-squared approximation and obtain
Number defective 0 1 2 3 4 Total

fi (ei ) 103(79:21) 80(95:54) 31(52:82) 19(17:7) 17(4:72) 250
The observed value of the likelihood ratio statistic is

103 80 17
2 103 log + 80 log + + 17 log
79:21 95:54 4:72
= 38:8552:
Under the null hypothesis we had to estimate the parameter . The degrees of freedom
are is 4 1 = 3. The p value is P (W > 38:8552) t 0 where W v 2 (3), so based
on the data there is very strong evidence that the Binomial model does not …t. The
likely reason is that the defects tend to occur in batches when packed (so that there
more cartons with no defects than one would expect).
7.4 The maximum likelihood estimate of is
^ = 1 P ifi = 230 = 1:15:

5
200 i=1 200
The expected frequencies assuming a Poisson(1:15) distribution are given in the table
below in brackets
Number of
0 1 2 3 4 5 Total
Interruptions
fi 64 71 42 18 4 1
200
ei 63:33 72:83 41:88 16:05 4:61 1:3
365
where
(1:15)i e 1:15
ei = 200 for i = 0; 1; : : : ; 4
i!
and the last category is obtained by subtraction. Since the expected frequency in the
last category is less than 5 we combine the last two categories to obtain
Number of
0 1 2 3 4 Total
Interruptions
fi (ei ) 64(63:33) 71(72:83) 42(41:88) 18(16:05) 5(5:91) 200

64 71 5
2 64 log + 71 log + + 5 log = 0:43
63:33 72:83 5:91
and p value t P (W > 0:43) = 0:93 where W v 2 (3). Based on the data there is
no evidence against the hypothesis that the Poisson model …ts the data.
7.5 (a) For n = 2, the likelihood function is

23 44 13
2 2 2 2 2
L2 ( 2 ) = (1 2) 2 (1 2) 2 0< 2 <1
0 1 2
or more simply
2(23) 44 44 2(13) 70 90
L2 ( 2 ) = (1 2) 2 (1 2) 2 = 2 (1 2) 0< 2 <1
which is maximized for

^2 = 70 = 0:4375:
160
For n = 3
3(10) 25 2(25) 2(48) 1(48) 3(13)
L3 ( 3 ) = (1 3) 3 (1 3) 3 (1 3) 3
160 128
= 3 (1 3) 0 < 3 <1

^3 = 160 = 0:5556:
288
For n = 4
4(5) 30 3(30) 2(34) 2(34)
L4 ( 4 ) = (1 4) 4 (1 4) 4 (1 3)
3(22) 1(22) 4(5)
4 (1 4) 4
184 200
= 4 (1 4 ) 0 < 4 <1

^4 = 184 = 0:4792:
384
The expected frequencies assuming the Binomial model, are calculated using
n ^j ^n
n j
enj = yn+ 1 j = 0; 1; : : : ; n; n = 2; 3; 4
j n
and are given below:
Total
Number of females = j number
of litters
enj 0 1 2 3 4 yn+
Litter 2 25:3125 39:375 15:3125 80
Size = n 3 8:4280 31:6049 39:5062 16:4609 96
4 7:0643 25:9964 35:8751 22:0034 5:0608 96
For n = 2 the observed value of the likelihood ratio statistic is

23 44 13
2 23 log + 44 log + 13 log = 1:11:
25:3125 39:375 15:3125
The degrees of freedom are 3 1 1 = 1 since 2 was estimated.

h p i
= 2 1 P Z 1:11 where Z v N (0; 1)
= 2 [1 P (Z 1:05)]
= 0:29220
p value = P (W 1:11)
and there is no evidence based on the data against the Binomial model. Similarly
for n = 3, we obtain = 4:22 and P (W 4:22) = 0:12 where W v 2 (2) and
there is no evidence based on the data against the Binomial model. For n = 4,
= 1:36 and P (W 1:36) = 0:71 where W v 2 (3) and there is also no
evidence based on the data against the Binomial model.
(b) The likelihood function for 1; 2; 3; 4 is
12 8 70 90 160 128 184 200
L ( 1; 2; 3; 4) = 1 (1 1) 2 (1 2) 3 (1 3) 4 (1 4)
0 < n < 1; n = 1; 2; 3; 4:
Under the hypothesis H0 : 1 = 2 = 3 = 4 = the likelihood function is
L( ) = 12
(1 )8 70
(1 )90 160
(1 )128 184
(1 )200
= 12+70+160+184
(1 )8+90+128+200
= 426
(1 )426 0< <1
367
which is maximized for ^ = 426

852 = 0:5. The expected frequencies, assuming H0
are calculated using
n
enj = yn+ (0:5)n j = 0; 1; : : : ; n; n = 2; 3; 4
j
and are given below:
Number of females = j Total number

enj 0 1 2 3 4 of litters = yn+
Litter 1 10 10 20
Size = n 2 20 40 20 80
3 12 36 36 12 96
4 6 24 36 24 6 96
8 12 22 5
2 8 log + 12 log + + 22 log + 5 log = 14:27:
10 10 24 6
The degrees of freedom = (1 + 2 + 3 + 4) 1 = 9 and

p value t P (W 14:27) = 0:11 where W v 2 (9) : There is no evidence
based on the data against the hypothesis 1 = 2 = 3 = 4 .
7.6 This process can be thought of as an experiment in which we observe yi = the number
of non-zero digits (Failures) until the …rst zero (Success) for i = 1; 2; : : : ; 50. There-
fore the Geometric( ) distribution is an appropriate model. Since is unknown we
estimate it using the maximum likelihood estimate. The likelihood function for is
P
50
Q
50 yi
L( ) = (1 )yi = 50
(1 )i=1 0< < 1:
i=1
P
50
For these data yi = 348 and the log likelihood function is
i=1
l ( ) = 50 log + 348 log (1 ) 0< < 1:
Solving
50 348
l0 ( ) = =0
(1 )
^= 50
= 0:1256
50 + 348
for . To test the …t of the model we summarize the data in a frequency table:
# between 2 zeros 0 1 2 3 4 5 6 7 8 10 12
# of occurrences 6 4 9 3 5 2 2 3 2 2 1
# between 2 zeros 13 14 15 16 18 19 20 21 22 26
# of occurrences 1 1 1 1 1 1 1 1 2 1
The expected frequencies are
ej = 50 (0:1256) (1 0:1256)j ; j = 0; 1; : : : :
To obtain expected frequencies of at least …ve we join adjacent categories to obtain:
Observation
0 1 2 3 4 5 6 7 8 10 11 Total
between two 0’s
Observed
6 4 12 7 5 4 12 50
Frequency.: fj
Expected
6:28 5:49 9:0 6:88 5:26 5:67 11:42 50
Frequency.: ei
The observed value of the likelihood ratio statistic is = 1:96. The degrees of
freedom for the Chi-squared approximation are 7 1 1 = 5 and the p value t
P (W 1:96) t 0:9 where W v 2 (5). There is no evidence based on the data against
the hypothesis that the Geometric distribution is a good model for these data.
7.7 The expected frequencies are:

Normal Enlarged Much enlarged Total
516 72 589 72
Carrier present 1398 = 26:57 1398 = 30:33 15:09 72
Carrier absent 489:43 558:67 558:67 1326
Total 516 589 293 1398
The observed value of the likelihood ratio statistic is 7:3209 with
p value t P (W 7:3209) = 0:026 where W v 2

(2) = Exponential (2)
7:3209=2
= e = 0:02572
so there is evidence based on the data against the hypothesis that the two classi…ca-
tions are independent.
7.8 The observed frequencies are:

yij Tall wife Medium wife Short wife Total
Tall husband 18 28 19 65
Medium husband 20 51 28 99
Short husband 12 25 9 46
Total 50 104 56 210
369
The expected frequencies are:
eij Tall wife Medium wife Short wife Total

65 50 65 104
Tall husband 210 = 15:476 210 = 32:191 17:333 65
99 50 99 50
Medium husband 210 = 23:571 210 = 49:029 26:400 99
Short husband 10:952 22:781 12:267 46
Total 50 104 56 210

18 28 19
= 2[18 log + 28 log + 19 log
15:476 32:191 17:333
20 51 28
+20 log + 51 log + 28 log
23:571 49:029 26:400
12 25 9
+12 log + 25 log + 9 log ]
10:952 22:781 12:267
= 3:1272
and p value t P (W 3:1272) = 0:5368 where W v 2 (4).and the degrees of

freedom = (3 1)(3 1) = 4. There is no evidence based on the data against the
hypothesis that the heights of husbands and wives are independent.
7.9 (a) The expected frequencies are:
eij 3 boys 2 boys 2 girls 3 girls Total

29 11 29 18 29 22
Mother under 30 64 64 64 5:8906 29
= 4:9844 = 8:1563 = 9:96883
Mother over 30 6:0156 9:8438 12:0313 7:1094 35
Total 11 18 22 13 64
The observed value of the likelihood ratio statistic is = 0:5587 with p value t
P (W 0:5587) = 0:9058 where W v 2 (3) so there is no evidence based on the
data to contradict the hypothesis of no association between the sex distribution
and age of the mother.
(b) The expected frequencies are:
y = no. of boys 3 2 1 0 Total

Observed 11 18 22 13
64
Frequency
Expected 64 (0:5)3 64 3
2 (0:5)2 64 3
1 (0:5)2
64
Frequency =8 = 24 = 24 8
P (W 5:4441) = 0:1420 where W v 2 (3). There is no evidence based on the
data against the Binomial model.
7.10. (a) The expected frequencies are:
yij (eij ) Both Mother Father Neither Total

30 50 16 50 18 50
Above Average 100 = 15 100=8 100 =9 18 50
Below Average 15 8 9 18 50
Total 30 16 18 36 100
P (W 10:8) = 0:013 where W v 2 (3). Therefore there is evidence based
on the data against the hypothesis that birth weight is independent of parental
smoking habits.
(b) The expected frequencies depending on whether the mother is a smoker or non-
smoker are:
Mother smokes
yij (eij ) Father smokes Father non-smoker Total
30 15
Above average 46 = 9:78 5:22 15
Below average 20:22 10:78 31
Total 30 16 46
Mother non-smoker
yij (eij ) Father smokes Father non-smoker Total
Above average 185435 = 11:67 23:33 35
Below average 6:33 12:67 19
Total 18 36 54
For the Mother smokes table, the observed value of the likelihood ratio statistic
is = 0:2644 with

h p i
= 2 1 P Z 0:2644 where Z v N (0; 1)
= 2 [1 P (Z 0:51)] = 0:60710
For the Mother non-smoker table, the observed value of the likelihood ratio
statistic is = 0:04078 with

h p i
= 2 1 P Z 0:04078 where Z v N (0; 1)
= 2 [1 P (Z 0:20)] = 0:83997
In both cases there is no evidence based on the data against the hypothesis
that, given the smoking habits of the mother, birth weight is independent of the
smoking habits of the father.
371
Chapter 8
8.1 (a) The observed value of the likelihood ratio statistic is 480:65 so the p value
is almost zero; there is very strong evidence against independence based on the
data.
8.3 (a) The observed value of the likelihood ratio statistic is = 112 and p value t 0.
(b) Only Program A shows any evidence of non-independence, and that is in the
direction of a lower admission rate for males.
APPENDIX B: SAMPLE TESTS
Sample Midterm Test 1

[16] 1. Answer the questions below based on the following:
A Waterloo-based public opinion research …rm was hired by the Ontario Ministry of
Education to investigate whether the …nancial worries of Ontario university students
varied by sex. To reduce costs, the research …rm decided to study only university
students living in the Kitchener-Waterloo region in September 2012. An associate
with the research …rm randomly selected 250 university students attending a Laurier-
Waterloo football game. The students were asked whether they agreed/disagreed with
the statement “I have signi…cant trouble paying my bills.”Their sex was also recorded.
The results are given below:
Agreed Disagreed Total

Male 68 77 145
Female 42 63 105
Total 110 140 250
(a) What are the units?

(b) De…ne the target population.
(c) De…ne the study population.
(d) What are two variates in this problem?
(e) What is the sampling protocol?
(f) What is a possible source of study error?
(g) What is a possible source of sampling error?
(h) Describe an attribute of interest for the target population and provide an esti-
mate based on the given data.
[14] 2. Fill in the blanks below. You may use a numerical value or one of the following words
or phrases: sample skewness, sample kurtosis, sample variance, sample mean, relative
frequencies, frequencies, histogram, boxplot.
373
374 APPENDIX B: SAMPLE TESTS
[2] (a) A large positive value of the indicates that the

distribution is not symmetric and the right tail is larger than the left.
[2] (b) The sum of the equals 1.
[2] (c) For a random sample from an Exponential( ) distribution, the value of can be
estimated using the .
[2] (d) Suppose y(1) ; y(2) ; : : : ; y(99) ; y(100) are the ordered values of a dataset with
y(1) = min (y1 ; : : : ; y100 ) and y(n) = max (y1 ; :::; y100 ). Suppose IQR = 3:85 is
the interquartile range of the dataset. Then the IQR of the dataset y(1) ; :::; y(99) ;
y(100) + 5 (that is, 5 is added only to the largest value) is equal to .
[2] (e) Suppose s2 = 2:6 is the sample variance of the dataset y1 ; y2 ; :::; y100 . Then the
sample variance of the dataset y1 + 2; y2 + 2; : : : ; y100 + 2 (that is, 2 is added to
every value) is .
[4] (f) The data y1 ; y2 ; : : : ; y100 is recorded in kilometers (km) and the sample mean and
sample skewness are recorded. If we decide instead to record the data in meters
instead of kilometers, (1 meter is 0.001 km) then the sample mean is changed by
a factor of and the sample skewness is changed
by a factor of .
[12] 3. Researchers are interested in the relationship between a certain gene and the risk
of contracting diabetes. A gene is said to be expressed if its coded information is
converted into certain proteins. A team of researchers investigates whether there is
a relationship between a certain gene being expressed, and whether or not a person
contracts diabetes in their lifetime. The team takes a random sample of 100 people
who are aged 55 or above. For each person selected they determine (i) age, (ii)
whether or not the gene is expressed, (iii) the person’s insulin level, and (iv) if the
person has diabetes.
[3] a. This study is an example of (check only those that apply)
i. an experimental study because we need to experiment with the genes.
ii. an observational study because we are recording observations for each sam-
pled unit.
iii. a probability model because probability is required to predict whether a

person will contract diabetes.
iv. a causative study because the diabetes causes the gene.
v. a response study because the patient responds to the clinician.

375
[3] b. The “age” of the subject is an example of (check only those that apply)
i. an explanatory variate because it explains how long the subject is in the
study.
ii. an explanatory variate because it may help to explain whether a given person
will contract diabetes.
iii. a non-Normal variate because subjects may lie about their age.
iv. a response variate because it responds to many di¤erent circumstances.
[3] c. The Plan step in PPDAC for this experiment includes (check only those that
apply)
i. the question of whether or not diabetes was related to the expression of the
gene.
ii. the sampling protocol or the procedure used to select the sample.
iii. the speci…cation of the sample size.
iv. the questions the researchers wished to investigate.
v. a determination of the units that are available to be included in

the study.
[3] d. In the Problem step of PPDAC, we (check only those that apply)
i. solve the problem for the maximum likelihood estimate.
ii. list all problems that might be encountered in our analysis.
iii. decide what questions we wish to address with this study.
iv. decide what group of individuals we wish to apply the conclusions.
v. de…ne the variates that may be needed.

[10] 4. In an experimental study conducted by Baumann and Jones of methods of teaching

reading comprehension, the values of n = 66 test scores were recorded. Graphical
summaries of the data are given in Figures 1-3. The summary statistics for these
data are:
Min 1st Quartile Median Mean 3rd Quartile Max Sample s.d.
30 40 45 44:02 49 57 6:65
20
15
frequency
10
5
0
30 35 40 45 50 55 60
Baumann$post.test.3
Figure 11.2: Frequency histogram of test scores

55
50
Baum ann$post.test.3
45
40
35
30
-2 -1 0 1 2
nor m quantiles
Figure 11.3: Normal qq plot for test scores with superimposed line and con…dence region
377
55
50
45
post.test.3
40
35
30
Figure 11.4: Boxplot for test scores
Based on these plots and statistics circle True or False for the following
statements.
(a) The interquartile range is 9. True False

(b) The distribution has very large tails, too large to be consistent with the Normal
distribution. True False
(c) The sample skewness is positive. True False
(d) About half of the test scores fall outside the interval (40; 49). True False
(e) The shape of the Normal qqplot would change if 5 marks were added to each
test score. True False
[13] 5.[7] a. Suppose y1 ; y2 ; :::; y25 are the observed values in a random sample from the
Poisson( ) distribution: Find the maximum likelihood estimate of . Show all
your steps.
[6] b. Suppose y1 ; y2 ; :::; y10 are the observed values in a random sample from the prob-
y y=
f (y; ) = 2e for y > 0
where 0 < < 1: Find the maximum likelihood estimate of . Show all your
steps.
Sample Midterm Test 1 Solutions

[16] 1. Answer the questions below based on the following:
A Waterloo-based public opinion research …rm was hired by the Ontario Ministry of
Education to investigate whether the …nancial worries of Ontario university students
varied by sex. To reduce costs, the research …rm decided to study only university
students living in the Kitchener-Waterloo region in September 2012. An associate
with the research …rm randomly selected 250 university students attending a Laurier-
Waterloo football game. The students were asked whether they agreed/disagreed with
the statement “I have signi…cant trouble paying my bills.”Their sex was also recorded.
The results are given below:
Agreed Disagreed Total

Male 68 77 145
Female 42 63 105
Total 110 140 250
(a) What are the units?
A unit is a university student

(b) De…ne the target population.
The set of all university students in Ontario

(c) De…ne the study population.
The set of university students living in the Kitchener Waterloo region in Sep-
tember 2012.
(d) What are two variates in this problem?
sex (male/female), and agree/disagree with the statement

(e) What is the sampling protocol?
take a random sample of 250 students attending a speci…c Laurier-Waterloo foot-

ball game
(f) A possible source of study error is:
There may be a di¤ erence between KW university students and the population of
Ontario university students, for e.g., university students in Toronto and Thunder
Bay may have di¤ erent …nancial worries then KW university students.
(g) A possible source of sampling error is:
Since more males tend to go to football games there may be a di¤ erence between
the proportion of males in the sample and the proportion of males in the study
population.
379
(h) Describe an attribute of interest for the target population and provide an esti-
mate based on the given data.
An attribute of interest is the proportion of the target population that “agrees”

with the statement. The estimate is 110/250 or 44%
[14] 2. Fill in the blanks below. You may use a numerical value or one of the following words
or phrases: sample skewness, sample kurtosis, sample variance, sample mean, relative
frequencies, frequencies, histogram, boxplot.
[2] a. A large positive value of the sample skewness indicates that the distribution
is not symmetric and the right tail is larger than the left.
[2] b. The sum of the relative frequencies equals 1.

[2] c. For a random sample from an Exponential( ) distribution, the value of can be
estimated using the sample mean .
[2] d. Suppose y(1) ; y(2) ; : : : ; y(99) ; y(100) are the ordered values of a dataset with
y(1) = min (y1 ; : : : ; y100 ) and y(100) = max (y1 ; :::; y100 ). Suppose IQR = 3:85 is
the interquartile range of the dataset. Then the IQR of the dataset y(1) ; : : : ; y(99) ;
y(100) + 5 (that is, 5 is added only to the largest value) is equal to 3.85 .
[2] e. Suppose s2 = 2:6 is the sample variance of the dataset y1 ; y2 ; : : : ; y100 . Then
the sample variance of the dataset y1 + 2; y2 + 2; : : : ; y100 + 2 (that is, 2 is added
to every value) is 2.6 .
[4] f. he data y1 ; y2 ; : : : ; y100 is recorded in kilometers (km) and the sample mean
and sample skewness is recorded. If we decide instead to record the data in
meters instead of kilometers, (1 meter is 0.001 km) then the sample mean is
changed by a factor of 1000 and the sample skewness is changed by a factor
of one (or the same) .
[12] 3. Researchers are interested in the relationship between a certain gene and the risk
of contracting diabetes. A gene is said to be expressed if its coded information is
converted into certain proteins. A team of researchers investigates whether there is
a relationship between a certain gene being expressed, and whether or not a person
contracts diabetes in their lifetime. The team takes a random sample of 100 people
who are aged 55 or above. For each person selected they determine (i) age, (ii)
whether or not the gene is expressed, (iii) the person’s insulin level, and (iv) if the
person has diabetes.
[3] a. This study is an example of (check only those that apply)
i. an experimental study because we need to experiment with the genes.
ii. an observational study because we are recording observations for each sam-
p
pled unit.
iii. a probability model because probability is required to predict whether a

person will contract diabetes.
iv. a causative study because the diabetes causes the gene.
v. a response study because the patient responds to the clinician.
[3] b. The “age” of the subject is an example of (check only those that apply)
i. an explanatory variate because it explains how long the subject is in the
study.
ii. an explanatory variate because it may help to explain whether a given person
p
will contract diabetes.
iii. a non-Normal variate because subjects may lie about their age.
iv. a response variate because it responds to many di¤erent circumstances.
[3] c. The Plan step in PPDAC for this experiment includes (check only those that
apply)
i. the question of whether or not diabetes was related to the expression of the
gene.
p
ii. the sampling protocol or the procedure used to select the sample.
p
iii. the speci…cation of the sample size.
iv. the questions the researchers wished to investigate.
v. a determination of the units that are available to be included in p

the study.
381
[3] d. In the Problem step of PPDAC, we (check only those that apply)
i. solve the problem for the maximum likelihood estimate.
ii. list all problems that might be encountered in our analysis.

p
iii. decide what questions we wish to address with this study.
p
iv. decide what group of individuals we wish to apply the conclusions.
p
v. de…ne the variates that may be needed.
[10] 4. In an experimental study conducted by Baumann and Jones of methods of teaching

reading comprehension, the values of n = 66 test scores were recorded. Graphical
summaries of the data are given in Figures 1-3. The summary statistics for these
data are:
Min 1st Quartile Median Mean 3rd Quartile Max Sample s.d.
30 40 45 44:02 49 57 6:65
Based on these plots and statistics circle True or False for the following
statements.
(a) The interquartile range is 9. True False

(b) The distribution has very large tails, too large to be consistent with the Normal
distribution. True False
(c) The sample skewness is positive. True False
(d) About half of the test scores fall outside the interval (40; 49). True False
(e) The shape of the Normal qqplot would change if 5 marks were added to each
test score. True False
[13] 5.[7] a. Suppose y1 ; y2 ; :::; y25 are the observed values in a random sample from the
Poisson( ) distribution: Find the maximum likelihood estimate of . Show all
your steps.
n
Y yi
L( ) = e
yi !
i=1
P
n
Q
n 1 yi
n Q
n 1
= i=1 e note that the term is optional.
i=1 yi ! i=1 yi !

P
n
l( ) = yi log( ) n
i=1
1P
n 1 Pn
l0 ( ) = yi n = 0 for = yi = y
i=1 n i=1
The maximum likelihood estimate of is ^ = y.

(a) Suppose y1 ; y2 ; :::; y10 are the observed values in a random sample from the prob-
y y=
f (y; ) = 2e for y > 0
where 0 < < 1: Find the maximum likelihood estimate of . Show all your
steps.
n
Y yi yi = Q
n 1 1P
n
L( ) = 2e = yi 2n exp yi for >0
i=1 i=1 i=1
or more simply
1 ny
L( ) = 2n exp for > 0:

1
l( ) = 2n ln( ) ny for >0
1 1 n
l0 ( ) = 2n + 2 ny = 0 or 2 ( 2 + y) = 0
The maximum likelihood estimate of is ^ = y=2.

383
Sample Midterm Test 2

1: [18]
(a) Suppose Y v Binomial (n; ). An experiment is to be conducted in which data y
are to be collected to estimate . To ensure that the width of the approximate 90%
con…dence interval for is no wider that 2 (0:02), the sample size n should be at least
________________.
(b) Between December 20, 2013 and February 7, 2014 the Kitchener City Council conducted
an online survey which was posted on the City of Kitchener’s website. The online survey
was publicized in the local newspapers, radio stations and TV news. The propose of the
survey was to determine whether or not the citizens of Kitchener supported a proposal to
put life sized bronze statues of Canada’s past prime ministers in Victoria Park, Kitchener
as a way to celebrate Canada’s 150th. The community group that had proposed the idea
had already received 2 million dollars in pledges and was asking the city for a contribution
of $300,000 over three years.
People who took part in the survey were asked "Do you support the statue proposal in
concept, by which we mean do you like the idea even if you don’t agree with all aspects of
the proposal?" Of the 2441 who took the survey, 1920 answered no to this question.
(i) Explain clearly whether you think using the online survey was a good way for the
City of Kitchener to determine whether or not the citizens of Kitchener support the Prime
Ministers’Statues Project.
(ii) Assume the model Y v Binomial (n; ) where Y = number of people who responded
no to the question "Do you support the statue proposal in concept, by which we mean do
you like the idea even if you don’t agree with all aspects of the proposal?" What does the
parameter represent in this study?
(iii) A point estimate of based on the observed data is ________________.
(iv) An approximate 95% con…dence interval for based on the observed data is
_________________________.
(v) By reference to the con…dence interval, indicate what you know about the p value
for a test of the hypothesis H0 : = 0:8?
(c) Suppose a Binomial experiment is conducted and the observed 95% con…dence interval
for is [0:1; 0:2]. This means (circle the letter for the correct answer):
A : The probability that is contained in the interval [0:1; 0:2] equals 0:95.
B : If the Binomial experiment was repeated 100 times independently and a 95% con-
…dence interval was constructed each time then approximately 95 of these intervals would
contain the true value of .
2: [20] At the R.A.T. laboratory a large number of genetically engineered rats are raised
for conducting research. Twenty rats are selected at random and fed a special diet. The
weight gains (in grams) from birth to age 3 months of the rats fed this diet are:
63:4 68:3 52:0 64:5 62:3 55:8 59:3 62:4 75:8 72:1
55:6 73:2 63:9 60:7 63:9 60:2 60:5 67:1 66:6 66:7
Let yi = weight gain of the i0 th rat, i = 1; : : : ; 20. For these data
P
20 P
20
yi = 1273:8 and (yi y)2 = 665:718:
i=1 i=1
To analyze these data the model
Yi v N ; 2
= G( ; ); i = 1; : : : ; 20

(a) Comment on how reasonable the Gaussian model is for these data based on the qqplot
below:

80
75
70
65
60
55
50
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
(b) Describe a suitable study population for this study. The parameters and correspond
to what attributes of interest in the study population?
(c) The maximum likelihood estimate of is _______________

385
The maximum likelihood estimate of is __________________

(You do not need to derive these estimates.)
(d) Let
Y 1 P20
2
T = p where S2 = Yi Y :
S= 20 19 i=1
The distribution of T is ___________________.
(e) The company, R.A.T. Chow, that produces the special diet claims that the mean weight
gain for rats that are fed this diet is 67 grams.
The p value for testing the hypothesis H0 : = 67 is between _____________
and _______________.
What would you conclude about R.A.T. Chow’s claim?
1 P
20
2
(f ) Let W = 2 Yi Y .
i=1
The distribution of W is ___________________.
Let a and b be such that P (W a) = 0:05 = P (W b).
Then a = _______________ and b = _______________________.
(g) A 90% con…dence interval for for the given data is _____________________.
3: [17] Let Y have an Exponential ( ) distribution with probability density function
1 y=
f (y; ) = e for y > 0 and > 0:
(a) Show that W = 2Y = has probability density function given by
1 w=2
g (w) = e ; for w > 0
2
which is the probability density function of a random variable with a 2 (2) distribution.
(b) Suppose Y1 ; : : : ; Yn is a random sample from the Exponential ( ) distribution. Use your
result from (a) and theorems that you have learned in class to prove that
2P
n
2
U= Yi (2n) :
i=1
(c) Explain clearly how the pivotal quantity U can be used to obtain a two-sided 100p%
con…dence interval for .
(d) Suppose n = 25 so that
2P
25
2
U= Yi (50) :
i=1
Let a and b be such that P (U a) = 0:05 = P (U b).
Then a = _______________ and b = ________________.
(e) Suppose y1 ; : : : ; y25 is an observed random sample from the Exponential ( ) distribution
P
25
with yi = 560.
i=1
The maximum likelihood estimate for is _________________________.

(You do not need to derive this estimate.)
A 90% con…dence interval for based on U is ________________________.
(f ) Suppose an experiment is conducted and the hypothesis H0 : = 0 is tested using a

test statistic D with observed value d. If the p value = 0:01 then this means (circle the
letter for the correct answer):
A : the probability that H0 : = 0 is correct equals 0:01.
B : the probability of observing a D value greater than or equal to d, assuming H0 :
= 0 is true, equals 0:01.
387
Sample Midterm Test 2 Solutions

1: [18]
(a) [3] Suppose Y v Binomial (n; ). An experiment is to be conducted in which data y are
to be collected to estimate . To ensure that the width of the approximate 90% con…dence
interval for is no wider that 2 (0:02), the sample size n should be at least 1692 .
Justi…cation:
q An approximate 90% con…dence interval for is given by
^ 1:645 ^(1 ^)=n since P (Z 1:645) = 0:95 where Z v N (0; 1) which has width
q
2 (1:645) ^(1 ^)=n. Therefore we need n such that
q
(1:645) ^(1 ^)=n 0:02
2
1:645 ^(1 ^)
or n
0:02
Since we don’t know ^ and the right side of the inequality takes on its largest value for
^ = 0:5 we chose n such that
2
1:645
n (0:5)2 = 1691:3
0:02
Since n must be an integer we take n = 1692:
(b) Between December 20, 2013 and February 7, 2014 the Kitchener City Council conducted
an online survey which was posted on the City of Kitchener’s website. The online survey
was publicized in the local newspapers, radio stations and TV news. The propose of the
survey was to determine whether or not the citizens of Kitchener supported a proposal to
put life sized bronze statues of Canada’s past prime ministers in Victoria Park, Kitchener
as a way to celebrate Canada’s 150th. The community group that had proposed the idea
had already received 2 million dollars in pledges and was asking the city for a contribution
of $300,000 over three years.
People who took part in the survey were asked "Do you support the statue proposal in
concept, by which we mean do you like the idea even if you don’t agree with all aspects of
the proposal?" Of the 2441 who took the survey, 1920 answered no to this question.
(i) [3] Explain clearly whether you think using the online survey was a good way for the
City of Kitchener to determine whether or not the citizens of Kitchener support the Prime
Ministers’Statues Project.
This is not a good way for the City of Kitchener to determine whether or not the citizens
of Kitchener support the Prime Ministers’Statues Project.
The respondents to the survey are people who heard about the survey through local
media, had access to the internet and then took the time to complete the survey. These
people are probably not representative of all citizens of Kitchener. This is an example of
sampling error.
To obtain a representative sample you would need to select a random sample of all
citizens living in Kitchener.
(ii) [2] Assume the model Y v Binomial (n; ) where Y = number of people who
responded no to the question ”Do you support the statue proposal in concept, by which we
mean do you like the idea even if you don’t agree with all aspects of the proposal?” The
parameter corresponds to what attribute of interest in the study population? Be sure to
de…ne the study population
The parameter corresponds to the proportion of people in the study population, which
consists of all citizens of Kitchener, who would respond no to the question.
(iii) [2] A point estimate of based on the observed data is 1920=2441 = 0:7866 .
(iv) [4] An approximate 95% con…dence interval for based on the observed data is
[0:7703; 0:8029] .
s
1920 1920 1920
1:96 1 =2441 = 0:7866 0:0163 = [0:7703; 0:8029]
2441 2441 2441
(v) [2] By reference to the con…dence interval, indicate what you know about the p
value for a test of the hypothesis H0 : = 0:8?
Since = 0:8 is a value contained in the interval [0:7703; 0:8029] therefore the p value
for testing H0 : = 0:8 is greater than or equal to 0:05.
(Note that since = 0:8 is very close to the upper endpoint of the interval that the
p value would be very close to 0:05.)
(c) [2] Suppose a Binomial experiment is conducted and the observed 95% con…dence in-
terval for is [0:1; 0:2]. This means (circle the letter for the correct answer):
A : The probability that is contained in the interval [0:1; 0:2] equals 0:95.
B : If the Binomial experiment was repeated 100 times independently and a 95%
con…dence interval was constructed each time then approximately 95 of these intervals
would contain the true value of .
389
2: [20] At the R.A.T. laboratory a large number of genetically engineered rats are raised
for conducting research. Twenty rats are selected at random and fed a special diet. The
weight gains (in grams) from birth to age 3 months of the rats fed this diet are:
63:4 68:3 52:0 64:5 62:3 55:8 59:3 62:4 75:8 72:1
55:6 73:2 63:9 60:7 63:9 60:2 60:5 67:1 66:6 66:7
Let yi = weight gain of the i0 th rat, i = 1; : : : ; 20. For these data
P
20 P
20
yi = 1273:8 and (yi y)2 = 665:718:
i=1 i=1
To analyze these data the model
Yi v N ; 2
= G( ; ); i = 1; : : : ; 20

(a) [2] Comment on how reasonable the Gaussian model is for these data based on the
qqplot below:
Since the points in the qqplot lie reasonably along a straight line the Gaussian model
seems reasonable for these data.

80
75
70
65
60
55
50
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
(b) [4] Describe a suitable study population for this study. The parameters and corre-
spond to what attributes of interest in the study population?
A suitable study population consists of the genetically engineered rats which are raised
for conducting research at the R.A.T. laboratory.
The parameter corresponds to the mean weight gain of the rats fed the special diet
from birth to age 3 months in the study population.
The parameter corresponds to the standard deviation of the weight gains of the rats
fed the special diet from birth to age 3 months in the study population.
(c) [2] The maximum likelihood estimate of is 1273:8=20 = 63:69
1=2
The maximum likelihood estimate of is 20 1
(665:718) = (33:2859)1=2 = 5:7694
(You do not need to derive these estimates.)
(d) [1] Let
Y 1 P 20
2
T = p where S2 = Yi Y :
S= 20 19 i=1
The distribution of T is t(19) .
(e) [6] The company, R.A.T. Chow, that produces the special diet claims that the mean
weight gain for rats that are fed this diet is 67 grams.
The p value for testing the hypothesis H0 : = 67 is between 0:02 and

0:05 .
1=2
1 jy j j63:69 67j
s= (665:718) = 5:9193 p0 = p = 2:5008
19 s= n 5:9193= 20
p value = P (jT j 2:5008) = 2 [1 P (T 2:5008)]

Since P (T 2:5395) = 0:99 and P (T 2:0930) = 0:975, therefore
2 (1 0:99) p value 2 (1:0975) or 0:02 p value 0:05:
What would you conclude about R.A.T. Chow’s claim?
Since the p value 0:05, therefore there is evidence against R.A.T. Chow’s claim,
H0 : = 67, based on the observed data.
1 P
20
2
(f ) [3] Let W = 2 Yi Y .
i=1
The distribution of W is 2 (19) .
Let a and b be such that P (W a) = 0:05 = P (W b).

391
Then a = 10:117 and b = 30:144 .
(g) [2] A 90% con…dence interval for for the given data is [4:6994; 8:1118] .
" #
1=2 1=2
665:718 665:718
; = [22:0846; 65:8019] = [4:6994; 8:1118]
30:144 10:117
3: [17] Let Y have an Exponential ( ) distribution with probability density function
1 y=
f (y; ) = e for y > 0 and > 0:
(a) [3] Show that W = 2Y = has probability density function given by
1 w=2
g (w) = e ; for w > 0
2
which is the probability density function of a random variable with a 2 (2) distribution.
For w 0;
2Y w w
G (w) = P (W w) = P w =P Y =F
2 2
where
F (y) = P (Y y)
Therefore
w 1 w 1
g (w) = G0 (w) = f = exp = = e w=2
; for w 0
2 2 2 2 2
as required.
(b) [3] Suppose Y1 ; : : : ; Yn is a random sample from the Exponential ( ) distribution. Use
your result from (a) and theorems that you have learned in class to prove that
2P
n
2
U= Yi (2n) :
i=1
From (a) ; 2 Yi v 2 (2) i = 1; 2; : : : ; n independently.

Since the sum of independent Chi-squared random variables has a Chi-squared distribu-
tion with degrees of freedom equal to the sum of the degrees of freedom of the Chi-squared
random variables in the sum, therefore
2P
n
2 P
n
2
U= Yi 2 or (2n)
i=1 i=1
as required.
(c) [4] Explain clearly how the pivotal quantity U can be used to obtain a two-sided 100p%
con…dence interval for .
1 p
Using Chi-squared tables …nd a and b such that P (U a) = 2 = P (U b) where
U 2 (2n)
Since
p = P (a U b)
0 1
B1 1C
=PB
@b
C
P
n
aA
2 Yi
i=1
0 P n P
n 1
2 Y 2 Yi
B i=1 i C
B
=P@ i=1 C
b a A
then 2 P 3
n Pn
2 yi 2 yi
6 i=1 i=1 7
6 7
4 b ; a 5
is a 100p% con…dence interval for .
(d) [2] Suppose n = 25 so that
2P
25
2
U= Yi (50) :
i=1
Let a and b be such that P (U a) = 0:05 = P (U b).
Then a = 34:764 and b = 67:505 .
(e) [3] Suppose y1 ; : : : ; y25 is an observed random sample from the Exponential ( ) distri-
P
25
bution with yi = 560.
i=1
393
The maximum likelihood estimate for is 560=25 = 22:4 .(You do not need
to derive this estimate.)
A 90% con…dence interval for based on U is [16:5914; 32:2172] .
2 (560) 2 (560)
; = [16:5914; 32:2172]
67:505 34:764
(f ) [2] Suppose an experiment is conducted and the hypothesis H0 : = 0 is tested using

a test statistic D with observed value d. If the p value = 0:01 then this means (circle the
letter for the correct answer):
A : the probability that H0 : = 0 is correct equals 0:01.
B : the probability of observing a D value greater than or equal to d, assuming
H0 : = 0 is true, equals 0:01.
Sample Final Exam

1: [18] A marketing research …rm designed a study to examine the relationship between
the amount of money spent in advertising a product on local television in one week and
the sales of the product in the following week. The …rm selected 4 levels of spending (in
thousands of dollars) on advertising a product on local television in one week: 1:2; 2:4; 3:6;
4:8. Twenty communities in Ontario were selected. Each of the 4 levels of spending on
advertising were applied in 5 di¤erent communities. The sales of the product (in thousands
of dollars) in the following week measured in each of the 20 communities are given below:
Cost of Number of
Total Sales (y )
Advertising (x) Communities
1:2 5 4:9 3:0 3:6 4:4 8:8
2:4 5 8:6 6:8 8:4 8:7 7:8
3:6 5 8:3 8:3 8:0 8:8 7:7
4:8 5 11:0 10:8 11:6 12:0 10:1
P
20
x = 3, y = 7:93, Sxx = (xi x)2 = 36,
i=1
P
20 P
20
Syy = (yi y)2 = 125:282, Sxy = (xi x) (yi y) = 61:32
i=1 i=1
To analyse these data the regression model
Yi = + xi + Ri ; where Ri v N 0; 2
= G (0; ) ; i = 1; : : : ; 20 independently
is assumed where , and are unknown parameters and the x0i s are assumed to be known
constants.
(a) [1] Is this an experimental or observational study? Explain.
(b) [4] Calculate the maximum likelihood estimates of and for these data and draw
the …tted line on the scatterplot below. How well does this line …t the data? Do you notice
anything unusual?
(c) [3] A Normal QQ-plot of the estimated residuals is given below.
Explain clearly how this plot is obtained. What conclusions can be drawn from this
plot about the validity of the assumed model for these data?
(d) [5] Test the hypothesis that there is no relationship between the amount of money
spent in advertising a product on local television in one week and the sales of the product
in the following week. Show all your work.
395
12
11
10
9
Sa le s ( th o u s a n d s
o f d o lla r s ) 8
3
1 1 .5 2 2 .5 3 3 .5 4 4 .5 5
Ad v e r tis in g s p e n d in g ( th o u s a n d s o f d o lla r s )
0.98
0.95
0.90
0.75
Probability
0.50
0.25
0.10
0.05
0.02
-1.5 -1 -0.5 0 0.5 1 1.5

Data
(e) [2] Would you conclude that an increase in the amount of money spent in advertising
causes an increase in the sales of the product in the following week? Explain your answer.
(f ) [3] If the amount of dollars spent on advertising a product on local television in one
week is 5 thousand dollars, …nd a 90% prediction interval for the sales of the product (in
thousands of dollars) in the following week.
2: [16] A wind farm is a group of wind turbines in the same location used for production
of electric power. The number of wind farms is increasing as we try to move to more
renewable forms of energy. Wind turbines are most e¢ cient if the mean windspeed is 16
km/h or greater.
The windspeed Y at a speci…c location is modeled using the Rayleigh distribution which
has probability density function
2y y2 =
f (y; ) = e ; y 0; >0
where is an unknown parameter which depends on the location.

(a) [4] Let y1 ; y2 ; : : : ; yn be the windspeeds measured on n di¤erent days at a speci…c
location. Assuming these observations represent n independent realizations of the random
variable Y which has the Rayleigh probability density function f (y; ), …nd the Maximum
Likelihood estimate of . Show all your work.
(b) [2] To determine whether a location called Windy Hill is a good place for a wind farm,
the windspeed was measured in km/h on 14 di¤erent days as given below:
14:7 30:0 13:3 41:9 25:6 39:6 34:5 9:9 13:6 24:2 5:1 41:4 20:5 22:2
P
14 P
14
yi = 336:5 and yi2 = 9984:03
i=1 i=1
For these data calculate the Maximum Likelihood estimate of and give the Relative
Likelihood function for .
p
(c) [5] If the random variable Y has a Rayleigh distribution then E (Y ) = =2: Thus a
2
mean of 20 km/h corresponds to = (40) = t 509:3. The owner of Windy Hill claims
that the average windspeed at Windy Hill is 20 km/h. Test the hypothesis H0 : = 509:3
using the given data and the likelihood ratio test statistic. Show all your work.
(d) [3] If Yi has a Rayleigh distribution with parameter , i = 1; : : : ; n independently then

2P
n
W = Yi2 v 2
(2n) :
i=1
If n = 14; …nd a and b such that P (W a) = 0:025 = P (W b). Use the pivotal
quantity W and the data from Windy Hill to construct an exact 95% con…dence interval
for . Show all your work.
(e) [2] Would you recommend that a wind farm be situated at Windy Hill? Justify your
answer.
3: [14] Two drugs, both in identical tablet form, were each given to 10 volunteer subjects
in a pilot drug trial. The order in which each volunteer received the drugs was randomized
and the drugs were administered one day apart. For each drug the antibiotic blood serum
level was measured one hour after medication. The data are given below:
Subject: i 1 2 3 4 5 6 7 8 9 10
Drug A: ai 1:08 1:19 1:22 0:60 0:55 0:53 0:56 0:93 1:43 0:67
Drug B: bi 1:48 0:62 0:65 0:32 1:48 0:79 0:43 1:69 0:73 0:71
Di¤erence:
0:40 0:57 0:57 0:28 0:93 0:26 0:13 0:76 0:70 0:04
yi = a i bi
397
P
10 P
10
yi = 0:14 and (yi y)2 = 2:90484:
i=1 i=1
To analyse these data the response model
Yi = + Ri ; where Ri v N 0; 2
= G (0; ) ; i = 1; : : : ; 10 independently

(a) [2] Describe a suitable study population for this study. The parameters and
(b) [5] Test the hypothesis of no di¤erence in the mean response for the two drugs, that
is, test H0 : = 0. Show all your work.
(c) [3] Construct a 95% con…dence interval for .
(d) [2] This experiment is a matched pairs experiment. Explain why this type of design
is better then a design in which 20 volunteers are randomly divided into two groups of 10
with one group receiving drug A and the other group receiving drug B.
(e) [2] Explain the importance of randomizing the order of the drugs, the fact that the
drugs where given in identical tablet form and the fact that the drugs were administered
one day apart.
4: [13] Exhaust emissions produced by motor vehicles is a major source of air pollution.
One of the major pollutants in vehicle exhaust is carbon monoxide (CO). An environmental
group interested in studying CO emissions for light-duty engines purchased 11 light-duty
engines from Manufacturer A and 12 light-duty engines from Manufacturer B. The amount
of CO emitted in grams per mile for each engine was measured. The data are given below:
Manufacturer A:
5:01 8:60 4:95 7:51 14:59 11:53 5:21 9:62 15:13 3:95 4:12
P
11 P
11
y1j = 90:22 and (y1j y1 )2 = 166:9860
j=1 j=1
Manufacturer B:
16:67 6:42 9:24 14:30 9:98 6:10 14:10 16:97 7:04 5:38 25:53 24:92
P
12 P
12
y2j = 136:65 and (y2j y2 )2 = 218:7656
j=1 j=1
To analyse these data assume the response model
Y1j = 1 + R1j ; where R1j v N 0; 2

= G (0; ) ; j = 1; : : : ; 11 independently
for Manufacturer A and independently
Y2j = 2 + R2j ; where R2j v N 0; 2

= G (0; ) ; j = 1; : : : ; 12 independently
for Manufacturer B where 1; 2 and are unknown parameters.

(a) [2] Describe a suitable study population for this study. The parameters 1; 2 and
(b) [4] Calculate a 99% con…dence interval for the di¤erence in the means: 1 2.
(c) [5] Test the hypothesis H0 : 1 = 2. Show all your work.
(d) [2] What conclusions can the environmental group draw from this study? Justify
your answer.
5: [9] In a court case challenging an Oklahoma law that di¤erentiated the ages at which
young men and women could buy 3:2% beer, the Supreme Court examined evidence from a
random roadside survey that measured information on age, gender, and drinking behaviour.
The table below gives the results for the drivers under 20 years of age.
Drank Alcohol in last 2 hours

Totals
Yes No
Gender Male 77 404 481
of Driver Female 16 122 138
Totals 93 526 619
(a) [2] Is this an experimental or observational study? Explain.
(b) [5] Test the hypothesis of no relationship (independence) between the two variates:
gender and whether or not the driver drank alcohol in the last 2 hours. Show all your work.
(c) [2] The Supreme Court decided to strike down the law that di¤erentiated the ages
at which young men and women could buy 3:2% beer based on the evidence presented. Do
you agree with the Supreme Court’s decision? Justify your answer.
6:The Survey of Study Habits and Attitudes (SSHA) is a psychological test that eval-
uates university students’motivation, study habits, and attitudes toward university. At a
small university college 19 students are selected at random and given the SSHA test. Their
scores are:
399
10 10 11 12 13 13 13 14 14 14
14 15 15 15 16 16 17 18 20
Let yi = score of the i0 th student, i = 1; : : : ; 19. For these data
P
19 P
19
yi = 270 and yi2 = 3956:
i=1 i=1
For these data calculate the mean, median, mode, sample variance, range, and interquartile
range.
7: A dataset consisting of six columns of data was collected by interviewing 100 students
on the University of Waterloo campus. The columns are:
Column 1: Sex of respondent
Column 2: Age of respondent
Column 3: Weight of respondent
Column 4: Faculty of respondent
Column 5: Number of courses respondent has failed.
Column 6: Whether the respondent (i) strongly disagreed, (ii) disagreed, (iii) agreed or
(iv) strongly agreed with the statement “The University of Waterloo is the best
university in Ontario.
(a) For this dataset give an example of each of the following types of data;
discrete__________
continuous___________
categorical____________
binary_______________
ordinal______________
(b) Two ways to graphically represent categorical data are ____________ and
________________.
(c) A graphical way to examine the relationship between heights and weights is a
______________.
(d) If the sample correlation between heights and weights was 0.4 you would con-
clude_____________.
Sample Final Exam Solutions

1: [18]
(a) [1]
This is an experimental study since the research …rm deliberately manipulated the levels
of spending on advertising in each community.
(b) [4]
^ = Sxy = 61:32 = 1:703; ^=y ^ x = 7:93 61:32

(3) = 2:82
Sxx 36 36
The …tted line is y = 2:82 + 1:703x.

For x = 1, y = 2:82 + 1:703(1) = 4:52 and for x = 5, y = 2:82 + 1:703(5) = 11:34.
12
11
10
y=2.82+1.703x
Sales (thousands
of dollars) 8
3
1 1.5 2 2.5 3 3.5 4 4.5 5
Advertising spending (thousands of dollars)
Looking at the scatterplot and the …tted line we notice that for x = 2:4, 4 of the 5 data
points lie above the …tted line while for x = 3:6, all 5 of the data points lie below the …tted
line. This suggests that the linear model might not be the best model for these data.
(c) [3]
Calculate the estimated residuals ri = yi yî = yi (^ + ^ xi ), i = 1; : : : ; 20 and order
the residuals from smallest to largest: r(1) ; : : : ; r(n) .
Calculate qi , i = 1; : : : ; 20 where qi satis…es F (qi ) = (i 0:5) =20 and F is the N(0; 1)
cumulative distribution function. Plot r(i) ; qi , i = 1; : : : ; 20.
OR:
Calculate the estimated residuals ri = yi yî = yi (^ + ^ xi ), i = 1; : : : ; 20 and order
the residuals from smallest to largest: r(1) ; : : : ; r(n) .
401
Plot the ordered residuals against the theoretical quantiles of the Normal distribution.
Since there is no obvious pattern of departure from a straight line we would conclude
that there is no evidence against the normality assumption Ri v N 0; 2 , i = 1; : : : ; 20.
(d) [5] To test the hypothesis of no relationship we test H0 : = 0: We use the

discrepancy measure
~ 0
D= p
S= Sxx
where
~ 0
T = p s t (18) assuming H0 : = 0 is true
S= Sxx
and
1 P 20
~ xi
2
S2 = Yi ~ :
18 i=1
Since ^ = 1:703 and

" #1=2
^ Sxy 1=2
Syy 125:282 (1:703) (61:32)
s= = = (1:15742)1=2 = 1:0758
18 18
the observed value of D is

j1:703 0j
d= p = 9:50
1:0758= 36
and
p value = P (D 9:50; H0 )
= P (jT j 9:50) where T s t (18)
t 0:
Therefore there is very strong evidence based on the data against the hypothesis of no
relationship between the amount of money spent in advertising a product on local television
in one week and the sales of the product in the following week.
(e) [2] Since this study was an experimental study, since there was strong evidence
against H0 : = 0, and since the slope of the …tted line was ^ = 1:703 > 0, the data
suggest that an increase in the amount of money spent advertising causes an increase in
the sales of the product in the following week. However we don’t know if the 4 levels of
spending on advertising were applied in the 5 di¤erent communities using randomization.
If the levels of advertising were not randomly applied then the di¤erences in the sales of the
product could be due to di¤erences between the communities. For example, if the highest
(lowest) level was applied to the richest (poorest) communities you might expect to see the
same pattern of response as was observed.
(f ) [3] From t tables P (T 1:73) = 0:95 where T s t (18). A 90% prediction interval
for the sales of the product (in thousands of dollars) in the following week if x = 5 is
" #1=2
1 (5 3)2
2:82 + 1:703(5) (1:73) (1:0758) 1 + +
20 36
= 11:3367 2:0055
= [9:33; 13:34]
2: [16]
(a) [4] The likelihood function is
Q
n 2y
i yi2 = Q
n
n 1P
n
L( ) = e = 2yi exp yi2 ; >0
i=1 i=1 i=1

Q
n 1P
n
l ( ) = log 2yi n log yi2 ; > 0:
i=1 i=1
The derivative of the log likelihood function is

n 1 P
n 1 P
n
l0 ( ) = + 2 yi2 = 2 n + yi2 ; >0
i=1 i=1
1 P
n
and l0 ( ) = 0 if = n yi2 : Therefore the Maximum Likelihood estimate of is
i=1
^ = 1 P y2:
n
n i=1 i
(b) [2] The Maximum Likelihood estimate for these data is
^ = 9984:03 = 713:145
14
and the Relative Likelihood function is
14 14
L( ) exp ( 9984:03= ) 713:45
R( ) = = 14
= exp (14 9984:03= ) ; > 0:
L(^) ^ exp 9984:03=^
(c) [5] The likelihood ratio test statistic for testing H0 : = 0 is

h i
D = 2r ( 0 ) = 2 l ( 0 ) l(^)
which has approximately a 2 (1) distribution if H0 : = is is true.

0
403
For these data the observed value of the likelihood ratio test statistic for H0 : = 509:3
is
d = 2r (509:3)
= 2 [l (509:3) l(713:145)]
713:45 9984:03
= 2 14 log + 14
509:3 509:3
= 2 ( 0:8904)
= 1:7807
and
p value = P (D 1:7807; H0 )
t P (W 1:7807) where W s 2 (1)
p
= P jZj 1:7807 where Z s N (0; 1)
= 2 [1 P (Z 1:33)]
= 2 (1 0:9082) = 2 (0:0912)
= 0:1824:
Since the p-value t 0:1824 > 0:1, therefore there is no evidence based on the data against
H0 : = 509:3.
(d) [3] From 2 tables we have
P (W 15:31) = 0:25 = P (W 44:46) where W v 2

(28) :
Since
2P
n
0:95 = P 15:31 Yi2 44:46
i=1
2 P n 2 P n
= P Y2 Y2
44:46 i=1 i 15:31 i=1 i
a 95% con…dence interval for based on these data is given by

2 (9984:03) 2 (9984:03)
;
44:46 15:31
= [449:12; 1304:25] :
p
(e) [2] A 95% con…dence interval for the mean windspeed =2 based on these data is
"p p #
449:12 1304:252
;
2 2
= [18:78; 32:01] :
Since the values of this interval are all above 16, the data seem to suggest a mean windspeed
greater than 16km/hr. However we don’t know how the data were collected. It would be
wise to determine how the data were collected before reaching a conclusion. Suppose that
Windy Hill is only windy at one particular time of the year and that the data were collected
only during the windy period. We would not want to make a decision only based on these
data.
3: [14]
(a) [2] A suitable study population would consist of individuals who have volunteered
to partake in clinical trials.
The parameter corresponds to the mean di¤erence in antibiotic blood serum level
between drugs A and B in the study population.
The parameter corresponds to the standard deviation of the di¤erences in antibiotic
blood serum level between drugs A and B in the study population.
(b) [5] To test the hypothesis of no di¤erence in the mean response for the two drugs,
that is, H0 : = 0 we use the discrepancy measure
Y 0
D= p
S= 10
where
Y 0
T = p s t (9) assuming H0 : = 0 is true
S= 10
and
1P 10
2
S2 = Yi Y :
9 i=1
Since y = 0:14=10 = 0:014 and
1=2
2:90484
s= = (0:32276)1=2 = 0:5681
9
the observed value of D is
j 0:014 0j
d= p = 0:078
0:5681= 10
and
p value = P (D 0:078; H0 )
= P (jT j 0:078) where T s t (9)
= 2 [1 P (T 0:078)] :
From t tables P (T 0:261) = 0:6 and P (T 0) = 0:5 so
0:8 = 2 (1 0:6) p value 2 (1 0:5) = 1:

405
Therefore there is no evidence based on the data against the hypothesis of no di¤erence in
the mean response for the two drugs, that is, H0 : = 0.
(c) [3] From t tables P (T 2:26) = 0:975 where T s t (9). A 95% con…dence interval
for based on these data is
p
y 2:26 (s) = 10
p
= 0:014 2:26 (0:5681) = 10
= [ 0:4200; 0:3920] :
(d) [2] Since this experimental study was conducted as a matched pairs study, an analysis
of the di¤erences, yi = ai bi ; allows for a more precise comparison since di¤erences between
the 10 pairs have been eliminated. That is, by analysing the di¤erences we do not need to
worry that there may have been large di¤erences in the responses between subjects due to
other variates such as age, general health, etc.
(e) [2] It is important to randomize the order of the drugs in case the order in which
the drugs are taken a¤ects the outcome.
It is important to give the drugs in identical tablet form so the subject does not know
which drug he or she is taking since knowing which drug is being taken could a¤ect the
outcome.
It is important that the drugs be administered one day apart to ensure that the e¤ects
of one drug are gone before the second drug is given.
4: [13]
(a) [2] The study population would consist of light-duty engines produced by Manufac-
turer A and Manufacturer B..The parameter 1 corresponds to the mean amount of CO
emitted by light-duty engines produced by Manufacturer A
The parameter 2 corresponds to the mean amount of CO emitted by light-duty engines
produced by Manufacturer B.
The parameter corresponds to the standard deviation of the CO emissions from light-
duty engines produced by Manufacturers A and B. (Note that it has been assumed that
this standard deviation is the same for both manufacturers.)
(b) [4] From t tables P (T 2:83) = 0:995 where T s t (21). For these
1=2
166:9860 + 218:7656
s= = (18:3691)1=2 = 4:2860
21
A 99% con…dence interval for 1based on these data is

2
r
1 1
y1 y1 2:83 (s) +
11 12
r
90:22 136:65 1 1
= 2:83 (4:2860) +
11 12 11 12
= 3:1857 5:0631
= [ 8:2487; 1:8774] :
(c) [5] To test the hypothesis of no di¤erence in the mean response for the two drugs,
that is, H0 : 1 = 2 we use the discrepancy measure
Y1 Y2 0
D= q
1 1
S 11 + 12
where
Y1 Y2 0
T = q s t (21) assuming H0 : 1 = 2 is true
1 1
S 11 + 12
and
1 P11
2 P
12
2
S2 = Y1i Y1 + Y2i Y2 :
21 i=1 i=1
Since
90:22 136:65
y1 y2 = = 3:1857
11 12
and s = 4:2860 the observed value of D is
j 3:1857 0j
d= q = 1:7806
1 1
4:2860 11 + 12
and
p value = P (D 1:7806; H0 )
= P (jT j 1:7806) where T s t (21)
= 2 [1 P (T 1:7806)] :
From t tables P (T 1:73) = 0:95 and P (T 2:09) = 0:975 so
0:05 = 2 (1 0:975) p value 2 (1 0:95) = 0:1
and therefore there is weak evidence based on the data against the hypothesis of no di¤er-
ence in the mean response for the two drugs, that is, H0 : 1 = 2 :
407
(d) [2] Although there is weak evidence of a di¤erence between the mean CO emissions
for the two manufacturers it is di¢ cult to draw much of a conclusion. The sample sizes
n1 = 11 and n2 = 12 are small. We also don’t know whether the engines were chosen
at random from the two manufacturers on the day, week, or month. In other words we
don’t know if the samples are representative of all light-duty engines produced by these
manufacturers.
5: [9]
(a) [2] This is an observational study because no explanatory variates were manipulated
by the researcher.
(b) [5] Denote the frequencies as F1 ; F2 ; F3 ; F4 with observed values f1 = 77; f2 = 404;
f3 = 16 and f4 = 122. Denote the expected frequencies as E1 ; E2 ; E3 ; E4 : If the hypothesis
of no relationship (independence) between the two variates: gender and whether or not the
driver drank alcohol in the last 2 hours is true then the expected frequency for the outcome
male and drank alcohol in the last 2 hours for the given data is
93 481
e1 = = 72:27:
619
The other expected frequencies e2 ; e3 ; e4 can be obtained by subtraction from the appro-
priate row or column total. The expected frequencies are given in brackets in the table
below.
Drank Alcohol in last 2 hours

Totals
Yes No
Gender Male 77 (72:27) 404 (408:73) 481
of Driver Female 16 (20:73) 122 (117:27) 138
Totals 93 526 619
To test the hypothesis of no relationship we use the discrepancy measure (a random variable)
P
4 (F
i Ei )2
D= :
i=1 Ei
For these data the observed value of D is

P
4 (f
i e i )2
d =
i=1 ei
(7772:27)2 (404 408:73)2 (16 20:73)2 (122 117:27)2
= + + +
72:27 408:73 20:73 117:27
= 1:6366
or
619 [(77) (122) (16) (404)]2
D= = 1:6366:
(481) (138) (93) (526)
Since the expected frequencies are all great than 5 then D has approximately a 2 (1)
distribution.
Thus
= P (D 1:6366; H0 )
t P (W 1:6366) where W v 2 (1)
p
= P jZj 1:6366 where Z s N (0; 1)
= 2 [1 P (Z 1:28)]
= 2 (1 0:8997)
= 0:2006
Since the p-value = 0:2006 > 0:1 we would conclude that there is no evidence against the
hypothesis of no relationship between between the two variates, gender and whether or not
the driver drank alcohol in the last 2 hours.
(c) [2] Although there is no evidence against the hypothesis of no relationship between
the two variates: gender and whether or not the driver drank alcohol in the last 2 hours
based on the data we cannot conclude there is no relationship since this is an observational
study. Whether a causal relationship exists or not cannot be determined by an observational
study only. A decision to strike down the law based on these data alone is unwise.
6:The Survey of Study Habits and Attitudes (SSHA) is a psychological test that eval-
uates university students’motivation, study habits, and attitudes toward university. At a
small university college 19 students are selected at random and given the SSHA test. Their
scores are:
10 10 11 12 13 13 13 14 14 14
14 15 15 15 16 16 17 18 20
Let yi = score of the i0 th student, i = 1; : : : ; 19. For these data
P
19 P
19
yi = 270 and yi2 = 3956:
i=1 i=1
For these data calculate the mean, median, mode, sample variance, range, and interquartile
range.
409
mean = 14:21, median = 14, mode = 14, sample variance = 6:62,

range = 20 10 = 10, IQR = 16 13 = 3
7: A data set consisting of six columns of data was collected by interviewing 100 students
on the University of Waterloo campus. The columns are:
Column 1: Sex of respondent
Column 2: Age of respondent
Column 3: Weight of respondent
Column 4: Faculty of respondent
Column 5: Number of courses respondent has failed.
Column 6: Whether the respondent (i) strongly disagreed, (ii) disagreed, (iii) agreed or
(iv) strongly agreed with the statement “The University of Waterloo is the best
university in Ontario.
(a) For this data set give an example of each of the following types of data;
discrete number of courses failed
continuous weight or age
categorical faculty or sex
binary sex
ordinal degree of agreement with statement
(b) Two ways to graphically represent categorical data are pie charts and bar charts
.
(c) A graphical way to examine the relationship between heights and weights is a
scatterplot .
(d) If the sample correlation between heights and weights was 0.4 you would conclude
that there is a positive linear relationship between heights and weights.
APPENDIX C: DATA
Here we list the data for Example 1.5.2. In the …le ch1example152.txt, there are three
columns labelled hour, machine and volume. The data are (H=hour, M=Machine, V=Volume):
H M V H M V H M V H M V
1 1 357:8 11 1 357 21 1 356:5 31 1 357:7
1 2 358:7 11 2 359:6 21 2 357:3 31 2 357
2 1 356:6 12 1 357:1 22 1 356:9 32 1 356:3
2 2 358:5 12 2 357:6 22 2 356:7 32 2 357:8
3 1 357:1 13 1 356:3 23 1 357:5 33 1 356:6
3 2 357:9 13 2 358:1 23 2 356:9 33 2 357:5
4 1 357:3 14 1 356:3 24 1 356:9 34 1 356:7
4 2 358:2 14 2 356:9 24 2 357:1 34 2 356:5
5 1 356:7 15 1 356 25 1 356:9 35 1 356:8
5 2 358 15 2 356:4 25 2 356:4 35 2 357:6
6 1 356:8 16 1 357 26 1 356:4 36 1 356:6
6 2 359:1 16 2 357:5 26 2 357:5 36 2 357:2
7 1 357 17 1 357:5 27 1 356:5 37 1 356:6
7 2 357:5 17 2 357:2 27 2 357 37 2 357:6
8 1 356 18 1 355:9 28 1 356:5 38 1 356:7
8 2 356:4 18 2 357:1 28 2 358:1 38 2 356:9
9 1 355:9 19 1 356:5 29 1 357:6 39 1 356:8
9 2 357:9 19 2 358:2 29 2 357:6 39 2 357:2
10 1 357:8 20 1 355:8 30 1 357:5 40 1 356:1
10 2 358:5 20 2 359 30 2 356:4 40 2 356:4
411
412 APPENDIX C: DATA
New Zealand BMI Data

Subject Number, Gender, Height, Weight, BMI
1 M 1.76 63.81 20.6 151 F 1.60 59.90 23.4
2 M 1.77 89.60 28.6 152 F 1.60 48.38 18.9
3 M 1.91 88.65 24.3 153 F 1.51 77.98 34.2
4 M 1.80 74.84 23.1 154 F 1.60 54.53 21.3
5 M 1.81 97.30 29.7 155 F 1.67 79.20 28.4
6 M 1.93 106.90 28.7 156 F 1.55 87.45 36.4
7 M 1.79 108.94 34.0 157 F 1.61 53.66 20.7
8 M 1.66 74.68 27.1 158 F 1.56 64.00 26.3
9 M 1.66 92.31 33.5 159 F 1.60 67.58 26.4
10 M 1.82 92.08 27.8 160 F 1.58 70.65 28.3
11 M 1.76 93.86 30.3 161 F 1.56 51.59 21.2
12 M 1.79 88.11 27.5 162 F 1.67 56.89 20.4
13 M 1.77 80.52 25.7 163 F 1.64 54.60 20.3
14 M 1.72 75.14 25.4 164 F 1.67 63.31 22.7
15 M 1.73 64.95 21.7 165 F 1.53 52.67 22.5
16 M 1.81 89.11 27.2 166 F 1.60 48.64 19.0
17 M 1.77 96.49 30.8 167 F 1.67 69.72 25.0
18 M 1.56 53.78 22.1 168 F 1.79 65.04 20.3
19 M 1.71 76.61 26.2 169 F 1.54 67.35 28.4
20 M 1.80 82.62 25.5 170 F 1.65 65.34 24.0
21 M 1.68 80.44 28.5 171 F 1.61 80.87 31.2
22 M 1.75 93.10 30.4 172 F 1.76 85.80 27.7
23 M 1.81 71.09 21.7 173 F 1.52 87.56 37.9
24 M 1.69 71.12 24.9 174 F 1.58 59.16 23.7
25 M 1.74 80.84 26.7 175 F 1.69 94.82 33.2
26 M 1.73 75.12 25.1 176 F 1.57 60.39 24.5
27 M 1.74 96.88 32.0 177 F 1.64 63.47 23.6
28 M 1.80 73.22 22.6 178 F 1.70 62.13 21.5
29 M 1.75 81.77 26.7 179 F 1.60 63.49 24.8
30 M 1.81 83.87 25.6 180 F 1.59 64.21 25.4
31 M 1.72 55.91 18.9 181 F 1.64 72.89 27.1
32 M 1.74 68.73 22.7 182 F 1.57 74.19 30.1
33 M 1.74 75.39 24.9 183 F 1.59 82.67 32.7
34 M 1.78 94.10 29.7 184 F 1.53 59.93 25.6
35 M 1.75 80.54 26.3 185 F 1.64 79.61 29.6
36 M 1.68 70.84 25.1 186 F 1.73 69.14 23.1
37 M 1.78 100.76 31.8 187 F 1.57 81.59 33.1
38 M 1.68 51.65 18.3 188 F 1.61 63.51 24.5
413
39 M 1.75 84.83 27.7 189 F 1.68 82.13 29.1

40 M 1.71 70.47 24.1 190 F 1.57 58.91 23.9
41 M 1.73 112.23 37.5 191 F 1.65 70.51 25.9
42 M 1.71 72.23 24.7 192 F 1.60 71.42 27.9
43 M 1.87 105.26 30.1 193 F 1.62 59.57 22.7
44 M 1.69 69.97 24.5 194 F 1.64 57.56 21.4
45 M 1.73 102.36 34.2 195 F 1.54 61.90 26.1
46 M 1.71 81.58 27.9 196 F 1.58 84.63 33.9
47 M 1.86 80.61 23.3 197 F 1.70 66.76 23.1
48 M 1.73 76.62 25.6 198 F 1.56 75.68 31.1
49 M 1.64 71.27 26.5 199 F 1.68 72.25 25.6
50 M 1.59 60.17 23.8 200 F 1.53 56.88 24.3
51 M 1.78 92.20 29.1 201 F 1.58 66.90 26.8
52 M 1.73 78.41 26.2 202 F 1.59 50.06 19.8
53 M 1.76 90.76 29.3 203 F 1.64 69.66 25.9
54 M 1.80 92.34 28.5 204 F 1.63 87.15 32.8
55 M 1.71 68.72 23.5 205 F 1.66 76.61 27.8
56 M 1.69 76.54 26.8 206 F 1.53 62.03 26.5
57 M 1.80 90.72 28.0 207 F 1.66 88.73 32.2
58 M 1.78 70.66 22.3 208 F 1.65 85.21 31.3
59 M 1.73 76.32 25.5 209 F 1.67 81.99 29.4
60 M 1.71 88.02 30.1 210 F 1.60 77.82 30.4
61 M 1.78 87.76 27.7 211 F 1.71 84.21 28.8
62 M 1.74 84.77 28.0 212 F 1.61 69.99 27.0
63 M 1.69 67.40 23.6 213 F 1.65 96.92 35.6
64 M 1.82 83.14 25.1 214 F 1.60 77.57 30.3
65 M 1.63 69.08 26.0 215 F 1.71 78.37 26.8
66 M 1.74 72.36 23.9 216 F 1.58 77.39 31.0
67 M 1.74 69.03 22.8 217 F 1.61 64.28 24.8
68 M 1.69 81.68 28.6 218 F 1.59 85.96 34.0
69 M 1.79 89.39 27.9 219 F 1.57 64.58 26.2
70 M 1.79 75.30 23.5 220 F 1.64 76.92 28.6
71 M 1.86 90.30 26.1 221 F 1.72 71.89 24.3
72 M 1.70 102.59 35.5 222 F 1.59 58.90 23.3
73 M 1.87 94.42 27.0 223 F 1.64 86.07 32.0
74 M 1.65 89.03 32.7 224 F 1.64 78.00 29.0
75 M 1.72 78.40 26.5 225 F 1.58 66.90 26.8
76 M 1.74 93.55 30.9 226 F 1.53 61.10 26.1
77 M 1.69 68.26 23.9 227 F 1.62 59.05 22.5
78 M 1.57 53.73 21.8 228 F 1.62 83.72 31.9
79 M 1.74 91.13 30.1 229 F 1.61 76.99 29.7
80 M 1.80 89.10 27.5 230 F 1.57 61.62 25.0

81 M 1.77 87.41 27.9 231 F 1.72 107.09 36.2
82 M 1.71 66.38 22.7 232 F 1.61 45.36 17.5
83 M 1.78 106.46 33.6 233 F 1.67 89.80 32.2
84 M 1.56 66.92 27.5 234 F 1.67 77.25 27.7
85 M 1.74 79.93 26.4 235 F 1.60 82.94 32.4
86 M 1.79 92.28 28.8 236 F 1.66 82.12 29.8
87 M 1.85 79.40 23.2 237 F 1.58 74.64 29.9
88 M 1.64 70.20 26.1 238 F 1.71 79.54 27.2
89 M 1.83 116.88 34.9 239 F 1.64 61.32 22.8
90 M 1.70 78.32 27.1 240 F 1.59 60.17 23.8
91 M 1.72 102.66 34.7 241 F 1.61 95.91 37
92 M 1.72 78.40 26.5 242 F 1.56 62.79 25.8
93 M 1.70 83.81 29.0 243 F 1.56 48.19 19.8
94 M 1.64 67.51 25.1 244 F 1.54 69.73 29.4
95 M 1.75 69.83 22.8 245 F 1.52 89.64 38.8
96 M 1.68 77.62 27.5 246 F 1.57 57.68 23.4
97 M 1.71 95.03 32.5 247 F 1.67 75.02 26.9
98 M 1.67 74.18 26.6 248 F 1.57 40.42 16.4
99 M 1.80 92.99 28.7 249 F 1.57 53.00 21.5
100 M 1.77 78.64 25.1 250 F 1.68 101.61 36.0
101 M 1.72 79.29 26.8 251 F 1.72 110.94 37.5
102 M 1.66 72.75 26.4 252 F 1.68 65.48 23.2
103 M 1.78 83.65 26.4 253 F 1.77 73.00 23.3
104 M 1.60 61.44 24.0 254 F 1.65 71.60 26.3
105 M 1.72 65.97 22.3 255 F 1.41 46.72 23.5
106 M 1.71 78.37 26.8 256 F 1.54 73.99 31.2
107 M 1.79 74.01 23.1 257 F 1.67 79.48 28.5
108 M 1.74 69.33 22.9 258 F 1.72 60.06 20.3
109 M 1.74 88.10 29.1 259 F 1.72 63.01 21.3
110 M 1.78 89.35 28.2 260 F 1.61 81.65 31.5
111 M 1.77 90.54 28.9 261 F 1.52 85.95 37.2
112 M 1.74 91.43 30.2 262 F 1.61 54.95 21.2
113 M 1.84 94.80 28.0 263 F 1.55 78.56 32.7
114 M 1.82 86.12 26.0 264 F 1.57 64.58 26.2
115 M 1.83 75.35 22.5 265 F 1.51 76.84 33.7
116 M 1.74 70.85 23.4 266 F 1.69 81.11 28.4
117 M 1.74 98.70 32.6 267 F 1.69 78.54 27.5
118 M 1.89 104.66 29.3 268 F 1.58 72.65 29.1
119 M 1.81 91.08 27.8 269 F 1.48 65.49 29.9
120 M 1.64 94.67 35.2 270 F 1.66 60.07 21.8
415
121 M 1.77 80.20 25.6 271 F 1.47 61.37 28.4

122 M 1.73 73.92 24.7 272 F 1.63 71.20 26.8
123 M 1.82 84.80 25.6 273 F 1.71 66.38 22.7
124 M 1.73 90.39 30.2 274 F 1.59 70.79 28.0
125 M 1.77 74.25 23.7 275 F 1.56 73.49 30.2
126 M 1.82 107.32 32.4 276 F 1.62 70.07 26.7
127 M 1.80 80.03 24.7 277 F 1.53 61.57 26.3
128 M 1.77 105.58 33.7 278 F 1.70 74.27 25.7
129 M 1.80 110.48 34.1 279 F 1.60 45.06 17.6
130 M 1.70 93.64 32.4 280 F 1.52 67.93 29.4
131 M 1.70 68.49 23.7 281 F 1.61 53.66 20.7
132 M 1.77 77.70 24.8 282 F 1.58 64.66 25.9
133 M 1.77 97.12 31.0 283 F 1.71 66.67 22.8
134 M 1.62 70.86 27.0 284 F 1.58 72.65 29.1
135 M 1.74 82.96 27.4 285 F 1.65 79.22 29.1
136 M 1.68 72.25 25.6 286 F 1.65 74.32 27.3
137 M 1.64 73.16 27.2 287 F 1.70 85.83 29.7
138 M 1.75 92.49 30.2 288 F 1.70 67.63 23.4
139 M 1.66 66.69 24.2 289 F 1.66 77.98 28.3
140 M 1.86 106.21 30.7 290 F 1.67 85.90 30.8
141 M 1.72 88.75 30.0 291 F 1.64 67.51 25.1
142 M 1.69 73.97 25.9 292 F 1.68 60.96 21.6
143 M 1.72 81.95 27.7 293 F 1.54 64.03 27.0
144 M 1.77 82.40 26.3 294 F 1.58 61.41 24.6
145 M 1.66 85.42 31.0 295 F 1.68 75.64 26.8
146 M 1.78 76.04 24.0 296 F 1.64 64.82 24.1
147 M 1.82 78.50 23.7 297 F 1.65 59.62 21.9
148 M 1.84 98.86 29.2 298 F 1.66 76.05 27.6
149 M 1.75 85.44 27.9 299 F 1.60 61.70 24.1
150 M 1.75 65.23 21.3 300 F 1.65 76.50 28.1
Brake pad Lifetimes (1000km)

21.2 44.6 31.7 34.7 79 90.1 140.5 64.9
97.8 52.7 23.4 31.7 15.8 11.7 78.7 26.5
53.8 38.8 49.3 117 100.8 94.9 53.1 19.3
46.4 37.7 94.1 9 58.7 88.7 29.5 21.8
32.4 125.4 75.1 70.3 18.8 61.2 43.9 50.5
25.4 90.4 61.8 26.4 50.2 59.7 21.1 108.4
177 44.8 61.2 67.3 18.2 22 41.4 28.1
87.6 17.5 73.9 24.2 37.6 19.2 68.5 21.4
54.7 110.4 31.9 32.8 38.1 27.2 43 40.3
70.8 138 14.5 16.3 71.1 62.3 33.1 85.1
7.4 96.5 29.5 54.3 69.9 38.3 14.5 53.5
52.9 2.6 72.7 36.9 59.5 48.2 40.4 10.9
26.6 42.6 42.5 74.9 113.4 102.3 30.6 70.2
69.3 13.7 29.6 36.1 30.7 36.3 53.4 17.4
91.3 39.9 71.8 44.3 25.3 82.3 31.5 38
31.6 40.1 115 6.1 10.1 100.9 19.3 25.5
31.1 6.5 167.2 88.4 39.3 47.6 14.2 169.3
22 90.3 26.5 80 23.4 5.8 8.3 20
57.5 66.4 31 21.6 31.2 136.3 108.2 48
21.9 26.9 32.8 27.6 103.2 9.2 35.5 42.3
23.1 36.3 11.5 0.9 32 47.2 18.8 49.5
34.4 40 8.3 44.4 10.6 28.1 59.3 44.5
41.3 43.4 17.8 44.5 121.8 8.8 45.1 66.2
29.6 27.1 11.1 25.4 46.1 42.3 55 24.2
15.6 74.5 18.7 33.6 61.6 53.5 105.1 55.8
Times (in minutes) between 300 eruptions of the Old Faithful geyser
between 1/08/85 and 15/08/85
80 71 57 80 75 77 60 86 77 56 81 50 89 54 90 73 60 83 65 82 84 54 85 58 79 57 88 68 76
78 74 85 75 65 76 58 91 50 87 48 93 54 86 53 78 52 83 60 87 49 80 60 92 43 89 60 84 69 74
71 108 50 77 57 80 61 82 48 81 73 62 79 54 80 73 81 62 81 71 79 81 74 59 81 66 87 53 80 50
87 51 82 58 81 49 92 50 88 62 93 56 89 51 79 58 82 52 88 52 78 69 75 77 53 80 55 87 53 85
61 93 54 76 80 81 59 86 78 71 77 76 94 75 50 83 82 72 77 75 65 79 72 78 77 79 75 78 64 80
49 88 54 85 51 96 50 80 78 81 72 75 78 87 69 55 83 49 82 57 84 57 84 73 78 57 79 57 90 62
87 78 52 98 48 78 79 65 84 50 83 60 80 50 88 50 84 74 76 65 89 49 88 51 78 85 65 75 77 69
92 68 87 61 81 55 93 53 84 70 73 93 50 87 77 74 72 82 74 80 49 91 53 86 49 79 89 87 76 59
80 89 45 93 72 71 54 79 74 65 78 57 87 72 84 47 84 57 87 68 86 75 73 53 82 93 77 54 96 48
89 63 84 76 62 83 50 85 78 78 81 78 76 74 81 66 84 48 93 47 87 51 78 54 87 52 85 58 88 79
417
Skinfold BodyDensity
1.6841 1.0613 1.9200 1.0338 1.5324 1.0696 2.0755 1.0355
1.9639 1.0478 1.6736 1.0560 1.7035 1.0449 1.4351 1.0693
1.0803 1.0854 1.7914 1.0487 1.8040 1.0411 1.7295 1.0518
1.7541 1.0629 1.7249 1.0496 1.8075 1.0426 1.5265 1.0837
1.6368 1.0652 1.5025 1.0824 1.3815 1.0715 1.7599 1.0328
1.2857 1.0813 1.6314 1.0526 1.5847 1.0602 1.4029 1.0933
1.4744 1.0683 1.3980 1.0707 1.3059 1.0807 1.2653 1.0860
1.6420 1.0575 1.7598 1.0459 1.3276 1.0536 1.2609 1.0919
2.3406 1.0126 1.3203 1.0697 1.5665 1.0602 1.6734 1.0433
2.1659 1.0264 1.3372 1.0770 1.8989 1.0536 1.5297 1.0614
1.2766 1.0829 1.3932 1.0727 1.4018 1.0655 1.5257 1.0643
2.2232 1.0296 0.9323 1.1171 1.6482 1.0668 1.8744 1.0482
1.7246 1.0670 1.8785 1.0423 1.5193 1.0700 1.6310 1.0459
1.5544 1.0688 1.6382 1.0506 1.8092 1.0485 1.6107 1.0653
1.7223 1.0525 1.4050 1.0878 1.3329 1.0804 1.9108 1.0321
1.5237 1.0721 1.8638 1.0557 1.5750 1.0503 1.3943 1.0755
1.5412 1.0672 1.1985 1.0854 1.6873 1.0557 1.7184 1.0600
1.8896 1.0350 1.5459 1.0527 1.8056 1.0625 1.7483 1.0554
1.8722 1.0528 1.5159 1.0635 1.9014 1.0438 1.5154 1.0765
1.8740 1.0473 1.6369 1.0583 1.5866 1.0632 1.6146 1.0696
1.7130 1.0560 1.6355 1.0621 1.2460 1.0782 1.3163 1.0744
1.3073 1.0848 1.3813 1.0736 1.4077 1.0739 1.3202 1.0818
1.7229 1.0564 1.5615 1.0682 1.3388 1.0805 1.5906 1.0546

STAT 231 Course Notes W16 Print

Uploaded by

Copyright:

Available Formats

STAT 231 Course Notes W16 Print

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

STAT 231 Course Notes W16 Print

Uploaded by

Copyright:

Available Formats

STATISTICS 231 COURSE NOTES

Department of Statistics and Actuarial Science, University of Waterloo

1. INTRODUCTION TO STATISTICAL SCIENCES 1

2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMA-

3. PLANNING AND CONDUCTING EMPIRICAL STUDIES 89

4.7 Con…dence Intervals for Parameters in the G( ; ) Model . . . . . . . . . . 133

5. TESTS OF HYPOTHESES 157

6. GAUSSIAN RESPONSE MODELS 189

7. MULTINOMIAL MODELS AND GOODNESS OF FIT TESTS 239

8. CAUSAL RELATIONSHIPS 255

9. REFERENCES AND SUPPLEMENTARY RESOURCES 267

10. FORMULA, DISTRIBUTIONS AND STATISTICAL TABLES 269

APPENDIX A: ANSWERS TO END OF CHAPTER PROBLEMS 275

APPENDIX B: SAMPLE TESTS 373

APPENDIX C: DATA 411

1.1 Statistical Sciences

1.2 Collecting Data

studies the population of interest is usually in…nite or conceptual. For example, in

(iii) Experiments or Experimental Studies. An experiment is a study in which the

Example 1.2.1 A sample survey about smoking

Example 1.2.2 A study of a manufacturing process

Example 1.2.3 A clinical trial in medicine

Example 1.2.4 Direct marketing campaigns

1.3 Data Summaries

2. Measures of dispersion or variability:

The (sample) variance:

The interquartile range IQR which is described below.

The (sample) skewness

is a measure of the (lack of) symmetry in the data.

The (sample) kurtosis

0.3 skewness = 0.71

Another way to numerically summarize data is to use sample percentiles or quantiles.

Sample Quantiles and Percentiles

Let m = (n + 1)p where n is the sample size.

If m 2 f1; 2; : : : ; ng then take the m0 th smallest value q(p) = y(m) , where

De…nition 3 The interquartile range is IQR = q(0:75) q(0:25).

Example 1.3.1 Comparison of Body Mass Index

Table 1.1: BMI Obesity Classi…cation

Table 1.2: First Five Rows of the File ch1example131.txt

Table 1.3: Summary of BMI by Sex

Table 1.4: BMI Relative Frequency Table by Sex

Example 1.3.1 Continued

Example 1.3.2 Physicians’Health Study

Table 1.5: Physicians’Health Study

Table 1.6: General Two-way Table

Recall that events A and B are independent events if P (A \ B) = P (A) P (B) or

Example 1.3.2 Revisited

In Chapter 7 we consider methods for analyzing data which can be summarized in a

fj = number of values from fy1 ; y2 ; : : : ; yn g that are in Ij .

Example 1.3.1 Continued

0.12 skewness = 0.41

Figure 1.5: Relative frequency histogram for male BMI data

Example 1.3.3 Lifetimes of brake pads

0.08 skewness = 0.30

Figure 1.6: Relative frequency histogram for female BMI data

Figure 1.7: Relative frequency histogram of brake pad lifetime data

Empirical Cumulative Distribution Functions

number of values in fy1 ; y2 ; :::; yn g which are y

Example 1.3.1 Continued