Principles of Applied Statistics PDF
Principles of Applied Statistics PDF
Principles of Applied Statistics PDF
Applied statistics is more than data analysis, but it is easy to lose sight of the big
picture. David Cox and Christl Donnelly draw on decades of scientific experience to
describe usable principles for the successful application of statistics, showing how
good statistical strategy shapes every stage of an investigation. As one advances from
research or policy questions, to study design, through modelling and interpretation,
and finally to meaningful conclusions, this book will be a valuable guide. Over 100
illustrations from a wide variety of real applications make the conceptual points
concrete, illuminating and deepening understanding. This book is essential reading for
anyone who makes extensive use of statistical methods in their work.
Principles of Applied Statistics
D. R. COX
Nuffield College, Oxford
CHRISTL A. DONNELLY
MRC Centre for Outbreak Analysis and Modelling,
Imperial College London
cambridge university press
Cambridge, New York, Melbourne, Madrid, Cape Town,
Singapore, São Paulo, Delhi, Tokyo, Mexico City
Cambridge University Press
The Edinburgh Building, Cambridge CB2 8RU, UK
Published in the United States of America by Cambridge University Press, New York
www.cambridge.org
Information on this title: www.cambridge.org/9781107013599
C D. R. Cox and C. A. Donnelly 2011
A catalogue record for this publication is available from the British Library
Preface ix
2 Design of studies 14
2.1 Introduction 14
2.2 Unit of analysis 18
2.3 Types of study 20
2.4 Avoidance of systematic error 21
2.5 Control and estimation of random error 24
2.6 Scale of effort 25
2.7 Factorial principle 26
v
vi Contents
4 Principles of measurement 53
4.1 Criteria for measurements 53
4.2 Classification of measurements 55
4.3 Scale properties 56
4.4 Classification by purpose 58
4.5 Censoring 61
4.6 Derived variables 62
4.7 Latent variables 64
5 Preliminary analysis 75
5.1 Introduction 75
5.2 Data auditing 76
5.3 Data screening 77
5.4 Preliminary graphical analysis 82
5.5 Preliminary tabular analysis 86
5.6 More specialized measurement 87
5.7 Discussion 88
6 Model formulation 90
6.1 Preliminaries 90
6.2 Nature of probability models 92
6.3 Types of model 97
6.4 Interpretation of probability 104
6.5 Empirical models 108
9 Interpretation 159
9.1 Introduction 159
9.2 Statistical causality 160
9.3 Generality and specificity 167
9.4 Data-generating models 169
Contents vii
10 Epilogue 184
10.1 Historical development 184
10.2 Some strategic issues 185
10.3 Some tactical considerations 186
10.4 Conclusion 187
References 189
Index 198
Preface
ix
x Preface
1.1 Preliminaries
This short chapter gives a general account of the issues to be discussed
in the book, namely those connected with situations in which appreciable
unexplained and haphazard variation is present. We outline in idealized
form the main phases of this kind of scientific investigation and the stages
of statistical analysis likely to be needed.
It would be arid to attempt a precise definition of statistical analysis
as contrasted with other forms of analysis. The need for statistical analy-
sis typically arises from the presence of unexplained and haphazard varia-
tion. Such variability may be some combination of natural variability and
measurement or other error. The former is potentially of intrinsic inter-
est whereas the latter is in principle just a nuisance, although it may need
careful consideration owing to its potential effect on the interpretation of
results.
Illustration: Variability and error The fact that features of biological or-
ganisms vary between nominally similar individuals may, as in studies
of inheritance, be a crucial part of the phenomenon being studied. That,
say, repeated measurements of the height of the same individual vary er-
ratically is not of intrinsic interest although it may under some circum-
stances need consideration. That measurements of the blood pressure of
a subject, apparently with stable health, vary over a few minutes, hours
or days typically arises from a combination of measurement error and
natural variation; the latter part of the variation, but not the former, may
be of direct interest for interpretation.
1
2 Some general concepts
Furthermore, while the sequence from question to answer set out above
is in principle desirable, from the perspective of the individual investiga-
tor, and in particular the individual statistician, the actual sequence may
be very different. For example, an individual research worker destined to
analyse the data may enter an investigation only at the analysis phase. It
will then be important to identify the key features of design and data col-
lection actually employed, since these may have an important impact on
the methods of analysis needed. It may be difficult to ascertain retrospec-
tively aspects that were in fact critical but were not initially recognized
1.4 Relationship between design and analysis 5
as such. For example, departures from the design protocol of the study
may have occurred and it very desirable to detect these in order to avoid
misinterpretation.
be obligatory to follow and report that analysis but this should not preclude
alternative analyses if these are clearly more appropriate in the light of the
data actually obtained.
In some contexts it is reasonable not only to specify in advance the
method of analysis in some detail but also to be hopeful that the proposed
method will be satisfactory with at most minor modification. Past experi-
ence of similar investigations may well justify this.
In other situations, however, while the broad approach to analysis should
be set out in advance, if only as an assurance that analysis and interpreta-
tion will be possible, it is unrealistic and indeed potentially dangerous to
follow an initial plan unswervingly. Experience in collecting the data, and
the data themselves, may suggest changes in specification such as a trans-
formation of variables before detailed analysis. More importantly, it may be
a crucial part of the analysis to clarify the research objectives; these should
be guided, in part at least, by the initial phases of analysis. A distinction
is sometimes drawn between pre-set analyses, called confirmatory, and ex-
ploratory analyses, but in many fields virtually all analyses have elements
of both aspects.
Especially in major studies in which follow-up investigations are not
feasible or will take a long time to complete, it will be wise to list the pos-
sible data configurations, likely or unlikely, that might arise and to check
that data will be available for the interpretation of unanticipated effects.
study is observational if, although the choice of individuals for study and
of measurements to be obtained may be made by the investigator, key ele-
ments have to be accepted as they already exist and cannot be manipulated
by the investigator. It often, however, aids the interpretation of an obser-
vational study to consider the question: what would have been done in a
comparable experiment?
Data auditing and screening, which should take place as soon as possible
after data collection, include inspection for anomalous values as well as for
internal inconsistencies. Other relatively common sources of concern are
sticking instruments, repeatedly returning the same value, for example zero
rainfall, as well as the confusion of zero values and missing or irrelevant
values. Sometimes, especially when extensive data are being collected in a
novel context, formal auditing of the whole process of data collection and
entry may be appropriate. Typically this will involve detailed study of all
aspects of a sample of study individuals, and the application of ideas from
sampling theory and industrial inspection may be valuable.
10 Some general concepts
1.10 Prediction
Most of this book is devoted to the use of statistical methods to analyse
and interpret data with the object of enhancing understanding. There is
sometimes a somewhat different aim, that of empirical prediction. We take
that to mean the prediction of as yet unobserved features of new study
individuals, where the criterion of success is close agreement between the
prediction and the ultimately realized value. Obvious examples are time se-
ries forecasting, for example of sales of a product or of the occurrence and
amount of precipitation, and so on. In discussing these the interpretation of
the parameters in any model fitted to the data is judged irrelevant, and the
choice between equally well-fitting models may be based on convenience
or cost.
Such examples are a special case of decision problems, in particular
problems of a repetitive nature, such as in industrial inspection where each
unit of production has to be accepted or not. The assessment of any pre-
diction method has to be judged by its empirical success. In principle this
should be based on success with data independent of those used to set up
the prediction method. If the same data are used directly for both purposes,
the assessment is likely to be misleadingly optimistic, quite possibly seri-
ously so.
There are some broader issues involved. Many investigations have some
form of prediction as an ultimate aim, for example whether, for a particular
patient or patient group, the use of such-and-such a surgical procedure will
improve survival and health-related quality of life. Yet the primary focus of
discussion in the present book is on obtaining and understanding relevant
data. The ultimate use of the conclusions from an investigation has to be
borne in mind but typically will not be the immediate focus of the analysis.
Even in situations with a clear predictive objective, the question may
arise whether the direct study of predictions should be preceded by a more
analytical investigation of the usefulness of the latter. Is it better for short-
term economic forecasting to be based on elaborate models relying on
possibly suspect economic theory or directly on simple extrapolation?
12 Some general concepts
1.11 Synthesis
It is reasonable to ask: what are the principles of applied statistics? The
difficulty in giving a simple answer stems from the tremendous variety,
both in subject-matter and level of statistical involvement, of applications
of statistical ideas. Nevertheless, the following aspects of applied statistics
are of wide importance:
Notes
Detailed references for most of the material are given in later chap-
ters. Broad introductory accounts of scientific research from respectively
physical and biological perspectives are given by Wilson (1952) and by
Beveridge (1950). For a brief introduction to the formal field called the phi-
losophy of science, see Chalmers (1999). Any direct impact of explicitly
philosophical aspects seems to be largely confined to the social sciences.
Mayo (1996) emphasized the role of severe statistical tests in justifying
scientific conclusions. For an introduction to data mining, see Hand et al.
(2001). Box (1976) and Chatfield (1998) gave general discussions of the
role of statistical methods.
2
Design of studies
This, the first of two chapters on design issues, describes the common
features of, and distinctions between, observational and experimental
investigations. The main types of observational study, cross-sectional,
prospective and retrospective, are presented and simple features of
experimental design outlined.
2.1 Introduction
In principle an investigation begins with the formulation of a research ques-
tion or questions, or sometimes more specifically a research hypothesis.
In practice, clarification of the issues to be addressed is likely to evolve
during the design phase, especially when rather new or complex ideas
are involved. Research questions may arise from a need to clarify and
extend previous work in a field or to test theoretical predictions, or they
may stem from a matter of public policy or other decision-making con-
cern. In the latter type of application the primary feature tends to be to
establish directly relevant conclusions, in as objective a way as possible.
Does culling wildlife reduce disease incidence in farm animals? Does a
particular medical procedure decrease the chance of heart disease? These
are examples of precisely posed questions. In other contexts the objective
may be primarily to gain understanding of the underlying processes. While
the specific objectives of each individual study always need careful con-
sideration, we aim to present ideas in as generally an applicable form as
possible.
We describe a number of distinct types of study, each raising rather dif-
ferent needs for analysis and interpretation. These range from the sampling
of a static population in order to determine its properties to a controlled ex-
periment involving a complex mixture of conditions studied over time. In
14
2.1 Introduction 15
There are two further aspects, very important in some contexts, but
whose applicability is a bit more limited. These are:
‘Explaining the unexpected’ (p. 6). If new data can be obtained relatively
speedily, this aspect of planning is of less concern.
care. Outcome data are then collected for each family. The primary unit
of analysis is the community, with the implication that outcome mea-
sures should be based in some way on the aggregate properties of each
community and a comparison of policies made by comparing commu-
nities randomized to different policies. This does not exclude a further
analysis in which distinct families within the same community are com-
pared, but the status of such comparisons is different and receives no
support from the randomization.
Day 1 2 3 4 5 6 7 8
morning T T T T T T T T
afternoon C C C C C C C C
(b)
Day 1 2 3 4 5 6 7 8
morning T T T C T T C T
afternoon C C C T C C T C
(c)
Day 1 2 3 4 5 6 7 8
morning T T C T C T C C
afternoon C C T C T C T T
yi j = µ + τai j + δb j + i j , (2.1)
Illustration: Comparing like with like The use of twins in a paired design
for some kinds of animal experiment may greatly enhance precision by
ensuring that conclusions are based largely on comparisons between the
twins in a pair, according to the second principle for precision enhance-
ment. The general disadvantage of using artificially uniform material
is the possibility that the conclusions do not extend to more general
situations.
Block 1 p nk 1 k npk n pk np
Block 2 npk n nk np 1 pk p k
Block 3 n k nk pk 1 p npk np
• Eight plots are used to test N, eight to test P and eight to test K. In the
plots used to test, say, N, the other two factors, P and K, are assigned
at the low levels.
• The eight combinations of N, P and K at the two possible levels are
each tested three times.
The second possibility, which may be described as three replicates of the
23 factorial system, is illustrated in randomized block form in Table 2.2.
There are two reasons for preferring the factorial arrangement in this
context. One is that if, say, P and K have no effect (or more generally
have a simple additive effect) then the estimated effect of N can be based
on all 24 plots rather than merely eight plots and hence appreciably
higher precision is achieved. More importantly in some ways, if there
were to be a departure from additivity of effect there is some possibility
of detecting this. An extreme instance would occur if any increase in
yield requires all three factors to be at their higher level.
special designs have their uses, complex arrangements can have major dis-
advantages, in particular the difficulties of administering an intricate de-
sign and of slowness in obtaining complete results for analysis. Finally, we
emphasize that the central principles of design apply as much to simple
experiments as to complex ones.
Notes
Section 2.3. The terminology used for these different types of study varies
between fields of work.
3.1 Preliminaries
We now discuss in a little more detail the main types of study listed in
Section 2.3. The distinctions between them are important, notably the con-
trast between observational and experimental investigations. Nevertheless
the broad objectives set out in the previous chapter are largely common to
all types of study.
The simplest investigations involve the sampling of explicit populations,
and we discuss these first. Such methods are widely used by government
agencies to estimate population characteristics but the ideas apply much
more generally. Thus, sampling techniques are often used within other
types of work. For example the quality, rather than quantity, of crops in
an agricultural field trial might be assessed partly by chemical analysis of
small samples of material taken from each plot or even from a sub-set of
plots.
By contrast, the techniques of experimental design are concentrated on
achieving secure conclusions, sometimes in relatively complicated situa-
tions, but in contexts where the investigator has control over the main fea-
tures of the system under study. We discuss these as our second theme in
this chapter, partly because they provide a basis that should be emulated in
observational studies.
29
30 Special types of study
h h h h h
Sampling
Cumulative value
origin
See Figure 3.2. In all these cases the analysis must take account of the
particular sampling method used.
3.3 Experiments
3.3.1 Primary formulation
The experiments to be considered here are comparative in the sense that
their objective is to investigate the difference between the effects produced
by different conditions or exposures, that is, treatments. The general for-
mulation used for discussing the design of such experiments is as follows.
Experimental units and treatments are chosen. An experiment in its sim-
plest form consists of assigning one treatment to each experimental unit
and observing one or more responses. The objective is to estimate the dif-
ferences between the treatments in their effect on the response.
The formal definition of an experimental unit is that it is the smallest
subdivision of the experimental material such that any two different units
may experience different treatments.
T 60 T 50 T 70 T 70 T 60 T 50 T 50 T 70 T 60
It can be shown that for this particular design a linear trend in time
has no effect on treatment comparison and that a quadratic trend in time
has no effect on the comparison of T 70 with T 50 ; for other comparisons
of treatments there is some mixing of effects, which may be resolved by
a least-squares analysis.
Source DF
mean 1
blocks b−1
treatments t−1
residual (b − 1)(t − 1)
yis = ȳ.. + (ȳi. − ȳ.. ) + (ȳ.s − ȳ.. ) + (yis − ȳi. − ȳ.s + ȳ.. ), (3.6)
Source DF SS
It follows, on squaring (3.6) and summing over all i and s, noting that
all cross-product terms vanish, that we can supplement the decomposition
of comparisons listed above as shown in Table 3.3.
Readers familiar with standard accounts of analysis of variance should
note that the sums are over both suffices, so that, for example, Σi,s ȳ2.. = btȳ2.. .
At this point the sum of squares decomposition (Table 3.3) is a sim-
ple identity with a vague qualitative interpretation; it has no probabilistic
content.
The simplest form of statistical analysis is based on the assumption that
a random variable corresponding to yis can be written in the symmetrical
form
Period 1 2 3 4
Square 1
subject 1 B C A D
subject 2 A D C B
subject 3 D A B C
subject 4 C B D A
Square 2
subject 5 D A C B
subject 6 C B D A
subject 7 B D A C
subject 8 A C B D
Source DF
mean 1
Latin squares 1
subjects within Latin squares 3+3
periods 3
periods × Latin squares 3
treatments 3
treatments × Latin squares 3
residual within Latin squares 6+6
special connection with the first subject in the second section. Subjects
are said to be nested within sections. However, the treatments, and indeed
the periods, are meaningfully defined across sections; treatments are said
to be crossed with sections. Thus the variation associated with subjects is
combined into a between-subject within-section component. The treatment
variation is shown as a treatment main effect, essentially the average effect
across the two sections, and an interaction component, essentially showing
3.3 Experiments 45
3.3.4 Developments
There are many developments of the above ideas. They call for more spe-
cialized discussion than we can give here. Some of the main themes are:
• development of fractional replication as a way of studying many factors
in a relatively small number of experimental runs;
• response-surface designs suitable when the factors have quantitative lev-
els and interest focuses on the shape of the functional relationship be-
tween expected response and the factor levels;
• designs for use when the effect of a treatment applied to an individual
in one period may carry over into the subsequent period, in which a
different primary treatment is used;
• extensions of the randomized block principle for use when the number
of units per block is smaller than the number of treatments (incomplete
block designs);
• special designs for the study of nonlinear dependencies; and
• designs for sequential studies when the treatment allocation at one phase
depends directly on outcomes in the immediate past in a pre-planned
way.
The data in the above illustration were obtained over an extended time
period. In a rather different type of cross-sectional study a series of in-
dependent cross-sectional samples of a population is taken at, say, yearly
intervals. The primary object here is to identify changes across time. More
precise comparisons across time would be obtained by using the same
study individuals at the different time points but that may be administra-
tively difficult; also, if a subsidiary objective is to estimate overall popula-
tion properties then independent sampling at the different time points may
be preferable.
y=0 y=1
y=0 y=1
y=0 y=1
Notes
Section 3.1. There are considerable, and largely disjoint, literatures on
sampling and on the design of experiments. The separateness of these lit-
eratures indicates the specialization of fields of application within mod-
ern statistical work; it is noteworthy that two pioneers, F. Yates and W. G.
Cochran, had strong interests in both topics. There are smaller literatures
on the design of observational analytical studies.
Principles of measurement
• relevance;
• adequate precision;
• economy; and
• absence of distortion of the features studied.
53
54 Principles of measurement
of sampling may give higher quality than the study of a complete popula-
tion of individuals.
education and years for which the patient has had the disease in effect
define the individuals concerned. An interesting feature of this example
is that the biochemical test and the psychometric measurements were
obtained virtually simultaneously, so that the labelling of the biochem-
ical measure as a response to knowledge is a subject-matter working
hypothesis and may be false. That is, success at controlling the disease
might be an encouragement to learn more about it.
are called primary explanatory variables. Conceptually, for each study in-
dividual a primary explanatory variable might have been different from the
choice actually realized.
A third kind of explanatory variable, a particular type of intrinsic vari-
able, may be called nonspecific. Examples are the names of countries in
an international study or the names of schools in an educational study in-
volving a number of schools. The reason for the term is that, depending of
course on context, a clear difference in some outcome between two coun-
tries might well have many different explanations, demographic, climate-
related, health-policy-related etc.
Other variables are intermediate in the sense that they are responses to
the primary explanatory variables and explanatory to the response vari-
ables.
At different stages of analysis the roles of different variables may
change as different research questions are considered or as understanding
evolves.
4.5 Censoring
When failure time data (or survival data) are collected, often some units of
analysis do not have fully observed time-to-fail data. Other kinds of data
too may have measurements constrained to lie in a particular range by the
nature of the measuring process. The most common occurrence is right
censoring, in which at the end of the period of data collection some units
of analysis have not yet failed and thus, at least theoretically, would fail at
some future point in time if observation were continued.
The next most common type of censoring is interval censoring, in which
the exact failure time is not observed but is known to have occurred within a
particular interval of time. Such data are called interval-censored or doubly
censored.
Finally, left censoring occurs when, for some units of observation, the
time to fail is known only to have occurred before a particular time (or
age).
Survival analysis is very widely used for right-censored data on the time
to death, the time to disease onset (such as in AIDS diagnosis) or the du-
ration of a health event such as a coma. Interval-censored data frequently
arise in clinical trials in which the outcome of interest is assessed only at
particular observation times; such data might be for example a biomarker
dropping below a particular value or the appearance of a medical feature
not readily recognized by the patient which might be assessed at a visit to
the healthcare provider.
this overlap may not take place; then special precautions may be needed in
analysis.
exp(αl + βl x)
P(Y = 1) = (4.2)
1 + exp(αl + βl x)
With these forms the relationships (4.2) and (4.3) between the observed
binary response and x are obtained. See Figure 4.2.
This representation has greater appeal if the latent tolerance variable Ξ
is in principle realizable.
One use of such a representation is to motivate more elaborate models.
For example, with an ordinal response it may be fruitful to use a number of
cut-off points of the form αk + βx. Thus, for three cut-off levels an ordinal
response model with four levels would be defined; see Figure 4.3.
Multivariate binary and similar responses can be related similarly to the
multivariate normal distribution, although the intractability of that distribu-
tion limits the approach to quite a small number of dimensions.
4.7 Latent variables 67
‘Stimulus’
α + βx
Distribution
of
tolerance
P (Y = 1) P(Y = 0)
ξ
Unobserved
tolerance
Distribution
of
latent
response Y=2
Y=3
Y=1
Y=4
Latent
Exposure-dependent response
thresholds
Y Slope Y Slope Y
βt βt
Curved
Slope
relationship
βm
with Xm
Xt Xm Xm
That is, the formula corresponding to (4.7) for a single explanatory vari-
able becomes more complicated and requires knowledge of the measure-
ment variances (and in principle covariances) for all the components.
Somewhat similar conclusions apply to other forms of regression, such
as that for binary responses.
In addition to the technical statistical assumptions made in correcting
for measurement error, there are two important conceptual issues. First, it
is sometimes argued that the relationship of interest is that with Xm , not
that with Xt , making the discussion of measurement error irrelevant for di-
rect purposes. This may be correct, for example for predictive purposes,
but only if the range of future values over which prediction is required is
broadly comparable with that used to fit the regression model. Note, for ex-
ample, that if in Figure 4.4(b) the values to be predicted lay predominantly
in the upper part of the range then the predicted values would be system-
atically biased downwards by use of the regression on Xm . Second, it is
important that the variation used to estimate σ2 and hence the correction
for attenuation should be an appropriate measure of error.
The above discussion applies to the classical error model, in which the
measurement error is statistically independent of the true value of X. There
is a complementary situation, involving the so-called Berkson error, in
which the error is independent of the measured value, that is,
Xt = X m + ∗ (4.8)
where ∗ is independent of Xm , implying in particular that the true values
are more variable than the measured values (Reeves et al., 1998). There
4.7 Latent variables 73
Y Y
Xm Xt
are two quite different situations where this may happen, one experimental
and one observational. Both are represented schematically in Figure 4.5. In
the experimental context a number of levels of X are pre-set, correspond-
ing perhaps to equally spaced doses, temperatures etc. These are the mea-
sured values. The realized values deviate from the target levels by random
amounts, making the formulation (4.8) appropriate.
An illustration of an observational context for Berkson error is the fol-
lowing.
Figure 4.5 illustrates in the case of a linear regression why the impact
of Berkson error is so different from that of classical error. The induced
74 Principles of measurement
Notes
Section 4.1. For a broad discussion of the principles of measurement see
Hand (2004). For a discussion of methods for assessing health-related qual-
ity of life, see Cox et al. (1992).
Section 4.2. For a brief account of content analysis, see Jackson (2008).
Preliminary analysis
5.1 Introduction
While it is always preferable to start with a thoughtful and systematic
exploration of any new set of data, pressure of time may tempt those
analysing such data to launch into the ‘interesting’ aspects straight away.
With complicated data, or even just complicated data collection processes,
this usually represents a false time economy as complications then come to
light only at a late stage. As a result, analyses have to be rerun and results
adjusted.
In this chapter we consider aspects of data auditing, data screening, data
cleaning and preliminary analysis. Much of this work can be described as
forms of data exploration, and as such can be regarded as belonging to a
continuum that includes, at the other extreme, complex statistical analy-
sis and modelling. Owing to the fundamental importance of data screening
and cleaning, guidance on ethical statistical practice, aimed perhaps par-
ticularly at official statisticians, has included the recommendation that the
data cleaning and screening procedures used should be reported in pub-
lications and testimony (American Statistical Association Committee on
Professional Ethics, 1999).
Of course the detailed nature of these preliminary procedures depends
on the context and, not least, on the novelty of what is involved. For ex-
ample, even the use of well-established measurement techniques in the
75
76 Preliminary analysis
research laboratory needs some broad checks of quality control, but these
requirements take on much more urgency when novel methods are used
in the field by possibly inexperienced observers. Moreover, for large stud-
ies breaking new ground for the investigators, especially those requiring
the development of new standardized operating procedures (SOPs), some
form of pilot study is very desirable; data from this may or may not be in-
corporated into the data for final analysis. More generally any changes in
procedure that may be inevitable in the course of a long investigation need
careful documenting and may require special treatment in analysis. For ex-
ample, in certain kinds of study based on official statistics key definitions
may have changed over time.
complex issues may well not be discovered until considerable analysis has
been undertaken.
The first thing to explore is the pedigree of the data. How were the data
collected? How were they entered into the current electronic dataset or
database? Were many people involved in data collection and/or data entry?
If so, were guidelines agreed and recorded? How were dates recorded and
entered? After all, 05/06/2003 means 5 June to some and May 6 to others.
A starting point in dealing with any set of data is a clear understanding
of the data coding structure and of any units associated with the variables.
Coding any missing values as −99 may not be confusing if the variable in
question is age in years. However, coding them as ‘Unk’ when the variable
denotes names of individuals, for example, of animals, may be confusing
particularly when many individuals have multiple entries in the dataset.
Thus, ‘Unk’ might be interpreted incorrectly as the name, albeit an unusual
one, of an individual.
Calculations of means, standard deviations, minima and maxima for all
the quantitative variables allow plausibility checks and out-of-range checks
to be made. An error could be as simple as a misinterpretation of units,
pounds being mistaken for kilograms, for example. Input and computa-
tional errors may become apparent at this stage.
It may help to consider multiple variables simultaneously when search-
ing for outliers (outlying individuals), the variables being selected if possi-
ble only after detailed consideration of the form in which the data were
originally recorded. For example, it may become clear that height and
weight variables were switched at data entry. Furthermore, from a plot
of two variables that are strongly related it may be easy to detect outly-
ing individuals that are not anomalous in either variable on its own. Such
individuals may have a major, often unwanted, effect on the study of rela-
tionships between variables.
Illustration: Outlier detection to detect foul play In some settings, the de-
tection of outliers may be one main purpose of the analysis. An example
of this is financial fraud detection based on the detection of anomalous
transaction data and/or user behaviour (Edge and Sampaio, 2009). Simi-
larly, possible intrusions into or misuses of computer systems can be de-
tected by analysing computer use patterns and assuming that instances
of computer misuse are both rare and different from patterns of legiti-
mate computer use (Wu and Banzhaf, 2010).
Graphical rep-
Illustration: Graphical depiction of complex data structure
resentation can be particularly helpful in understanding networks. A
study was undertaken to describe a cluster of HIV transmission in South
Wales (Knapper et al., 2008). The identification and notification of part-
ners following diagnosis of the index case allowed reconstruction of the
sexual network through which infection was spread. While a plot of the
5.4 Preliminary graphical analysis 83
One implication of these points is that there is at most a limited and spe-
cialized role for smoothing methods other than simple binning. The reason
is that smoothness which is artefactually generated is virtually impossible
to distinguish from real smoothness implicit in the data. Put differently, un-
dersmoothed data are easily smoothed by eye, or more formally, but over-
smoothed data are vulnerable to misinterpretation and can be unsmoothed
only, if at all, by delicate analysis.
A common method, in some epidemiological studies, of presenting the
analysis of a series of related investigations, sometimes called a forest plot,
is less than ideal in some respects. Usually the primary outcome plotted is
a log relative risk, that is, the log of the ratios of the estimated probabilities
of death for individuals exposed or unexposed to a particular risk factor.
Figure 5.3 shows a typical forest plot, in which each study provides an
estimate and a 95% confidence interval.
Imperfections in this are as follows:
• they invite the misinterpretation that two studies are mutually consistent
if and only if the confidence bands overlap;
• they misleadingly suggest that the studies in a set are mutually consistent
if and only if there is a single point intersecting all the study intervals
shown; and
• to obtain the standard error and confidence intervals for comparison of
pairs or groups of studies is not as direct as it might be.
Perhaps the most difficult aspect to assess is that a large amount of pre-
liminary processing may have taken place before the primary data become
available for analysis.
In other contexts where the evaluation and scoring of complex data are
involved, blind scoring of a subsample by independent experts is desirable.
When data are collected over a substantial time period, calibration checks
are important and also checks for sticking instruments, that is, instruments
that repeatedly record the same value, often at the extreme of the feasible
range for the instrument concerned. Thus a long sequence of zero rain-
falls recorded in a tipping-bucket rain gauge may be strong evidence of a
defective gauge.
5.7 Discussion
It is impossible to draw a clear line between analyses which are exploratory
and those which form the main body of an analytical study. Diagnostic
investigations into the fit of a particular model may lead back to further
exploratory analyses or even to further data screening and cleaning. The
5.7 Discussion 89
process is usually, and indeed should be, iterative as further insights are
gained both into the phenomenon under study and into other processes
which have contributed to the generation of the observed data.
The documentation of methods and results is important throughout to
avoid confusion at later stages and in publications about the work. More
generally, for data of an especially expensive kind and for data likely to
be of broad subsequent interest, early consideration should be given to the
subsequent archiving of information. Material to be recorded should in-
clude: the raw data, not summaries such as means and variances; a clear
record of the data collection methods and definitions of variable coding
schemes (including the coding of missing values); any computer programs
used to retrieve the study data from a complex database.
6
Model formulation
6.1 Preliminaries
Simple methods of graphical and tabular analysis are of great value. They
are essential in the preliminary checking of data quality and in some cases
may lead to clear and convincing explanations. They play a role too in pre-
senting the conclusions even of quite complex analyses. In many contexts
it is desirable that the conclusions of an analysis can be regarded, in part at
least, as summary descriptions of the data as well as interpretable in terms
of a probability model.
Nevertheless careful analysis often hinges on the use of an explicit prob-
ability model for the data. Such models have a number of aspects:
90
6.1 Preliminaries 91
In connection with this last point note that, while it is appealing to use
methods that are in a reasonable sense fully efficient, that is, extract all
relevant information in the data, nevertheless any such notion is within
the framework of an assumed model. Ideally, methods should have this
efficiency property while preserving good behaviour (especially stability
of interpretation) when the model is perturbed.
Essentially a model translates a subject-matter question into a mathe-
matical or statistical one and, if that translation is seriously defective, the
analysis will address a wrong or inappropriate question, an ultimate sin.
The very word ‘model’ implies that an idealized representation is in-
volved. It may be argued that it is rarely possible to think about complex
situations without some element of simplification and in that sense models
of some sort are ubiquitous. Here by a model we always mean a probability
model.
Most discussion in this book concerns the role of probability models as
an aid to the interpretation of data. A related but somewhat different use of
such models is to provide a theoretical basis for studying a phenomenon.
Furthermore, in certain cases they can be used to predict consequences in
situations for which there is very little or no direct empirical data. The
illustration concerning Sordaria discussed below in Section 6.3 is a simple
example of a model built to understand a biological phenomenon. In yet
other contexts the role of a probability model may be to adjust for special
features of the data collection process.
Stereology is concerned with inferring the properties of structures in sev-
eral dimensions from data on probes in a lower number of dimensions; see
Baddeley and Jensen (2005). In particular, many medical applications in-
volve inferring properties of three-dimensional structures in the body from
scans producing two-dimensional cross-sections.
Often, more detailed models involve progression in time expressed by
differential equations or progression in discrete time expressed by differ-
ence equations. Sometimes these are deterministic, that is, they do not
involve probability directly, and provide guidance on the systematic varia-
tion to be expected. Such initially deterministic models may have a ran-
dom component attached as an empirical representation of unexplained
variation. Other models are intrinsically probabilistic. Data-generating
processes nearly always evolve over time, and in physics this manifests
itself in the near-ubiquity of differential equations. In other fields essen-
tially the same idea may appear in representations in terms of a sequence
of empirical dependencies that may be suggestive of a data-generating
process.
92 Model formulation
• parametric;
• nonparametric; or
• semiparametric.
We will concentrate largely on parametric formulations in which θ has a
finite, sometimes relatively small, number of components. Familiar exam-
ples are the standard forms of generalized linear regression, in particular
those with independent normally distributed deviations; in this case the pa-
rameter θ consists of the unknown regression coefficients and the variance
of the associated normal distribution. One of the simplest such forms is
a straight-line relationship between the explanatory variables x j and the
response variables Y j , namely
Y j = β 0 + β1 x j + j , (6.1)
where for j = 1, . . . , n the deviations j are independently normally dis-
tributed with mean zero and unknown variance σ2 and each value of j
corresponds to a distinct study individual. The vector parameter is θ =
(β0 , β1 , σ2 ), although in particular instances one or more of the components
might be known. The systematic component of the model can be written as
E(Y j ) = β0 + β1 x j . (6.2)
Many widely used models are essentially generalizations of this.
By contrast a nonparametric formulation of such a relationship might be
Y j = φ(x j ) + j , (6.3)
where φ(x) is an unknown function of x constrained only by some smooth-
ness conditions or by being monotonic, and the j are mutually independent
random variables with median zero and with otherwise unknown and arbi-
trary distribution.
There are two broad forms of semiparametric formulation. In one the
distribution of the j remains nonparametric whereas the regression func-
tion is linear or has some other simply parameterized form. In the other
the roles are reversed. The function φ(x) remains as in the nonparametric
96 Model formulation
as asking whether the model is true but whether there is clear evidence of
a specific kind of departure implying a need to change the model so as to
avoid distortion of the final conclusions.
More detailed discussion of the choice of parameters is deferred to
Section 7.1, and for the moment we make a distinction only between the
parameters of interest and the nuisance parameters. The former address
directly the research question of concern, whereas typically the values of
the nuisance parameters are not of much direct concern; these parameters
are needed merely to complete the specification.
In the linear regression model (6.1) the parameter β1 , determining the
slope of the regression line, would often be the parameter of interest. There
are other possibilities, however. For example interest might lie in the inter-
cept of the line at x = 0, i.e. in β0 , in particular in whether the line passes
through the origin. Yet another possible focus of interest is the value of
x at which the expected response takes a preassigned value y∗ , namely
(y∗ − β0 )/β1 .
In hospital
(a)
(b)
P(Y = 1) = α + βx (6.8)
incurs the constraint on its validity that for some values of x probabilities
that are either negative or above 1 are implied.
For this reason (6.8) is commonly replaced by, for example, the linear
logistic form
P(Y = 1)
log = α + β x, (6.9)
P(Y = 0)
which avoids any such constraint. The form (6.8) does, however, have the
major advantage that the parameter β, which specifies the change in proba-
bility per unit change in x, is more directly understandable than the param-
eter β , interpreted as the change in log odds, the left-hand side of (6.9),
per unit x. If the values of x of interest span probabilities that are restricted,
6.5 Empirical models 111
say to the interval (0.2, 0.8), the two models give essentially identical con-
clusions and the use of (6.8) may be preferred.
If the relationship between a response variable and a single continuous
variable x is involved then, given suitable data, the fitting of quite compli-
cated equations may occasionally be needed and justified. For example, the
growth curves of individuals passing through puberty may be quite com-
plex in form.
In many common applications, especially but not only in the social sci-
ences, the relationship between Y and several or indeed many variables
x1 , . . . , x p is involved. In this situation, possibly after transformation of
some of or all the variables and preliminary graphical analysis, the fol-
lowing general ideas often apply:
• it is unlikely that a complex social system, say, can be treated as wholly
linear in its behaviour;
• it is impracticable to study directly nonlinear systems of unknown form
in many variables;
• therefore it is reasonable to begin by considering a model linear in all or
nearly all of the x1 , . . . , x p ;
• having completed the previous step, a search should be made for iso-
lated nonlinearities in the form of curved relationships with individual
variables xi and interactions between pairs of variables.
The last step is feasible even with quite large numbers of explanatory
variables.
Thus a key role is played in such applications by the fitting of linear re-
lationships of which the simplest is the multiple linear regression, where
for individual j the response Y j is related to the explanatory variables
x j1 , . . . , x jp by
E(Y j ) = β0 + β1 x j1 + · · · + β p x jp . (6.10)
Here β1 , say, specifies the change in E(Y) per unit change in x1 with the
remaining explanatory variables x2 , . . . , x p held fixed. It is very important
to appreciate the deficiencies of this notation; in general β1 is influenced
by which other variables are in the defining equation and indeed in a better,
if cumbersome, notation, β1 is replaced by βy1|2...p . If, say, x2 were omit-
ted from the equation then the resulting coefficient of x1 , now denoted by
βy1|3...p , would include any effect on E(Y) of the change in x2 induced by
the implied change in x1 . It is crucial, in any use of a relationship such as
(6.10) to assess the effect of, say, x1 , that only the appropriate variables are
included along with x1 .
112 Model formulation
Area
Peak
time Time (hr)
Two general issues are that the assessment of precision comes primarily
from comparisons between units of analysis and that the modelling of vari-
ation within units is necessary only if such internal variation is of intrinsic
interest.
To compare the treatments the complex response for each rat must
be reduced to a simpler form. One way that does not involve a spe-
cific model is to replace each set of values by, say, the peak concen-
tration recorded, the time at which that peak is recorded and the total
area under the concentration–time curve. Comparison of the treatments
would then hopefully be accomplished, for example by showing the dif-
ferences in some and the constancy of others of these distinct features.
An alternative approach would be to model the variation of concentra-
tion with time either by an empirical equation or perhaps using the un-
derlying differential equations. The treatment comparison would then
be based on estimates of relevant parameters in the relationship thus
fitted.
The general issue involved here arises when relatively complex re-
sponses are collected on each study individual. The simplest and often the
most secure way of condensing these responses is through a number of
summary descriptive measures. The alternative is by formal modelling of
the response patterns. Of course, this may be of considerable intrinsic in-
terest but inessential in addressing the initially specified research question.
In other situations it may be necessary to represent explicitly the differ-
ent hierarchies of variation.
An analysis of
Illustration: Hazards of ignoring hierarchical data structure
the success of in vitro fertilization pre-embryo transfer (IVF-ET) used
hierarchical logistic regression models to analyse the effect of fallopian
tube blockage, specifically hydrosalpinx, on the probability of embryo
implantation success. Variation was modelled both at the individual-
woman level and the embryo level, allowing for variability in women’s
unobservable probability of implantation success. Hogan and Blazar
Notes 117
Notes
The role of empirical statistical models is discussed in many books on sta-
tistical methods. For a general discussion of different types of model with
more examples, see Lehmann (1990) and Cox (1990).
7
Model choice
118
7.1 Criteria for parameters 119
variation not of direct interest. The roles are reversed, however, when
attention is focused on the haphazard variation.
The choice of parameters involves, especially for the parameters of inter-
est, their interpretability and their statistical and computational properties.
Ideally, simple and efficient methods should be available for estimation and
for the assessment of the precision of estimation.
in the variables being studied but which are of no, or limited, direct
concern.
One general term for such features is that they are nonspecific. Two dif-
ferent centres in a trial may vary in patient mix, procedures commonly
used, management style and so on. Even if clear differences in outcome
are present between clinics, specific detailed interpretation will be at best
hazardous.
It may be necessary to take account of such features in one of two differ-
ent ways. The simpler is that, on an appropriate scale, there is a parameter
representing a shift in level of outcome. The second and more challenging
possibility is that the primary contrasts of concern, treatments, say, them-
selves vary across centres; that is, there is a treatment–centre interaction.
We will deal first with the simpler case.
The possibility arises of representing the effects of such nonspecific vari-
ables by random variables rather than by fixed unknown parameters. In a
simple example, within each of a fairly large number of centres, individuals
are studied with a binary response and a number of explanatory variables,
including one representing the treatment of primary concern. For individu-
als in centre m the binary response Ymi might be assumed to have a logistic
regression with
P(Ymi = 1)
log = αm + βT xmi , (7.9)
P(Ymi = 0)
where xmi is a vector of explanatory variables, the vector of parameters β is
assumed to be the same for all centres and the constant αm characterizes the
centre effect. A key question concerns the circumstances under which the
αm should be treated as unknown constants, that is, fixed effects, and those
in which it is constructive to treat the αm as random variables representing
what are therefore called random effects.
126 Model choice
(a) Block 1 2 3 4 5
T2 T1 T3 T3 T1
T1 T2 T1 T2 T3
T3 T3 T2 T1 T2
(b) Block 1 2 3 4 5
T1 T2 T1 T2 T3
T1 T3 T2 T2 T2
T2 T3 T3 T3 T1
also has a random error leads to estimation of the parameters via respec-
tively the treatment means and the block means. It is fairly clear on general
grounds, and can be confirmed by formal analysis, that estimation of the
treatment effects from the corresponding marginal means is unaffected by
whether the block effects are:
For the first two possibilities, but not the third, estimation of the residual
variance is also unaffected. Thus the only difference between treating the
block constants as random rather than fixed lies in a notional generalization
of the conclusions from the specific blocks used to a usually hypothetical
population of blocks.
If, however, the complete separation of block and treatment effects in-
duced by the balance of Table 7.1(a) fails, the situation is different. See
Table 7.1(b) for a simple special case with an unrealistically small number
of blocks! The situation is unrealistic as the outcome of an experiment but
is representative of the kinds of imbalance inherent in much comparable
observational data.
If now the block parameters are treated as independent and identically
distributed with variance, say, σ2B and the residuals in the previous model
128 Model choice
have variance σ2W then all observations have variance σ2B + σ2W and two
observations have covariance σ2B if and only if they are in the same block
and are otherwise uncorrelated. The so-called method of generalized least
squares may now be used to estimate the parameters using empirical es-
timates of the two variances. Unless σ2B /σ2W is large the resulting treat-
ment effects are not those given by the fixed-effect analysis but exploit
the fact that the block parameters are implicitly assumed not to be greatly
different from one another, so that some of the apparent empirical varia-
tion between the block means is explicable as evidence about treatment
effects. It is assumed that reasonable estimates of the two variances are
available. If in fact the ratio of the variances is very small, so that inter-
block variation is negligible, then, as might be expected, the block struc-
ture can be ignored and the treatment effects again estimated from marginal
means.
Indeed in the older experimental design literature the above proce-
dure, presented slightly differently, was called the recovery of inter-block
information.
We now move away from the relatively simple situations typified by a
randomized block design to consider the broad implications for so-called
mixed-model analyses and representations. The supposition throughout is
that in the model there are explanatory variables of direct and indirect inter-
est and also one or more nonspecific variables that are needed but are typi-
cally not of intrinsic interest. The issue is whether such variables should be
represented by unknown parameters regarded as unknown constants or by
random variables of relatively simple structure, typically, for each source
independent and identically distributed. The following considerations are
relevant.
event rates that is stable rather than the ratio? Or does the ratio depend
in a systematic way on the socio-economic make-up of the patient pop-
ulation at each centre? While any such explanation is typically tentative
it will usually be preferable to the other main possibility, that of treating
interaction effects as random.
• The value, the standard error and most importantly the interpretation of
the regression coefficient of y on x∗ in general depends on which other
explanatory variables enter the fitted model.
• The explanatory variables prior to x∗ in a data-generating process should
be included in the model unless either they are conditionally indepen-
dent of y given x∗ and other variables in the model or are conditionally
independent of x∗ given those other variables. These independencies are
to be consistent with the data and preferably are plausible a priori.
• The variables intermediate between x∗ and y are to be omitted in an
initial assessment of the effect of x∗ on y but may be helpful in a later
stage of interpretation in studying pathways of dependence between x∗
and y.
• Relatively mechanical methods of choosing which explanatory variables
to use in a regression equation may be helpful in preliminary explo-
ration, especially if p is quite large, but are insecure as a basis for a final
interpretation.
• Explanatory variables not of direct interest but known to have a sub-
stantial effect should be included; they serve also as positive controls.
Occasionally it is helpful to include explanatory variables that should
have no effect; these serve as negative controls, that is, if they do not
have the anticipated positive or null impact then this warns of possible
misspecification.
• It may be essential to recognize that several different models are essen-
tially equally effective.
• If there are several potential explanatory variables on an equal footing,
interpretation is particularly difficult in observational contexts.
When the dimension of x∗ is greater than one then, even though choos-
ing an appropriate model may raise no special difficulties, interpretation of
the estimated parameters may be ambiguous, especially in observational
studies.
136 Model choice
A particular issue that may arise at the initial stage of an analysis con-
cerns situations in which a number of conceptually closely related mea-
sures are recorded about the same aspect of the study individuals.
for inclusion in the model may have been omitted. It will be wise to add
them back into the model, one (or a very small number) at a time, and
to check whether any major change in the conclusions is indicated. Sec-
ond, and very important, the initial discussion was predicated on a linear
fit. Now, especially in relatively complex situations it is unlikely that lin-
earity of response is other than a crude first approximation. While specific
nonlinear effects can, of course, be included from the start, a general form
of nonlinearity with many explanatory variables is not a feasible base for
analysis, unless p is small. A compromise is that, probably at the end of the
first phase of analysis but perhaps earlier, nonlinear and interaction effects,
such as are expressed by x2j and x j xk , are explored one at a time by their
addition to the model as formal explanatory variables to describe curvature
of the response to the explanatory variables. The resulting test statistics, as-
sessed graphically, provide a warning process rather than a direct interpre-
tation; for example, the detection of interaction signals the need for detailed
interpretation.
The choice of a regression model is sometimes presented as a search for
a model with as few explanatory variables as reasonably necessary to give
an adequate empirical fit. That is, explanatory variables that are in some
sense unnecessary are to be excluded, regardless of interpretation. This
approach, which we do not consider, or in general recommend, may some-
times be appropriate for developing simple empirical prediction equations,
although even then the important aspect of the stability of the prediction
equation is not directly addressed.
A topic that has attracted much interest recently is that of regression-like
studies in which the number, p, of explanatory variables exceeds the num-
ber, n, of independent observations. In these applications typically both n
and p are large. Some key points can be seen in the simple case studied
in the industrial experimental design literature under the name ‘supersat-
urated design’. In this one might wish to study, for example, 16 two-level
factors with only 12 distinct experimental runs feasible. If it can be safely
assumed that only one of the 16 factors has an important effect on response
then, especially if a suitable design (Booth and Cox, 1962) is chosen, there
is a good chance that this factor can be identified and its effect estimated.
If, however, several factors have appreciable effects then it is likely that
confusing and potentially misleading results will be obtained. That is, suc-
cess depends crucially on the sparsity of the effects. Whether it is ever wise
in practice to use such a design is another matter!
It has been shown recently that for general linear regression problems
with p > n and with sparse effects, that is, most regression coefficients
Notes 139
Notes
Section 7.1. For more discussion of the choice of parameters, see Ross
(1990). Dimensional analysis is described in textbooks on classical
physics. For a general discussion of the use of floating parameters, see
Firth and de Menezes (2004). There is an extensive literature on the anal-
ysis of survival data, variously with engineering, medical or social science
emphasis. Kalbfleisch and Prentice (2002) gave a thorough account in a
largely medical context. For the comparison of different parametric forms,
see Cox and Oakes (1984).
Section 7.2. For recovery of inter-block information, see Cox and Reid
(2000).
8
8.1 Preliminaries
The details of specific statistical methods and the associated mathemati-
cal theory will not be discussed here. We shall, however, outline the main
forms in which statistical conclusions are presented, because understand-
ing of the strengths and limitations of these forms is essential if misunder-
standing is to be avoided.
We discuss first analyses in which interpretation centres on individual
parameters of interest; that is, in general we investigate component by
component. We denote a single such parameter by ψ. This can represent
some property of interest, such as the number of individual animals of a
specific wildlife species in a particular area, or it can represent contrasts
between groups of individuals in an outcome of interest or, in a linear
regression application, ψ can be the slope of the relationship.
It is desirable that ψ is considered in ‘sensible’ units, chosen to give
answers within or not too far outside the range from 0.1 to 10. For example,
a slope of a relationship of length against time, that is a speed, might be in
mm per hour or km per day, etc., as appropriate and an incidence rate might
be in the number of cases per 100 000 person-years or in cases per 100
person-days. It is good practice always to state and repeat units explicitly.
Then conclusions about ψ can be presented as:
140
8.2 Confidence limits 141
• find a function v of the data such that under H0 the corresponding ran-
dom variable V has, to an adequate approximation, a known distribution,
for example the standard normal distribution, or, in complicated cases, a
distribution that can be found numerically;
• collect the data and calculate v;
• if the value of v is in the lower or central part of the distribution, the data
are consistent with H0 ; or
• if v is in the extreme upper tail of the distribution then this is evidence
against H0 .
The last step is made more specific by defining the p-value of the data y,
yielding the value vobs , as
choices such as the level 0.05. The critical level that is sensible depends
strongly on context and often on convention.
• The p-value, although based on probability calculations, is in no sense
the probability that H0 is false. Calculating that would require an ex-
tended notion of probability and a full specification of the prior proba-
bility distribution, not only of the null hypothesis but also of the distribu-
tion of unknown parameters in both null and alternative specifications.
• The p-value assesses the data, in fact the function v, via a comparison
with that anticipated if H0 were true. If in two different situations the
test of a relevant null hypothesis gives approximately the same p-value,
it does not follow that the overall strengths of the evidence in favour of
the relevant H0 are the same in the two cases. In one case many plausible
alternative explanations may be consistent with the data, in the other
very few.
• With large amounts of informative data, a departure from H0 may be
highly significant statistically yet the departure is too small to be of
subject-matter interest. More commonly, with relatively small amounts
of data the significance test may show reasonable consistency with
H0 while at the same time it is possible that important departures are
present. In most cases, the consideration of confidence intervals is desir-
able whenever a fuller model is available. For dividing null hypotheses,
the finding of a relatively large value of p warns that the direction of
departure from H0 is not firmly established by the data under analysis.
model families will be available. Then the most fruitful approach is usually
to take each model as the null hypothesis in turn, leading to a conclusion
that both are consistent with the data, or that one but not the other is consis-
tent or that neither is consistent with the data. Another possibility is to form
a single model encompassing the two. Methods are sometimes used that
compare in the Bayesian setting the probabilities of ‘correctness’ of two or
more models but they may require, especially if the different models have
very different numbers of parameters, assumptions about the prior distri-
butions involved that are suspect. Moreover such methods do not cover the
possibility that all the models are seriously inadequate.
More commonly, however, there is initially just one model family for
assessment. A common approach in regression-like problems is the graph-
ical or numerical study of residuals, that is, the differences between the ob-
served and the fitted response variables, unexpected patterns among which
may be suggestive of the need for model modification. An alternative, and
often more productive, approach is the use of test statistics suggestive of ex-
plicit changes. For example, linear regression on a variable x can be tested
by adding a term in x2 and taking the zero value of its coefficient as a
null hypothesis. If evidence against linearity is found then an alternative
nonlinear representation is formulated, though it need not necessarily be
quadratic in form.
The status of a null hypothesis model as dividing or atomic depends on
the context. Some models are reasonably well established, perhaps by pre-
vious experience in the field or perhaps by some theoretical argument. Such
a model may be regarded as an atomic null hypothesis. In other situations
there may be few arguments other than simplicity for the model chosen,
and then the main issue is whether the direction in which the model might
be modified is reasonably firmly established.
cells in which the observed counts exceed the background. A very large
number of individual cells may need to be examined unless the energy
associated with the events of specific interest is known.
In all these examples, except probably the last, the issue may appear to
be how to interpret the results of a large number of significance tests. It is,
however, important to be clear about objectives. In the last illustration, in
particular, a decision process is involved in which only the properties of
the varieties finally selected are of concern.
8.5.2 Formulation
In fact, despite the apparent similarity of the situations illustrated above, it
is helpful to distinguish between them. In each situation there is a possibly
large number r of notional null hypotheses, some or none of which may be
false; the general supposition is that there are at most a small number of
interesting effects present. Within that framework we can distinguish the
following possibilities.
• It may be that no real effects are present, that is that all the null hypothe-
ses are simultaneously true.
• It may be virtually certain that a small number of the null hypotheses
are false and so one must specify a set containing false null hypotheses,
such a procedure to have specified statistical properties.
• The previous situation may hold except that one requires to attach indi-
vidual assessments of uncertainty to the selected cases.
• The whole procedure is defined in terms of a series of stages and only
the properties of the overall procedure are of concern.
important effect will be found even if all null hypotheses hold. If p∗ corre-
sponds to the smallest of r independent tests then the appropriate probabil-
ity of false rejection is
1 − (1 − p∗ )r . (8.4)
In the special case this becomes 1 − 0.9820 , which is approximately 0.33.
That is, there is nothing particularly surprising in finding p∗ = 0.02 when
20 independent tests are considered. In some contexts, such as discovery
problems in particle physics and genome-wide association scans in genet-
ics, very much larger values of r, possibly as high as 105 , will arise.
If rp∗ is small then the corrected p-value approximately equals the Bon-
ferroni bound rp∗ . Moreover this provides an upper limit to the appropriate
adjusted significance level even if the tests are not independent.
It is important that adjustment to the significance level is required not so
much because multiple tests are used but because a highly selected value
is being considered. Thus in a factorial experiment several or sometimes
many effects are studied from the same data. Provided each component of
analysis addresses a separate research question, confidence intervals and
tests need no allowance for multiple testing. It is only when the effect for
detailed study is selected, in the light of the data, out of a number of possi-
bilities that special adjustments may be needed.
Of course, when trying out several tests of the same issue, choosing one
just because it gives the smallest p-value amounts to the misuse of a tool.
Notes
Section 8.1. Cox (2006) gives a general introduction to the concepts un-
derlying statistical inference.
Section 8.4. Tippett (1927) was probably the first to produce tables of
random digits to aid empirical random sampling studies. The bootstrap
approach was introduced by Efron (1979); for a thorough account, see
Davison and Hinkley (1997). For a general account of simulation-based
methods, see Ripley (1987) and for MCMC, see Robert and Casella (2004).
Interpretation
9.1 Introduction
We may draw a broad distinction between two different roles of scien-
tific investigation. One is to describe an aspect of the physical, biological,
social or other world as accurately as possible within some given frame
of reference. The other is to understand phenomena, typically by relating
conclusions at one level of detail to processes at some deeper level.
In line with that distinction, we have made an important, if rather vague,
distinction in earlier chapters between analysis and interpretation. In the
latter, the subject-matter meaning and consequences of the data are em-
phasized, and it is obvious that specific subject-matter considerations must
figure strongly and that in some contexts the process at work is intrinsically
more speculative. Here we discuss some general issues.
Specific topics involve the following interrelated points.
• To what extent can we understand why the data are as they are rather
than just describe patterns of variability?
• How generally applicable are such conclusions from a study?
• Given that statistical conclusions are intrinsically about aggregates, to
what extent are the conclusions applicable in specific instances?
• What is meant by causal statements in the context in question and to
what extent are such statements justified?
• How can the conclusions from the study be integrated best with the
knowledge base of the field and what are the implications for further
work?
159
160 Interpretation
The intermediate variables are ignored in the first direct analysis of the
effect of primary explanatory variables on the response. As mentioned
above, however, the intermediate variables may themselves be response
variables in further analyses. In other situations it is helpful to decompose
the dependence of Y on T into a path through I, sometimes called an indi-
rect effect, and a remainder, sometimes misleadingly called a direct effect;
a better name for this remainder would be an unexplained effect. In terms
of linear regression this decomposition is expressed as
at essentially the same time and it is not impossible that the incentive
to learn more about the disease is influenced by the level of success
achieved in controlling it.
still hold. Conversely, a check that they are all satisfied is no guarantee of
causality.
The reason behind the first point is that the larger the effect the less
likely it is to be the consequence of an unobserved confounder. Point 5 has
greater cogency if the explanation is both clearly evidence-based and avail-
able beforehand. Retrospective explanations may be convincing if based on
firmly established theory but otherwise need to be treated with special cau-
tion. It is well known in many fields that ingenious explanations can be
constructed retrospectively for almost any finding.
In point 6, a natural experiment means a large intervention into a system,
unfortunately often harmful.
The most difficult of the guidelines to assess is the last, about the speci-
ficity of effects. In most contexts the pathways between the proposed cause
and its effect are quite narrowly defined; a particular physical, biological
or sociological process is involved. If this is the case then the proposed re-
lationship should be seen as holding only in quite restricted circumstances.
If the relationship in fact holds very broadly there are presumably many
different processes involved all pointing in the same direction. This is not
quite in the spirit of causal understanding, certainly not in the sense of the
detailed understanding of an underlying process.
Baseline
Response
or
intrinsic
variables Intermediate variables
(b)
Factor 2
level 1 4 6 7
level 2 8 12 14
9.5 Interaction
9.5.1 Preliminaries
Interaction may seem to be more a detailed issue of statistical analysis than
one of interpretation, but in some contexts it has a valuable interpretive
role.
First, the term interaction itself is in some ways a misnomer. There is
no necessary implication of interaction in the physical sense or synergy in
a biological context. Rather, interaction means a departure from additivity
of the effects of two or more explanatory variables. This is expressed most
explicitly by the requirement that, apart from random fluctuations, the dif-
ference in outcome between any two levels of one factor is the same at all
levels of the other factor.
In Table 9.1 the factor effects combine multiplicatively rather than ad-
ditively and so the interaction is removed by log transformation. In more
general situations the interaction may be removable by a less familiar trans-
formation, although if this is to an uninterpretable scale of measurement it
may be best to retain the representation with interaction especially if the
original scale is extensive in the sense of Section 4.3. Table 9.2 shows a
complementary empirical case where the data are consistent with no inter-
action on an additive scale but not with a simple multiplicative effect.
The most directly interpretable form of interaction, certainly not remov-
able by transformation, is effect reversal. This, while relatively rare in most
contexts, may have strong implications. Table 9.3 shows an example where
172 Interpretation
Table 9.2 An example where the data are
consistent with no additive interaction but
inconsistent with no multiplicative interaction.
The data give the relative risk of lung cancer.
The risks are approximately additive but far
from multiplicative (Gustavsson et al., 2002)
Asbestos exposure
Smoking No Yes
no 1.0 10.2
yes 21.7 43.1
low 1.54
medium 1.07
high 0.57
P(W = i, X = j, Y = k) = pi jk , (9.4)
The second kind of data are relatively short series collected on a large
number of essentially independent individuals. Thus a panel of individu-
als may be interviewed every year for a few years to enquire about their
voting intentions, attitudes to key political and social issues and so on.
In another type of investigation, following injection of a monitoring sub-
stance the blood concentration may be recorded at regular intervals for, say,
a 24 hr period. The object is to assess the resulting concentration-time re-
sponse curves, possibly in order to compare different groups of individuals.
Both types of data lead to a wide variety of statistical methods, which
we shall not attempt to summarize. Instead we give a few broad principles
that may help an initial approach to such data. An important general precept
applying to both types of data, and indeed more generally, is that qualitative
inspection of the data, or of a sample of it, is crucial before more formal
methods of analysis are applied.
• Analyse the data in two phases, first fitting a suitable model to the data
for each individual and then in a second phase analysing inter-individual
variation.
• Fit a single multi-level model in which all variation is represented con-
sistently in a single form.
The choice clearly depends on the focus of interest. We have already
discussed the first approach in connection with the illustration ‘Summariz-
ing response curves’ (p. 115) and so now we discuss briefly just the second
and third possibilities.
Probably the simplest possibility is that for each individual there are
observations of a response y at a series of time points and that a broadly
linear relationship holds for each individual. If the time points are fairly
widely spaced then a linear regression model for individual i,
yit = αi + βi (t − t0 ) + it , (9.12)
with independent errors it might be taken; otherwise time series models for
the correlations between errors at different time points might be needed.
In the simple version, for the first stage of analysis we would have an
estimated linear regression coefficient bi and an intercept ai , the latter cal-
culated at the reference time point t0 common to all individuals. In the
second phase of analysis the estimates (ai , bi ) are studied. How variable
are they? Are they related to other observed features of the individuals? In
a study in which the data for all individuals are taken at the same set of
time points the (ai , bi ) will be of equal precision, but in many applications
the configuration of time points will not be the same and any need to allow
for differing precisions may complicate this essentially simple procedure.
In the third approach we use a multi-level model in which
ai = µa + ηi , bi = µb + ζi , (9.13)
where ηi and ζi are random terms of mean zero having an unknown co-
variance matrix that is independent of i . The representation (9.13) can be
augmented, for example by terms representing systematic dependencies on
explanatory variables defined at the individual level. The random variation
is now described by three variances and a covariance; software is available
for estimating the full model (9.12), (9.13).
The effect, recognized for many years (Sterling, 1959; Dawid and Dickey,
1977), was labelled the “file drawer effect” in 1979 (Rosenthal, 1979) and,
although files are now more likely to be computer- rather than drawer-
based, the phenomenon remains. There are other related biases that make
significant results more readily available than their non-significant counter-
parts. The Cochrane Collaboration (no date) listed four additional biases:
statistically significant results are more likely to be published rapidly (time-
lag bias); they are more likely to be published in English (language bias);
they are more likely to be published more than once (multiple publication
bias); and they are more likely to be cited (citation bias). Furthermore, sig-
nificant results are more likely to be published in journals with high impact
factors. These effects have been well documented in the clinical literature
but they have also been examined in areas as diverse as studies of political
behaviour (Gerber et al., 2010), of the potentially violent effects of video
games (Ferguson, 2007) and of climate change (Michaels, 2008).
Efforts to reduce such biases include trial registries, such as the US clin-
icaltrials.gov, which aim to document all clinical trials undertaken regard-
less of the significance or direction of their findings. Funnel plots are often
used to detect publication bias in studies bringing together results from
previously published studies. The aim of these is to recognize asymmetry
in the published results such as might arise from publication bias, as well
as from time-lag bias and, potentially, from English-language bias depend-
ing on the form of the literature search. Regression methods are sometimes
available to adjust for the effects of publication bias.
Notes
Section 9.5. For a general discussion of interaction, see Cox (1984) and
Berrington de Gonzáles and Cox (2007). The analysis of variance is de-
scribed at length in some older books on statistical methods, for example
Snedecor and Cochran (1956).
Section 9.6. For methods for time series analysis see Brockwell et al.
(2009), for longitudinal data see Diggle et al. (2002) and for multi-level
modelling see Snijders and Bosker (2011). There are extensive literatures
on all these topics.
10
Epilogue
184
10.2 Some strategic issues 185
machine learning. The characteristics of some, although not all, such ap-
plications are the following.
• Large or very large amounts of data are involved.
• The primary object is typically empirical prediction, usually assessed by
setting aside a portion of the data for independent checks of the success
of empirical formulae, often judged by mean square prediction error.
• Often there is no explicit research question, at least initially.
• There is little explicit interest in interpretation or in data-generating pro-
cesses, as contrasted with prediction.
• Any statistical assessment of errors of estimation usually involves strong
independence assumptions and is likely to seriously underestimate real
errors.
• In data mining there is relatively little explicit use of probability models.
• Methods of analysis are specified by algorithms, often involving nu-
merical optimization of plausible criteria, rather than by their statistical
properties.
• There is little explicit discussion of data quality.
A broader issue, especially in machine learning, is the desire for wholly,
or largely, automatic procedures. An instance is a preference for neural
nets over the careful semi-exploratory use of logistic regression. This pref-
erence for automatic procedures is in partial contrast with the broad ap-
proach of the present book. Here, once a model has been specified, the
analysis is largely automatic and in a sense objective, as indeed is desir-
able. Sometimes the whole procedure is then relatively straightforward.
We have chosen, however, to emphasize the need for care in formulation
and interpretation.
Breiman (2001) argued eloquently that algorithmic-based methods tend
to be more flexible and are to be preferred to the mechanical use of standard
techniques such as least squares or logistic regression.
10.4 Conclusion
In the light of the title of the book, the reader may reasonably ask: What
then really are the principles of applied statistics? Or, more sceptically,
and equally reasonably: in the light of the great variety of current and po-
tential applications of statistical ideas, can there possibly be any universal
principles?
188 Epilogue
It is clear that any such principles can be no more than broad guide-
lines on how to approach issues with a statistical content; they cannot be
prescriptive about how such issues are to be resolved in detail.
The overriding general principle, difficult to achieve, is that there should
be a seamless flow between statistical and subject-matter considerations.
This flow should extend to all phases of what we have called the ideal se-
quence: formulation, design, measurement, the phases of analysis, the pre-
sentation of conclusions and finally their interpretation, bearing in mind
that these phases perhaps only rarely arise in such a sharply defined
sequence. To put this in slightly more personal terms, in principle seam-
lessness requires an individual statistician to have views on subject-matter
interpretation and subject-matter specialists to be interested in issues of
statistical analysis.
No doubt often a rather idealized state of affairs. But surely something
to aim for!
References
Ahern, M. J., Hall, N. D., Case, K., and Maddison, P. J. (1984). D-penicillamine with-
drawal in rheumatoid arthritis. Ann. Rheum. Dis. 43, 213–217. (Cited on p. 85.)
Amemiya, T. (1985). Advanced Econometrics. Harvard University Press. (Cited on
p. 52.)
American Statistical Association Committee on Professional Ethics (1999). Ethical
guidelines for statistical practice. (Cited on p. 75.)
Baddeley, A., and Jensen, E. B. V. (2005). Stereology for Statisticians. Chapman and
Hall. (Cited on pp. 51 and 91.)
Bailey, R. A. (2008). Design of Comparative Experiments. Cambridge University Press.
(Cited on p. 51.)
Ben-Akiva, M., and Leman, S. R. (1985). Discrete Choice Analysis: Theory and Appli-
cation to Travel Demand. MIT Press. (Cited on p. 49.)
Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical
and powerful approach to multiple testing. J. R. Statist. Soc. B 57, 289–300. (Cited
on p. 158.)
Beral, V., Bull, D., Reeves, G., and Million Women Study Collaborators (2005). En-
dometrial cancer and hormone-replacement therapy in the Million Women Study.
Lancet 365, 1543–1551. (Cited on p. 172.)
Berrington de Gonzáles, A., and Cox, D. R. (2007). Interpretation of interaction: a
review. Ann. Appl. Statist. 1, 371–385. (Cited on p. 183.)
Beveridge, W. I. B. (1950). The Art of Scientific Investigation. First edition. Heinemann,
(Third edition published in 1957 by Heinemann and reprinted in 1979.) (Cited on
p. 13.)
Bird, S. M., Cox, D., Farewell, V. T., Goldstein, H., Holt, T., and Smith, P. C. (2005).
Performance indicators: good, bad, and ugly. J. R. Statist. Soc. A 168, 1–27. (Cited
on p. 55.)
Bollen, K. A. (1989). Structural Equations with Latent Variables. Wiley-Interscience.
(Cited on p. 74.)
Booth, K. H. V., and Cox, D. R. (1962). Some systematic supersaturated designs. Tech-
nometrics 4, 489–495. (Cited on p. 138.)
Bourne, J., Donnelly, C., Cox, D., et al. (2007). Bovine TB: the scientific evidence.
A science base for a sustainable policy to control TB in cattle. Final report of the
independent scientific group on cattle TB. Defra. (Cited on p. 17.)
Box, G. E. P. (1976). Science and statistics. J. Am. Statist. Assoc. 71, 791–799. (Cited
on p. 13.)
189
190 References
Box, G. E. P., Hunter, W. G., and Hunter, J. S. (1978). Statistics for Experimenters:
Design, Innovation, and Discovery. John Wiley & Sons. (Second edition published
in 2005 by Wiley-Interscience.) (Cited on p. 51.)
Breiman, L. (2001). Statistical modeling: the two cultures. Statist. Sci. 16, 199–231.
(Cited on p. 185.)
Breslow, N. E., and Day, N. E. (1980). Statistical Methods in Cancer Research. Vol. 1:
The Analysis of Case-Control Studies. International Agency for Research on Cancer.
(Cited on p. 52.)
Breslow, N. E., and Day, N. E. (1987). Statistical Methods in Cancer Research. Vol. II:
The Design and Analysis of Cohort Studies. International Agency for Research on
Cancer. (Cited on p. 52.)
Brockwell, P., Fienberg, S. E., and Davis, R. A. (2009). Time Series: Theory and Meth-
ods. Springer. (Cited on p. 183.)
Brody, H., Rip, M. R., Vinten-Johansen, P., Paneth, N., and Rachman, S. (2000). Map-
making and myth-making in Broad Street: the London cholera epidemic, (1854).
Lancet 356(07), 64–68. (Cited on p. 82.)
Bunting, C., Chan, T. W., Goldthorpe, J., Keaney, E., and Oskala, A. (2008). From
Indifference to Enthusiasm: Patterns of Arts Attendance in England. Arts Council
England. (Cited on p. 68.)
Büttner, T., and Rässler, S. (2008). Multiple imputation of right-censored wages in the
German IAB Employment Sample considering heteroscedasticity. IAB discussion
paper 2000 844. (Cited on p. 62.)
Carpenter, L. M., Maconochie, N. E. S., Roman, E., and Cox, D. R. (1997). Examin-
ing associations between occupation and health using routinely collected data. J. R.
Statist. Soc. A 160, 507–521. (Cited on p. 4.)
Carpenter, L. M., Linsell, L., Brooks, C., et al. (2009). Cancer morbidity in British
military veterans included in chemical warfare agent experiments at Porton Down:
cohort study. Br. Med. J. 338, 754–757. (Cited on p. 48.)
Carroll, R. J., Ruppert, D., Stefanski, L. A., and Crainiceanu, C. M. (2006). Measure-
ment Error in Nonlinear Models: A Modern Perspective. Second edition. Chapman
& Hall/CRC. (Cited on p. 74.)
Chalmers, A. F. (1999). What is This Thing Called Science? Third edition. Open Uni-
versity Press. (Cited on p. 13.)
Champion, D. J., and Sear, A. M. (1969). Questionnaire response rate: a methodological
analysis. Soc. Forces 47, 335–339. (Cited on p. 54.)
Chatfield, C. (1998). Problem-solving. Second edition. Chapman and Hall. (Cited on
p. 13.)
Choy, S. L., O’Leary, R., and Mengersen, K. (2009). Elicitation by design in ecology:
using expert opinion to inform priors for Bayesian statistical models. Ecology 90,
265–277. (Cited on p. 143.)
Cochran, W. G. (1965). The planning of observational studies of human populations
(with discussion). J. R. Statist. Soc. A 128, 234–266. (Cited on p. 165.)
Cochrane Collaboration. What is publication bias? http://www.cochrane-net.org/
openlearning/HTML/mod15-2.htm.
Coleman, J. S., and James, J. (1961). The equilibrium size distribution of freely-forming
groups. Sociometry 24, 36–45. (Cited on p. 102.)
References 191
Cornfield, J., Haenszel, W., Hammond, E. C., Lilienfeld, A. M., Shimkin, M. B., and
Wynder, E. L. (2009). Smoking and lung cancer: recent evidence and a discussion
of some questions (reprinted from (1959). J. Nat. Cancer Inst. 22, 173–203). Int. J.
Epidemiol. 38, 1175–1191. (Cited on p. 167.)
Cox, D. R. (1952). Some recent work on systematic experimental designs. J. R. Statist.
Soc. B 14, 211–219. (Cited on p. 51.)
Cox, D. R. (1958). Planning of Experiments. John Wiley & Sons. (Cited on p. 51.)
Cox, D. R. (1969). Some sampling problems in technology in New Developments in
Survey Sampling, pages 506–527. John Wiley and Sons. (Cited on p. 51.)
Cox, D. R. (1984). Interaction. International Statist. Rev. 52, 1–31. (Cited on p. 183.)
Cox, D. R. (1990). Role of models in statistical analysis. Statist. Sci. 5, 169–174. (Cited
on p. 117.)
Cox, D. R. (2006). Principles of Statistical Inference. Cambridge University Press.
(Cited on p. 158.)
Cox, D. R., and Brandwood, L. (1959). On a discriminatory problem connected with
the works of Plato. J. R. Statist. Soc. B 21, 195–200. (Cited on p. 105.)
Cox, D. R., and Oakes, D. (1984). Analysis of Survival Data. Chapman and Hall. (Cited
on p. 139.)
Cox, D. R., and Reid, N. (2000). Theory of the Design of Experiments. Chapman and
Hall. (Cited on pp. 51 and 139.)
Cox, D. R., and Snell, E. J. (1979). On sampling and the estimation of rare errors.
Biometrika 66, 125–132. (Cited on p. 51.)
Cox, D. R., and Wermuth, N. (1996). Multivariate Dependencies. Chapman and Hall.
(Cited on pp. 47, 58, and 103.)
Cox, D. R., and Wong, M. Y. (2004). A simple procedure for the selection of significant
effects. J. R. Statist. Soc. B 66, 395–400. (Cited on p. 158.)
Cox, D. R., Fitzpatrick, R., Fletcher, A., Gore, S. M., Spiegelhalter, D. J., and Jones,
D. R. (1992). Quality-of-life assessment: can we keep it simple? (with discussion).
J. R. Statist. Soc. A 155, 353–393. (Cited on p. 74.)
Darby, S., Hill, D., Auvinen, A., et al. (2005). Radon in homes and risk of lung cancer:
collaborative analysis of individual data from 13 European case-control studies. Br.
Med. J. 330, 223–227. (Cited on p. 50.)
Davison, A. C., and Hinkley, D. V. (1997). Bootstrap Methods and Their Application.
Cambridge University Press. (Cited on p. 158.)
Dawid, D. P., and Dickey, J. M. (1977). Likelihood and Bayesian inference from selec-
tively reported data. J. Am. Statist. Assoc. 72, 845–850. (Cited on p. 181.)
de Almeida, M. V., de Paula, H. M. G., and Tavora, R. S. (2006). Observer effects on the
behavior of non-habituated wild living marmosets (Callithrix jacchus). Rev. Etologia
8(2), 81–87. (Cited on p. 54.)
De Silva, M., and Hazleman, B. L. (1981). Long-term azathioprine in rheumatoid arthri-
tis: a double-blind study. Ann. Rheum. Dis. 40, 560–563. (Cited on p. 85.)
Diggle, P., Liang, K.-Y., and Zeger, S. L. (2002). Analysis of Longitudinal Data. Second
edition. Oxford University Press. (Cited on p. 183.)
Donnelly, C. A., Woodroffe, R., Cox, D. R., et al. (2003). Impact of localized badger
culling on tuberculosis incidence in British cattle. Nature 426, 834–837. (Cited on
p. 17.)
192 References
Donnelly, C. A., Woodroffe, R., Cox, D. R., et al. (2006). Positive and negative effects
of widespread badger culling on tuberculosis in cattle. Nature 439, 843–846. (Cited
on p. 17.)
Edge, M. E., and Sampaio, P. R. F. (2009). A survey of signature based methods for
financial fraud detection. Comput. Secur. 28, 381–394. (Cited on p. 79.)
Efron, B. (1979). Bootstrap methods: another look at the jackknife. Ann. Statist. 7, 1–
26. (Cited on pp. 149 and 158.)
Efron, B. (2010). Large-Scale Inference. IMS Monograph. Cambridge University Press.
(Cited on p. 158.)
Erikson, R., and Goldthorpe, J. H. (1992). The Constant Flux. Oxford University Press.
(Cited on p. 176.)
Feller, W. (1968). An Introduction to Probability Theory and its Applications. Vol. 1.
Third edition. Wiley. (Cited on p. 158.)
Ferguson, C. J. (2007). The good, the bad and the ugly: a meta-analytic review of
positive and negative effects of violent video games. Psychiat. Quart. 78, 309–316.
(Cited on p. 181.)
Firth, D., and de Menezes, R. X. (2004). Quasi-variances. Biometrika 91, 65–80. (Cited
on p. 139.)
Fisher, R. A. (1926). The arrangement of field experiments. J. Min. Agric. G. Br. 33,
503–513. (Cited on pp. 28 and 51.)
Fisher, R. A. (1935). Design of Experiments. Oliver and Boyd. (Cited on pp. 28 and 51.)
Fleming, T. R., and DeMets, D. L. (1996). Surrogate end points in clinical trials: are we
being misled? Ann. Intern. Med. 125, 605–613. (Cited on p. 60.)
Forrester, M. L., Pettitt, A. N., and Gibson, G. J. (2007). Bayesian inference of hospital-
acquired infectious diseases and control measures given imperfect surveillance data.
Biostatistics 8, 383–401. (Cited on pp. 100 and 101.)
Galesic, M., and Bosnjak, M. (2009). Effects of questionnaire length on participation
and indicators of response quality in a web survey. Public Opin. Quart., 73, 349–360.
(Cited on p. 54.)
Gerber, A. S., Malhotra, N., Dowling, C. M., and Doherty, D. (2010). Publication bias
in two political behavior literatures. Am. Polit. Research 38, 591–613. (Cited on
p. 181.)
Gile, K., and Handcock, M. S. (2010). Respondent-driven sampling: an assessment of
current methodology. Sociol. Methodol. 40, 285–327. (Cited on p. 51.)
Gøtzsche, P. C., Hansen, M., Stoltenberg, M., et al. (1996). Randomized, placebo con-
trolled trial of withdrawal of slow-acting antirheumatic drugs and of observer bias
in rheumatoid arthritis. Scand. J. Rheumatol. 25, 194–199. (Cited on p. 85.)
Grady, D., Rubin, S. H., Petitti, D. B., et al. (1992). Hormone therapy to prevent dis-
ease and prolong life in postmenopausal women. Ann. Intern. Med. 117, 1016–1037.
(Cited on p. 7.)
Greenland, S., and Robins, J. M. (1988). Conceptual problems in the definition and
interpretation of attributable fractions. Am. J. Epidemiol. 128, 1185–1197. (Cited on
p. 120.)
Gustavsson, P., Nyberg, F., Pershagen, G., Schéele, P., Jakobsson, R., and Plato, N.
(2002). Low-dose exposure to asbestos and lung cancer: dose-response relations and
interaction with smoking in a population-based case-referent study in Stockholm,
Sweden. Am. J. Epidemiol. 155, 1016–1022. (Cited on p. 172.)
References 193
Guy, W. A. (1879). On tabular analysis. J. Statist. Soc. Lond. 42, 644–662. (Cited on
p. 86.)
Hand, D. J. (2004). Measurement: Theory and Practice. Arnold. (Cited on p. 74.)
Hand, D. J., Mannila, H., and Smyth, P. (2001). Principles of Data Mining. MIT Press.
(Cited on p. 13.)
Herman, C. P., Polivy, J., and Silver, R. (1979). Effects of an observer on eating behav-
ior: The induction of “sensible” eating. J. Pers. 47(1), 85–99. (Cited on p. 54.)
Hill, A. B. (1965). The environment and disease: association or causation? Proc. R. Soc.
Med. 58, 295–300. (Cited on p. 165.)
Hogan, J. W., and Blazar, A. S. (2000). Hierarchical logistic regression models for clus-
tered binary outcomes in studies of IVF-ET. Fertil. Steril. 73(03), 575–581. (Cited
on p. 117.)
Ingold, C. T., and Hadland, S. A. (1959). The ballistics of sordaria. New Phytol. 58,
46–57. (Cited on pp. 98 and 100.)
Jackson, M. (2008). Content analysis, in Research Methods for Health and Social Care,
pages 78–91. Palgrave Macmillan. (Cited on p. 74.)
Jeffreys, H. (1939). The Theory of Probability. Third edition published in 1998. Oxford
University Press. (Cited on p. 144.)
Johnson, B. D. (2006). The multilevel context of criminal sentencing: integrating judge-
and county-level influences. Criminology 44, 259–298. (Cited on p. 116.)
Jöreskog, K. G., and Goldberger, A. S. (1975). Estimation of a model with multiple
indicators and multiple causes of a single latent variable. J. Am. Statist. Assoc. 70,
631–639. (Cited on p. 74.)
Kalbfleisch, J. D., and Prentice, R. L. (2002). Statistical Analysis of Failure Time Data.
Second edition. Wiley. (Cited on p. 139.)
Knapper, C. M., Roderick, J., Smith, J., Temple, M., and Birley, H. D. L. (2008). Inves-
tigation of an HIV transmission cluster centred in South Wales. Sex. Transm. Infect.
84, 377–380. (Cited on p. 82.)
Knudson, A. G. (2001). Two genetic hits (more or less) to cancer. Nat. Rev. Cancer 1,
157–162. (Cited on p. 99.)
Koga, S., Maeda, T., and Kaneyasu, N. (2008). Source contributions to black carbon
mass fractions in aerosol particles over the northwestern Pacific. Atmos. Environ.
42, 800–814. (Cited on p. 80.)
Koukounari, A., Fenwick, A., Whawell, S., et al. (2006). Morbidity indicators of Schis-
tosoma mansoni: relationship between infection and anemia in Ugandan schoolchil-
dren before and after praziquantel and albendazole chemotherapy. Am. J. Trop. Med.
Hyg. 75, 278–286. (Cited on p. 115.)
Kremer, J. M., Rynes, R. I., and Bartholomew, L. E. (1987). Severe flare of rheumatoid
arthritis after discontinuation of long-term methotrexate therapy. Double-blind study.
Am. J. Med. 82, 781–786. (Cited on p. 85.)
Krewski, D., Burnett, R. T., Goldberg, M. S., et al. (2000). Reanalysis of the Harvard
Six Cities Study and the American Cancer Society Study of particulate air pollution
and mortality. Special report, Health Effects Institute. (Cited on p. 76.)
Krewski, D., Burnett, R. T., Goldberg, M. S., et al. (2003). Overview of the Reanalysis
of the Harvard Six Cities Study and the American Cancer Society Study of Particu-
late Air Pollution and Mortality. J. Toxicol. Environ. Health A 66, 1507–1551. (Cited
on p. 76.)
194 References
R Development Core Team (2007). R: a language and environment for statistical com-
puting. The R Foundation for Statistical Computing. (Cited on p. 110.)
Raghunathan, T. E., and Grizzle, J. E. (1995). A split questionnaire survey design. J.
Am. Statist. Assoc. 90, 54–63. (Cited on p. 80.)
Rathbun, S. L. (2006). Spatial prediction with left-censored observations. J. Agr. Biol.
Environ. Stat. 11, 317–336. (Cited on p. 62.)
Reeves, G. K., Cox, D. R., Darby, S. C., and Whitley, E. (1998). Some aspects of mea-
surement error in explanatory variables for continuous and binary regression models.
Stat. Med. 17, 2157–2177. (Cited on pp. 72 and 74.)
Riley, S., Fraser, C., Donnelly, C. A., et al. (2003). Transmission dynamics of the etio-
logical agent of SARS in Hong Kong: impact of public health interventions. Science
300, 1961–1966. (Cited on p. 92.)
Ripley, B. D. (1987). Stochastic Simulation. Wiley. (Cited on p. 158.)
Robert, C. P., and Casella, G. (2004). Monte Carlo Statistical Methods. Springer. (Cited
on p. 158.)
Roethlisberger, F. J., and Dickson, W. J. (1939). Management and the Worker. Harvard
University Press. (Cited on p. 55.)
Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychol.
Bull. 86, 638–641. (Cited on p. 181.)
Ross, G. (1990). Nonlinear Estimation. Springer. (Cited on p. 139.)
Salcuni, S., Di Riso, D., Mazzesch, C., and Lis, A. (2009). Children’s fears: a survey
of Italian children ages 6 to 10 years. Psychol. Rep. 104(06), 971–988. (Cited on
p. 63.)
Samph, T. (1976). Observer effects on teacher verbal classroom behaviour. J. Educ.
Psychol. 68(12), 736–741. (Cited on p. 54.)
Scheffé, H. (1959). The Analysis of Variance. John Wiley & Sons. (Reprinted by Wiley-
Interscience in 1999). (Cited on pp. 52 and 158.)
Schweder, T., and Spjøtvoll, E. (1982). Plots of p-values to evaluate many tests simul-
taneously. Biometrika 693, 493–502. (Cited on p. 158.)
Shahian, D. M., Normand, S. L., Torchiana, D. F., et al. (2001). Cardiac surgery report
cards: comprehensive review and statistical critique. Ann. Thorac. Surg. 72, 2155–
2168. (Cited on p. 55.)
Shapter, T. (1849). The History of the Cholera in Exeter in 1832. John Churchill. (Cited
on p. 82.)
Shiboski, S., Rosenblum, M., and Jewell, N. P. (2010). The impact of secondary con-
dom interventions on the interpretation of results from HIV prevention trials. Statist.
Commun. Infect. Dis. 2, 2. (Cited on p. 40.)
Snedecor, G. W., and Cochran, W. G. (1956). Statistical Methods Applied to Experi-
ments in Agriculture and Biology. Fifth edition. Iowa State University Press. (Cited
on pp. 52 and 183.)
Snijders, T. A. B., and Bosker, R. J. (2011). Multilevel Analysis. Second edition. Sage.
(Cited on p. 183.)
Snow, J. (1855). On the Mode of Communication of Cholera. John Churchill. (Cited on
pp. 82 and 83.)
Solow, R. M. (1970). Growth Theory: An Exposition. Paperback edn. Oxford University
Press. (Cited on p. 109.)
196 References
Stamler, J. (1997). The INTERSALT study: background, methods, findings, and impli-
cations. Am. J. Clin. Nutr. 65, 626S–642S. (Cited on p. 72.)
Sterling, T. D. (1959). Publication decisions and their possible effects on inferences
drawn from tests of significance – or vice versa. J. Am. Statist. Assoc. 54, 30–34.
(Cited on p. 181.)
Storey, J. D. (2002). A direct approach to false discovery rates. J. R. Statist. Soc. B 64,
479–498. (Cited on p. 158.)
Stubbendick, A. L., and Ibrahim, J. G. (2003). Maximum likelihood methods for non-
ignorable missing responses and covariates in random effects models. Biometrics
59(12), 1140–1150. (Cited on p. 81.)
Sullivan, M. J., Adams, H., and Sullivan, M. E. (2004). Communicative dimensions of
pain catastrophizing: social cueing effects on pain behaviour and coping. Pain 107,
220–226. (Cited on p. 54.)
ten Wolde, S, Breedveld, F. C., Hermans, J., et al. (1996). Randomised placebo-
controlled study of stopping second-line drugs in rheumatoid arthritis. Lancet 347,
347–352. (Cited on p. 85.)
Thompson, M. E. (1997). Theory of Sample Surveys. Chapman and Hall. (Cited on
p. 51.)
Thompson, S. K. (2002). Sampling. Second edition. John Wiley & Sons (First edition
published in 1992). (Cited on p. 51.)
Tippett, L. H. C. (1927). Random Sampling Numbers. Cambridge University Press.
(Cited on p. 158.)
Todd, T. J. (1831). The Book of Analysis, or, a New Method of Experience: Whereby the
Induction of the Novum Organon is Made Easy of Application to Medicine, Phys-
iology, Meteorology, and Natural History: To Statistics, Political Economy, Meta-
physics, and the More Complex Departments of Knowledge. Murray. (Cited on
p. 86.)
Toscas, P. J. (2010). Spatial modelling of left censored water quality data. Environ-
metrics 21, 632–644. (Cited on p. 62.)
Van der Leeden, H., Dijkmans, B. A., Hermans, J., and Cats, A. (1986). A double-blind
study on the effect of discontinuation of gold therapy in patients with rheumatoid
arthritis. Clin. Rheumatol. 5, 56–61. (Cited on p. 85.)
Vandenbroucke, J. P., and Pardoel, V. P. A. M. (1989). An autopsy of epidemiological
methods: the case of ‘poppers’ in the early epidemic of the acquired immunodefi-
ciency syndrome (AIDS). Am. J. Epidemiol. 129, 455–457. (Cited on p. 71.)
Vogel, R., Crick, R. P., Newson, R. B., Shipley, M., Blackmore, H., and Bulpitt, C. J.
(1990). Association between intraocular pressure and loss of visual field in chronic
simple glaucoma. Br. J. Ophthalmol. 74, 3–6. (Cited on p. 61.)
Walshe, W. H. (1841a). Tabular analysis of the symptoms observed by M. Louis, in 134
cases of the continued fever of Paris (Affection Typhoı̈de). Provincial Med. Surg. J.,
2(5), 87–88. (Cited on p. 86.)
Walshe, W. H. (1841b). Tabular analysis of the symptoms observed by M. Louis, in 134
cases of the continued fever of Paris (Affection Typhoı̈de). Provincial Med. Surg. J.,
2(5), 107–108. (Cited on p. 86.)
Walshe, W. H. (1841c). Tabular analysis of the symptoms observed by M. Louis, in 134
cases of the continued fever of Paris (Affection Typhoı̈de). Provincial Med. Surg. J.,
2(5), 131–133. (Cited on p. 86.)
References 197
Webb, E., and Houlston, R. (2009). Association studies, in Statistics and Informatics in
Molecular Cancer Research, pp. 1–24. Oxford University Press. (Cited on p. 154.)
Wellcome Trust Case Control Consortium. (2007). Genome-wide association study of
14 000 cases of seven common diseases and 3000 shared controls. Nature 447, 661–
678. (Cited on p. 152.)
Wilansky-Traynor, P., and Lobel, T. E. (2008). Differential effects of an adult observer’s
presence on sex-typed play behavior: a comparison between gender-schematic and
gender-aschematic preschool children. Arch. Sex. Behav. 37, 548–557. (Cited on
p. 54.)
Wilson, E. B. (1952). An Introduction to Scientific Research. McGraw-Hill (Reprinted
in 1990 by Dover.) (Cited on p. 13.)
Woodroffe, R., Donnelly, C. A., Cox, D. R., et al. (2006). Effects of culling on badger
Meles meles spatial organization: implications for the control of bovine tuberculosis.
J. Appl. Ecol. 43, 1–10. (Cited on p. 17.)
Writing group for Women’s Health Initiative Investigators (2002). Risks and benefits of
estrogen plus progestin in healthy postmenopausal women. J. Am. Med. Assoc. 288,
321–333. (Cited on p. 7.)
Wu, S. X., and Banzhaf, W. (2010). The use of computational intelligence in intrusion
detection systems: a review. Appl. Soft Comput. 10, 1–35. (Cited on p. 79.)
Yates, F. (1952). Principles governing the amount of experimentation required in
development work. Nature 170, 138–140. (Cited on p. 28.)
Index
agriculture, 19, 26, 29, 33, 60, 77, 107, Berkson error, 72
119, 125, 168; see also bovine bias, 21
tuberculosis, foot-and-mouth ecological, 19
disease, field trial publication, 180
AIDS, 3, 61, 71, 166 blinding, 21, 40
algorithmic methods, 10, blood pressure, 1, 6, 38, 70, 72, 94,
185 136
analysis body mass index (BMI), 47, 62, 94,
change of, 17 172
of variance, 42, 177 Bonferroni correction, 154
plan of, 16 bootstrap, 149
preliminary, 75–89, 178 bovine tuberculosis, 2, 17, 39, 77, 108,
principal components, 63 114, 135
subgroup, 169 cancer, 4, 48, 50, 73, 96, 99, 154, 160,
tabular, 86–87 167, 172
unit of, 5, 18–20, 114 capture–recapture sampling, 36
anecdotal evidence, 36 case–control study, 49, 73, 135
antithetic variable, 113 causality, 122, 160–167
Armitage–Doll model, 99 censoring, 61–62
atomic null hypothesis, choice-based sampling, 49
145 climatology, 2
attenuation, 70 clinical trial, 18, 25, 106, 125, 154, 168,
autocorrelation, 178 181
autoregressive process, 132 cluster
baseline variable, 161 detection, 81
Bayes, Thomas, 143 randomization, 38, 114
Bayes factor, 131, 150 Cochran, W. G., 51
behaviour component of variance, 25, 126
animal, 54 concealment, 7, 21, 40
anomalous, 79, 126 confidence
electoral, 35 distribution, 141
human, 54 limits, 141
individual, 167 Consumer Prices Index, 64
of a system, 55 contingency tables, 4, 175
political, 181 contrasts, 41, 93, 124, 125, 140, 173
wildlife, 2, 17 cost, 8, 11, 25, 28, 33, 53
Berkson, J., 167 cross-classification, 43, 123
198
Index 199
precision, 3, 15, 23, 32, 53, 81, 112, 119, routine testing, 4
126, 129, 151, 173 sampling
prediction, 11 capture–recapture, 36
preliminary analysis, 75–89, 178 choice-based, 49
pressure frame, 30
of reactants, 136 inspection, 9
of time, 75 length-biased, 32
to produce one-dimensional monetary-unit, 30
summaries, 9 multi-stage, 34
to produce particular results, 181 respondent-driven, 36
to stick to plan of analysis, 186 snowball, 36
principal components analysis, 63 systematic, 30
prior distribution, 143, 150, 157 temporal, 34
probability, 104 scale of effort, 15
model, 10, 92–104 semiparametric model, 95
proportional adjustment, 33 sensitivity, 70
publication bias, 180 sequence, ideal, 2, 188
quality control, 4 serial correlation, 178
quality of life, 8, 65, 81 Severe Acute Respiratory Syndrome
quasi-realistic model, 92 (SARS), 92
questionnaire, 54, 57, 65, 80, 135 significance test, 25, 145–156
random simulation, 92, 104, 113, 131, 146,
effects, 125 148
error, 24 smoking, 120, 137, 160, 167, 172
variable, 94 smoothing, 10, 85, 96
randomization, 18 snowball sampling, 36
cluster, 38, 114 social mobility, 176
randomized block design, 34, 40, 126 Sordaria, 98, 105
range of validity, 15, 27, 108 sparsity, 138
recall bias, 46, 49, 162 spatial outlier, 82
recovery of inter-block information, 128 specificity, 70, 168
regression, 33, 63, 65, 69, 95, 110, 121, split-plot design, 19
136, 138, 150, 155, 163 standing, equal, 170
chain of, 59, 102, 170 stochastic model, 92, 100
dilution, 70 study
logistic, 49, 80, 87, 116, 125, 135, 185 case-control, 49, 73, 135
polynomial, 131 cross-sectional, 46, 164
regulatory agency, 16 observational, 7, 20
inhibitory role of, 5 pilot, 25, 76
relative risk, 58, 85, 172 stylized facts, 109
reliability, measurements of, 8 subgroup analysis, 169
replication, 15 sum of squares, 42, 150
research question, formulation of, 2 surrogate
residual, 41 endpoint, 60
respondent-driven sampling, 36 variable, 2
response survival analysis, 61, 96
curve, 20, 56 synthesis of information, 2
surface, 46, 136, 173 systematic
Retail Prices Index, 64 error, 15, 21, 70
Reynolds number, 121 sampling, 30
robust methods, 79 variation, 109
202 Index