Clinical Epidemiology
Clinical Epidemiology
Clinical Epidemiology
Clinical
Epidemiology
THE ESSENTIALS
All rights reserve<:j rhis book is protected by copyrighl No part ct this hook may be reprodlJced in
anytorm m by any means, including phctoccpyinq, or utilized by any intormation storagH and retrieval
system without written permission trom the copyright owner
Accurate indications, adverse reHdions and dos-agH schedules for dt1.Jgs are provided in this llOok,
but it is possible that they may changQ. The reader is urgod to review Ihe package informabon oata
of the rnHrlLJfacturers of the medications mentioned.
Fletcher, Robert H.
Clinical epidemloloqy: ue essentials/Bobert H. Fletcher.
Suzanne W. Fletcher, Edward II. Wagner, -- aro ed.
p, cm
Includes bibliographical references and index.
ISm, 06B3-03?69-0
1 Clinical epidemiology, I. Wa(1ner, Edward H. (lidwanl Harris).
t940- II litle.
[DNI M- t Epidemiologic Methods, WA %0 F6t~c 1996]
RAC52.2.C5c.Fb1 1996
6 t 4.4-<ic?O
DNI M/DLC
for Librarv of Cong",ss 8b-8382
CP
The publishers !>iNe mcao every elforl to tmce the cOpyrigll! holders for borrow~'CJ material Illhey
have inadvert"mliy overlooked any, limy WIJI be pleased 10 make the necesswy =angements al the
first opportunity,
9b 96 91 98 99
? 3 4 5 6 7 8 9 10
PREFACE
Since the Second Edition was written in 1988, the pace of change in
medicine has accelerated. Changes have brought greater recognition of the
perspectives and methods of clinical epidemiology.
Countries throughout the world have, in their efforts to provide high-
quality health care, experienced growing difficulties controlling the cost
of care. The tension between demands for care and resources to provide
it have increased the need for better information about clinical effectiveness
in setting priorities. It has become dearer that not all clinical care is effective
and that the outcomes of care are the best way of judging effectiveness.
Variations in care among clinicians and regions, not explained by patients'
needs and not accompanied by similar differences in outcomes, have wised
questions about which practices are best. All these forces in modern society
have increased the value of good clinical research and of those who can
perform and interpret this research properly.
Phenomenal advances in understanding the biology of disease, espe-
cially at the molecular level, have also occurred. Discoveries in the labora-
tory increase the need for good patient-based research. They must be tested
in patients before they can be accepted as clinically useful. Thus the two-
laboratory science and clinical epidemiology-complement each other and
are not alternatives or competitors.
Other aspects of medicine are timeless. Patients and physicians still face
the same kinds of questions about diagnosis, prognosis, and treatment and
still value the same outcomes: to relieve suffering, restore function, and
prevent untimely death. We rely on the same basic strategies (cohort and
case-control studies, randomized trials, and the like) to answer the ques-
tions. The inherent uncertainty of all clinical information, even that based
on the best studies, persists.
In preparing the third edition of this text, we have tried to take into
account the sweeping changes in medicine as well as what has not changed.
We have left the basic structure of the book intact. We updated examples
throughout in recognition that some diseases, such as AIDS, are new and
others, such as peptic ulcer disease, arc better understood.
We have tried to remember that the book's niche is as an introduction
v
vi PREFACE
Preface , v
Acknowledgments , vii
Chapter 1
INTRODUCTION , . . .. 1
Chapter 2
ABNORMALITY. ...... 19
Chapter 3
DIAGNOSIS .... , 43
Chapter 4
FREQUENCY ..... 75
Chapter 5
H1SK. . 94
Chapter 6
PROGNOSIS , . . .. 111
Chapter 7
TREATMENT. .............. , ,.136
Chapter 8
PREVENTION .. . , , 165
Chapter 9
CHANCE .. . .... 186
Chapter 10
STU DYING CASES , , . . 208
Chapter 11
CAUSE . . ... 228
Chapter 12
SUMMING UP ... ......... , ... 249
Index 271
1
INTRODUCTION
A 51-year-old man asks to sec you because of chest pain. He was well
until 2 weeks ago, when he noticed tightness in the center of his chest
while walking uphill. The tightness stopped after 2 to 3 min of rest. A
similar discomfort occurred several times since then, sometimes during
exercise and sometimes at rest. He smokes one pack of cigarettes per day
and has been told that his blood pressure is "a little high." He is otherwise
well and takes no medications. However, he is worried about his health,
particularly about heart disease. A complete physical examination
and resting electrocardiogram are normal except for a blood pressure of
150/96.
This patient is likely to have many questions. Am 1 sick? How sure are
you? If Tam sick, what is causing my illness? How will it affect me? What
can be done about it? How much will it cost?
As the clinician caring for this patient, you must respond to these ques-
tions and use them to guide your course of action. Is the probability of
serious, treatable disease high enough to proceed immediately beyond
simple explanation and reassurance to diagnostic tests? How well do vari-
ous tests distinguish among the possible causes of chest pain: ilngina pecto-
ris, esophageal spilsm, muscle strain, anxiety, and the like. For example,
how helpful will an exercise electrocardiogram be in either confirming or
ruling out coronary artery disease? Tf coronary disease is found, how long
can the patient expect to have the pain? Will the condition shorten his
life? How likely is it that other complications-congestive heart failure,
myocardial infarction, or atherosclerotic disease of other organs-will oc-
cur? Will reduction of his risk factors for coronary diseasc-c-cignrctte smok-
ing and hypertension-reduce his risk? If medications control the pain,
should the patient have coronary artery bypass surgery anyway?
Clinicians use various sources of information to answer these questions:
their own experiences, the advice of their colleagues, and the medical
literature. In general, they depend on past observations on other similar
patients to predict what will happen to the patient at hand. The manner
1
2 CLINICAL EPIDEMIOLOGY
Clinical Epidemiology
Clinical epidemiology is the science of making predictions about indi-
vidual patients by counting clinical events in similar patients, using strong
scientific methods for studies of groups of patients to ensure that the pre-
dictions are accurate. The purpose of clinical epidemiology is to develop
and apply methods of clinical observation that will lead to valid conclu-
sions by avoiding being misled by systematic error and chance. It is one
important approach to obtaining the kind of information clinicians need
to make good decisions in the care of patients.
CLINICAL MEDICINE AND EPIDEMIOLOGY
The term clinical epidemiology is derived from its two parent disciplines:
clinical medicine and epidemiology. It is "clinical" because it seeks to
answer clinical questions and to guide clinical decision making with the
best available evidence. It is "epidemiologic" because many of the methods
used to answer these questions have been developed by epidemiologists
and because the care of individual patients is seen in the context of the
larger population of which the patient is a member.
Clinical medicine and epidemiology began together (1). The founders
of epidemiology were, for the most part, clinicians. It is only during this
century that the two disciplines drifted apart, with separate schools, train-
ing, journals, and opportunities for employment. More recently, clinicians
and epidemiologists have become increasingly aware that their fields inter-
relate and that each is limited without the other (2).
TRADITIONAL CLINICAL PERSPECTIVE
Clinicians have a special set of experiences and needs that has condi-
tioned how they go about answering clinical questions. They are, by and
large, concerned with individual patients. They know all of their patients
personally; take their own histories; do their own physical examinations;
and they accept an intense, personal responsibility for each patient's wel-
fare. As a result, they tend to see what is distinctive about each one and
are reluctant to lump patients into crude categories of risk, diagnosis, or
treatment and to express patients' membership in these categories as a
probability.
Because their work involves the care of a succession of individual pa-
tients and is demanding in its own right, clinicians tend to be less interested
in patients who have not come to their attention because they are in some
other medical setting or are not under medical care at all-even though
these patients may be just as sick as the patients they see.
CHAPTER 1 I INTRODUCTION 3
Basic Principles
The basic purpose of clinical epidemiology is to foster methods of clini-
cal observation and interpretation that lead to valid conclusions. The most
credible answers to clinical questions are based on the following principles.
CLINICAL QUESTIONS
Types of questions addressed by clinical epidemiology are listed in
Table 1.1. These are the same questions confronting the doctor and patient
in the example presented at the beginning of this chapter. They are at issue
in most doctor-patient encounters.
HEALTH OUTCOMES
The clinical events of primary interest in clinical epidemiology are the
health outcomes of particular concern to patients and those caring for them
(Table 1.2). They are the events doctors try to understand, predict, interpret,
and change when caring for patients. An important distinction between
CHAPTER 1 ! INTRODUCTION 5
Table 1.1
Clinical Questions
Question
Table 1.2
Outcomes of Disease {the Five Ds)B
Perhaps a sixth 0, destitution, belongs on this list because the financial cost of illness (for individual patients
or society) is an important consequence of orseeso.
"Or illness, the patient's experience of disease
clinical epidemiology and other medical sciences is that the events of inter-
est in clinical epidemiology can be studied directly only in intact humans
and not in animals or parts of humans, such as humeral transmitters, tissue
cultures, cell membranes, and genetic sequences,
Biologic outcomes cannot properly be substituted for clinical ones with-
out direct evidence that the two are related. Table 1.3 summarizes some
biologic and clinical outcomes for the modern treatment of a patient with
human immunodeficiency virus (HIV) infection. It is plausible, from what
is known about the biology of HIV infection, that clinical outcomes such
as opportunistic infections, Kaposi's sarcoma, and death would be better
if an intervention reduced the decline in CD4+ cell counts and p34 antigen.
However, there is evidence that these are incomplete markers of disease
progression and response to treatment. It is too much to assume that patient
outcomes would improve as a result of the intervention just because bio-
logic markers do, because many other factors might determine the end
result. Clinical decisions should, therefore, be based on direct evidence
that clinical outcomes themselves arc improved.
6 CLINICAL EPIDEMIOLOGY
Table 1.3
Biologic and Clinical Outcomes; Treatment of Human Immunodeficiency
Virus Infection
Outcomes
Table 1.4
Bias in Clinical Observation
• Selection bidS occurs whon comparisons are made between groups of patients that differ
in of outcome other than the one under study,
(jeterrnillant~;
• Measurement bl~gs
occurs when the methods of measurement me ossmsor among
groups of patients,
• Confounding bias occurs when two teeters arc ossooctoo {"'trilvet toqether"} and H18
effect of one is confused with or distorted by tile ottect of the ou-er
8 CLINICAL EPIDEMIOLOGY
the outcome of the study. Croups of patients often differ in many ways-
age, sex, severity of disease, the presence of other diseases, and the care they
receive. If we compare the experience of two groups that differ on a specific
characteristic of interest (for example, a treatment or a suspected cause of
disease) but are dissimilar in these other ways and the differences are them-
selves related to outcome, the comparison is biased and little can be concluded
about the independent effects of the characteristic of interest. In the example
used earlier, selection bias would have occurred if patients given treatment
A were healthier than those given treatment B.
Measurement bias occurs when the methods of measurement are consis-
tently dissimilar in different groups of patients. An example of a potential
measurement bias would be the use of information taken from medical
records to determine if women on birth control pills were at greater risk
for thromboembolism than those not on the Pill. Suppose a study were
made comparing the frequency of oral contraceptive use among women
admitted to a hospital because of thrombophlebitis and a group of women
admitted for other reasons. It is entirely possible that women with throm-
bophlebitis, if aware of the reported association between estrogens and
thrombotic events, might report use of oral contraceptives more completely
than women without thrombophlebitis, because they had already heard
of the association. For the same reasons, clinicians might obtain and record
information about oral contraceptive use more completely for women with
phlebitis than for those without it. If so, an association between oral contra-
ceptives and thrombophlebitis might be observed because of the way in
which the history of exposure was reported and not because there really
is an association.
Confounding bias occurs when two factors are associated with each other,
or "travel together," and the effect of one is confused with or distorted by
the effect of the other. This could occur because of selection bias, by chance,
or because the two really are associated in nature.
example Is herpesvirus infection a cause of cervical cancer) It has been
consistently observed that the prevalence of herpesvirus infection is higher
in women with cervical cancer than in those without. However, both herpes-
virus and a number of other infectious agents, themselves possible causes of
cervica I cancer,arc transmitted by sexual contact.In particular, there i~ strong
evidence that human papillomavirus infection leads to cervical cancer. Per-
haps the higher prevalence of herpesvirus infection in women with cervical
cancer is only a consequence of greater sexual activity and so is indirectly
related to a true cause, which is also transmitted sexually (Fig. 1.1). To show
that herpesvirus infection is associated with cervical cancer independently
of other agents, it would be necessary to observe the effects of herpesvirus
Ircc of the other factors related to increased sexual activity (5).
Selection bias and confounding bias are not mutually exclusive. They arc
described separately, however, because they present problems at different
CHAPTER 1 ! INTRODUCTION 9
MAIN QUESTION
I
sexual activity
CONFOUNDING
FACTORS
!
HPV
I
Figure 1.1. Confounding bias: Is herpesvirus 2 (HSV-2) a possible cause of cervical
cancer? Only if its association with cervical cancer is independent of human papillo-
mavirus (HPV) infection, known to be a cause of cervical cancer. Both viruses are
related to increased sexual activity.
True Blood
blood pressure
.,r::: pressure
(intraarterial cannula)
measurement
(sphygmomanometer)
o
~
1lo l.. •••••• •• ••
•• .. t
'0
~
1:
I. Chance •
E
:>
z ----Bias----
80 90
Diastolic Blood Pressure (mm Hg)
Figure 1.2. Relationship between bias and chance: Blood pressure measurements
by intraarterial cannula and sphygmomanometer.
selection
bias
~
measurement,
~
? - confounding bias
-- ... chance 1
EXTERNAL CONCLUSION
VALIDITY
(generalizability)
30
•"
••,..
-l
~
20
~
•
"' 10
Population Referral
centers
Figure 1.4. Sampling bias: Range of risk of rupture (shaded area) in the next 5
years of abdominal aortic aneurysm «5.0 cm in diameter) according to whether the
patient is from the general population or a referral center (6).
14 CLINICAL ~PIDEMIOLOGY
without known coronary heart disease (7). The 11,037 physicians randomly
assigned to take aspirin had a 44~;) lower rate of myocardial infarction than
the 11,034 assigned to take placebo. The study was carorullv conducted and
used a strong research design; its findings have stood up well to criticisms.
However, only healthy male physicians were in the study. When the results
of the study were first released, clinicians had to decide whether it was
justified to i;ive aspirin to women, people with many risk factors, and patients
who arc already known to have coronary disease. Subsequently, reviews of
evidence from all available studies have suggested that aspirin is also effec-
tiVIO in these other groups of people (8).
as the biology of disease and the physical and social environment, deter-
mine health outcomes, so that they can know what they can and cannot
change.
For these reasons, we believe the time invested in learning clinical epide-
miology is more than repaid.
Population
at risk
Risk factors
Cause
Cigarettes
Risk
Asbestos
Prevention
Radon
Abnormality
Onset of disease Frequency
Diagnosis
Diagnosis Prevention
Symptoms and signs
Chest x-ray
Sputum cytology
Biopsy
Treatment Treatment
Outcomes
Death
Disease
Recovery •
Figure 1.5. Organization of the book.
Prognosis
CHAPTER 1 I INTRODUCTION 17
SUGGESTED RRAI)[NCS
Andersen B. Methodological ",rrors in medical research. Boston: Blackwell Scientific Publica-
tions, 1990.
Eisenberg JM. Clinical economics. A guide to the ,,'nmomic analysis of clinical practices. JAMA
1Y89;262:2879- 28.%.
Facl~, {igllres, and fallacies series
Jolley T. The glitter of the I table. LancctI9Y3;342:27-29.
Victoria CG Whafs the denominator? Lancet 1')<)3;342:97-99.
Grisso JA. MJking compari~on~. I.ane",t 1993;342:157-160.
Datta M. You cannot exclude the expbnation you haven't considered. LancdI993;342:345-
347.
Mert<m~ TE. Estimating the effects of mtsclassification. I.imed 1993;342:418-42l.
Le~m D. Failed or misleading adjustment for confounding. Lancet 1993;342:479-48l.
Sitlhi-amorn C, 1"oshachinJa V. Bia~. Lancer 1993;34J:2116-2!l!l.
Glynn JR. A question of attribution. I.aned 1993;342:530-532.
Carpenter 1M. b the study worth doing? LancctIYY3;343:221-223.
18 CLINICAL EPIDEMIOLOGY
ABNORMALITY
record. The abnormal findings are set out in a problem list or under the
heading "impressions" or "diagnoses" and are the basis for action.
Simply calling clinical findings normal or abnormal is undoubtedly
crude and results in some misclassification. The justification for taking this
approach is that it is often impractical or unnecessary to consider the raw
data in all their detail. As Bertrand Russell put it, "To be perfectly intelligi-
ble one must be inaccurate, and to be perfectly accurate, one must be
unintelligible." Physicians usually choose to err on the side of being intelli-
gible-to themselves and others-even at the expense of some accuracy.
Another reason for simplifying data is that each aspect of a clinician's
work ends in a decision-to pursue evaluation or to wait, to select a
treatment or reassure. Under these circumstances some sort of present/
absent classification is necessary.
Table 2.1 is an example of how relatively simple expressions of abnor-
mality are derived from more complex clinical data. On the left is a typical
problem list, a statement of the patient's important medical problems. On
the right are some of the data on which the decisions to call them problems
are based. Conclusions from the data, represented by the problem list, are
by no means noncontroversial. For example, the mean of the four diastolic
blood pressure measurements is 94 mm Hg. Some might argue that this
level of blood pressure does not justify the label "hypertension," because
it is not particularly high and there are some disadvantages to telling
patients they are sick and giving them pills. Others might consider the
label fair, considering that this level of blood pressure is associated with
an increased risk of cardiovascular disease and that the risk may be re-
duced by treatment. Although crude, the problem list serves as a basis
for decisions-about diagnosis, prognosis, and trcahnent-and clinical
Table 2.1
Summarization of Clinical Data: A Patient's Problem List and the Data
on Which It Is Based
Clinical Measurement
Measurements of clinical phenomena yield three kinds of data: nominal,
ordinal, and interval.
Nominal data occur in categories without any inherent order. Examples
of nominal data are characteristics that are determined by a small set of
genes (c.g.. tissue antigens, sex, inborn errors of metabolism) or are dra-
matic, discrete events (c.g., death, dialysis, or surgery). These data can be
placed in categories without much concern about misclasstfication. Nomi-
nal data that can be divided into two categories (e.g.. present/absent, yes/
no, alive/dead) are called dichotomous.
Ordinal data posse~s some inherent ordering or rank, such as small to
large or good to bad, but the size of the intervals between categories cannot
be specified. Some clinical examples include 1+ to 4+ leg edema, grades
I to VI murmurs (heard only with special effort to audible with the stetho-
scope off the chest), and grades 1 to 5 muscle strength (no movement to
normal strength)
For interral data, there is inherent order and the interval between successive
values is equal, no matter where one is on the scale. There are two types of
interval data. Continuous data can take on any value in a continuum. Exam-
ples include most serum chemistries, weight, blood pressure, and partial
pressure of oxygen in arterial blood. The measurement and descriptions of
continuous variables may in practice be confined to a limited number of points
on the continuum, often integers, because the precision of the measurement, or
its usc, does not warrant greater detail. For example, a particular blood glucose
reading may in fact be 1':)3.2846573 ... mg/lOO mL but simply reported as
193 mg/100 ml.. Discrete data, can take on only specific values and arc ex-
pressed as counts. Examples of discrete data arc the number of a woman's
pregnancies and live births and the number of seizures a patient has per
month.
It is for ordinal and numerical data that the following question arises:
Where docs normal leave off and abnormal begin? When, for example,
does a large normal prostate become too large to be considered normal?
Clinicians are free to choose any cutoff point. Some of the reasons for the
choices will be considered later in this chapter.
22 CLINICAL EPIDEMIOLOGY
Performance of Measurements
Whatever the type of measurement, its performance can be described
in several ways, discussed below.
VALIDITY
As pointed out in Chapter 1, validity is the degree to which the data
measure what they were intended to measurec--that is, the results of a
measurement correspond to the true state of the phenomenon being mea-
sured. Another word for validity is IIcCliracy.
Por clinical observations that can be measured by physical means, it is
relatively easy to establish validity. The observed measurement is com-
pared with some accepted standard. For example, serum sodium can be
measured on an instrument recently calibrated against solutions made up
with known concentrations of sodium. Clinical laboratory measurements
are commonly subjected to extensive and repeated validity checks. For
example, it is a national standard in the United States that blood glucose
measurements be monitored for accuracy by comparing readings against
high and low standards at the beginning of each day, before each technician
begins a day, and after any changes in the techniques such as a new bottle
of reagents or a new battery for the instrument. Similarly, the validity
of a physical finding can be established by the results of surgery or an
autopsy.
Other clinical measurements such as pain, nausea, dyspnea, depression,
and fear cannot be verified physically. In clinical medicine, information
about these phenomena is obtained by "taking a history." More formal
and standardized approaches, used in clinical research, are structured in-
terviews and questionnaires. Groups of individual questions (items) are
designed to measure specific phenomena (such as symptoms, feelings,
attitudes, knowledge, beliefs) called "constructs." Responses to questions
concerning a construct are converted to numbers and grouped together to
form "scales."
There are three general strategies for establishing the validity of mea-
surements that cannot be directly verified by the physical senses.
Content validily is the extent to which a particular method of measure-
ment includes all of the dimensions of the construct one intends to measure
and nothing more. For example, a scale for measuring pain would have
content validity if it included questions about aching, throbbing, burning,
and stinging but not about pressure, itching, nausea, tingling, and the like.
Construct validity is present to the extent that the measurement is consis-
tent with other measurements of the same phenomenon. For example, the
researcher might show that responses to a srale measuring pain are related
to other manifestations of the severity of pain such as sweating, moaning,
writhing, and asking for pain medications.
CHAPTER 2 ! ABNORMALITY 23
The term "hard" is usually applied to data that are reliable and preferably
dimensional k.g., laboratory data, demogrJphic data, and financial costs).
Hut clinical performance, convenience, anticipation, and familial data are
"soft." They depend on xubjer-tivc statements, usually expressed in words
rather than numbers, by the people who are the observers and the observed.
To avoid such soft data, the results of treatment arc commonly restricted
to laboratory information that can be objective, dimensional, and reliable-
but it is also dehumanized. If we are told lhat the serum cholesterol is 230
mg per 100 ml, that the chest X-rily shows cardiac enlargement, and that the
electrocardiogram has Q waves, we would not know whether the treated
object was a dog or a person. If we were told that capacity at work was
restored, that the medicine tasted good and W,lS easy to take, and that the
family was happy about the results, we would recognize a human set of
responses.
RELIABILITY
VALIDITY
High low
, ,
A , B ,
, ,
, ,
High , ,
, ,
, ,
>-
c
,
RELIABILITY ~
c :1 ,
,
c- , ,
o C , D ,
"
u, , ,
, ,
, ,
low , ,
, ,
,
~A l.A ,
,
~~-
Measurement
Figure 2.1. Validity and reliability. A, High validity and high reliability. B, Low validity
and high reliability. C, High validity and low reliability, 0, Low validity and low reliability
Tile dotted lines represent the true values.
a large set of measurements can be valid (accurate) on the average but not
be reliable, because the measures obtained are widely scattered about the
true value. On the other hand, an instrument can be very reliable but be
systematically off the mark (inaccurate). A single measurement with poor
reliability has low validity because it is likely to be off the mark simply
because of chance alone.
RANGE
An instrument may not register very low or high values of the thing
being measured, limiting the information it conveys. Thus the "first-
generation" method of measuring serum thyroid-stimulating hormone
(1'Sl I) was not useful for diagnosing hyperthyroidism or for precise titra-
tion of thyroxine administration because the method could not detect low
levels of TSH. Similarly, the Activities of Daily Living scale (which mea-
sures people's ability at feeding, continence, transferring, going to the toi-
let, dressing, and bathing) docs not measure inability to read, write, or play
the piano-activities that might be very important to individual patients.
CHAPTFR 2 / ABNORMALITY 25
RESPONSIVENESS
An instrument is responsive to the extent that its results change as condi-
tions change. For example, the New York Heart Association scale-classes
1 to IV (no symptoms, symptoms with slight and moderate exertion, and
symptoms at rest)-is not sensitive to subtle changes in congestive heart
failure, ones patients would value, whereas laboratory measurements of
ejection fraction can detect changes too subtle for patients to notice.
INTERPRETABILITY
A disadvantage of scales based on questionnaires that is not generally
shared by physical measurements is that the results may not have meaning
to clinicians and patients. for example, just how bad is it to have a Zung
depression scale value of 72? To overcome this disadvantage, researchers
can "anchor" scale values to familiar phenomena-for example, by indi-
cating that people with scores below 50 are considered normal and those
with scores of 70 or over are severely or extremely depressed. requiring
immediate care.
Variation
Clinical measurements of the same phenomenon can take on a range of
values, depending on the circumstances in which they are made. To avoid
erroneous conclusions from data, clinicians should be aware of the reasons
for variation in a given situation and know which are likely to play a large
part, a small part, or no part at all in what has been observed.
Overall variation is the sum of variation related to the act of measure-
ment, biologic differences within individuals from time to time, and bio-
logic differences from person to person (Table 2.2).
MEASUREMENT VARIATION
All observations are subject to variation because of the performance of
the instruments and observers involved in making the measurement. The
conditions of measurement can lead to a biased result (lack of validity) or
Table 2.2
Sources of Variation
Source Dc'ipition
Measurement
Instrument The means of rnClking the rneaeuremera
Observer The person making the measurement
Biologic
Within individuals Changes in people with time and situation
Among individuals Biologic differences from person to person
26 CLINICAL EPIDEMIOLOGY
15 Monitored fetal
heart rate
130-150
10
Ul
<:
o
~ 5
e
CI>
1l
o
'0
~
~ 11
E
::l Monitored fetal
Z 10
heart rate >150
oL,----,t"-,Iillibfl
-50 40 30 20 10 0 10 20 30 4050+
Underestimate Overestimate
ERROR (beats/min)
Figure 2.2. Observer variability. Error in reporting fetal heart rate according to
whether the true rate, determined by electronic monitor, is within the normal range,
low, or high. (Redrawn from Day E, Maddern L, Wood C. Auscultation of foetal heart
rate: an assessment of its error and significance. Br Med J 1968;4:422-424.)
VPDs, similar to other patients studied (3). VPDs per hour varied from less
than 20 to 380 during a 3-day period, according to day and time of day. The
authors concluded: "To distinguish a reduction in VPD frequency attribut-
able to therapeutic intervention rather than biologic or spontaneous variation
alone required a greater than 83% reduction in VPD frequency if only two
24-hour monitoring periods were compared."
400
~
~
C
300
0.
>
'0 200
~
1: Day 1
E
100
z"
, ,
Noon Midnight 6 A.M.
servers with various biases that tend to balance each other out-results
on average in no net misrepresentation of the true state of a phenomenon
if a set of measurements are made; individual measurements, however,
may be misleading. Inaccuracy resulting from random variation can be
reduced by taking a larger sample of what is being measured, for example,
by counting more cells on a blood smear, examining a larger area of a
urine sediment, or studying more patients. Also, the extent of random
variation can be estimated by statistical methods (see Chapter 9).
On the other hand, biased results are systematically different from the true
Measurement
One patient, many
observers, at one time
Biologic and
Measurement
Many patients
value, no matter how mallY times they are repeated. For example, when investi-
gating a patient suspected of having an infiltrative liver disease (perhaps follow-
ing up an elevated serum alkaline phosphatase) a single liver biopsy may be
misleading, depending on how the lesions are distributed in the liver. If the
lesion is a metastasis in the left lobe of the liver, a biopsy in the usual place
(the right lobe) would miss it. On the other hand, a biopsy for miliary tubcrculo-
sis, which is represented by millions of small granulomata throughout the liver,
would be inaccurate only through random variation. Similarly, all of the high
values for VPDs shown in Figure 2.3 were recorded on the first day, and most
of the low values on the third. The days were bia.s. ed estimates of each other,
because of variation in VPD rate from day to day.
Distributions
Data that are measured on interval scales are often presented as a figure,
called a frequency distribution, showing the number (or proportion) of a
defined group of people possessing the different values of the measure-
ment (Fig. 2.5). Presenting interval data as a frequency distribution conveys
the information in relatively fine detail.
20 Mode
•
15
-"
c:
CJ 10
~
"
l1.
2 3 4 5 6 7 8 9 10
PSA (ng/mL)
Figure 2.5. Measures of central tendency and dispersion, The distribution of pros-
tate-specific antigen (PSA) levels in presumably normal men, (Data from Kane RA,
Littrup PJ, Babaian R, Drago JR, Lee F, Chesley A, Murphy GP, Mettlin C. Prostate-
specific antigen levels in 1695 men without evidence of prostate cancer. Cancer
1992;69:1201-1201,)
CHAPTER 2 ! ABNORMALITY 31
Table 2.3
Expressions of Central Tendency and Dispersion
Central Tendency
Median The point where the number Not easily Not well suited for
ot observations above influenced by mathematical
equals the number below extreme values manipulation
Dispersion
Standard The absolute value of the Well suited tor For non-Gaussian
deviation" average difference 01 mathematical distributions,
individual values tram the manipulation does not describe
mean a known
proportion of the
observations
Percentile, decile, The proportion of all Describes the Not well suited for
quartile, etc, observations falling "unusualness" statistical
between specified values of a value manipulation
without
assumptions
about the shape
of a distribution
DESCRIBING DISTRIBUTIONS
It is convenient to summarize distributions. Indeed, summarization is imper-
ative if a large number of distributions art:' to be presented and compared.
Two basic properties of distributions arc used to summarize them: cen-
tral tendency, the middle of the distribution, and dispersion, how spread out
the values are. Several ways of expressing central tendency and dispersion,
along with their advantages and disadvantages. arc summarized in Table
2.3 and illustrated in Figure 2.5.
ACTUAL DISTRIBUTIONS
The frequency distributions of four common blood tests (potassium,
alkaline phosphatase. glucose, and hemoglobin) are shown in Figure 2.6.
In general, most of the values appear near the middle, and except fOT the
30
20
Serum Alkaline
potassium 20 phosphatase
10
10
-
c
Gl
0
~
Gl
3.0 4.0
mEq/L
50 20 40 60 80 100120140
units
"- 30 40
Plasma 30
20 glucose Hemoglobin
20
10
10
Figure 2.6. Actual clinical distributions. (Data from Martin HF, Gudzinowicz BJ,
Fanger H. Normal values in clinical chemistry. New York: Marcel Dekker, 1975.)
CHAPTER 2 ! ABNORMALITY 33
central part of the curves, there are no "humps" or irregularities. The high
and low ends of the distributions stretch out into tails, with the tail at one
end often being more elongated than the tail at the other (i.e., the curves
are "skewed" toward the long end). Whereas some of the distributions
are skewed toward higher values, others are skewed toward lower values.
Tn other words, all these distributions are unimodal, are roughly bell-
shaped, and are not necessary symmetric; otherwise they do not resemble
each other.
The distribution of values for many laboratory tests changes with char-
acteristics of the patients such as agc, sex, race, and nutrition. figure 2.7
shows how the distribution of one such test, blood urea nitrogen (BUN),
changes with age. A BUN of 25 mg/lOO mL would be unusually high for
a young person, but not particularly remarkable for an older person.
THE NORMAL DISTRIBUTION
"
l:T
Ql
~
U.
10 20 30 40 50
BUN mg/l00 mL
Figure 2.7. The distribution of clinical variables changes with age: BUN for people
aged 20-29 versus those 80 or older, (Data from Martin HF, Gudzinowicz BJ, Fanger
H. Normal values in clinical chemistry. New York: Marcel Dekker, 1975.)
34 CLINICAL EPIDEMIOLOGY
>-
.
e
e
.
:>
0'
~
LL
Standard -3 -2 -1 a +1 +2
deviations
2.14 13.59 34.13 34.13 13.59 2.14
Percent of
--68.26-:
area under
the curve - - - 95.44 --~
- - - - - 99.72 - - - - -•
Figure 2.8. The normal (Gaussian) distribution.
This is the case for specific DNA and RNA sequences and antigens (Fig.
2.9A), which are either present or absent, although their clinical manifesta-
tions may not be so dear-cut.
However, most distributions of clinical variables are not easily di-
vided into "normal" and "abnormal," because they are not inherently
dichotomous and they do not display sharp breaks or two peaks that
Normal Mutant
A lIeles for Phenylalanine Hydroxylase
o 2 4 6 8 10
Blood Phenylalanine (mg IdL)
Figure 2.9. Screening for phenylketonuria (PKUj in infants: dichotomous and over-
lapping distributions of normal and abnormal. A, Alleles coding for phenylalanine
hydroxylase are either normal or mutant. B, The distributions of blood phenylalanine
levels in newborns with and without PKU overlap and are of greatly different magni-
tude. (The prevalence of PKU, actually about 1/10,000, is exaggerated so that its
distribution can be seen in the figure.)
36 CLINICAL EPIDeMIOLOGY
be used to decide? Three criteria have proven useful: being unusual, being
sick, and being treatable. For a given measurement, the results of these
approaches bear no necessary relation to each other, so that what might
be considered abnormal using one criterion might be normal by another.
ABNORMAL ~ UNUSUAL
Normal often refers to the most frequently occurring or usual condition.
Whatever occurs often is considered normal, and whatever occurs infre-
quently is abnormal. This is a statistical definition, based on the frequency
of a characteristic in a defined population. Commonly, the reference popu-
lation is made up of people without disease, but this need not be the case.
For example, we may say that it is normal to have pain after surgery or
for eczema to itch.
It is tempting to be more specific by defining what is unusual in mathe-
matical terms. One commonly used way of establishing a cutoff point
between normal and abnormal is to agree, somewhat arbitrarily, that all
values beyond 2 standard deviations from the mean are abnormal. On
the assumption that the distribution in question approximates a normal
(Gaussian) distribution, 2.5°/" of observations would then appear in each
tail of the distribution and be considered abnormal.
Of course, as already pointed out most biologic measurements are not
normally distributed. So it is better to describe unusual values, whatever
the proportion chosen, as a fraction (or percentile) of the actual distribution.
In this way, it is possible to make a direct statement about how infrequent
a value is without making assumptions about the shape of the distribution
from which it came.
A statistical definition of normality is commonly used but there are
several ways in which it can be ambiguous or misleading.
First, if all values beyond an arbitrary statistical limit, say the 95th
percentile, were considered abnormal, then the prevalence of all diseases
would be the same, 5'%. This is inconsistent with our usual way of thinking
about disease frequency.
Second, there is no general relationship between the degree of statistical
unusualness and clinical disease. The relationship is specific to the disease
in question. For some measurements, deviations from usual are associated
with disease to an important degree only at quite extreme values, well
beyond the 95th or even the 99th percentile.
Example The World Health Organization (WHO) considers anemia to
be present when hemoglobin (lib) levels are below 12 g/lOO mL in adult
nonpregnant females. In a British survey of women aged 20-64, Hb was
below 12 g/lOO mL in 11% of 920 nonpregnant women, twice as many as
would be expected if the criterion for abnormality were exceeding 2 standard
deviations (5). But were the women with Hb levels below 12 g/lOO ml.
"diseased" in any way because of their relativalv low llb? Two possibilities
38 CLINICAL ePIDEMIOLOGY
Third, many laboratory tests are related to risk of disease over their entire
range of values, from low to high. for serum cholesterol, there is an almost
threefold increase in risk from the "low normal" to the "high normal" range.
Fourth, some extreme values are distinctly unusual but preferable to
more usual ones. This is particularly true at the low end of some distribu-
tions. Who would not be pleased to have a serum creatinine of 0.4 mg/
100 mL or a systolic blood pressure of 105 mm Hg? Both are unusually
low but they represent better than average health or risk.
Finally, sometimes patients may have, for laboratory tests diagnostic of
their disease, values in the usual range for healthy people, yet clearly be
diseased. Examples include low pressure hydrocephalus, normal pressure
glaucoma, and normncnlremic hyperparathyroidism.
ABNORMAL --= ASSOCIATED WITH DISEASE
A sounder approach to distingulshmg normal from abnormal is to call
abnormal those observations that are regularly associated with disease,
disability, or death, i.e., clinically meaningful departures from good health.
Example What is a "normal" alcohol (ethanol) intake? Several studies
have shown a U-shaped relationship between alcohol intake and mortality:
high death rates in abstainers, lower rates in moderate drinkers, and high
rates in heavy drinkers (Fig. 2.10). It has been suggested that the lower death
rates with increasing alcohol consumption, at the lower end of the curve,
occur because alcohol raises high density lipoprotein levels, which protects
against cardiovascular disease. Alternatively, when people become ill they
reduce their alcohol consumption and this could explain the high rate of
mortality associated with low alcohol intake (6). High death rates at high
intake is less controversial: alcohol is a cause of several fatal diseases (heart
disease, cancer, and stroke). The interpretation of the causes for the Ll-shapcd
curve determines whether it is as abnormal to abstain as it is to drink heavily.
ABNORMAL - TREATABLE
Par some conditions, particularly those that are not troublesome in their
own right (i.e., are asymptomatic), it is better to consider a measurement
abnormal only if treatment of the condition represented by the measure-
CHAPTER 2 ! A8NORMALITY 39
16
14
••
~
12
~
•
~
0- 10
0
-
0
0
8
0;
0-
.••
s:
0
6
Figure 2.10. Abnormal as associated with disease. Ihe relationship between alco-
holconsumption and mortality. (From Shaper AG, weroametbee G, Walker M, Alco-
hol and mortality in British men: explaining the u-sbecec curve, Lancet
1888;2:12l';7-1 )73)
t First testing of
.,'"
(J the popUlation
c:
.,.,.
::J
~
LL
I
Patients retested
Mean
Summary
Clinical phenomena are measured on nominal, ordinal, and interval
scales. Although many clinical observations fall on a continuum of values,
for practical reasons they are often simplified into dichotomous (normal!
abnormal) categories. Observations of clinical phenomena vary because
of measurement error, differences in individuals from time to time, and
differences among individuals. The performance of a method of measure-
ment is characterized by validity (Does it measure what it intends to mea-
sure"), reliability (Do repeated measures of the same thing give the same
result?), range, responsiveness, and interpretability.
Frequency distributions for clinical variables have different shapes,
which can be summarized by describing their central tendency and
dispersion.
Laboratory values from normal and abnormal people often overlap;
because of this and the relatively low prevalence of abnormals, it is usually
not possible to make a dean distinction between the two groups using the
test result alone. Choice of a point at which normal ends and abnormal
begins is arbitrary and is often related to one of three definitions of abnor-
mality: statistically unusual. associated with disease, or treatable. If patients
with extreme values of a test are selected and the test is repeated, the
second set of va lues is likely to fa11 closer to the central (statistically normal)
part of the frequency distribution, a phenomenon called regression to
the mean.
RHI+:RENCES
I. Feinstein AR. The need for humanized science in evaluating medication. Lancet 1972;2:421-
423.
2. [Jay E, Madd"rn L, Wood c:. Auscultation of fnetal h",ul ratc: an assessment of its error
and significance, IJr Mcd j 196H;4:422424,
3. Morgaru-oth J, Michelson EL, Horowitz LN, Josephson ME, Pearlman AS, Dunkrnan WB.
Limitations of routine long-term electrocardiographic monitoring to assess ventricular ec-
topic frequency. Circubtion 1978;58:408-414.
4. Flwhil(·k 1.1'. Cuillin CI .. Kl'ating FR. H""lth, normality, and th" ghc>st of <;auss. j;\Mi\
1970;211:69--75.
42 CLINICAL EPIDEMIOLOGY
5. Elwood PC, W~ters WE, Greene WJW. sw",etnam P. Symptoms and circulating hemoglobin
level. J Chmn Dis 1%9;21:615-628.
6. Shaper AG, wannamerhee C, Walker M. Alcohol and mortality in British men: explaining
the If-shaped curve. I.aned 19HH;2: 1267-1273.
7. Epstein KA, Schneiderman LJ, Bush Iw, Zenner A. The "abnormal" screening serum thy-
roxine (T4 ) : una lysis of physician response. outcome, cost and health and dfediveness, J
Cham OisI9HI;34;175-190,
SUGGESTED READINGS
Department of Clinic~l Epidemiology and Biostatisti~s, McMaster University. Clinical db-
agre-ement L How often it occurs and why. Can M"d Assoc J 1980;123:499-504.
o..'partmenl of Clinical Epidemiology and Biostatistics, Mclvlaster Uni~Trsity, Clinical dis-
agreement II. How to avoid it and how to learn from on"·s mistakes, Can Med Assoc J
1980; 123:613-617.
Feinstein AR. Clinical judgment. &lltimore: Williams & Wilkins, 1967.
Feinstein AR. Problems in measurement, dinieal biostatistics. St. Louts: CV Mosby. 1977,
Feinstein AR. Clinimetrics. New Haven, CT: Yale University Press, 191>7.
Koran LM. TIle reliability of dinical methods, data and judgment. N Engl J Med 1975;293:642-
646,695-701.
Mainland D Remarks on clinical "norms." elin Chern 1971; 17:267-274,
Murphy EA. Th,' logic of medicine. Baltimore, Johns Hopkins University Press, 1976
Cuy~tt CJ J, t-eeny DH, Patrick DI.. M",,,suring health-related quality of lif". Ann Intern Med
1,),)3;111l:622-629.
3
DIAGNOSIS
Appearances to the mind are of four kinds. Things either arc what
they appear to be; or they neither arc, nor appear to be; or they arc,
and do not appear to be; or they are not, yet appear to be. Rightly
to aim in all these cases is the wise man's task
Epictetus, 2nd century A.D.
Simplifying Data
In Chapter 2, it was pointed out that clinical measurements, including
data from diagnostic tests, are expressed on nominal, ordinal, or interval
scales. Regardless of the kind of data produced by diagnostic tests, clini-
cians generally reduce the data to a simpler form to make them useful in
practice. Most ordinal scales are examples of this simplification process.
Obviously, heart murmurs can vary from very loud to inaudible. But trying
to express subtle gradations in the intensity of murmurs is unnecessary
for clinical decision making. A simple ordinal scale-grades I to VI-
43
44 CLINICAL EPIDEMIOLOGY
serves just as well. More often, complex data are reduced to a simple
dichotomy, e.g" present/absent, abnormnlj'normal, or diseased/well. This
is particularly done when test results Me used to decide on treatment. At
any given point in time, therapeutic decisions are either/or decisions. Ei-
ther treatment is begun or it is withheld.
The use of blood pressure data to decide about therapy is an example
of how we simplify information for practical clinical purposes. Blood
pressure is ordina rily measured to the nearest 2 mm Hg, i.e. on an
interval scale. However, most hypertension treatment guidelines, such
as those of the Joint National Committee on the Detection, Evaluation,
and Treatment of Hypertension (1) and of most physicians, choose a
particular level (e.g.. % mm Hg diastolic pressure) at which to initiate
drug treatment. In doing so, clinicians have transformed interval data
into nominal (in this case, dichotomous) data. To take the example
further, the Joint National Committee recommends that a physician
choose a treatment plan according to whether the patient's diastolic
blood pressure is "mildly elevated" (90-94 mm Hg), "moderately ele-
vated" (95-114 mm Hg). or "severely elevated" (:2:115 mm Hg). an
ordinal scale,
DISEASE
Present
Positive
TEST f------+
True
Negative negative
Figure 3.1. The relationship between a diagnostic lest result and the occurrence
of disease, There are two possibilities for the test result to be correct (true positive
and true negative) and two possibilities for the result to be incorrect (false positive
and false negative),
x-rays and sputum smears are used to determine the nature of pneumonia,
rather than lung biopsy with examination of :he diseased lung tissue.
Similarly, electrocardiograms and serum enzymes are often used to estab-
lish the diagnosis of acute myocardial infarction, rather than catheterization
or imaging procedures. The simpler tests arc used as proxies for more
elaborate but more accurate ways of establishing the presence of disease,
with the understanding that some risk of misclassiftcatlon results. This risk
is justified by the safety and convenience of the simpler tests. But Simpler
tests are only useful when the risks of misclass.ification are known and
found to be acceptably low. This requires sound data that compare their
accuracy to an appropriate standard.
LACK OF INFORMATION ON NEGATIVE TESTS
The goal of all clinical studies describing the value of diagnostic tests
should be to obtain data for all four of the cells shown in Figure 3.1.
Without all these data, it is not possible to assess the risks of misclassifica-
tion, the critical questions about the performance of the tests. Given that
the goal is to fill in all four cells, it must be stated that sometimes this is
difficult to do in the real world. It may be that an objective and valid
means of establishing the diagnosis exists, but it is not available for the
purposes of formally establishing the properties of a diagnostic test for
ethical or practical reasons. Consider the situation in which most informa-
tion about diagnostic tests is obtained. Published accounts come primarily
from clinical, and not research, settings. Under these circumstances, physi-
cians are using the test in the process of caring for patients. They feel
justified in proceeding with more exhaustive evaluation, in the patient's
best interest, only when preliminary diagnostic tests arc positive. They are
naturally reluctant to initiate an aggressive workup, with its associated
risk and expense, when the test is negative. As a result, information on
negative tests, whether true negative or false negative, tends to be much
less complete in the medical literature.
This problem is illustrated by an influential study of the utility of the
blood test that detects prostate specific antigen (PSA) in looking for prostate
cancer (2). Patients with PSAs above a cutoff level were subjected to biopsy
while patients with PSAs below the cutoff were not biopsied. The authors
understandably were reluctant to subject men to an uncomfortable proce-
dure without supporting evidence. As J result, the study leaves us unable
to determine the false-negative rate for PSA screening.
LACK OF INFORMATION ON TEST RESULTS IN THE NONDISEASED
As discussed above, clinicians are understandably loath to perform elab-
orate testing on patients who do not have problems. An evaluation of a
test's performance can be grossly misleading if the test is only applied to
patients with the condition.
CHAPTER 3 I DIAGNOSIS 47
would be considered false positives in relation to the old test. Just such a
situation occurred in a comparison of real-time ultrasonography and ora]
cholecystography for the detection of gallstones (4).]n five patients, ultra-
sound was positive for stones that were missed on an adequate cholccvsto-
gram. Two of the patients later underwent surgery and gallstones were
found, so that for at least those two patients, the standard oral cholecysto-
gram was actually less accurate than the newer real-time ultrasound. Simi-
larly, if the new test is more often negative in patients who really do not
have the disease, results for those patients will be considered false nega-
tives compared with the old test. Thus, if an inaccurate standard of validity
is used, a new test can perform no better than that standard and will seem
inferior when it approximates the truth more closely.
Sensitivity and Specificity
Figure 3.2 summarizes some relationships between a diagnostic test and
the actual presence of disease. It is an expansion of Figure 3.1, with the
addition of some useful definitions. Most of the rest of this chapter deals
DISEASE
Present
---'L -"---
LR+ = a + c LR_=a+c
~ ---<L
b+d b+d
Figure 3.2. Diagnostic test characteristics and definitions. Se - sensitivity; Sp -,--
specificty; P """ prevalence; PV - predictive value.
CHAPTER 3 ! DIAGNOSIS 49
Group A
B = Hemolytic
Streptococcus on
Throat Culture
Present
37 112 149
Table 3.1
Trade~Off between Sensitivity and Specificity when Diagnosing Diabetes"
70 986 88
SO 97,1 25,5
90 94,3 41,6
100 88.6 69,8
110 85.7 84,1
120 71.4 92,5
130 64.3 96.9
140 57.1 99.4
150 500 99.6
160 47,1 99,(>
110 42,9 100.0
180 38,6 100,0
190 34,3 100,0
200 271 1000
Publ,c Health Service DiHhetHs program guide, Puolicnticn no. GOll. W~sh,nnton, DC: U,S, Gove",,,,ent
Printin(] Office, 1960,
of the disease. The test would be very specific at the expense of sensitivity.
At the other extreme, if anyone with a blood sugar of greater than 70 mg
'X, were diagnosed as diabetic, very few people with the disease would be
missed, but most normal people would be falsely labeled as having diabe-
tes. The test would then be sensitive but nonspecific There is no way,
using a single blood sugar determination under standard conditions,
that one can improve both the sensitivity and specificity of the test at the
same time.
Another way to express the relationship between sensitivity and speci-
ficity for a given test is to construct a curve, called a receiver operator charac-
teristic (1{00 curve. An ROC curve for the use of a single blood sugar
determination to diagnose diabetes mellitus is illustrated in Figure 3.4. It
is constructed by plotting the true-positive rate (sensitivity) against the
false-positive rate {l-spccificity) over a range of cut-off values. The values
on the axes run from a probability of 0 to 1.0 (or, alternatively, from 0 to
100'};,). Figure 3.4 illustrates the dilemma created by the trade-off between
sensitivity and specificity. A blood sugar cutoff point of 100 will miss only
11% of diabetics, but 30'~o of normals will be alarmed by a false-positive
report. Raising the cutoff to 120 reduces false-positives to less than 10% of
normals, but at the expense of missing nearly 30% of cases.
Tests that discriminate well crowd toward the upper left comer of the
ROC curve; for them, as the sensitivity is progressively increased (the
52 CLINICAL EPIDEMIOLOGY
SPECIFICITY (%)
80 60 40 20
100
/
80 l"
'0
• , 00 /
20
--
_.-
80 110/ /
/
;1!
-"
;f''(U 'co Cutoff points
/
/
>
(mg/l00 mL)
> ">
>-
->".-
60 no
''0
I /
/
/ 40 >-
>
>-
'50 /
>-~ ,eo
60 '"
/ z
inc.
, 40 no / W
z" '80 /
'" >-.
W a
20
'"
'00
/
/
/
80
-
'",
/
/
/
0 -'
20 40 60 80 100
1-SPECIFICITY (%)
(False-positive rate)
-----
Figure 3.4. A ROC curve. The accuracy of 2-hr postprandial blood sugar as a
diagnostic test for diabetes mellitus. (Data Iron I Public Health Service, Diabetes
program guide. Publication no, 506 Washington, DC: U.S. Government Printing
Office. 1960.)
Figure 3.5 compares the ROC curves for two questionnaire tests used to
screen for alcoholism in elderly patients-the CAGE and the MAST (Michi-
gan Alcoholism Screening Test) (5). The CAGE is both more sensitive and
more specific than the MAST and includes a much larger area under its
curve.
Obviously, tests that are both sensitive and specific are highly sought
after and can be of enormous value. However, practicing clinicians rarely
work with tests that are both highly sensitive and specific. So for the
present, we must use other means for circumventing the trade-off between
sensitivity and specificity. The most common way is to use the results of
several tests together (as discussed below).
100
90
CAGE
80
70
...
~ .
s 60
.... MAST
E
Ul 50
.
Z ,
W
Ul 40 ,
.
...
30
20 ,
10
o l_~~_~~~_ T-~~~-
o 10 20 30 40 50 60 70 80 90 100
1- SPECIFICITY
Figure 3.5. ROC curves for the CAGE and MAST questionnaires in elderly patients
with and without alcoholism. (Redrawn tram Jones TV, Lindsey BA, Yount P, Soltys
R, Farani-Enayat B, Alcoholism screening questionnaires: arc they valid in elderly
medical outpatients? J Gen Intern Med 1993;8:674-678,)
54 CLINICAL EPIDEMIOLOGY
SPECIFICITY
0.8 0.6 0.4 0.2
1.00 ,1 "-:: ~ _1_ - ;-;;f
... StageD
... -- _-- .......
".
• 5.0 ng/rnL
.10.0 ng/mL
1-SPECIFICITY
- - - - - ---
Figure 3.6. ROC curve for CEA as a diagnostic test for colorectal cancer, according
to stage of disease. The sensitivity and specmcuv of a test vary with the stace of
disease. (Redrawn from Fletcher RH. Carcinoembryonic antigen, Ann Intern Med
1986; 104:66-73,)
several ways, As already pointed out, if the test is evaluated using data
obtained during the course of a clinical evaluation of patients suspected
of having the disease in question, a positive test may prompt the clinician
to continue pursuing the diagnosis, increasing the likelihood that the dis-
ease will be found, On the other hand, a negative test may cause the
clinician to abandon further testing, making it more likely that the disease,
if present, will be missed.
In other situations, the test result may be part of the information used
to establish the diagnosis, or conversely, the results of the test may be
interpreted taking other clinical information or the final diagnosis into
account. Radiologists are frequently subject to this kind of bias when they
read x-rays. Because x-ray interpretation is somewhat subjective, it is easy
to be influenced by the clinical information provided. All clinicians experi-
ence the situation of having x-rays overread because of a clinical impres-
sion, or conversely, of going back over old x-rays in which a finding was
missed because a clinical event was not known at the time, and therefore,
attention was not directed to the particular area in the x-ray. Because of
these biases, some radiologists prefer to read x-rays twice, first without
and then with the clinical information. All of these biases tend to increase
the agreement between the test and the standard of validity. That is, they
tend to make the test seem more useful than it actually is, as, for example,
when an MRJ of the lumbar spine shows a bulging disc in a patient with
back pain (see earlier example in this chapter).
CHANCE
Values for sensitivity and specificity (or likelihood ratios, another char-
acteristic of diagnostic tests, discussed below) arc usually estimated from
observations on relatively small samples of people with and without the
disease of interest. Because of chance (random variation) in anyone sample,
particularly if it is small, the true sensitivity and specificity of the test can
be misrepresented, even if there is no bias in the study. The particular
values observed are compatible with a range of true values, typically char-
acterized by the "95% confidence intervals"] (sec Chapter lJ). The width
of this range of values defines the degree of precision of the estimates of
sensitivity and specificity. Therefore, reported values for sensitivity and
specificity should not be taken too literally if a small number of patients
is studied.
Tl", 9."'~·;, confidence interval of a prop"'lion is ~asily ~stim"led by the following formula, b""'-'d on the
binomiol th"orcm:
wh"", {' i, th" "h-",,,',,<I proportiOIl aml ,,,, i, the numbe, "f people ol"erved. Tn be more nearly cx"c!,
Illulliply by 1.Y6,
CHAPTER 3 ! DIAGNOSIS 57
100
--
if-
80
>-
I- 60
>
l-
V>
z 40
w
V>
20
0
10 20 30 40 50
NUMBER OF PEOPLE OBSERVED
Figure 3.7. The precision of an estimate of sensitivity, The 95% confidence interval
for an observed sensitivity of 75%, according to the number of people observed,
58 CLINICAL EPIDEMIOLOGY
dictive value is the probability of not having the disease when the test
result is negative (normal). Predictive value answers the question, "If my
patient's test result is positive (negative) what are the chances that my
patient does (does not) have the disease?" Predictive value is sometimes
called posterior (or posUest) probability, the probability of disease after the
test result is known. Figure 3.3 illustrates these concepts. Among the pa~
tients treated with antibiotics for streptococcal pharyngitis, less than half
(44%) had the condition by culture (positive predictive value). The negative
predictive value of the housestaff's diagnostic impressions was better; of
the 87 patients thought not to have streptococcal pharyngitis, the impres-
sion was correct for 77 (88~';,).
Terms summarizing the overall value of a test have been described.
One such term, accuracy, is the proportion of all test results, both positive
and negative, that are correct. (For the pharyngitis example in Figure
3.3, the accuracy of the houscstaff's diagnostic impressions was 70%.)
The area under the ROC curve is another useful summary measure of
the information provided by a test result. However, these summary
measures are too crude to be useful clinically because specific informa-
tion about the component parts-sensitivity, specificity, and predictive
value at specific cutoff points-is lost when they are aggregated into a
single index.
DETERMINANTS OF PREDICTIVE VALUE
The predictive value of a test is not a property of the test alone. It is
determined by the sensitivity and specificity of the test and the prevalence
of disease in the population being tested, where prevalence has its custom-
ary meaning-the proportion of persons in a defined population at a given
point in time with the condition in question. Prevalence is also called prior
(or pretest) probability, the probability of disease before the test result is
known. (For a full discussion of prevalence, see Chapter 4.)
The mathematical formula relating sensitivity, specificity, and preva-
lence to positive predictive value is derived from Bayes's theorem of condi-
tional probabilities:
Positive Sensitivity x Prevalence
predictive = (Sensitivity x Prevalence) + (I-Specificity) x (f-Prevalence)
value
The more sensitive a test is, the better will be its negative predictive
value (the more confident the clinician can be that a negative test result
rules out the disease being sought). Conversely, the more specific the test
is, the better will be its positive predictive value (the more confident the
clinician can be that a positive test confirms or rules in the diagnosis being
sought). Because predictive value is also influenced by prevalence, it is not
CHAPTER 3 I DIAGNOSIS 59
independent of the setting in which the test is used. Positive results even
for a very specific test, when applied to patients with a low likelihood of
having the disease, will be largely false positives. Similarly, negative re-
sults, even for a very sensitive test, when applied to patients with a high
chance of having the disease, are likely to be false negatives. In sum, the
Interpretation of a positive or negative diagnostic test result varies from
setting to setting, according to the estimated prevalence of disease in the
particular setting.
It is not intuitively obvious what prevalence has to do with an individual
patient. For those who arc skeptical it might help to consider how a test
would perform at the extremes of prevalence. Remember that no matter
how sensitive and specific a test might be (short of perfection), there will
still be a small proportion of patients who are misdassified by it. Imagine
a population in which no one has the disease. In such a group all positive
results, even for a very specific test, will be false positives. Therefore, as
the prevalence of disease in a population approaches zero, the positive
predictive value of a test also approaches zero. Conversely, if everyone in
a population tested has the disease, all negative results will be false nega-
tives, even for a very sensitive test. As prevalence approaches 100%, nega-
tive predictive value approaches zero. Another way for the skeptic to con-
vince himself or herself of these relationships is to work with Figure 3.2,
holding sensitivity and specificity constant, changing prevalence, and cal-
culating the resulting predictive values.
The effect of prevalence on positive predictive value, for a test at differ-
ent but generally high levels of sensitivity and specificity, is illustrated in
Figure 3.8. When the prevalence of disease in the population tested is
relatively high-more than several percent-the test performs well. But
at lower prevalences, the positive predictive value drops to nearly zero,
and the test is virtually useless for diagnosing disease. As sensitivity and
specificity fall, the influence of changes in prevalence on predictive value
becomes more acute.
100~ __
--
ifl
w 80
w-'
:> \ \
80/80 90/90
,
sensitivity/specificity
99/99
>"
-> 60
I-
-w
"'>
0 - 40
0.1-
0
0 20
w
a:
0.
0 1/5 1/10 1/50 1/100 1/1000 1/10,000
PREVALENCE
that are based on the estimate. In any case, the process is bound to be
more accurate than implicit judgment alone.
Tn general, prevalence is more important than sensitivity and specificity
in determining predictive value (see fig. 3.8). One reason why this is so
is that prevalence commonly varies over a wider range. Prevalence of
disease can vary from a fraction of a percent to near certainty in clinical
settings, depending on the age, gender, risk factors, and clinical findings
of the patient. Contrast the prevalence of liver disease in a healthy, young
adult who uses no drugs, illicit or otherwise, and consumes only occasional
alcohol, with that of a jaundiced intravenous drug user. By current stan-
dards, clinicians arc not particularly interested in tests with sensitivities
and specificities much below 50'Yo, but if both sensitivity and specificity
are 99°/;" the test is considered a great one. In other words, in practical
terms sensitivity and specificity rarely vary more than twofold.
INCREASING THE PREVALENCE OF DISEASE
Considering the relationship between the predictive value of a test and
prevalence, it is obviously to the physician's advantage to apply diagnostic
tests to patients with an increased likelihood of having the disease being
sought. In fact as Figure 3.8 shows, diagnostic tests are most helpful when
the presence of disease is neither very likely nor very unlikely.
There Me a variety of ways in which the probability of a disease can
be increased before using a diagnostic test.
Referral Process
The referral process is one of the most common ways in which the
probability of disease is increased. Referral to teaching hospital wards,
clinics, and emergency departments increases the chance that significant
disease will underlie patients' complaints. Therefore, relatively more ag-
gressive use of diagnostic tests might be justified in these settings. In pri-
mary care practice, on the other hand, and particularly among patients
without complaints, the chance of finding disease is considerably smaller,
and tests should be used more sparingly.
Example While practicing in a military clinic, one of the authors saw
hundreds of people with headache, rarely ordered diagnostic tests, and never
encountered a patient with a severe underlying cause of headache. (It is
unlikely that important conditions were missed because the clinic was virtu-
.,lIy the only source of medical care for these patients and prolonged follow-
up was avaflahle.) However, during the first week back in a medical resi-
dency, a patient visiting the hospital's emergency department because of a
headache similar to the ones managed in the military was found to have a
cerebellar abscess!
Because clinicians may work at different extremes of the prevalence spec-
trum at various times in their clinical practices, they should bear in mind
62 CLINICAL EPIDEMIOLOGY
Because of this effect, physicians must interpret similar test results dif-
ferently in different clinical situations. A negative stress test in an asymp-
tomatic 35-year-old man merely confirms the already low probability of
coronary artery disease, but a positive test usually will be misleading if it
is used to search for unsuspected disease, as has been done among joggers,
airline pilots, and business executives. The opposite applies to the 65-
CHAPTER 3 I DIAGNOSIS 63
w
100
gO _>
L::'EI 0.5-1.0 mm
2.5 mm
::> 80
~
'w_
>" 70
:=~ 60
I-~
00
-'"
iilo
a: a:
50
40
"0
w"-
30
".'"
!::
0
20
10
0
Age 30-39 60-69 60-69 60-69
Symptom None None Atypical Typical
angina angina
year-old man with typical angina. In this case, the test may be helpful in
confirming disease but not in excluding disease. The test is most useful in
intermediate situations, in which prevalence is neither very high nor very
low. For example, a eu-ycar-old man with atypical chest pain has a 6n:,
chance of coronary artery disease before stress testing (see Fig. 3.9); but
afterward, with greater than 2.5 mm ST segment depression, he has a 99%
probability of coron<lry disease.
Because prevalence of disease is such a powerful determinant of how
useful a diagnostic test will be, clinicians must consider the probability of
disease before ordering a test. Until recently, clinicians relied on clinical
observations and their experience to estimate the pretest probability of a
disease. Research using large clinical computer data banks now provide
64 CLINICAL EPIDEMIQIOGY
express how many times more (or less) likely a test result is to be found
in diseased, compared with nondiscased, people. If a test is dichotomous
(positive/negative) two types of likelihood ratios describe its ability to
discriminate between diseased and nondiseased people: one is associated
with a positive test and the other with a negative test (sec Fig. 3.2).
In the pharyngitis example (see Fig. 3.3), the data can be used to calculate
likelihood ratios for streptococcal pharyngitis in the presence of a positive
or negative test (clinical diagnosis). A positive test is about 2.5 times more
likely to be made in the presence of streptococcal pharyngitis than in the
absence of it. If the clinicians believed streptococcal pharyngitis was not
present, the likelihood ratio for this negative test was 0.39; the odds were
about 1:2.6 that a negative clinical diagnosis would be made in the presence
of streptococcal pharyngitis compared with the absence of the disease.
USES OF LIKELIHOOD RATIOS
Pretest probability (prevalence) can be converted to pretest odds using
the formula presented earlier. Likelihood ratios can then be used to convert
pretest odds to posttcst odds, by means of the following formula:
Example How accurate is serum thyroxine (T4 ) alone as a test for hypo-
thyroidism? This question was addressed in a study of 120 ambulatory gen-
eral medical patients suspected of having hypothyroidism ('2). Patients were
diagnosed as being hypothyroid if serum thyrotropin (TSH) was elevated
and if subsequent evaluations, including other thyroid tests and response to
treatment, were consistent with hypothyroidism. The authors studied the
initial Tvlevel in 27 patients with hypothyroidism and 93 patients whn were
found not to have it to determine how accurately the simple test alone might
have diah'l1osed hypothyroidism.
As expected, likelihood ratios for hypothyroidism were highest for low
levels of T1 and lowest for high levels (Table 3.2). The lowest values in the
distribution of Ls «4.0 pg/dL) were only seen in patients with hypothyroid-
ism, i.e., these levels filled in the diagnosis. The highest levels (>8.0 pg/dL)
were not seen in patients with hypothyroidism, i.e., the presence of these
levels ruled out the disease.
The authors concluded that "it may be possible to achieve cost savings
without loss of diagnostic accuracy by using a single total It measure-
ment for the initial evaluation of suspected hypothyroidism in selected
patients."
The likelihood ratio has several other advantages over sensitivity and
specificity as a description of test performance. The information contrib-
uted by the test is summarized in 011e number instead of two. The calcula-
tions necessary for obtaining posttest odds from pretest odds arc easy.
Table 3.2
Distribution of Values for Serum Thyroxine in Hypothyroid and Normal Patients,
with Calculation of Likelihood Ratios'
<11 2 (7.4) 1
1.1-2,0 3 (11.1) Ruled in
? 1·-3,0 1 (3.7)
::11-40 8 (29,6) J
4,1-5,0 4 (14,8) 1 (11) 13,8
5,1-6.0 4 (148) 6 (6,5) 23
61 7.0 3 (11 1) 11 (11.8) 9
7,1-8.0 2 {7.41 19 (20.4) A
8,1-9.0 1/ (1S,:J)
91-10
101-11
20
11
(21,5)
(11,8)
1
Ruled out
11.1-12
-,12
Tolal 27 (100)
4
4
93
(43)
!'1.3)
(WO)
j
, I,om Goldstein BJ, Musblin AI. Use 01 a 'linglH thyroxine test to evaluate ambulatory medical paliHnts for
suspected 11YlX}lIlyroidis",.,t Gellintern Med 19fJ7;2;20 2·1
CHAPTER 3 I DIAGNOSIS 67
Also, likelihood ratios are particu lady well suited far describing the overa 11
probability of disease when a series of diagnostic tests is used (see below).
Likelihood ratios (LR) also have disadvantages. One must use odds, not
probabilities, and most of us find thinking in terms of odds more difficult
than probabilities. Also, the conversion from probability to odds and back
requires math or the use of a nomogram, which partly offsets the simplicity
of calculating posttest odds using LRs. Finally, for tests with a range of
results, LRs use measures of sensitivity and specificity that are different
from those usually described.
Multiple Tests
Because clinicians commonly use imperfect diagnostic tests, with less
than IOO'Yo sensitivity and specificity and intermediate likelihood ratios, a
single test frequently results in a probability of disease that is neither very
high nor very low, e.g.. somewhere between 10')'0 and 90%. Usually it is
not acceptable to stop the diagnostic process at such a point. Would a
physician or patient be satisfied with the conclusion that the patient has
even a 20% chance of having carcinoma of the colon? Or that an asymptom-
atic 35-year~ald man with 2.5 mm 5T segment depression on a stress test
has a 42% chance of coronary artery disease (see Fig. 3.9)? Even for less
deadly diseases, such as hypothyroidism, tests with intermediate postrest
probabilities are of little help. The physician is ordinarily bound to raise
or lower the probability of disease substantially in such situations-unless,
of course. the diagnostic possibilities are all trivial, nothing could be done
about the result, or the risk of proceeding further is prohibitive. When
these exceptions do not apply, the doctor will want to proceed with
further tests.
When multiple tests are performed and all are positive or all are nega-
tive, the interpretation is straightforward. All too often, however, some
are positive and others are negative. Interpretation is then more compli-
cated. This section discusses the principles by which multiple tests are
applied and interpreted.
Multiple tests can be applied in two general ways (Fig. 3.10). They can
be used in parallel (i.c.. all at once), and a positive result of any test is
considered evidence for disease. Or they can be done serially (i.e., consecu-
tively), based on the results of the previous test. for serial testing, all
tests must give a positive result for the diagnosis to be made, because the
diagnostic process is stopped when a negative result is obtained.
PARALLEL TESTS
Physicians usually order tests in parallel when rapid assessment is necessary,
as in hospitalized or emergency patients, or for ambulatory patients who cannot
return easily because they have come from a long distance for evaluation.
66 CLINICAL EPIDEMIOLOGY
Serial testing
Parallel testing
TSensitivity
B--,>
LSpeclttclty
c--'>
Table 3.3
Tests Characteristics of PSA and Digital Rectal Examination (ORE)"
Positive Preddive
Sensitivily Specilicily Value
"I 'SI\ and ORE, alone and in combi"alion (parallel a"d SG,illl lestinR) in tll" diaqnosis 01 pr08l11to cancer
(l\daptHd from Kramer RS et al prostate cancer sctacnlnq: what WH know and what we flood to know. Arm
Int MO{j 1993;1199H 923,)
70 CLINICAL EPIDEMIOLOGY
more efficient if the test with the highest specificity is used first. Table 3.4
shows the effect of sequence on serial testing. Test A is more specific than
test B, whereas B is more sensitive than A. By using A first, fewer patients
are subjected to both tests, even though equal numbers of diseased patients
are diagnosed, regardless of the sequence of tests. However, if one test is
much cheaper or less risky, it may be more prudent to use it first.
ASSUMPTION OF INDEPENDENCE
When multiple tests are used, as discussed above, the accuracy of the
result depends on whether the additional information contributed by each
test is somewhat independent of that already available from the preceding
ones, i.e., the next test does not simply duplicate known information. In
fact, this premise underlies the entire approach to predictive value we have
discussed. However, it seems unlikely that the tests for most diseases
Table 3.4
Effect of Sequence ill Serial Testing: A Then B versus B Theil A"
Prevalence of Disease
Sequence of I esting
Disease Disease
+
A 160 80 240 B + 180 160 340
40 no l60 20 640 660
200 SUO 1000 200 600 1000
240 Patients Retested with B 340 Patients Retested with A
Disease Disease
+ +
B , 144 16 160 A + 144 16 160
16 6' 80 46 144 180
160 80 240 180 160 340
. , Note that in both sequenG8S the sarnc number of patients arc idenlili",d as diseased (160) arld the same
number of true positives (t44) arp identified. But wlmn tAst A (wrth tho higher specificity) is used first, fewpr
patents am retest"d. The lower sensitivity oJ test II does not ndwrs"ly affect uc final result
CHAPTER 3 I DIAGNOSIS 71
are fully independent of one another. If the assumption that the tests are
completely independent is wrong, calculation of the probability of disease
from several tests would tend to overestimate the tests' value.
SERIAL LIKELIHOOD RATIOS
When a series of tests is used, an overall probability can be calculated,
using the likelihood ratio for each test result, as shown in Figure 3.11. The
prevalence of disease before testing is first converted to pretest odds. As
each test is done, the posttest odds of one becomes the pretest odds for
the next. Tn the end, a new probability of disease is found that takes into
account the information contributed by all the tests in the series.
Responsiveness
The clinical status of patients changes continually either in response to
treatment or because of the effects of aging or illness. Clinicians regularly
face the question, "Has my patient improved or deteriorated?" The tests
used to monitor the clinical course (e.g.. symptom severity, functional
status) are often somewhat different from those used to diagnose disease,
but the assessment of their performance is very similar.
The ability of changes in the value of a test to identify correctly changes
in clinical status is called its reeponeiocnese. It is conceptually related to
the validity of a diagnostic test, except that the presence or absence of a
meaningful change in clinical status, not the presence or absence of disease,
is the gold standard. The magnitude of a test's responsiveness can be
expressed as sensitivity, specificity, and predictive value or as the area
under the ROC curve.
Example Several self-report measures of health and. functional status are
commonly used to monitor the health of populations and evaluate the effects
of treatment. Two such measures are restricted activity days-number of
~jj"ll!:i'iI!IIT "PRC>&A8ILIH
Test A
"
Pretest odds x LRA = Posttest odds
Test B "
Pretest odds x LRs = Posttest odds
"
PC>i"tTIiIlT PRC>IIA8ILIH
Figure 3.11. Use of likelihood ratios in serial testing.
72 CLINICAL FPIDEMIOLOGY
Summary
Diagnostic test performance is judged by comparing the results of the
test to the presence of disease in a two-by-two table. All four cells of the
table must be filled. When estimating the sensitivity and specificity of a
new diagnostic test from information in the med ical Iiteruturc, there must
be a gold standard to which the accuracy of the test is compared. The
diseased and nondtscascd subjects should both resemble the kinds of pa-
100 I-------~...,.
_ Restricted .:·If;:"
80 days ..
.... / '/
.. -:/
'
/
>- 60 -: /
--rr- - Health ... /
>- -: /
> /
>- /
40
'"w
z
/
/
/
'" 20 /
/
'/
,>
:/
100
1-SPECIFICITY ('Yo)
bents for whom the test might be useful in practice. In addition, knowledge
of the final diagnosis should not bias the interpretation of the test results
or vice versa. Changing the cutoff point between normal and abnormal
changes sensitivity and specificity. Likelihood ratios are another way of
describing the accuracy of a diagnostic test.
The predictive value of a test is the most relevant characteristic when
clinicians interpret test results. It is determined not only by sensitivity and
specificity of the test but also by the prevalence of the disease, which may
change from setting to setting. Usually it is necessary to use several tests,
either in parallel or in series, to achieve acceptable diagnostic certainty.
Responsiveness, a test's ability to detect change in clinical status, is also
judged by the same two-by-two table.
REFERENCES
Joint National Committee on Detection, Evaluation, and Treatment of High Blood Pres-
MIre. The fifth report of the Joint National Committee on I:kkdinn, Evaluation, and
Treatment of High J3l ood Pressure (JNC V), Arch lnkm Mcd 1993;1:i3:154-183.
2, Catalona WJ; d al. Measurement of pro,tilk-sp"cific antigen in serum as a screening kst
for prostate "ilncer. N Engl J Med 1991;324:11,~6-1161.
3 kmen Me Brant-Zawadzki Ml\, Obu"howski N, Modic MT, Millkasi~n D, Ross IS. Mag-
netic resonance imaging of the lumbar spine in people without back pain. N Engl J Mcd
1YY4;331:69-73.
4. Bartrum RJ Jr, Crow liC, Foote SR. Ultrasonic "nd radiographic cholecystography. N Fngl
J M"d 1977;2%:53!:1-54L
5. jones TV, Lindsey RA, Yount P, Soltys R, Paraui-Enayat 11 Alcoholism screening question-
naires: are th.,y vilild in elderly medical o\ltpiltimts? J Cen Intern .'vied 1993;8:674-fi78.
6. Ptctchcr Rl l. Carcinoembryonic antigen. Ann Intern Med J 986; 104:66-73.
7. Voss JD. Prostate cancer, scn",:ning, and prostate-specific ilntigen: promise or peril? J Cen
Intern Med 1994;9:46R·474,
H, Barry MJ, MIlII"y AC, Singer DE. Screening for HTI.V III antibodies: the relation between
prevalPT1ce and positive predictive valu" ~nd its social consequences. JAM A 19R5;2,)3:33%,
9 W,ml JW, Crindon AJ, feorino PM, Schable C, Parvin M, Allen JR. I ,ilboratory and epide-
miologic evaluation of an enzyme immunoassay for antibodies t" HTLV 111. JAMA
lYIl6;256:357-361
HI. Diamond GA, Forrester jS, Analysis of probability il" an aid in the clinical diagnosis of
coronary artery disease. N Engl J M..,d 1Y79;3(HI:1350-1358.
11 Tierney WM, :McDonald CJ. Prilctice databases and their uses ill dinieal research, Sial
Med 1991;10:541-557.
12. Goldstein "J, Mushlin AI. Use of a single thyroxine lest to evaluate ambulatory medical
pillients for suspected hypothyroidi~m.J Cen Intern Med 1987;2:20-24.
13, Haddow JE, Palomaki GE, Knight Cj, Williams], Pulkkin"n II, Canick jA, Saller D:'J Jr,
Bowers GB, Prenilt,,1 sUt,<,ning for Down's syndrome with uS{' of maternal serum marker~.
N Engl J Med 1Y92;327:5118-593.
14. Wagnn EH, LaCroix AZ, Crothaus I.c, H<.:cht JII. Responsiveness of heilah stalus mea-
sures to change among older "dulls, J Am Ccnatr Soc 1993;41:241-248.
SUGGESTED READINCS
Cebul l{U, Beck LH. Teaching clinical decision making. New York: Pm..'ger, 1985,
Department of Clinical Epidemiology and Biostatistics, McMilster University. Interpretation
of diagnostic clilta. V, How to do it with simple math. Can Med Assoc J 198..1;129:22-29.
74 CLINICAL EPIDEMIOLOGY
(Reprinted in Sackett DC Haynes RB, Tugwdl P, eds. Clinical epidemiology: a bask scknce
for clinical medicine. Boston: Little, Brown, 1985.)
Fagan TJ. Nomogram for Bayes' theorem [Letter]. N Engl J Mud 1975;293:257.
Griner PI', Maycwski RJ, Mu~hlin AI, Greenland P. Selection and interpretdtion of diagnostic
tests and procedures. Principles and applications. Ann Intern Med 1981;94:5-',7---600.
Criner PF, Panzer R], Crccnland P. Clini<:al diagnosis and the laboratory: logi<:<ll strategies
for common medical problems. ChicJgo: Year Book, 1986.
McNeil BJ, Abr~ms HL, eds, Brigham and Women's Hospital handbook of diagnostic imaging.
Boston: Little, Brown, 1986,
Pauker SG, Kassircr JI'. C1inicJI ilpplication of decision analysis: a detailed illustration, Semin
Nucl Med 1978;8:324-335.
Sh"ps SI3, Schechter MT. The assessment of didgnostic tests. A survey of current medicdl
research. JAMAI984;252:2418-2422.
Sox HC Jr. Probability theory in the use of diagnostic tests. An introduction to critical study
of the literature, Arm Intern MedI986;104:60-66.
Sox He Jr. Blatt M, Higgins Me, Marton KI. Medical decision making. Kent, UK: Butterworth,
1987,
Wasson JIl, Sox He [r, Neff RK, Goldman L. Clinical prediction rules. applications and
methodological standnnh. N Fngl , Moo 1985;313:793-799.
Weinstein Me, Fineberg BV, Elstcin AS, Frazier HS, Neuhauser D, Nculra RR, McNeil BJ,
Clinical decision analysis. Philaddphid: WB Saunders, 1980,
4
FREQUENCY
deterioration, cure, or death forms the basis for answering most clinical
questions, this chapter examines measures of clinical frequency.
0
0
0-
0-
0 .
0-----
.
0--
0 ---
-. 0 Onset
-
Duration
Figure 4.1. Occurrence of disease in 100 people at risk from 1992 to 1994.
Table 4.1
Characteristics of Incidence and Prevalence
Hernia
Heart disease
Peptic ulcer
Diabetes
Hypertension
Arthritis
Asthma/hayfever
Chronic sinusitis
I I
10 8 6 4 2 o 2 4 6
PREVALENCE (%)
Table 4.2
Criteria for Reporting L.yme Dlseaee"
"Centers for Disease Control ami f'rovenliun crileria. (Adupled from U.S. OepHrtlTlHrlt of HeHlfh HnrllllJman
Services. C""e (!HrlrlitioflS for plJblic health SI.Jtv81I1ance. MMWR 198(1;c!Y; 19-~O.)
CHAPTER 4 I FREQUENCY 83
over the entire age span, indicating that asthma tends to be chronic and is
especially chronic among older individuals. Also, because the pool of preva-
lent cases does not increase in size, about the same number of patients are
recovering from their asthma as new patients are acquiring it.
If we usc the following formula, we can determine that asthma has an
average duration of 10 years:
Average duration = Prevalence -;- Incidence
When the duration of asthma is determined for each age category by divid-
ing the prevalences by the incidences, it is apparent that the duration of
asthma increases with increasing age. This reflects the rfinical observation
that childhood asthma often clears with time, whereas adult asthma tends
to be more chronic.
Table 4.3
The Relationships among Incidence, Prevalence, and Duration of Disease: Asthma
in the United states-
",
b Ib
Ii/IOOO
3/1000
?9/1000
32/1000
4.8 veers
10./ YP;lrs
17 ·44 ?/1000 ?b/1(01) 13.0 ycms
·15 M 1/1000 :nnOOO aa.o veers
6c.+ C 30/1000 XU) YPA,lrs
:)/1000 :31)/11)1)1) 10.0 yews
Incidence Study
Measurement is
development of
new cases of
disease over time
causes
Prevalence Study
Measurement is
past or present
exposure to J
possible causes
Figure 4.4. Temporal relationship between possible causal factors and disease for
incidence and prevalence studies.
CHAPTEn 4 ! FREQUENCY 87
Enter population
\
\TI~\>
Early Cures Leave
deaths population:
Severe disease
Mild disease
Prefer other care, etc.
Figure 4.5. The difference in cases for incidence and prevalence studies.
Example The therapeutic options facing the older man with urinary
symptoms from benign prostatic hyperplasia (described <It the opening of
this chapter) have been evaluated using decision analysis (15). Before drugs
and laser prostatectomy made the decision more complicated, the options
were surgery (transurethral resection of the prostate, TURP) or careful follow-
up, called "watchful waiting." Figure 4.6 shows the decision tree that the
authors used lo evaluate the options. The frequencies of the various outcomes
were derived in the incidence study of New England men described earlier
in the chapter (12) and other published sources (15). Note that the optimal
decision in this case is surgery (net utility 0.94). In this case, TURP is the
favored treatment because the risk of operative death is low and the utilities
i1ssigned 10 Incontinence or impotence are the same as that assigned to living
with stable moderate urinary symptoms. If stable moderate symptoms were
preferred over incontinence or impotence, the balance would shift.
Summary
Most clinical questions are answered by reference to the frequency of
events under varying circumstances. The frequency of clinical events is
indicated by probabilities or fractions, the numerators of which include
CHAPTER 4 I FREQUENCY 91
Incontinent or
impotert (0.06)
Symptoms
improved (0.80)
No complications (0.94) r;-;:;l
L!.:Qj
Survive 0.95
(0.99)
Symptoms not improved (O.2°L 10.801
Die
(0.01)
Symptoms improved (0.02)
TURP
the number of cases and the denominators of which include the number
of people from whom the cases arose.
There are two measures of frequency: prevalence and incidence. Preva-
lence is the proportion of a group with the disease at a single point in
time. Incidence is the proportion of a susceptible group that develops new
cases of the disease over an interval of time.
Prevalence is measured by a single survey of a group containing cases
and noncases. whereas measurement of incidence requires examinations
of a previously disease-free group over time. Thus prevalence studies iden-
tify only those cases who are alive and diagnosable at the time of the
survey, whereas cohort (incidence) studies ascertain all new cases. Preva-
lent cases, therefore, may be a biased subset of all cases because they do
not include those who have already succumbed or been cured. In addition,
prevalence studies frequently do not permit a clear understanding of the
temporal relationship between a causal factor and a disease.
To make sense of incidence and prevalence, the clinician must under-
stand the basis on which the disease is diagnosed and the characteristics
of the population represented in the denominator. The latter is of particular
importance in trying to decide if a given measure of incidence or prevalence
pertains to patients in one's own practice.
92 CLINICAL EPIDEMIOLOGY
Postscript
Counting clinical events as described in this chapter may seem to be
the most mundane of tasks. It seems so obvious that examining counts of
clinical events under various circumstances is the foundation of clinical
science. It may be worth reminding the reader that Pierre Louis introduced
the "numerical method" of evaluating therapy less than 200 years ago.
Louis had the audacity to count deaths and recoveries from febrile illness
in the presence and absence of blood-letting. He was vilified for allowing
lifeless numbers to cast doubt on the healing powers of the leech, powers
that had been amply confirmed by decades of astute qualitative clinical
observation.
REFERENCES
Bryant GO, .,\lormiln GR. Expressions of probability: words and numbers. N Engl J Med
1980;302:411,
2. Toogood JH. What do we mean by "usually"? Lancet 19&);1:1094.
3. ~cConn"]l ]0, Barry MJ, Bruskcwitz RC. Beni b'l1 prostatic hyperplasia: diilgnosis and
treatment. C1in Pract Guid(> Quick Ref Cuide Clin 1994;8:1-17.
4, Friedman GO. Medical usage and abusage, "prevalence" and "'incidPnce-" Ann Inkm
Med 1976;84:502-503.
5. O'Connor DW, PoEtt 1'1\, Hyde }B, Fellows JL, Miller NO, Brook CPB. Reiss BB, I~oth M.
The prevalence of dementia as measured by the Cambridge Mental Disorders of the
IJIderly Fxaminiation. Acta Psyehiatr Scand 1989;79:J90-191>.
6 Paykcl ES, Brayne C Huppert 1:1\, c.m C, Barkley C, GehlhaM E, Beardsall I., Girling
OM. PollittI', O'Connor D, Incidence of dementia in a popubtion older that 75 years in
the United Kingdom. Arch Cen Psychiatry 1994;51:325-332.
7. Sanders BS. Have morbidity surveys been oversold? Am J l'ublic Health 1962;52:1648-
1659.
8. Case definitions for public health surveillance. MlvmWI990:(RR-D);19-21
9, Matteson EL, Be<:kett VL, O'Fallon W1I.1, Melton 1,1 Ill, Duffy J. Epidemiology of Lyme
disease in Olmsted COllnty, MN,1975-19YO, J RheumatoI1992;19:1743-1745.
III Spitzer WO, Harth M, Coldsmith CII, Norman CR, Dickie CL, Bass M}, Newell Jr. TIle
arthritic complaint in primary care: prevalence, related disability. and costs. J Rheumatol
1976;3:88-',19,
CHAPTER 4 I FREQUENCY 93
11 Sedlack RE, Whisnant J, Elveback I.R, Kurland LT. Incidence of Crohn's dist'ase in Olmsted
County, Minnesota, 1935- 1975. Am J Epidemiol 19RO; 112:759-763.
12. Fowl(,r I'J, Wennlwrg JE, Timothy RP, Rarry \1J, Mulley AG Hanky D. Symptom status
,md quality of life following prosMectorny. JAMA 1988;259:3OTR-J022,
n. Tompkins RK, rlurnes DC Cable WE, An analysis of the cost-dfecliveness of pharyngitis
management and acute rheumatic f"ver prevention. Ann Intern Med 1977;R6:4Rl-492.
14. Sox He Blatt MA, Higgins Me Marton Kl. Medical decision milking. Stoneham, MA:
Butterworth,19!:1!:1,
15, Barry MJ, Mulley AG, Fowler FJ, weru-berg jw. Watchful waiting vs. immediate transur,,-
thral resection for symptomatic prostatism. JAMA 1988;259:3010-3017.
SUGGESTED KEADINGS
Fllmb<'rg JH, Nelson KB.Sample selection and the natural history of disease: studies of febrile
seizures, JAMA 19RO;24J(l):377- 1340.
Friedman GO. Medical usage und abusage, "prevalence" and "incidence." Ann Intern Med
1976;84:502-503,
IVlorgenstern H, Kleinbaum UG, Kupper 1.1 , Measures of disease incidence used in epidemio_
logic research. lnt J Epidcmiol 1980;9:97-104,
5
RISK
Risk Factors
Characteristics that are associated with an increased risk of becoming
diseased are called riskfactors. Some risk factors arc inherited. For example,
having the haplotype HLA-B27 greatly increases one's risk of acquiring the
spondylarthropathies. Work on the Human Genome Project has identified
several other diseases for which specific genes are risk factors, including
colon cancer, osteoporosis, and amyotropic lateral sclerosis. Other risk
factors, such as infectious agents, drugs, and toxins, are found in the physi-
cal environment. Still others are part of the social environment. For exam-
ple, bereavement due to the loss of a spouse, change in daily routines,
and crowding all have been shown to increase rates of disease-not only
emotional illness but physical illness as well. Some of the most powerful
risk factors are behavioral; examples are smoking, drinking alcohol to ex-
cess, driving without seat belts, and engaging in unsafe sex.
Exposure to a risk factor means that a person has, before becoming ill,
come in contact with or has manifested the factor in question. Exposure
can take place at a single point in time, as when a community is exposed
94
CHAPTER 5 I RISK 95
Recognizing Risk
Large risks associated with effects that occur rapidly after exposure are
easy for anyone to recognize. Thus it is not difficult to appreciate the
relationship between exposure and disease for conditions such as chick-
enpox, sunburn, and aspirin overdose, because these conditions follow
exposure relatively rapidly and with obvious effects. But most morbidity
and mortality is caused by chronic diseases. For these, relationships be-
tween exposure and disease are far less obvious. It becomes virtually im-
possible for individual clinicians, however astute, to develop estimates of
risk based on their own experiences with patients. This is true for several
reasons, which are discussed below.
LONG LATENCY
Many diseases have long latency periods between exposure to risk
factors and the first manifestations of disease. This is particularly true
for certain cancers, such as thyroid cancer in adults after radiation treat-
ment for childhood tonsillitis. When patients experience the conse-
quence of exposure to a risk factor years later, the original exposure
may be all but forgotten. The link between exposure and disease is
thereby obscured.
FREQUENT EXPOSURE TO RISK FACTORS
Many risk factors, such as cigarette smoking or eating a diet high in
cholesterol and saturated fats, are so common in our society that for many
years they scarcely seemed dangerous. Only by comparing patterns of
disease among people with and without these risk factors or by investigat-
ing special subgroups-e.g., Mormons (who do not smoke) and vcgetan-
96 CLINICAL EPIDEMIOLOGY
ans (who eat diets low in cbolesterofj-c-dtd we recognize risks that are, in
fact, large.
LOW INCIDENCE OF DISEASE
Most diseases, even ones thought to be "common," are cctuallv quite
rare. Thus, although lung cancer is the most common cause of cancer
deaths in Americans, the yearly incidence of lung cancer even in heavy
smokers is less than 2 in 1000. In the average physician's practice, years
may pass between patients with new cases of lung cancer. It is difficult to
draw conclusions about such infrequent events.
SMALL RISK
If a factor confers only a small risk for a disease, a large number of
people are required to observe a difference in disease rates between
exposed and unexposed persons. This is so even if both the risk factor
and the disease occur relatively frequently. for example, it is still uncer-
tain whether birth control pills increase the risk of breast cancer, because
estimates of this risk are all small and, therefore, easily discounted as
resulting from bias or chance. In contrast, it is not controversial that
hepatitis B infection is a risk factor for hepatoma, because people with
hepatitis B infection are hundreds of times more likely to get liver cancer
than those without it.
COMMON DISEASE
If a disease is common-s-heart disease, cancer, or stroke-s-and some of
the risk factors for it are already known, it becomes difficult to distinguish
a new risk factor from the others. Also, there is les~ incentive to look for
new risk factors. For example, the syndrome of sudden, unexpected death
in adults is a common way to die. Many cases seem related to coronary
heart disease. However, it is entirely conceivable that there are other im-
portant causes, as yet unrecognized because an adequate explanation for
most cases is available.
On the other hand, rare diseases and unusual clinical presentations
invite efforts to find a cause. AIDS was surh an unusual syndrome that
the appearance of just a few cases raised suspicion that some new agent
(as it turned out, a retrovirus) might be responsible. Similarly, physicians
were quick to notice when several cases of carcinoma of the vagina, a very
rare condition, began appearing. A careful search for an explanation was
undertaken, and maternal exposure to diethylstilbestrol was found.
MULTIPLE CAUSES AND EFFECTS
1,:illffii:,:>,;:F;,:n;nHp<>""'"
'ill1K
''i:~:I :ii~ mUI ]!'i' ' ·ii;f :i!l'iI
,m ~"'-'
I::
.>
!.,.~·:,·" ~.;:."!.,''",.~':."!" 'iU!i~I·:·]
....i.,.'!. ..! '.f,!.'• . · .!,:.• :'.~! ·'.'].~.·'.'.·_
.. .. . ".·',!',1
Other
~ Coronary atherosclerosis
~Stroke
Renal failure
Myocardial infarction
Figure 5.1. Relationship between risk factors and disease: hypertension (1 BP)
and congestive heart failure (CHF). Hypertension causes many diseases, including
congestive heart failure, and congestive heart failure has many causes, including
hypertension.
develop congestive heart failure and many do not. Also, many people who
do not have hypertension develop congestive heart failure, because there
are several different causes. The relationship is also obscured because hy-
pertension causes several diseases other than congestive heart failure.
Thus, although people with hypertension are about 3 times more likely
than those without hypertension to develop congestive heart failure and
hypertension is the leading cause of the condition, physicians were not
particularly attuned to this relationship until the 1970s, when adequate
data became available after careful study of large numbers of people over
many years.
For all these reasons, individual clinicians are rarely in a position to
confirm associations between exposure and disease, though they may sus-
peet them. For accurate information, they must tum to the mcdtcal lltera-
ture, particularly to studies that are carefully constructed and involve 01
large number of patients.
98 CLINICAL EPIDEMIOLOGY
Uses of Risk
PREDICTION
Risk factors are used, first and foremost, to predict the occurrence of
disease. In fact, risk factors, by definition, predict some future event. The
best available information for predicting disease in an individual person
is past experience with a large number of people with a similar risk factor.
The quality of such predictions depends on the similarity of the people
on whom the estimate is based and the person for whom the prediction
is made.
It is important to keep in mind that the presence of even a strong risk
factor does not mean that an individual is very likely to get the disease.
For example, studies have shown that a heavy smoker has a 20-fold greater
risk of lung cancer compared with nonsmokers, but he or she still has only
a 1 in a 100 chance of getting lung cancer in the next 10 years.
There is a basic incompatibility between the incidence of a disease in
groups of people and the chance that an individual will contract that
disease. Quite naturally, both patients and clinicians would like to answer
questions about the future occurrence of disease as precisely as possible.
They arc uncomfortable about assigning a probability, such as the chances
that a person will get lung cancer or stroke in the next 5 years. Moreover,
anyone person will, at the end of 5 years, either have the disease or not.
So in a sense, the average is always wrong because the two are expressed
in different terms, a probability versus the presence or absence of disease.
Nevertheless, probabilities can guide clinical decision making. Even if a
prediction does not come true in an individual patient, it will usually be
borne out in many such patients.
CAUSE
Just because risk factors predict disease, it docs not necessarily follow
that they cause disease. A risk factor may mark a disease outcome indi-
rectly, by virtue of an association with some other determinant(s) of dis-
ease, i.e., it may be confounded with a causal factor. For example, Jack of
maternal education is a risk factor for low birth weight infants. Yet, other
factors related to education, such as poor nutrition, less prenatal care,
cigarette smoking, etc., are more directly the causes of low birth weight.
A risk factor that is not a cause of disease is called a marker, because it
"marks" the increased probability of disease. Not being a cause does not
diminish the value of a risk factor as a way of predicting the probability
of disease, but it does imply that removing the risk factor might not remove
the excess risk associated with it. For example, as pointed out in Chapter
1, although there is growing evidence that the human papillomavirus
(HPV) is a risk factor for cervical cancer, the role of other sexually transmit-
CHAPTER 5 I RISK 99
ted diseases, such as herpes simplex virus and Chlalllydia, is not as dear.
Antibodies to these agents are more common among patients with cervical
cancer than in women without cancer, but the agents may be markers for
risk of cervical cancer rather than causes, If so, curing them would not
necessarily prevent cervical cancer. On the other hand, decreasing promis-
cuity might prevent the acquisition of both the causative agent for cervical
cancer and other sexually transmitted diseases (2).
There are several ways of deciding whether a risk factor is (I cause or
merely a marker for disease. These are covered in Chapter 1L
DIAGNOSIS
Knowledge of risk can be used in the diagnostic process, since the pres-
ence of a risk factor increases the prevalence (probability) of disease among
patients-one way of improving the positive predictive value of a diagnos-
tic test.
However, in individual patients, risk factors usually are not as strong
predictors of disease as art' clinical findings of early disease. As Rose (3)
put it:
Often the best predictor of future major diseases is the presence of existing
minor disease. A low ventilatory function today is the best predictor of its
future rate of decline. A high blood pressure today is the best predictor of
its future rate of rise. Early coronary heart disease is better than all of the
conventional risk factors as a predictor of future fatal disease.
Risk factors can provide the most help with diagnosis in situations
where the factor confers a substantial risk and the prevalence of the disease
is increased by clinical findings. For example, age and sex are relatively
strong risk factors for coronary artery disease, yet the prevalence of disease
in the most at risk age and sex group, old men, is only 12%. When specifics
of the clinical situation. such as presence and type of chest pain and re-
sults of an electrocardiographic stress test, arc considered as well, the
prevalence of coronary disease can be raised to 99'10 (4).
More often, it is helpful to use the absence of a risk factor to help rule
out disease, particularly when one factor is strong and predominant. Thus
it is reasonable to consider mesothelioma in the differential diagnosis of a
pleural mass in a patient who is an asbestos worker, but mesothelioma is
a much less likely diagnosis for the patient who has never worked with
asbestos.
Knowledge of risk factors is also used to improve the efficiency of
screening programs by selecting subgroups of patients at increased risk.
PREVENTION
If a risk factor is also a cause of disease, its removal can be used to
prevent disease whether or not the mechanism by which the disease takes
place is known. Some of the classic successes in the history of epidemiology
100 CLINICAL EPIDEMIOLOGY
illustrate this point. For example, before bacteria were identified, Snow
found an increased rate of cholera among people drinking water supplied
by a particular company and controlJed an epidemic by cutting off that
supply. More recently, even before HIV had been identified, studies
showed that a lifestyle of multiple sexual partners among homosexual men
was a risk factor for acquiring AIDS (5). The concept of cause and its
relationship to prevention is discussed in Chapter 11.
Studies of Risk
The most powerful way of determining whether exposure to a potential
risk factor results in an increased risk of disease is to conduct an experi-
ment. People currently without disease would be divided into groups of
equal susceptibility to the disease in question. One group would be ex-
posed to the purported risk factor and the other would not, but the groups
would otherwise be treated the same. Later, any difference in observed
rates of disease in the groups could be attributed to the risk factor.
Unfortunately, the effects of most risk factors for humans cannot be
studied with experimental studies, in which the researcher determines who
is exposed. Consider some of the questions of risk that concern us today.
I low much are inactive people at increased risk for cardiovascular disease,
everything else being equal? Do cellular phones cause brain cancer? Does
alcohol increase the risk of breast cancer? For such questions as these, it
is usually not possible to conduct an experiment. first, the experiment
would have to go on for decades. Second, it would be unethical to impose
possible risk factors on a group of the people in the study. Finally, most
people would balk at having their diets and behaviors determined by
others for long periods of time. As a result, it is usually necessary to study
risk in less obtrusive ways.
Clinical studies in which the researcher gathers data by simply observ-
ing events as they happen, without playing an active part in what takes
place, are called observational studies. Most studies of risk are observational
studies, either cohort studies, described in the rest of this chapter, or case
control st1ldies, described in Chapter 10.
COHORTS
The term cohort is used to describe a group of people who have some-
thing in common when they are first assembled and who are then observed
for a period of time to sec what happens to them. Table 5.1 lists some of
the ways in which cohorts are used in clinical research. Whatever members
of a cohort have in common, observations of them should fulfill two criteria
if they are to provide sound information about risk.
First, cohorts should be observed over a meaningful period of time in
the natural history of the disease in question. This is so there will be
CHAPTER 5 I RISK 101
Table 5.1
Cohorts and Their purposes
fa Assess
Cttnrncterist«: in Common [!feci of bam pie
---'--- - - - - - - ~--
Agc Age Life expectancy tor people age
70 (regardless of when born)
Dale at birth Calendar time Tuberculosis rates for people
born in HJ10
Exposure Risk factor Lung cancer in people who
smoke
Disease Projrosu. Survival rate tor patients with
breast cancer
F'reV(ll!lve interventiOIl Prevention Reduction in incidence of
pneumonia atter
pneumococcal vaccination
Therapeutic intervention Treatment unproverront in survival for
patients with Hodgkin's
disease given combination
chemotherapy
Example The Framingham Study (6) was begun in 1949 to identify fac-
tors associated with an increased risk of coronary heart disease (CI-ID). A
representative sample of .'),209 men and women, aged 30-59, was selected
from approximately] 0,000 persons of that age living in Framingham, a small
town near Boston. Of these, 5,127 were free of CHlJ when first examined and.
therefore, were at risk of developing CHD. These people were reexamined
biennially for evidence of coronary disease. The study ran for 30 years and
demonstrated that risk of developing CHD is associated with elevated blood
pressure. high serum cholesterol, cigardte smoking, glucose intolerance, and
left ventricular hypertrophy. There was a large difference in risk between
those with none and those with all of these risk factors.
Cohort studies can be conducted in two ways (Fig. 5.3). The cohort can
be assembled in the present and followed into the future (a concurrent
cohort study), or it can be identified from past records and followed forward
from that time up to the present (an historical cohort study).
Most of the advantages and disadvantages of cohort studies discussed
below apply whether the study is concurrent or historical. However, the
potential for difficulties with the quality of data is different for the two.
In concurrent studies, data can be collected specifically for the purposes
of the study and with full anticipation of what is needed. It is thereby
possible to avoid biases that might undermine the accuracy of the data.
On the other hand, data for historical cohorts arc often gathered for other
purposes-usually as part of medical records for patient care. These data
may not be of sufficient quality fur rigorous research.
CHAPTER 5 I RISK 103
.~ PRESENT FUTURE
Historical
Cohort Cohort
-o-Follow-up
assembled
Concurrent
Cohort ""~?~'~'~~"'~Follow-up
assembled
Table 5.2
Advantages and tnsacvanteoes of Cohort Studies
Advantages
___-==oc rJbcHlvantdgcs
_
The only WaY of establishing incidence (i.e.. Inefficient because rnarw more subjects
ubsolute risk) directly must be enrolled than experience the
Follows tile same logic as tile clinical event of interest: Iherefore, cannot he
question: If persons exposed. then do u~;e{j (or rare diseases
they get the disease? Expensive because of resources necessary
Exposure can he elicited without the l)i,JS to study many people over time
that mighl occur if outcome were ,llreacJy Results not <Ivail<ltlle tor a long time
known Assessee; nl(o relationship between disease
Carl assess the relationship between and exposure to only relatively few teeters
exposure one many drsoases u.c., those recorded at the outset of the
~;tlJdy)
104 CLINICAL EPIDFMIOLOGY
study of its kind when it began. Nevertheless, more than 5000 people had
to be followed for several years before the first, preliminary conclusions
could be published. Only 5% of the people had experienced a coronary
event during the first oS years!
A related problem with cohort studies results from the fact that the
people being studied are usually "free living" and not under the control
of researchers. 1\ great deal of effort and money must be expended to keep
track of them. Cohort studies, therefore, are expensive, sometimes costing
many millions of dollars.
Because of the time and money required for cohort studies, this ap-
proach cannot be used for all clinical questions about risk. For practical
[('<I sons, the cohort approach has been reserved for only the most important
questions. This has led to efforts to find more efficient, yet dependable,
ways of assessing risk. (TIl(' most common of these, case control studies,
is discussed in Chapter 10.)
The most important scientific disadvantage of observational studies,
including cohort studies, is that they are subject to a great many more
potential biases than are experiments. People who are exposed to a certain
risk factor in the natural course of events arc likely to differ in a great
many ways from a comparison group of people not exposed to the factor.
1£ these other differences arc also related to the disease in question, they
could account for any association observed between the putative risk factor
and the disease.
This leads to the main challenge of observational studies: to deal with
extraneous differences between exposed and nonexposcd groups to mimic
as closely as possible an experiment. The differences arc considered "extra-
ncous" from the point of view of someone hying to determine cause-
and-effect relationships. The following example illustrates one approach
to handling such differences.
were followed from birth to 3-5 vcars old, No differences in growth and
development were found. •
Comparing Risks
The basic expression of risk is incidence, defined in Chapter 4 as the
number of new cases of disease arising in a defined population during a
given period of time. But usually we want to compare the incidence of
disease in two or more groups in a cohort that differ in exposure to a
possible risk factor. To compare risks, several measures of the association
between exposure and disease, called measures of effect, are commonly used.
They represent different concepts of risk and are used for different pur-
poses. four measures of effect are discussed below (Tables 5.3 and 5.4).
ATTRIBUTABLE RISK
First, one might ask, "What is the additional risk (incidence) of disease
following exposure, over and above that experienced by people who are
not exposed?" The answer is expressed as attributable risk, the incidence
of disease in exposed persons minus the incidence in nonexposcd persons.
Attributable risk is the additional incidence of disease related to exposure,
taking into account the background incidence of disease, presumably from
other causes. Note that this way of comparing rates implies that the risk
factor is a cause and not just a marker. Because of the way it is calculated,
attributable risk is also called risk differcllce.
Table 5.3
Measures of Effect
Table 5.4
Calculating Measures of Effect: Cigarette Smoking and Death from Lung Cancer~
Simplp. risks
Com[J8rp:!l risks
RELATIVE RISK
On the other hand, one might ask, "How many times are exposed per-
sons more likely to get the disease relative to nonexposed persons?" To
answer this question, we speak of relative risk or risk ratio, the ratio of
incidence in exposed persons to incidence in noncxposed persons. Relative
risk tells us nothing about the magnitude of absolute risk (incidence). Even
for large relative risks, the absolute risk might be quite small if the disease
is uncommon. It does tell us the strength of the association between expo-
sure and disease and so is a useful measure of effect for studies of disease
etiology.
INTERPRETING ESTIMATES OF INDIVIDUAL RISK
The clinical meaning attached to relative and attributable risk is often
quite different, because the two expressions of risk stand for entirely differ-
ent concepts. The appropriate expression of risk depends on which ques-
tion is being asked.
Example Risk factors for cardiovascular disease are generally thought
to be weaker among the elderly than the middle-aged. This assertion was
examined by comparing the relative risks and attributable risks of common
risk factors for cardiovascular disease among different agc groups (8). An
example is the risk of stroke from smoking (Table 5..' ). The relative risk
decreases with age, from 4.0 in persons ages 45·49 to 1.4 in persons aged
65-69. However, the attributable risk increases slightly with age, mainly
because stroke is more common in the elderly regardless of smoking status,
Thus, although the causal link between smoking and stroke decreases with
age, an elderly individual who smokes increases his or her actual risk of
stroke to a similar, indeed slightly greater, degree than a younger person.
CHAPTER 5 I RISK 107
Table 5.5
Comparing Relative Risk and Attributable Risk in the Relationship of Smoking,
Stroke, and Age~
A
o en 150
c..- ..co
c~
~>;-100 ~----~J=r...r='=tUill1
-co ..
.<:0
Q)
00.
~
~
Q) 50
Excess death
rate
attributable to
BP >90 mm Hg
01---,---,---,--1--,--,--,--
-
o 0~
-;
15 B Prevalence of
elevated BP at
lii..!
o :J
10 various levels
~ 0. 5
.. 0
0. 0. 0 ~C;C-r-_,_-jillllif
..ill 60 Percent excess
o ..
><.<:
c deaths attributable
to various levels of
w_
- ..co
~
40 hypertension
24.1
"0 20
f:!
~
o',--,----,----,---f=
50 60 70 ao 90 100 110 120 130
Diastolic Blood Pressure (mm Hg)
Figure 5.4. Relationships among attributable risk, prevalence of risk factor, and
population risk for hypertension, (Adapted from The Hypertension Detection and
Follow-up Cooperative Group, Mild hypertensives in the hypertension detection and
follow-up program. Ann NY Acad Sci 1978;304:254-266.)
cal literature than are measures of individual risk, e.g., attributable and
relative risks. But a particular clinical practice i;: as much a population
for the doctor as is a community for health policymakers. Also, how the
prevalence of exposu re affects community risk can be important in the care
of individual patients. for instance, when patients cannot give a history or
when exposure is difficult for them to recognize, we depend on the usual
prevalence of exposure to estimate the likelihood of various diseases. When
considering treatable causes of cirrhosis in a North American patient, for
example, it would be more profitable to consider alcohol than schtstosomes,
inasmuch as few North Americans are exposed to Schistosoma maneoni. Of
course, one might take a very different stance in the Nile delta, where
schistosomes are prevalent and the people, who are mostly Musl irns, rarcly
drink alcohol.
Summary
Risk factors are characteristics that are associated with an increased risk
of becoming diseased. Whether or not a particular risk factor is a cause
of disease, its presence allows one to predict the probability that disease
will occur.
Most suspected risk factors cannot be manipulated for the purposes of
an experiment, so it is usually necessary to study risk by simply observing
people's experience with risk factors and disease. One way of doing so is
to select a cohort of people, some members of which are and some of which
are not exposed to a risk factor, and observe the subsequent incidence of
disease. Although it is scientifically preferable to study risk by means of
cohort studies, this approach is not always feasible because of the time,
effort, and expense it entails.
When disease rates are compared among groups with different expo-
su res to a risk factor, the results can be expressed in several ways. Attribut-
able risk is the excess incidence of disease related to exposure. Relative
risk is the number of times more likely exposed people are to become
diseased relative to noncxposed people. The impact of a risk factor on
groups of people takes into account not only the risk related to exposure
but the prevalence of exposure as well.
REI;!':I.:ENCES
I. Weiss NS, Liff JM. An'ounting for th", multtcausal n~tur~ of disease in the design ~nJ
analysis of epidemiologic studies. Am J Epidemiol 19SJ;1l7:14-18.
2, Pr~bhat KSJ, Beral V, Pete J, Hack S, Hermon C. Deacon J. Antibodies to hum~n papillo-
rnnvirus and to other genitol infectious ~gl'nls and invasive cervical c~ncer risk. Lancet
1993;34l:1116- 1118.
3. Rose G. Sick individuals and sick populations. Int J Epldcmiol 1985;1432-38.
4. Diamond GA, Forrester jS, Analysis of probability as nn aid in the clinical diagnosis of
coronary-artery disease. N Engl J Med 1979;300:1350-1358.
110 CLINiCAL EPIDEMIOLOGY
5. Jaffe HW et al. National case-control study of Kaposi's S<lrcoma and Pncumocystis carinii
pneumonia in homosexual men. Part 1, Fpidemio!qgic results. Ann Intern Med
1983;99:145-151
6, Dawber TR. The Framingham Study. 'Ih., epidemiology of atherosclerotic disease. Cam-
bridge MA: Harvard University Press, 1980.
7. Kramer MS, Rooks Y, Pearson HA. Growth and development in children with ~ickJe-cell
trail. N Engl J Med 1978;299:686-689.
8. rsety BM ct nl. Risk ratio~ and risk differences in estimating the effect of risk factors for
cardiovascular disease in the elderly. 1 Clin EpidemioI1990;4J:961-970,
9. Hofmnn A, Vandenbroucke JP Geoffrey Rose's big id'''L Br MeJ J 1992;305:1519-1520,
SUGGESTFD READINGS
Detsky AS, O'Rourke K, Corey PN, Johnston N, fenton S, Jeejecbhoy KN, The hazards of
using adive clinic patients as " source of subjects for clinical stmlie~. J Ccn Intern Med
1988;3:260-266.
Feinstein AR. Scientific standards in epidemiologic studies of the menace of daily life. Science
19&\;242:1257-1263. Respo"-'<-: by Savitz DA, Greenland S, Stolley PD, Kelsey JL Scientific
standards of criticism: a reaction to "Scientific standards in epidemiologic studies of the
menace of daily life" by Feinstein. Epidemiology 1990;1:78-83.
Malenka DJ, Baron JA, Johansen 5, wahrenbergcr JW, Ross 1M. The framing effect of relative
and absolute risk. J Gen Intern Mcd 1993;8:543-·548-
Morganstern H, Kleinbaum LX~, Kupper LL. Measures of disease incidence used in epidemio-
logic re~eilrch, Int J EpidemioI1980;9:97-104.
Naylor CD, Chen E, Straus~ B. Measured enthusiasm: does the method of reporting trial
results alter perceptions of therapeutic effectiveness? Ann Intern Med 1992;117:916-921.
6
PROGNOSIS
When people become sick, they have a great many questions about how
their illness will affect them. Is it dangerous? Could I die of it? Will there
be pain? How long will I be able to continue my present activities? Will
it ever go away altogether? Most patients and their families want to know
what to expect, even if little can be done about their illness.
ProfillOsis is a prediction of the future course of disease following its
onset. In this chapter, we review the ways in which the course of disease
can be described. We then consider the biases that can affect these descrip-
tions and how these biases can he controlled. Our intention is to give
readers a better understanding of a difficult but indispensable task-pre-
dicting patients' futures as closely as possible. The object is to avoid ex-
pressing prognosis with vagueness when it is unnecessary, and with cer-
tainty when it is misleading.
Doctors and patients think about prognosis in several different ways.
First, they want to know the general course of the illness the patient has.
A young patient suffering from postherpetlc neuralgia associated with
herpes zoster can be assured that the pain usually resolves in less than a
month. Second, they usually want to know, as much as possible, the prog-
nosis in the particular case. Even though HIV infection is virtually univer-
sally fatal, individuals with the infection may live from a few months to
more than a decade; a patient wants to know where on this continuum
his or her particular case falls. Third, patients especially are interested to
know how <In illness is likely to affect their lives, not only whether it will
or will not kill them, but how it will change their ability to work, to walk,
to talk, how it will alter their relationships with family and friends, how
much pain and discomfort they will have to endure.
Prognosis Studies
Studies of prognosis tackle these clinical questions in ways similar to
cohort studies of risk. A gnup of patients having something in common
(a particular medical disease or condition, in the case of prognostic studies)
111
112 CLINICAL EPIDEMIOLOGY
are assembled and followed forward in time, and clinical outcomes are
measured. Often, conditions that are associated with a given outcome of
the disease, i.c., prognostic factors, are sought.
CLINICAL COURSE/NATURAL HISTORY OF DISEASE
Disease prognosis can be described for either the clinical course or
the natural history of illness. The term clinical course has been used to
describe the evolution (prognosis) of disease that has come under medi-
cal care and is then treated in a variety of ways that might affect the
subsequent course of events. Patients usually come under medical care
at some time in the course of their illness when they have diseases that
cause symptoms such as pain, failure to thrive, disfigurement, or un-
usual behavior. Examples include type 1 diabetes mellitus, carcinoma
of the lung, and rabies. Once disease is recognized, it is also likely to
be treated.
The prognosis of disease without medical intervention is termed the
natural history of disease. Natural history describes how patients will
fare if nothing is done for their disease. A great many medical condi-
tions, even in countries with advanced medical care systems, often do
not come under medical care. They remain unrecognized, perhaps be-
cause they are asymptomatic or are considered among the ordinary
discomforts of daily living, Examples include mild depression, anemia,
and cancers that are occult and slow growing (e.g., some cancers of the
thyroid and prostate).
ZERO TIME
Cohorts in prognostic studies are observed starting from a point in time,
called zero time. This point should be specified clearly and be the same
well-defined location along the course of disease (e.g., the onset of symp-
toms, time of diagnosis, or beginning of treatment) for each patient. The
term inception cohurt is used to describe a group of people who are assem-
bled near the onset (inception) of disease.
Tf observation is begun at different points in the course of disease for
the various patients in the cohort, description of their subsequent course
will lack precision. The relative timing of such events as recovery, recur-
rence, and death would be difficult to interpret or misleading.
For example, suppose we wanted to describe the clinical course of pa-
tients with lung cancer. We would assemble a cohort of people with the
disease and follow them forward over time to such outcomes as complica-
tions and death. Hut what do we mean by "with disease"? If zero time
was detection by screening for some patients, onset of symptoms for others,
and hospitalization or the beginning of treatment for still others, then
observed prognosis would depend on the particular mix of zero times in
the study. Worse, if we did not explicitly describe when in the course of
CHAPTER 6 / PROGNOSIS 113
disease patients entered the cohort, we would not know how to interpret
or use the reported prognosis.
DESCRIBING OUTCOMES OF DISEASE
Prognostic Factors
Although most patients are interested in the course of their disease in
general, they Me even more interested in a prediction for their given case.
Prognostic factors help identify groups of patients with the same disease
who have different prognoses.
DIFFERENCES BETWEEN PROGNOSTIC FACTORS AND RISK FACTORS
Studies of risk factors usually deal with healthy people, whereas prog-
nostic factors-conditions that are associated with an outcome of disease-
are, by definition, studied in sick people. There arc other important differ-
ences as well, outlined below.
Different Factors
Factors associated with an increased risk are not necessarily the same
as those marking a worse prognosis and are often considerably different
for a given disease. for example, low blood pressure decreases one's
chances of having an acute myocardial infarction, but it is a bad prognostic
sign when present during the acute event (fig. 6.1). Similarly, intake of
exogenous estrogens during menopause increases women's risk of endo-
metrial cancer, but the associated cancers are found at an earlier stage and
seem to have a better-than-average prognosis.
Figure 6.1. Differences between risk and prognostic factors for acute myocardial
infarction
CHAPTER 6 I PROGNOSIS 115
Some factors do have a similar effect on both risk and prognosis. For
example, both the risk of experiencing an acute myocardial infarction and
the risk of dying of it increase with age.
Different Outcomes
Risk and prognosis describe different phenomena. For risk, the event
being counted is the onset of disease. For prognosis, a variety of consc-
quences of disease are counted, including death, complications, disability,
and suffering.
Different Rates
Risk factors generally predict low probability events. Yearly rates for
the onset of various diseases are on the order of 1/100 to 1/10,000. As a
result, relationships between exposure and risk usually elude even astute
clinicians unless they rely on carefully executed studies, often involving a
large number of people over extended periods of time. Prognosis, on the
other hand, describes relatively frequent events. Clinicians often can form
good estimates of prognosis on their own, from their personal experience.
For example, they know that few patients with lung or pancreatic cancer
survive as long as 5 years, whereas most with chronic lymphocytic leuke-
mia survive much longer.
MULTIPLE PROGNOSTIC FACTORS AND PREDICTION RULES
A combination of factors may give a more precise prognosis than each
of the same factors taken one at a time. Clinical prediction rules estimate the
probability of outcomes according to a set of patient characteristics.
Example Once patients with HIV infection develop A[DS, the prognosis
is poor and survival time is short. Even so, and before antiviral and prophy-
lactic therapy for opportunistic infections became standard treatment, it was
clear that some patients with AIDS survived much longer than others. A
study was done to determine which patient characteristics predicted survival
(3). Each of several physiologic characteristics was found to be related to
survival. Using these factors in combination, the investigators developed a
prognostic staging system, with [ point for the presence of each of 7 factors:
severe diarrhea or a serum albumin <2.0 gm/dL, any neurologic deficit, Pl.':2
less than or equal to 50 mm Hg, hematocrit <30(}:" lymphocyte count < 150/
mL, white count <2500/mL, and platelet count < 140,000/ mL. The total score
determined the prognostic stage (I, 0 points; II, 1 point; lll, greater than or
equal to 2 points). Figure 6.2 shows the survival of AIDS patients in each
prognostic stage. Using multiple prognostic factors together, the authors
noted that prediction for median length of survival varied from [[.5 months
for patients in stage I to 2.1 months for patients in stage Ill.
Describing Prognosis
PROGNOSIS AS A RATE
It is convenient to summarize the course of disease as a single number,
or rate: the proportion of people experiencing an event. Rates commonly
116 CLINICAL EPIDEMIOLOGY
1.0
'; '
,
,
0.8 : ,
•-
'"
<::
'S;
:
,,
' -,
Stage II
,
':;;: 0.6 -,,
"<:: '-,,
,
'" ,
,- -
0 , '- - 1
Figure 6.2. Survival of AIDS patients according to prognostic stage. Median sur-
vival times (in months): stage I, 11.6; stage II, 5.1, stage III, 2.1. (Adapted from
JusliceAC, Feinstein AR, Weils CK. A new prognostic staging system for the acquired
immunodeficiency syndrome. N Eng J Med 1989: 320:1388-1393,)
used for this purpose are shown in Table 6.1. These rates have in common
the same basic components of incidence, events arising in a cohort of
patients over time.
All the components of the rate must be specified: zero time, the specific
clinical characteristics of the patients, definition of outcome events, and
length of follow-up. Follow-up must be long enough for all the events to
occur; otherwise, the observed rate will understate the true one.
A TRADE-OFF: SIMPLICITY VERSUS MORE INFORMATION
Expressing prognosis as a rate has the virtue of simplicity. Rates can be
committed to memory and communicated succinctly. Their drawback is
that relatively little information is conveyed, and large differences in prog-
nosis can be hidden within similar summary rates.
Figure 6.3 shows 5-year survival for patients with four conditions. For
each condition, about 10°;;, of the patients are alive at 5 years. But similar
CHAPTER 6 / PROGNOSIS 117
Table 6.1
Rates Commonly Used to Describe Prognosis
Hntn Detiniti!)rr'
5-year survival Percent of patients surviving 5 years from some point in the
course of their disease
Case fatality Percent of patients with a disease who die of it
Disease-specific mortality Number of people per 10,000 {or 100,0(0) population dying
of a specific disease
Response Percent 01 patients showing some evidence of improvement
following an intervention
gomsson Percent of patients entering a phase in which disCilSlo is no
longer detectable
Recurrence Perceot ot patients who have return of disease after a
disease-free interval
"Timo under observation is cnncr stated or assumed to be \OufticiArltly 10"9 so tllil! "II ffi'Arlls lila! will OcCur
have teen observed.
60 60
40 40
tll
e 20 20
"S;
'S; . -,-----,
~
-.. o
::l 0 2 3 4 5 2 3 4 5
'".c: 100 c. Chronic 100
D.Age100
c
~ granulocytic years
80· leukemia 80
"-
60 60
40 40
20 20
0 2 3 4 5 o 1 2 3 4 5
Years
Figure 6.3. A limitation of 5-year survival rates: four conditions with the same ti-
year survival rate of 10%. (Data from Anagnostopoulos CD et al. Aortic dissections
and dissecting aneurysms. Am J Cardiology 1972:30:263-273; Sash JA, Hoover
DR et al. Factors influencing survival after AIDS: report from the Multicenter AIDS
Cohort Study (MACS). J Acquir Immune Defic Syndr 1994;7:287-?95; Ksrdinal
CG et al. Chronic qranulocvtic leukemia, Review of 536 cases. Arch Intern Med
1976; 136:30tJ-313: and American College of I ite Insurance. 1979 life insurance
fact book, Washington, DC: ACLI 1979,)
The plot of survival against time displays steps, corresponding: to the death
of each of the 10 patients in the cohort. If the number of patients were
increased (Fig. 6.48), the size of the steps would diminish. If a very large
number of patients were represented, the figure would approximate a
smooth curve. This information could then be used to predict the year-by-
year, or even week-by-week. prognosis of similar patients.
Unfortunately, obtaining the information in this way is impractical for
several reasons. Some of the patients would undoubtedly drop out of the
CHAPTER 6 I PROGNOSIS 119
study before the end of the follow-up period. perhaps because of another
illness, a move to a place where follow-up was impractical, or dissatisfac-
tion with the study. These patients would have to be excluded from the
cohort, even though considerable effort may have been exerted to gather
data on them up to the point at which they dropped out. Also, it would
be necessary to wait until all of the cohort's members had reached each
point in time before the probability of surviving to that point could be
calculated. Because patients ordinarily become available for a study over
a period of time, at any point in calendar time there would be a relatively
long follow-up for patients who entered the study first, but only brief
experience with those who entered recently. The last patient who entered
the study would have to reach each year of follow-up before any informa-
tion on survival to that year would be available.
SURVIVAL CURVES
To make efficient use of all available data from each patient in the
cohort, a way of estimating the survival of a cohort over time, called
survival analysis, has been developed. (The usual method is called a Kaplan-
Meir analysis, after the originators.) The purpose of survival analysis is
not (as its name implies) only to describe whether patients live or die. Any
outcome that is dichotomous and occurs only once during follow-up-
e.g., time to coronary event or to recurrence of cancer-can be described
in this way. 'When an event other than survival is described, the term time-
to-c'vcnt analysis is sometimes used.
Figure 6.5 shows a typical survival curve. On the vertical axis is the
100 100
s
l:
A. 10 patients B. 100 patients
80
'"
~
-
n.
o
~
60
40
1: 20
E
::J
Z
o 1 2 3 4 5 0 2 3 4 5
Time (years)
Figure 6.4. Survival of two cohorts, small and large, when all members are ob-
served for the full period of tallow-up.
120 CLINICAL EPIDEMIOLOGY
415=80% 112=50%
"0_ 10 0
- "'-
>~
=CQ
...
.c>
-
.c~
0::>
~(J)
0.
2
Time (Years)
Figure 6.5. A typical survival curve, with detail for ono part of the curve,
45
40
'"~~
.0_
35
,,-
--
"<II
o
" W~
~
30
25
Severe
(~75% stenosis)
1ii
c: .!:! 20
" E
"
1ii.r:: " 15
E "
:;::::;.!?
81 10
Normal
5 (:'030"1., stenosis)
6 12 18 24 30 36 42
Time from Detection of
Asymptomatic Neck Bruit (months)
N{severe 94 80 68 52 42 28 21 7
Normal 242 236 211 169 158 69 56 1
Figure 6.6. Survival curve showing comparison of two cohorts, number of people
at risk, and 95% confidence intervals for observed rates. These curves show the
cumulative probability of a cerebral ischemic event from time of diagnosis, according
to the initial degree of carotid stenosis. (Data from Chambers SR, Norris, JW. Out-
come in patients with asymptomatic neck bruits, N Engl J Med 1986; 315:860-865.)
Prognostic Outcomes
Factor
Present
I
~
Absent .. 1.1!Iii;!·1
LMeasurement~
Selection
Potential Sampling
Biases
100
80
60
o 3 6 9 12 15 18 21 24
Months
Figure 6.8. Disease-free survival according to CEA levels in colorectal cancer pa-
tients with similar pathological staging (Dukes B). (Redrawn from Wolmark N et
at The prognostic significance of preoperative carcinoembryonic antigen levels in
colorectal cancer. Results from NSABP clinical trials. Ann 8urg 1984; 199:375-382.)
CEA levels independently predicted relapse. Similar results were found for
patients with Dukes C tumors. Therefore, the association between CEA levels
and likelihood of relapse could not be explained by susceptibility bias for
patients with Dukes Hand C colorectal cancers, and CEA is an important
independent prognostic factor.
SURVIVAL COHORTS
True cohort studies should be distinguished from studies of survival
cohorts in which patients are included in a study because they both have
a disease and are currently available-perhaps because they are being
seen in a specialized clinic. Another term for such groups of patients is
available patient cohorts. Reports of survival cohorts arc misleading if they
are presented as true cohorts. Tn a survival cohort, people are assembled
at various times in the course of their disease, rather than at the beginning,
as in a true cohort study. Their clinical course is then described by going
back in time and seeing how they have fared up to the present (Fig. 6.9).
The experiences of survival cohorts are sometimes presented as if they
Observed True
True Cohort Improvement Improvement
Survival Cohort
Assemble
patients
1
Begin Measure outcomes
follow-up 80% 50%
Improved: 40
(N.501 Not improved: 10
-
1 Not 1
1
1
observed 1
1- - --I
1 (N=100) 1 ~
1 1 Dropouts
Improved: 35
Not improved: 65
Figure 6.9. Comparison of a true and a "survival" cohort: in the survival cohort,
some at the patients present at the beginning are not included in tile follow-up.
126 CLINiCAL EPIDEMIOLOGY
Example Concern has been raised about the possibility that silicone
breast implants may cause autoimmune symptoms of rheumatic disease. A
study was, therefore, done of 156 women with silicone breast implants and
rheumatic disease complaints (5). The patients were consecutive referrals to
three rheurnatologists who were known for their interest in silicone implants
and rheumatic disease. Serologic tests in the women were compared to those
of women without implants but with fibromyalgla and to tests in women
with implants but no rheumatic symptoms. The clinical findings in the
women with implants and complaints were described; most did not fulfill
criteria for rheumatoid arthritis and most had normal immunologic tests.
However, l4 patients had scleroderma-like illness and abnormal serology
that was not found in the comparison groups. Because of the possible biases
that can occur in the assembly of patients for this case series, the authors
were cautious about their findings, concluding that "the hypotheses raised
in this study and others should be tested in large, population-based studies."
Publication of the first such study does not support the hypothesis (6).
MIGRATION BIAS
Migration bias, another form of selection bias, can occur when patients
in one group leave their original group, dropping out of the study alto-
gether or moving to one of the other groups under study. If these changes
take place on a sufficiently large scale, they can affect the validity of
conclusions.
In nearly all studies, some members of an original group drop out over
time. If these dropouts occur randomly, such that the characteristics of lost
subjects in one group are on the average similar to those lost from the
other, then no bias would be introduced. This is so whether or not the
number of dropouts is large or the number is similar in the groups. But
ordinarily the characteristics of lost subjects arc not the same in various
groups. The reasons for dropping out-death, recovery, side effects of
CHAPTER 6 I PROGNOSIS 127
treatment, etc.v-ere often related to prognosis and may also affect one
group more than another. As a result, groups in a cohort that were compa-
rable at the outset may become less so as time passes.
As the proportion of people in the cohort who are nnt followed up
increases, the potential for bias increases. It is not difficult to estimate how
large this bias could be. All one needs is the number of people in thc
cohort, the number not accounted for, and the observed outcome rate.
Example Thompson et al. described the long-term outcomes of gastro-
gastrostomy {7)./\ cohort of 123 morbidly obese patients was studied 19-47
months after surgery. Success was defined as having lost more than 30% of
excess weight.
Only 103 patients (84%) could be located. In these, the success rate of
surgery was 60/lO3 (58°1<,). To determine the range within which the true
success rate must lie, the authors did a best case/worst case analysis. Success
rates were calculated, assuming that all of the patients lost to follow-up were,
on the one hand, successes (best case) and, on the other hand, failures (worst
case). Of the total cohort of 123 patients, 103 were followed up and 20 were
lost to follow-up. The observed success rate was 60/103, or 58%. In the best
case, all 20 patients lost to follow-up would be counted as successes, and the
success rate would be (60 + 20)/123, or 65':0. In the worst case, all 20 patients
would be counted as failures, and the success rate would be 60/123, or 49"/0.
Thus the true rate must have been between 49 and 65%; probably, it was
closer to 58%, the observed rate, because patients not followed up are unlikely
to be all successes or all failures.
Patients may also cross over from nne group to another in the cohort
during their follow-up. Whenever this occurs, the original reasons for pa-
tients being in one group or the other no longer apply. Tf exchange of
patients between groups takes place on a large scale, it can diminish the
observed difference in risk compared to what might have been observed
if the original groups had remained intact. Migration bias due to crossover
is more often a problem in risk than in prognosis studies, because risk
studies often go on for many years. On the other hand, migration from
one group to another can be used in the analysis of a study.
Example The relationship between lifestyle and mortality was studied
by classifying 10,269 Harvard College alumni by physical activity, smoking
status, weight, and blood pressure in 1966 and again in 1977 (8). Mortality
rates were then observed over a a-veer period from 1977 to 1985. It was
recognized that original classifications might change, obscuring any relation-
ship that might exist between lifestyle and mortality. To deal with this, the
investigators defined four categories: men who maintained high-risk life-
styles. those who changed from low- to high-risk lifestyles, those who
changed from high- to low-risk life~tyles, and those who maintained 101'.'-
risk lifestyles. After adjusting for other risk factors, men who increased their
physical activity from low to moderate amounts, quit smoking, lost weight
to normallevels, and/or became normotensive all had lower mortality than
men who maintained or adopted high-risk characteristics, but not as low as
the rates for alumni who never had any risk factors.
128 CLINICAL EPIDEMIOLOGY
MEASUREMENT BIAS
Measurement bias is possible if patients in one ,jroup stand a better chance
of having their outcome detected than those in another group. Obviously,
some outcomes, such as death, cardiovascular catastrophes, and major
cancers, arc so obtrusive that they are unlikely to be missed. But for less
dear-cut outcomes-the specific cause of death, subclinical disease, side
effects, or disability-measurement bias can occur because of differences
in the methods with which the outcome is sought or classified.
Measurement bias can be minimized in three general ways. One can
ensure that those who make the observations are unaware of the group to
which each patient belongs, can set careful rules for deciding whether or
not an outcome event has occurred (and follow the rules), and can apply
efforts to discover events equally in all groups in the study.
Example Chambers and Norris studied the outcome of patients with
asymptomatic neck bruits (9). A total of 500 asymptomatic patients with
cervical bruits were observed for up to 4 years. Patients were classified ac-
cording to the degree of initial carotid artery stenosis by Doppler ultrasonog-
raphy. Outcomes were change in degree of carotid stenosi s and incidence of
cerebral ischemic events.
To avoid biased measurements, the authors estimated carotid stenosis
using established, explicit criteria fur mterprottng Doppler scans and made
the readings without knowledge of the auscultatory or previous Doppler
findings. Clinical and Doppler assessments were repeated every 6 months,
and all noncomplying patients were telephoned to determine whether out-
comes had occurred.
This study showed, among other things, that patients with >75'1" carotid
stenosis had a >2IY:;, incidence of cerebral ischemic events in 3 years, more
than 4 times the rate of patients with <30% stenosis (see Fig. fl.fl).
I Co"I,,,1 h,,, ,,'v,'r~1 m,'~nings in r""'arch: I,,) gcner"I term ror any Im,,-,'," -_.n..stridion. !notching, ~l," lifi-
Glth",. "djll'lrHt'!lt-~im"d ~l ,,'mo,'in,. the' efi,'cts of exl,"neo", vari,'ble' while ,,,,,milling the' ind,'pen
d,'nl di,'cts 01 one variabk. (bJ lhe non'''I"""d p~"rl" in" cohort ,tudy (0 conlusing use of H", It"",),
(<.'! th .. npnt"',lt,'{i p"tic,~ts i~" clinical tl'i<11. "nd (.1.1 ,,,,ndi,,',,,,·d I'~"pl,' (non,·"",,) ;n ,1 (~",' conll'ol study
(sec Chapter "111).
CHAPTER 6 I PROGNOSIS 129
Table 6.2
Methods for Controlling Selection Bias
Phase of Study
RANDOMIZATION
The only way to equalize all extraneous factors, or "everything else," is to
assign patients to groups randomly so that each patient has an equal chance
of falling into the exposed or unexposed group. A special feature of random-
ization is that it not only equalizes factors we think might affect prognosis,
it also equalizes factors we do not know about. Thus randomization goes a
long way in protecting: us from incorrect conclusions about prognostic factors.
However, it is usually not possible to study prognosis in this way. The special
situations in which it is possible to allocate exposure randomly, usually to
study the effects of treatment on prognosis, will be discussed in Chapter 7.
RESTRICTION
Patients who are enrolled in a study can be restricted to only those
possessing a narrow rilnge of characteristics, to equalize important extrane-
ous factors. For example, the effect of age on prognosis after acute myocar-
dial infarction could be studied in white males with uncomplicated anterior
130 CLINiCAL EPIDEMIOLOGY
Patients can be matched as they enter a study so that for each patient
in one group there are one or more patients in the comparison group with
the same characteristics except for the factor of interest. Often patients are
matched for age and sex, because these factors are strongly related to the
prognosis of many diseases. But matching for other factors may be called
for as well, such as stage or severity of disease, rate of progression,
and prior treatments. An example of matching in a cohort study of sickle-
cell trait was presented in the discussion of observational studies in
Chapter 5.
Although matching is commonly used and can be very useful, it controls
for bias only for those factors involved in the match. Also, it is usually not
possible to match for more than a few factors, because of practical difficul-
ties in finding patients who meet all the matching criteria. Moreover, if
categories for matching are relatively crude, there may be room for sub-
stantial differences between matched groups. For example, if a study of
risk for Down's syndrome were conducted in which there was matching
for maternal age within 10 years, there could be a nearly lO-fold difference
in frequency related to age if most of the women in one group were 30
and most in the other 39. Also, once one restricts or matches on a variable,
its effects on outcomes can no longer be evaluated in the study.
STRATIFICATION
After data are collected, they can be analyzed and results presented
according to subgroups of patients, or strata, of similar characteristics.
Example Let us suppose we want to compare the operative mortality
rates for coronary bypass surgery at hospitals A and B. Overall, hospital A
noted 48 deaths in 1200 bypass operations (4'\,), and hospital B experienced
64 deaths in 2400 operations (2.6%,).
The crude rates suggest that hospital 13 is superior. Or do they? Perhaps
patients in the two hospitals were not otherwise of comparable prognosis.
On the basis of age, myocardial function, extent of occlusive disease, and
other characteristics, the patients can be divided into subgroups based on
preoperative risk (Table 6.3); then the operative mortality rates within each
category Of stratum of risk can be compared.
Table 6.3 shows that when patients arc divided by preoperative risk, the
operative mortality rates in each risk stratum are identical in two hospitals:
6';;', in high-risk patients, 4uj" in medium-risk patients, and 0.67~j, in low-
risk patients. The obvious source of the misleading impression created by
CHAPTER 6 I PROGNOSIS 131
Table 6,3
Example of Stratification: Hypothetical Death Rates after Coronary BYP8SS Surgery
in Two Hospitals, Stratified by Preoperative Risk
Hospital A Hospital B
Preoperative ._--_.
Ri"k Patients Deaths R~te (%) Patients Dpatlls Rate {%)
examining only the crude rates is the important differences in the risk charac-
teristics of the patients treated at the two hospitals: 42% of hospital A's
patients and only 17% of hospital B's patients were high risk.
patients because volunteers for studies tend to do better than patients who
do not volunteer. For example, in a large Canadian study of breast cancer
screening among women in their 40s, 90'1" of women who were in the
control group and had invasive breast cancer were alive 7 years later, and
the number of deaths from breast cancer were lower than for Canadian
women generally (11).
Summary
Prognosis is a description of the course of disease from its onset. Com-
pared to risk, prognostic events arc relatively frequent and often can be
estimated by personal clinical experience. However, cases of disease ordi-
narily seen in medical centers and reported in the medical literature are
often biased samples of all cases and tend to overestimate severity.
Prognosis is best described by the probability of having experienced an
outcome event at any time in the course of disease. Tn principle, this can
be done by observing a cohort of patients until all who will experience
the outcome of interest have done so. However, because this approach is
inefficient, another method-called survival, or time-to-event analysis-
is often used. TIle onset of events over time is estimated by accumulating
the rates for all patients at risk during the preceding time intervals.
As for any observations on cohorts, studies comparing prognosis in
different groups of patients can be biased if differences arise because of
the way cohorts arc assembled, if patients do not remain in their initial
groups, and if outcome events arc not assessed equally. A variety of strate-
gies arc available to deal with such differences as might arise, so as to
allow fair (unbiased) comparisons. These include restriction, matching,
stratification, standardization, multivariable analysis, and sensitivity analy-
sis. One or more of these strategies should be found whenever comparisons
are made.
RFI'ERENCES
I. Laupacis A. Changes in quality of lik ~nd functional capacity in hemodialysis patient.;
treated with recombinant human erythropoietin. The C:,,"~dian Erythropoietin Study
Group. Semin '\!~phroI199();2{SuppJ 1)1:1119.
2. Celber RD ct al. Quality-of-lin- evaluation in " clinical trial of zidovidine therapy in
patients with mildly symptomatic I IIV infection. Ann Inkm 'vied 1992;116:961-'0166.
3. Justice AC, Feinstein AR, Wells CK. A ,,,'W prognostic staging system for thl' acquired
immunodeficiency syndrome. N Fng ] Med 19S9;320:lJHH-1393
4. Wolmark N et <11. Thl' prognostic significance of preoperanvc carcinoembryonic antigen
IeveIs in coloredal cancer. Results i",m the f\SABP cllnicaltnals. Ann Sutg 19M;199:375-
382.
5. Bridg,,'s AI, Conley C. Wang G, BlIrn~ DF. Vasey FIJ. A clinical and immunologic evalua-
tion of women with silicone breast implants and s}'mpt()m~ of rheumatic disease. Ann
Intern Med 1993;]]8:929 <HIi.
Ii. Cabri~l SC, O'Fullon Wlvl, Kurland IT, Beard CM, Woods [E, Ml'iton LJ Ill. Risk of
CHAPTER 6 ! PROGNOSIS 135
connective-tissue ui~eascs and other di~(lrd..r~ after breast implantation. N Engl J Med
1994;330:1fi97 1702.
7. Thompson KS, Fletcher SW, ()'I\bli~y MS, Buckwalter JA. Long-term outcomes of mor-
bidly obese patients tr""ted with gastrogastrostomy. J Cen Intern Med 1986;1:85-99.
H. l'affcnbarger RS, Hyde RT, Wing AL, lee 1M, Jung lJL, Kempen JB The a~~ociati()n of
changes in physical-activity level and other lifestyle characteristics with mortality among
men. N FngJ Med 1993;328:538-545.
9 Chambers 1m, Norris JW Outcome in patients with a~ymrtom"tic neck bruits, N Engl J
'vkd 1<J1l6;315:860-865.
10. Cornfield J. The University Croup Diabetes Pmgram, A further statistical analy~i~ of the
mortality findings, .lAMA 1971;217:1676-1687.
11 Miller AB, Baines Ct. Teress Tc, W,,11 Co Canadian Nalional llr"ast Screening Study. 1.
Breast cancer detection "nd death rates among wom"n aged 40 to 49 year~. Cim Med
Assoc J 1992:147:14_~<J 1470.
SUCCESTED READINGS
C"lton T. Statistics in medi,in". 13oston; Liulc, Brown, 1975, PI" TJ7-25(),
Concato J, Feinstein AI<, Hulford TR. '111e risk of determining risk with multivariable models.
Ann Intern M,-dI993;1 HUOI-21O.
Feinstein AI<, Sosin DM, Wells CK. The INili I<og"rs phenomenon-stage migration and new
diJgn()~tic techniques as a source of misleading statistics for ~urvivJI in Cclnn'L N Lngl J
M"d 19H5;312;1604-1608.
Cuyatt Cll, Feeny DH, P~trkk D!.. 'v1easuring health-related gu,dity of life. Ann Intern M~d
19')3;118:622-629.
Horwitz RL Th" "xperimental paradigm and observational studies of cause-died relationships
in dini~al medicine, J Chron Dis 19R7;40:91--99,
l.aupacis A, Wells C, Rirhardson S, Tugwell P, Users' guides to the medical literature. V
How to 118e ~n "rticle about prognosis. lAMA 1994;272;234-237
Peto R d al.l.ksign and analysis of r,mdumiz,xi clinicaltrials requiring prolonged observation
of each patient. 11. Analysis and .., xmnples, l::lr J Cancer 1977;35:1-39.
W"ssonJH, Sox HC, Ndl I~K, Goldman L. Clinical pn,dktion rules; applications and method·
ological standards. N Engl J Med 19S5;313:793-799
weiss NS: Cliniccll epidemiology: the study "I' the outcomes of illness. :\"w York: Oxford
University I'n"s, ')<JS6,
7
TREATMENT
Once the nature of a patient's illness has been established and its ex-
pected course predicted, the next question is, 'What can be done about it?
Is there a treatment that improves the outcome of disease? This chapter
describes the evidence used to decide whether a well-intentioned treatment
is effective.
Ideas about treatment, but more often prevention, also come from epide-
miologic studies of populations. Burkitt observed that colonic diseases are
less frequent in African countries, where diet is high in fiber, than in
developed countries, where intake of dietary fiber is low. This observation
has led to efforts to prevent bowel diseases-c-imtable bowel syndrome,
diverticulitis, appendicitis, and colorectal cancer-with high-fiber diets.
Comparisons across countries have also suggested the value of red wme
to prevent heart disease and fluoride to prevent dental caries.
TESTING IDEAS
Some treatment effects are so prompt and powerful that their value is
self-evident even without formal testing. Clinicians do not have reserva-
tions about the value of penicillin for pneumonia, surgery for appendicitis,
or colchicine for gout. Clinical experience has been sufficient.
Usually, however, the effects of treatment are considerably less dra-
matic. Tt is then necessary to put ideas about treatments to a formal test,
through clinical research, because a variety of conditions-coincidence,
faulty comparisons, spontaneous changes in the course of disease, wishful
thinking-can obscure the true relationship between treatment and effect.
Sometimes knowledge of mechanisms of disease, based on work with
laboratory models or phvsiolugic studies in humans, has become so exten-
sive that it is tempting to predict effects in humans without formal testing.
However, relying solely on our current understanding of mechanisms,
without testing ideas on intact humans, can lead to unpleasant surprises
because the mechanisms are only partly understood.
Example Many strokes are caused by cerebra! infarction in the area distal
to an obstructed segment of the internal carotid artery. It should be possible
to prevent the rnarnfestatinns of disease in people with these lesions by
bypassing the diseased segment so that blood can flow to the threatened area
normally. It is technirallv feasible to enastamosc the superficial temporal
artery to the internal carotid distal to an obstruction. Because its value seemed
self-evident on physiologic grounds and because of the documented success
of an analogous procedure, coronary arterv bypass, the surgery became
widely used.
The EC/IC Bypass Study Croup {l] conducted a randomized controlled
trial of temporal artery bypass surgery. Patients with cerebral ischemia and
an obstructed internal carotid artery were randomly allocated to surgical
versus medical treatment. The operation was a technical success; 96°;', of
anastomoses were patent just after surgery_ Yet, the surgery did not help the
patients. Mortality and stroke rates after 5 years were nearly identical in
the surgically and medically treated patients, but deaths occurred earlier in
the surgically treated patients.
This study illustrates how treatments that make good sense, based on
what we know about the mechanisms of disease, may be found ineffective
in human terms when put to a rigorous test. Of course, it is not always the
case that ideas arc debunked; the value of carotid endarterertomv, suggested
on similar grounds, has been confirmed (2).
138 CLINICAL EPIDEMIOLOGY
Outcomes
Experimental
intervention ,liiipr{>,," I
l ~.lITIpr<lIlOO1
I""" """~,
f'liflpriMill:.TI
-Compariso';-
Intervention
-c;~vOOl
. ,
_-~lr_:T",~-~.~~ L
]---"---> OUTCOMES
Allocation
))
..
... - - - - - - - - - - - - - - - -
Generalizing ......
-~
Statistical
analysis
the results apply. But exclusions come at the price of diminished scope of
generaltzability, because characteristics that exclude patients occur com-
monly among those ordinarily seen in clinical practice, limiting generaliz-
ability to these patients, the very ones for whom the information is needed.
Second, patients can refuse to participate in a trial. They may not want
a particular type of treatment or to have their medical care decided by a
flip of a coin or by someone other than their own physician. Patients who
refuse to participate are usually systematically different-in socioeconomic
class, severity of disease, other health-related problems, and other ways-
from those who agree to enter the trial.
Third, patients who are found to be unreliable during the early stages
of the trial are excluded. This avoids wasted effort and the reduction in
internal validity that would occur if patients moved in and out of treatment
groups or out of the trial altogether.
for these reasons, patients in clinical trials are usually a highly selected,
biased sample of all patients with the condition of interest (Fig. 7.3). Be-
cause of the high degree of selection in trials, it often requires considerable
faith to generalize the results of clinical trials to ordinary practice settings.
INTERVENTION
The intervention itself can be described in relation to three general char-
acteristics: gcneralizebllity, complexity, and strength.
First, Is the intervention in question one that is likely to be implemented
CHAPTER 7 I TREATMENT 141
Percent of Patients
Inclusion Criteria
Age >40 years
Diabetes diagnosed after 30 years old
Require medication for hyperglycemia
Plan to remain in practice >2 years
Other illness, disability, etc.
24 Eligible
Uncooperative
Refused to participate
Did not keep appointments
13 Randomized
Dropped Out
Death
Change of residence
Illness, etc.
12 Completed Study
Figure 7.3. Sampling for a clinical trial. A study of the effectiveness of a program
to reduce lower extremity problems in patients with diabetes. (Data from Litzelman
OK, Slemenda CW, Langfeld CD, Hays LM, Welch MA, Bild DE, Ford ES, Vinicor F.
Reduction in lower extremity clinical abnormalities in patients with non-insulin depen-
dent diabetes mellitus. A randomized controlled trial. Ann Intern Med 1993;119:
36-41.)
142 CLINICAL EPIDEMIOLOGY
change their behavior because they are the target of special interest and
attention in a study, regardless of the specific nature of the intervention
they might be receiving, a phenomenon called the Hawthorne effect. The
reasons for this changed behavior are not clear. Patients are anxious to
please their doctors and make them feel successful. Also, patients who
volunteer for trials want to do their part to see that "good" results are
obtained. Thus comparison of treatment with simple observation measures
treatment effect over and above the Hawthorne effect.
Placebo Treatment
Do treated patients do better than similar patients given a placebo, an
intervention that is intended to be indistinguishable from the active treat-
ment-e-in physical appearance, color, taste, and smell-s-but does not have
a specific, known mechanism of action? Sugar pills and saline injections
are examples of placebos. It has been shown that placebos, given with
conviction, relieve severe, unpleasant symptoms, such as postoperative
pain, nausea, or itching, of about one-third of patients, a phenomenon
called the placebo effect.
Action Examples
Mostly Antimicrobials
_i
specific Antimetabolites
Narcotics
Mixed Placebo
Antidepressants
Mostly Antipruritics
non- •
specific L- _ Injection of
trigger points
Figure 7.4. The effects of most drugs are partly attributable to the placebo effect.
(Redrawn from Fletcher RH. The clinical importance of placebo effects. Fam Med
Rev 1983; 1:40-48.)
Effects
i- Specific
..
I:
E
treatment
Placebo
>
0
~
a. Hawthorne
E
Natural
I history
Figure 7.5. Total effects of treatment are the sum of spontaneous improvement,
nonspecific responses, and the errocts of specific treatments.
Usual Treatment
Do patients given the experimental treatment do better than those re-
ceiving usual treatment? This is the only meaningful (and ethical) question
if the usual treatment is already known to be efficacious.
The cumulative effects of these various reasons for improvement in
treated patients are diagrammed in Figure 7.5.
CHAPTER I ! TREATMENT 145
ALLOCATING TREATMENT
To study the effects of a clinical intervention free of other effects, the best
way to allocate patients to treatment groups is by means of randomization.
Patients are given either the experimental or the control treatment by one
of a variety of disciplined procedures-analogous to flipping a coin-
whereby each patient has an equal (or at least known) chance of being
assigned to anyone of the treatment groups.
Random allocation of patients is preferable to other methods of alloca-
tion, because randomization assigns patients to groups without bias. That
is, patients in one group are, on the average, as likely to possess a given
characteristic as patients in another. Only with randomization is this so
for all factors related to prognosis, whether or not they are known before
the study takes place.
In the long run, with a large number of patients in the trial, randomization
usually works as described. above. However, random allocation does not
guarantee that the groups will be similar. Although the process of random
allocation is unbiased. the results may not be. Dissimilarities between groups
can arise by chance alone, particularly when the number of patients random-
ized is small. To assess whether this kind of "bad luck" has occurred, authors
of randomized controlled trials often present a table comparing the frequency
of a variety of characteristics in the treated and control groups, especially
those known to be related to outcome. Tt is reassuring to see that important
characteristics have, in fact, fallen out nearly equally in the groups being
compared. If they have not, it is possible to see what the differences are and
attempt to control them in the analysis (see Chapter 6).
Some investigators believe it is best to make sure, before randomization,
that at least some of the characteristics known to be strongly associated
with outcome appear equally in treated and control groups, to reduce the
risk of bad luck. They suggest that patients first be gathered into groups
(strata) of similar prognosis and then randomized separately within each
stratum-a process called stratified ronaomiuuion. The groups arc then
bound to be comparable, at least for the characteristics that were used to
create the strata. Others argue that whatever differences arise by bad luck
arc unlikely to be large and can be dealt with mathematically after the
data are collected.
DIFFERENCES ARISING AFTER RANDOMIZATION
Not all patients in a clinical trial participate as originally planned. Some
are found not to have the disease they were thought to have when they
entered the trial. Others drop out, do not take their medications, are taken
out of the study because of side effects or other illnesses, or somehow
obtain the other study treatment or treatments that are not part of the
study at all. The result is comparison of treatment groups that might have
146 CLINICAL EPIDEMIOLOGY
been comparable just after randomization but have become less so by the
time outcomes are counted.
Patients Do Not Have the Disease
Tt may be necessary to decide which treatment to give (in a clinical trial
or in practice) before it is certain the patient actually has the disease for
which the treatment is designed.
Example To study whether a monoclonal antibody against endotoxin im-
proves survival from sepsis, 543 patients with sepsis and suspected Gram-
negative infection were randomized to receive antiendotoxin or placebo (6). In
the subgroup of patients who actually had Gram-negative bacteremia, death
rate was reduced from 49 to 3ln:" a large difference that was well beyond what
could be accounted for by chance. However. only 200 patients (37'1<,) had Crem-
negative bacteremia, confirmed by blood culture. There is no known reason
why the other 63% would be helped by the drug. For all patients with sepsis
(some of whom had bacteremia and others did not) mortality rate was 43)}~ in
the placebo group and 39'% in the group receiving antiendotoxin, a small differ-
ence that was not beyond that expected by chance alone.
Thus, from this trial, there was evidence that the drug was effective against
Cram-negative bacteremia, but not for sepsis. Both are important questions:
the former for researchers, who are interested in the biologic effect of antien-
dotoxin in bacteremia, and the latter for clinicians, who needed to know the
clinical effects of their decision to give the drug to patients with scpsis->-a
decision that musl be made before it is known whether or not bacteremia is
actually present.
the drugs being studied in the trial (researchers call the exchange of treatment
regimens among study participants "contamination") or obtain drugs that
are not part of the trial through"drug clubs." Information about this behavior
is usualiv not shared with the researchers and so cannot be accounted for in
the study. The result is to bias the study toward observing no effect, since
the contrast between the treatment of the "treated" group and the comparison
group is diminished.
Comparing Responders with Nonresponders
In some clinical trials, particularly those about cancer, the outcomes of
patients who initially improve after treatment (responders) are compared
with outcomes in those who do not (nonrcsponders). The implication is
that one Gill learn something about the efficacy of treatment in this way.
This approach is scientifically unsound and often misleading, because
response and non response might be associated with milny characteristics
related to the ultimate outcome: stage of disease, rate of progression, corn-
pliance. dose and side effects of drugs, and the presence of other diseases.
If no patient actually improved because of the treatment, and patients were
destined to follow various clinical courses for other reasons, then some
(the ones who happened to be doing well) would be called "responders"
and others (the ones having a bad course) would be considered "nonrc-
spenders." Responders would, of course, have a better outcome whether
or not they received the experimental treatment.
BLINDING
Participants in a trial may change their behavior in a systematic way
(i.e., be biased) if they are aware of which patients receive which treatment.
One way to minimize this effect is by blillding, an attempt to make the
various participants in a study unaware of which treatment patients have
been offered, so that the knowledge does not cause them to act differently,
thereby damaging: the internal validity of the study. "Masking" is a more
appropriate metaphor, but blinding is the time-honored term.
Blinding can take place at four levels in a clinical trial. First, those
responsible for allocating patients to treatment groups should not know
which treatment will be assigned next so that the knowledge does not
affect their willingness to enter patients in the trial or take them in the
order they arrived. Second, patients should be unaware of which treatment
they are taking: they arc thereby less likely to change their compliance or
their reporting of symptoms because of this information. Third, physicians
who take care of patients in the study should not know which treatment
each patient has been given; then they will not, perhaps unconsciously,
manage them differently. Finally, if the researchers who assess outcomes
cannot distinguish treatment groups, that knowledge cannot affect their
measurements.
The terms sil1x'dllind (patients] and double-blind (patients and research-
CHAPTER 7 I TREATMENT 149
ers) are sometimes used, but their meaning is ambiguous. Tt is better simply
to describe what was done. A trial in which there is no attempt at blinding
is called open or open label.
When blinding is possible, mainly for studies of drug effects, it is usually
accomplished by means of a placebo. However, for many important clinical
questions-the effects of surgery, radiotherapy, diet, or the organization
of medical care-blinding of patients and managing physicians is not
possible.
Even when blinding appears to be possible, it is more often claimed
than successful. Physiologic effects, such as lowered pulse rate with beta-
blocking drugs, or bone marrow depression with cancer chemotherapy,
are regular features of some medications. Symptoms may also be a clue.
Example In the Lipids Research Clinics (B) trial of the prim<lry preven-
tion of cardiovascular disease, a nearly perfect placebo was used. Some Pl.'O-
pie received cholestyramine and others a powder of the same appearance,
odor, and taste. However, side effects were substantially more common in
the cholcstyramine group. At the end of the Tst year of the trial, there were
much higher rates in the experimental (chulestyrarnine] group than the con-
trol group for constipation (39 versus 10%), heartburn (27 versus 1m:.),
bdching and bloating (27 versus 16%), and nausea (16 versus 8'1.,). Patients
might have been prompted by new symptoms to guess which treatment they
were getting.
Table 7.1
Summarizing Treatment Effects"
-t.aooos A Sackelt DL llolJHrfs RS. An H~,*,~SrTlf1flt of clinically useful measures of the consequences of
treatment. New Engl J Med t988;3 t8 t 728 t 733
I, ,or continuous data, when there are measurements at bascmc and alter treatment, analogous rlIeilsurH~
are basHd Oil thH "'Ha" values for treated and control oroupr either aftof treatment or for the rjftcrcnco
between baseline and posttreatment "llluo,;
CHAPltR 7 ! TREATMENT 151
Drop out
,, ,
, ,, Analysis
according to
Cross over ,,
,,
treatment
,,
assigned
. . . . . . . .4,
--- - - - - ~
Drop out
Explanatory Analysis
Drop out
~/ Cross over
Analysis
according to
treatment
.--.0."'\ .,
received
\'rop out
questions of effectiveness. This is in part because of the risk that the result
will be inconclusive. If a treatment is found to be ineffective, it could be
because of a lack of efficacy, lack of patient acceptance, or both.
Tailoring the Results of Trials to Individual Patients
Clinical trials involve pooling the experience of many patients who arc
admittedly dissimilar and describing what happens to them on the average.
How can we obtain more precise estimates for individual patients? Two
ways are to examine subgroups and to study individual patients using
rigorous methods similar to those in randomized trials.
SUBGROUPS
The principal result of a clinical trial is a description of the most im-
portant outcome in each of the major treatment groups. But it is tempting
to examine the results in more detail than the overall conclusions afford.
CHAPTFR 7 I TREATMENT 153
expected death rate. The trial was also stopped because there were fewer
myocardial infarctions in the treated than the control group. The authors
thought that the effect on myocardial infarction, although not the answer to
a main study question at the outset, was real because it was biologically
plausible, because it was found in other studies, and because the chance of
a false-positive conclusion was estimated to be very small (1/10,000). On the
other hand, although the authors observed a small increase in risk of stroke
in the treated group, they could not be certain whether this effect was real
or not, as there were too few physicians with this end point. Thus, in a study
that could not address the main research question, the authors interpreted
the validity of findings in subgroups (both positive and negative) in relation
to the totality of information that might bear on the validity of these findings.
EFFECTIVENESS IN INDIVIDUAL PATIENTS
A treatment that is effective on the average may not work on an individ-
ual patient. The results of valid clinical research provide a good reason to
begin treating a patient, but experience with that patient is a better reason
to continue therapy. Therefore, when conducting a treatment program it
is useful to ask the following series of questions:
• Is the treatment known to be efficacious for any patients?
• Is the treatment known to be effective, on the average, in patients like
mine?
• Are the benefits worth the discomforts and risks?
• Is the treatment working in my patient?
By asking these questions, and not simply following the results of trials
alone, one can guard against ill-founded choice of treatment or stubborn
persistence in the face of poor results.
Table 7.2
Guidelines for Deciding Whether Apparent Differences in Effects
within Subgroups Are Real"
. , Adapted from Oxrndo Arl. Guyatt GIL A corwurners glJidp. to subgrolJp arlHlysis A"" I"t"", M"d
1992: 116:78 8~_
CHAPTER 7 I TREATMENT 155
TRIALS OF N = 1
Rigorous clinical trials, with proper attention to bias and chance, can
be done with individual patients, one at a time (16). The method-called
trials of N = 1-is an improvement in the more informal process of trial
and error that is so common in clinical practice. A patient is given one or
another treatment (e.g.. active treatment or placebo) in random order, each
for a brief period of time, such as a week or two. Patients and physicians
are blind to which treatment is given. Outcomes (c.g.. a simple preference
for a treatment or a symptom score) are assessed after each period and
subjected to statistical analysis.
This method is useful when activity of disease is unpredictable, response
to treatment is prompt, and there is no carryover effect from period to
period. Examples of diseases for which the method can be used include
migraine, bronchospasm, fibrositis, and functional bowel disease.
N of 1 trials can be useful for guiding clinical decision making, although
for a relatively small proportion of patients. It can also be used to screen
interesting clinical hypotheses to select some that are promising enough
to be evaluated using a full randomized controlled trial involving many
patients.
after the best way to deliver the treatment has been worked out, so that a
good example of the intervention is tested. In any case, it is generally
agreed that if a controlled trial is postponed too long, the opportunity to
do it all may be lost.
For these reasons, guidance from clinical trials is not available for many
treatment decisions. But the decisions must be made nonetheless. What
are the alternatives and how credible are they?
ADVANTAGES AND DISADVANTAGES
Control patients can be chosen from a time and place different from the
experimental patients. For example, we may compare the prognosis of
recent patients treated with current medications to experience with past
patients who were treated when current medications were not available.
Similarly, we may compare the results of surgery in one hospital to results
in another, where a different procedure is used. This approach is conve-
nient. The problem is that time and place are almost always strongly related
to prognosis. Clinical trials that attempt to make fair comparisons between
groups of patients arising in different eras, or in different settings, have a
particularly difficult task.
The results of current treatment are sometimes compared with experi-
ence with similar patients in the past, called historical or nonconcurrent
controte. Although it may be done well, this design has many pitfalls.
Methods of diagnosis change with time, and with them the average prog-
nosis. It has been shown that new diagnostic technologies have created
the impression that the prognosis of treated lung cancer have improved
over time when it has not (17). With better ability to detect occult metasta-
ses, patients are classified in a worse stage than they would have been
earlier, and this "stage migration" has resulted in a better prognosis in
each stage than was reported in the past. Supporting treatments (e.g.,
antibiotics, nutritional supplementation and peptic ulcer prevention) also
improve with time, creating a general improvement in prognosis that might
not be attributable to the specific treatment given in a later time period.
Example Sacks et al. (18) reviewed d;nical trials of six therapies 10 sec
if trials with concurrent controls produced different results than studies of
the same treatments with historical controls. They studied 50 randomized
trials and 56 studies with historical controls. A total of 7';l'}:, of trials with
historical controls but onlv 20'%, of trials with a concurrent, randomized con-
trol group found the experimental treatment to be better. Differences between
the two kinds of trials occurred mainly because the control patients in the
historical trials did worse. Adjustment for prognostic factors, when possible,
did not change the results, i.e., the differences were probably because of
genera! improvements in therapy or to selection of less ill patients.
158 CLINICAL EPIDEMIOLOGY
Example The mortality rate for hospitals where coronary bypass surgery
was done varied almost threefold across hospitals in central Pennsylvania
(Fig. 7.8) (19). The severity of illness, and therefore prognosis, of patients in
these hospitals varied too. After taking into account the number of deaths
expected, by considering patients' prognostic factors, one hospital had fewer
than expected deaths, another the expected number, and a third more than
expected. Any fair comparison of treatment effects across these hospitals
would have to take into account not only the differences in severity of the
patients in these hospitals but also the skills of the surgeons.
UNCONTROLLED TRIALS
6.93
6.50
6.0 Expected
range
"c• 5.2 5.25
5.0
0
s:
;::
-••
<
4.0
..-• 3.0
2.48 Actual 1.~ffL.J
-•
0
e
2.0
number-
.•.
0
"
1.0
L---_~ _
1 2 3
Hospital
Figure 7.8. Severity of illness and skill of surgeons vary by location. Observed and
expected (taking into account case mix) death rate from coronary bypass surgery
in 1I1reB hospitals. (Data from Topol EJ, Califf RM. Scorecard cardiovascular medicine.
Its impact and future direction. Ann Intern Med 1994; 120:65- 70,)
4+
3+
-",
-> 2+
()
<I:
1+
0
1955 1960 1964
Time
Figure 7.9. The unpredictable course of disease, The natural history of systemic
lupus erythematosus in a patient observed before the advent of immunosuppressive
drugs, (Redrawn from Ropes M. Systemic lupus erythematosus. Cambridge, MA:
Harvard University Press, 1976.)
Summary
Promising ideas about what might be good treatment should be put to
a rigorous test before being used as a basis for clinical decisions. The best
test is a randomized controlled trial, a special case of a cohort study in
which the intervention is allocated randomly and, therefore, without bias.
Patients in clinical trials are usually highly selected, reducing gcneralizabil-
ity. They are randomly allocated to receive either an experimental interven-
tion or some comparison management: usual treatment, a placebo, or sim-
ple observation. On the average, the compared groups have a similar
prognosis just after randomization (and before the interventions). How-
ever, differences not attributable to treatment can arise later, including not
taking the assigned treatment, dropping out of the study, receiving the
other treatment, being managed differently in other ways, or getting treat-
ments that are not part of the study. Blinding all participants in the trial
can help minimize bias in how patients are randomized, managed, and
assessed for outcomes but is not always possible or successful.
The results of randomized trials can be summarized according to the
treatment assigned; an intention-to-treat analysis, which is a test of the
clinical decision and maintains a randomized trial design; or according to
the treatment actually received, which bears on the biology of disease, but
not directly on the clinical decision, and has the disadvantage that patients
may not remain with the treatment they were originally assigned. To obtain
information more closely tailored to individual patients than the main
results of randomized trials afford, clinicians can use results in subgroups
of patients, which carry the additional risk of being misleading, or do trials
on their own patients, one at a time.
For many clinical questions it is not possible, or not practical, to rely
on a randomized controlled trial. Compromises with the ideal include
making comparisons to experience with past patients, to past experience
with the same patients, or to a concurrent group of patients who are not
randomly allocated. When these compromises arc done, the internal valid-
ity of the study is weakened.
REFIiRENCES
l. ECjlC Ilyp~~~ Study Group, Failun, "f ",xtr~cranial-intracranial arterial bypass to reduce
the risk of ischemic ~troke. N Engl J Mcd 19S5;313:1191-1200,
2. Mayberg MR, Wilson E, Y~t~u F, Weiss DG, Messina I., Hershey LA, Colling C bkri(lgt'
L Deykin D, Winn UK Carotid endarterectomy and prevention of cerebral ischemic' in
symptomatic carotid stenosis. JAMAI991 ;266:3289-3294.
:I, Opie On the heartllidnorial]. LancctI9S0;1:692.
4, Rubenstein LZ, R(lhhin~ AS, Josephson KI':, Schulm~n ilL, Dsterweil 0. l'hc valu", of
assessing falls in an elderly population. A randomized controlled trial. Ann Intern Med
1990;113:30t!-316.
S. Fisch"r RW Comparison of antipruritic drugs administered orally. JAMA 1968;2U3:4Hl-
419.
CHAPTER 7 ! TREATMENT 163
6. 7iegler Ej, et ~l. Treatment of gram neg"tive bacteremia and septic shock with HA-IA
human monoclonal antibody against endotoxin. New Engl J Med 1991,.124:429-43(,.
7, Coronary Drug Project Research Croup. Influence of adherence to treatment ~nd response
of cholesterol on mortality in th" coronary drug project. r-..: Engl I Med 19S0;303:1038-
1041-
R l.ipid Research Clinics Program. The Lipid Research Clinics coronary primary prevention
trial results. 1 Reduction in incidence of coronary heart disease, jAMAI9M;251:351-1M,
9. Bvlngton RP, ct al. A~s;,~~ment of doublc-blindne,s at the conclusion of the bda·blocker
heart "tbck trial. JAMA 1985;253:1733-1736.
10. l,aupacis A, S"ckelt DL, Roberb R.S. An ~~s;,~sment of clinically useful measures of the
«ooscqvences of treatment New Engl.l Med 1988;318:I72H-1733,
11. Naylor CD, Chen E, Str"u~s 1:1. Mea~ured enthusiasm: does the method of reporting trial
results alter perceptions of tht'f~peutic effectiveness? Ann Inkrn Med 1992;117:916-921
12, Malenka OJ, Hamn [A, joh"ns;,n S, Wahr"nberger jW, Ross .1M. The framing effect of
relative "nd absolute rbk ] Gcn Intern Med 1993;8:541-54H.
13. McNeil B], Paukcr SG, Sox I IC [r. Tvt'fsky A. On th" elicitation of preference~ for altcrna-
tive th"rapies.. New Engl.l Med 19H2;306:12.S9.
14. Sackett DL, Gent M. Controversy in counting ~nd attributing events in clinical trials. N
Fngl j Med 1979;301:1410-1412.
15 Steering Commilte<:e of tile Physicians' Ht'alth Study R,,~eMCh Group !'inal report of the
aspirin component of the ongoing Physicians' Hl'alth Study. New Engl .I Mcd
1989;321 :129-135.
16. Guyatt C, Sackett D, Taylor OW, Chong [. Roberts R, Pugsley S. Determining optimal
thsrapv-c-randomizcd trials in individual patients. N Engl.l Mcd 1986;314:RS9-S92.
17 Feinstein AR, Sosin DM, W"lls CK, The Will Rogers pht'nornenon, Stage migration ~l1d
new diagnostic techniques as a source of misleading statistics for survival in cancer. New
Engl.l Med 1985;312:1(,04-2608.
18. S<:t\K.s II, Chalmers TC, Smith II Jr. Randomized versus historical controls for clinical
trials, Am I Mcd 1982;72:211-240.
19. Topol Fj, C~liff RM. Scorecard cardiovascular medicine. lis impart and future direction.
Ann Intern Med 1994;120:65-70.
21l, Ropes M. Systemic lupus erythematosus. Cambridge, MA: H"vnrd University Press, 1976.
21 Spilker B. Guide to clinical interpretation of data, New York: Raven Pn'ss, 19!16.
SUGGESTl-iD READINGS
Chalmers TC, Smith II [r, Blackbum 1:1, Silverman 1:1, Schroeder 1:1, Reitman U, Arnbroz A.
A method for assessing tht, quality of a randomized control trial. Control Clin Tri~lsI9Hl;
2:31-49
lJ€partmeTlt of Clinical Epidemiology and Biostatistics, Mclvlaster University, H~milton, Ont.
How to read clinical journals. V: To distinguish useful from useless Dreven harmful thnllpy,
Can /l.1ed Assoc .I 1981,124:1l:i6-1162.
DerSimonian R, Charette Lj, McP"",k 1:1, Mosteller F. Reporting on methods ill clinical trial~.
N Engl J Med 19H2;30fi:1332-B37.
Feinstein AR An additional basic science for clinical medicine, J1: The limitations of r"ndom-
izcd trials. Ann Intern Mcd 19f\3;99:544-550.
Friedman LM, l'urberg CD, De Mets DL. Fundamentals of dinical trials, 2nd ed, Littleton,
MA: john Wright PSC, 1985.
Guyatt CII, Sackett DL, Cook D.l. How to read clinical journals. II: How tu use and article
about therapy or prevention. A: Are the results of the ~tudy valid? .lAMA 1993;270:259H-
2601.
Cuyatt CH, Sackett DI" Cook DJ. How to read clinical journals. II: I low to use ilnd article
164 CLINiCAL EPIDEMIOLOGY
about therapy or prevention, B: What were the n'~ult5 and will they help me in caring for
my patient~? JAMA 1994;271:59-63,
Cuyan G, Sackett 0, Taylor OW, Chong J, Roberts I{, PUI-,,,,Iey S. Determining optimal ther-
apy-randomized trials in indiviuual patients. N Engl J Mcd I'JH6;::l14889-892.
Hdlman 5, Hellman DS, Of mice anu men Problems of the randomized clinkal trial. New
Eng! J MedI991;324:1585-15H'J,
Laupakis A, Welb G, Richardson S, Tugwdl P Users' guide to the medicalliteratun:, V: How
to use an article about prognosis. 1994;272:234-2::17,
Lavon PW, Louis I'A, lJaiiar Je 1J1, Polansky M. Design for experim"nt~-parallel comparisons
of tn,atmenL N Engl J Med 19H::l;109:1291-1298
Meinert (:1.. Clinical trials: design, conduct and analysis. New York Oxford Univn~ity Press,
1986.
Mostelller F, Cllbertlt'. Mcf'cck B. I''''porting standards and research ~trategie~ for controlled
triab, Control CEn Trials I'JHO;L37-'i8.
Oxman AO, Cuyutt CH. A consumer's guide to ~ubgrOlJp analysis. AJUl Intern M",d
1992;116:7H-Il4.
Peto R, Pike MC, Armitage P, Bre~low NE, Cox DR, Howard SV, Mantel N, Mcl'herson K,
Pete J, Smith PC, Design and analysi8 of randomized clinical triab requiring prolonged
observation of each patient, part l. Br J C~n,,,,r, 1976;34:5lJ5-612,
Peto R, Pike Me, Armitage P, Breslow NE, Cox DR, Howard SV, Mantel N, MrPher~on K,
I'eto J, Smith Pc. Design and analysis of randomized clinical trials requiring prolong",d
observation of each patitnt, port 2. Br J Cancer, 1'J77;::l5:1-,~9.
Pocock 5J. Clinical trials: a practical approach, New York; John Wiky & Sons. 1983.
Yusuf S, Collin~ R, Peto R Why do we n,'"d ~ome large, simple randomized tria1s7 Stat Med
19H4;3:409-420.
8
PREVENTION
lmc sensibly-among Ii ttioueana people, only Dill' dies a natural
death, tile rest succumb to irmuouat modes of living.
Maimonides 1135-1204 A.I).
165
166 CLINICAL EPIDEMIOLOGY
tients stop smoking and public education, regulations, and taxes prevent
teenagers front starting to smoke.
Much of the scientific approach to prevention in clinical medicine, par-
ticularly the principles underlying the use of a diagnostic tests, disease
prognosis, and euectlvcncss of interventions, has already been covered in
this book. This chapter expands on those principles and strategies as they
specifically relate to prevention.
Levels of Prevention
Webster's (2) dictionary defines prevention as "the act of keeping from
happening." With this definition in mind, almost all activities in medicine
could be defined as prevention. After all, clinicians' efforts are aimed at
preventing the untimely occurrences of death, disease, disability, discom-
fort, dissatisfaction, and destitution (Chapter 1). However, in clinical medi-
cine, the definition of prevention is usually restricted, as outlined below.
Although more prevention is practiced than ever before, clinicians still
spend most of their time in diagnosing and trcating rather than in pre-
venting disease.
Depending on when in the course of disease interventions are made,
three types of prevention are possible (pig. 8.1).
PRIMARY PREVENTION
Clinical
Onset Diagnosis
~f
~
ASYMPTOMATIC
DISEASE
CLINICAL COURSE
maximize the amount of high-quality time a patient has left. For example,
presently there is no specific therapy for patients with amyotrophic lateral
sclerosis, a neurologic condition ending in paralysis of respiratory and
swallowing muscles. But careful medical management C,In lead to early
intervention with a gastrostomy for administering food and liquids to
prevent dehydration and weakness from starvation, a tracheostomy for
better suctlontng to prevent pneumonia for as long as possible, and if the
patient wishes, a portable respirator to rest respiratory muscles. Without
such a proactive approach, the patient may present with acute respiratory
failure due to the combined effects of the underlying disease, dehydration,
and pneumonia. Patient, family, and physician are then faced with endotra-
cheal intubation and admission to the intensive care unit, with the hope
of reversing enough of the processes to reestablish decent quality of life
for a little longer. Tertiary prevention can help avoid this scenario.
There are few, if any, tertiary prevention programs outside the health
care system, but many health care professionals in addition to physicians
are active in these programs.
Burden of Suffering
Is screening justified by the severity of the medical condition in terms
of mortality, morbidity, and suffering caused by the condition? Only condi-
tions posing threats to life or health (the six Ds) should be sought. The
severity of the medical condition is determined primarily by the risk it
CHAPTER 8 I PREVENTION 169
Table 8.1 •
Criteria for Deciding Whether a Medical Condition Should Be Included in Periodic
Health Examinations
1. How qreat is the burden of $utterirlg caused by lhe condition in terms of:
Death Discomfort
Disease Dissatisfaction
Disability Destitution
2. How good is the screening test, it cnc is to tJC performed, in terms 01:
Sensitivity end Labeling effects
Spccncrtv Safety
Simplicity Acceptability
3. a. For primary prevention, how effcctivl} i~; Ihe intervention?
"
b For secondary prevention, if the condition is found. how cttccnvo i~; the en~;uirlg
treatment in terms ot
Efficacy
Patient compliance
Farly treatment being more effective than later treatment
poses or its prognosis (discussed in Chapters 5 and 6). For example, except
during pregn<lncy and before urologic surgery, the health consequences
of asymptomatic bacteriuria are not dear. We do not know if it causes
renal failure and Zor hypertension. Even so, bacteriuria is frequently sought
in periodic health examinations.
Burden of suffering takes into account the frequency of a condition.
Often a particular condition causes great suffering for individuals unfortu-
nate enough to get it, but occurs too rarcly-c-pcrhaps in the individual's
particular age group-for screening to be considered. Breast cancer and
colorectal cancer are two such examples. Although both can occur in much
younger people, they primarily occur in persons older than 50 years. For
women in their early 20s, breast cancer incidence is 1 in 100,000 (one-fifth
the rate for men in their early 70s) (3). Although breast cancer should be
sought in periodic health examinations in women over 50, it is too uncom-
mon in 20-ye<IT-old women (or 70-year-old men) for screening. Screening
for very rare diseases means not only that at most very few people will
benefit but, because of false-positive tests, that many people may suffer
harm from labeling and further workup (sec below).
A particularly difficult dilemma faced by clinicians and patients is the
situation in which a person is known to be at high risk for a condition,
but there is no evidence that early treatment is effective. What should the
physician and patient do? For example, there is evidence that people with
Barrett's esophagus (a condition in which the squamous mucosa in the
distal esophagus is replaced by columnar epithelium) run a 30- to 40-fold
greater risk of developing esophageal cancer than persons without Barrett's
170 CLINICAL EPIDEMIOLOGY
Which Tests?
The following criteria for a good screening test apply to all types of
screening tests, whether they are history, physical examination, laboratory
tests or procedures.
SENSITIVITY AND SPECIFICITY
The very nature of searching for a disease in people without symptoms
for the disease means prevalence is usually very low, even among high-
risk groups selected because of age, sex, and other characteristics. A good
screening test must, therefore, have a high sensitivity, so it does not miss
the few cases of disease that are present, and a high specificity, to reduce
the number of people with false-positive results who require further
workup.
Sensitivity and specificity arc determined for screening tests much as
they are for diagnostic tests, except that the gold standard for the presence
of disease usually is not another test but rather a period of follow-up. For
example, in a study of fecal occult blood tests for colorectal cancer, the
sensitivity of the test was determined by the ratio of the number of c010-
rectal cancers found during screening to that number plus the number of
intervalcancers, colorectal cancers subsequently discovered over the follow-
ing year in the people with negative test results (the assumption being that
interval cancers were present at screening but were missed, i.e., the test
results were false negative) (5). Determination of sensitivity and specificity
for screening tests in this way is sometimes referred to as the detection
method.
The detection method for calculating sensitivity works well for many
screening tests, but there are two difficulties with the method for some
cancer screening tests. First, it requires that the appropriate amount of
follow-up time is known; often it is not known and must be guessed. The
method also requires that the abnormalities detected by the screening test
would go on to cause trouble if left alone. This second issue is a problem
in screening for prostate cancer. Because histologic prostate cancer is so
common in men (it is estimated that 25')'0 of 50-year-old men have histologic
foci of prostate cancer, and by the age of 90, virtually all men do), screening
CHAPTER 8 / PREVENTION 171
tests can find such cancers in milny men, but for most, the cancer will
never become malignant. Thus, when the sensitivity of prostate cancer
tests such as prostate-specific antigen (PSA) is determined by the detection
method, the test may look quite good, since the numerator includes all
cancers found, not just those with malignant potential.
To get around these problems, the incidence method, a new method,
calculates sensitivity by using the incidence in persons not undergoing
screening and the interval cancer rate in persons who are screened. The
rationale for this approach is that the sensitivity of a test should affect
interval cancer rates but not disease incidence. For prostate cancer, the
incidence method defines sensitivity of the test as 1 minus the ratio of the
interval prostate cancer rate in a group of men undergoing periodic screen-
ing to the incidence of prostate cancer in a group of men not undergoing
screening (control group). The incidence method of calculating sensitivity
gets around the problem of counting "benign" prostate cancers, but it may
underestimate sensitivity because it excludes cancers with long lead times.
True sensitivity of a test is, therefore, probably between the estimates of
the two methods.
Because of the low prevalence of most diseases, the positive predictive
value of most screening tests is low, even for tests with high specificity.
Clinicians who practice preventive health care by performing screening
tests on their patients must accept the fact that they will have to work up
many patients who will not have disease. However, they can minimize
the problem by concentrating their screening efforts on people with a
higher prevalence for disease.
Example The incidenceof breast cancer increases with age, from approx-
imately 1 in 100,000/yl:'ar at age 20 tal in 200/year over age 71). Therefore,
a lump found during screening in a young woman's breast is more likely to
be nonmalignant than a lump in an older woman. In a large demonstration
project 011 breast cancer screening, biopsy results of breast masses varied
markedly according to the age of women (6); in women under age 40, more
than 16 benign lesions were found for every malignancy, but in women over
age 70 fewer than .'1 benign lesions were found for every malignancy (Fig.
8.2). Sensitivity and specificity of the clinical breast examination and mam-
mography arc better in older women as well, because of changl:'s in breast
tissue as women grow older.
The yield of screening decreases as screening is repeated over time in
a group of people. Figure 8.3 demonstrates why this is true. The first time
that screening is carried out-the prevalence screen-cases of the medical
condition will have been present for varying lengths of time. During the
second round of screening, most cases found will have had their onset
between the first and second screening. (A few will have been missed by
the first screen.) Therefore, second and subsequent screenings are called
incidence screens, Figure 8.3 illustrates how, when a group of people are
172 CLINICAL EPIDEMIOWGY
18
14
-=
~ '"
rJl:=: 1a
Ill';':
a:c:
>'"
U)c: 6
'c..~
0-
.- '"
lIlE
o
c: 2 o
o
Z
40 50 60 70 80
Age (years)
Figure 8.2. Yiold of a screoning test according to patients' age. Ratio of nonmalig-
nantmalignant biopsy results among women screened for breast cancer. (Data from
Baker LH. Breast Cancer Detection Demonstration Project: five-year summary report.
CA 1982;32:195-231.)
Round
2 3
D,
(}-- D,
D,
D,
~D,
D,
-D,
0,
0 f-D
P- ~--D,
0,
533
Number of Cases
Newly Detected
The physician cannot postpone action and does his or her best. It is quite
another matter to subject presumably well people to risks when there is
no known problem. In such circumstances, the procedure should be espe-
cially safe. This is partly because the chances of finding disease in healthy
people arc so low. Thus, although colonoscopy is hardly thought of as a
"dangerous" procedure when used on patients with gastrointestinal com-
plaints, it may be too dangerous to use as a screening procedure because
of the possibility of bowel perforation. In fact, if colonoscopy, with a perfo-
ration rate of 0.2°,{" were used to screen for colorcctal cancer in women in
their 50s, almost two perforations would occur for every cancer found. For
women in their 70s, the ratio would reverse, because colorectal cancer is
so much more common (7).
ACCEPTABLE TO BOTH PATIENTS AND CLINICIANS
The importance of acceptability is illustrated by experience with tests
for early cervical cancer and early colon cancer. Women at greatest risk
for cervical cancer are least likely to get routine Pap smears. The same
problem holds true for colorectal cancer. Studies indicate there is a strong
reluctance among asymptomatic North Americans to submit to periodic
examinations of their lower gastrointestinal tracts-a finding that should
be no surprise to any of us!
Table 8.2 shows acceptance of screening for colorectal cancer by various
kinds of people. People who voluntarily attended a colon-eta I cancer
screening clinic were very cooperative; they were willing to collect stool
samples, smear the samples on guaiac-impregnated paper slides, and mail
the slides to their doctors for clinical testing. Patients who did not volunteer
were less willing to participate. Older persons, who arc at greatest risk for
culun-ctal cancer because of their age, were least willing to be screened.
Table 8.2
Patients' Acceptance of Screening Tests: Reported Response Rates for Returning
Guaiac-impregnated Slides in Different Settings·
I·'mticirilnts
Returning SlidHS
SHitin'1 fPArGent\
- - --'----
Colorectal cancer screening proqram 85
Breast cancer ~;cr(~cninlJ proqram 70
HMO members agee) 50-74 years 27
HMO members aged [iO-74 years cent kit. rorrnnoer letter, and sell- 4e'
help booklet ,md who were VllIGd with instrucnons and renmcors
"Ofltd trom Myc,,'; 11[. Ross EA, Wult TA. necrcm A, ,JAIH1I1 C, Millner I Behavioral intmvAfltions to
ocrcrc,e adhArAllce in CGlorHc1,,1 scrccns-q. M"d Colee 1YY1;2'1: I fXN-1ilbil: c-o HHld\er SW, llaupllillAe
Wi!. Should culorAclal carcinonli; bH oouqht in periodic IlAillth 0Xlllllindtiofls'! 1\11 apptoach 10the evktcnce
eli" IrlVHst Med t')SI:'1Z]-Jl
CHAPTER 8 I PREVENTION 175
Substantial extra effort can result in getting more people (but still fewer
than half) to participate.
TIlt' acceptability of the test to clinicians is a criterion usually overlooked
by all but the ones performing it. After one large, well-conducted study
on the usefulness of screening, sigmoidoscopy was abandoned because the
physicians performing the procedure-gastroenterologists, at that-found
it too cumbersome and time-consuming to be justified by the yield (8).
(Patient acceptancc, 1RClj,), was not good either.]
LABELING
Effectiveness of Treatment
"Treatments" in primary prevention are immunizations, such as tetanus
toxoid to prevent tetanus; drugs, such as aspirin to prevent myocardial
infarction; and behavioral counseling, such as helping patients stop smok-
ing or adopt low-cholesterol diets. Whatever the intervention, it should be
efficacious (produce a beneficial result in ideal situations) and effective
(produce a beneficial result under usual conditions). Efficacy and effective-
ness of pharmaceuticals are usually better documented than they are for
behavioral wunseling. Federal laws require rigorous evidence of efficacy
before pharmaceuticals are approved for use. The same is not true for
behavioral counseling methods, but clinicians should require scientific evi-
dence before incorporating routine counseling into health maintenance.
Health behaviors are among the most important determinants of health in
modern society; effective counseling methods could promote health more
Table 8.3
Relation between Number of Tests Ordered and Percentage of Normal People with
at Least One Abnormal Test Resutt"
5
:<3
20 (;4
100 99.4
"rom Sackett OL, Clinical Jiagnosis Hfl(j the c1imcallaboratofY. C1in Invesl MH<t 19713: l:M -4:3,
CHAPTER 8 I PREVENTION 177
than most anything else a clinician can do, but counseling that does not
work wastes time, costs money, and may harm patients.
Example Two different smoking cessation counseling strategies-
weekly hour-long group counseling sessions for 8 weeks and weekly 10- to
20-min individual counseling sessions for 8 weeb-were combined with
nicotine patch therapy and evaluated for their effectiveness in promoting
smoking cessation (11,12). Compared with patients randomized to control
groups, the patients receiving the interventions did somewhat better, with a
third of patients in the group counseling sessions having stopped smoking
at 6 months follow-up. However. fewer than 20% of patients receiving indi-
vidual counseling had stopped smoking. furthermore, the authors found
that most failures at fi months could be predicted by patients smoking at
some time during the first 2 weeks after trying to stop. These findings suggest
that counseling should be "front loaded" By carefully evaluating behavioral
counseling, studies such as this are determining what approaches work.
Treatments for secondary prevention are generally the same as treat-
ments for curative medicine. Like interventions for primary prevention,
they should be both efficacious and effective. If early treatment is not
effective, it is not worth screening for a medical problem regardless of how
easily it can be found, because early detection alone merely extends the
length of time the disease is known to exist, without helping the patient.
Another criterion important for treatments in secondary prevention is
that patient outcome must be better if the disease is found by screening,
when it is asymptomatic, than when it is discovered later, after the condi-
tion becomes symptomatic and the person seeks medical care. If outcome
in the two situations is the same, screening is not necessary.
Example In a study of the use of chest x-rays and sputum cytology to
screen for lung cancer, mille cigarette smokers who were screened every 4
months and treated promptly if cancer was found did no better than those
not offered screening (13); at the end of the study, death rates from lung
cancer were the same in the two groups-3.2 per lOOO person-years in the
screened men versus 3.0 per 1000persons·years in men not offered screening.
Farly detection and treatment did nol help patients with lung cancer more
than treatment of people at the time they presented with symptoms.
BIASES
As discussed in Chapter 7, the best way to establish the efficacy of
treatment is with a randomized controlled trial. This is true for all interven-
tions but especially for early treatment after screening. To establish that a
preventive intervention is effective typically takes years and requires large
numbers of people to be studied. for example, early treatment after colo-
rectal cancer screening can decrease colorectal cancer deaths by approxi-
mately one-third. But to show this effect, a study with 13 years of follow-
up was required (5). A "clinical impression" of the effect of screening
simply does not suffice in this situation.
178 CLINICAL EPIDEMIOLOGY
DIAGNOSIS
Onset Early Usual Death
UNSCREENED
SCREENED
Early treatment 0 - - - - Dx ",*i!'i;i#ii*i!'i;i#ii*#ii*#ii*#ii*#ii*t
not effective
Improved
Survival
SCREENED
Early treatment 0 - - - - Dx 7,,~~~~~~~;;;, t
is effective
Figure 8.4. How lead time affects survival lime after screening; shaded areas
indicate length of survvat after diagnosis (Ox).
Careful studies an' also necessary because of biases that are specific to
studies of the effectiveness of screening programs. Three such biases are
described below.
lead Time Bias
read time is the period of time between the detection of a medical condi-
tion by screening and when it ordinarily would be diagnosed because a
patient experiences symptoms and seeks medical care (Fig. 8.4). The
amount of lead time for a given disease depends on both the biologic rate
of progression of the disease and on the ability of the screening test to
detect early disease. When lead time is very short, as is presently the case
with lung cancer, treatment of medical conditions picked up on screening
is likely to be no more effective than treatment after symptoms appear.
On the other hand, when lead time is long, as is true for cervical cancer
(on average, it takes approximately 30 years to progress from carcinoma
in situ to clinically invasive disease), treatment of the medical condition
found on screening can be very effective.
How can lead time cause biased results in a study of the efficacy of
early treatment? As Figure 8.4 shows, because of screening, a disease is
found earlier than it would have been after the patient developed symp-
toms. As a result, people who are diagnosed by screening for a deadly
disease will, on average, survive longer from the time of diagnosis than
people who are diagnosed after they get symptoms, even if eilrly treatment
is no more effective than. treatment at clinical presentation. In such a situa-
tion screening would appear to help people live longer, when in a reality
they would be given not more "survival time" but more "disease time."
CHAPTER 8 I PREVENTION 179
Screening
0 0,
0--0,
0 0,
0,
0 0,
0,
0,
0,
0-0,
0,
0,
0,
Figure 8.5. Length time bias. Cases that progress rapidly from onset (0) to symp-
toms and diagnosis (Ox) are less likely 10 be detected during a screening examination.
Compliance Bias
Compliance bias, the third major type of bias that can occur in effective-
ness studies of presymptomatic treatment is the result of the extent to
which patients follow medical advice. Compliant patients tend to have
better prognoses regardless of screening. If a study compares disease out-
comes among volunteers for a screening program with outcomes in a group
of people who did not volunteer, better results for the volunteers might not
be due to treatment but be the result of other factors related to compliance.
Example In a study of the effect of a health maintenance program, one
group of patients was invited fur an annual periodic health examination and
a comparable g:roup was not invited (14). Over the years, however, some of
the control group asked for periodic health examinations. As seen in Figure
8.7, those patients in the control group who ilctivdy sought out the examina-
tions had better mortality rates than the patients who were invited for screen-
ing. The latter group contained not only compliant patients but also ones
who had to be persuaded to participate.
1 Diagnosis after
symptoms
Onset "
Screened
-----Time-----~
Figure 8.6. Length time bias. Rapidly growing tumors come to medical attention
before screening is oerrorrneo, whereas more slowly growing tumors allow time for
detection. D, diagnosis after symptoms; S, diagnosis after screening.
CHAPTER 8 I PREVENTION 181
2.0
•
0 0 0
1.0 • 0 0 0
• • 0
0
0.5 • • •
•
o 2 3 4 5
Number of MHCs
Figure 8.7. Effect of patient compliance on a screening program. The control group
of patients (e) had 8 lower standardized mortality ratio (observed:expected deaths,
standardized for age) than the study group offered {OJ screening; but the control
group included only patients who requested screening, whereas the study group
included all patients offered screening. MI-JCs, multiphasic health checkups, (Re-
drawn trom Friedman GO, Collen MF, Fireman BH, Multiphasic health checkup evalu-
ation: a 16-year follow-up. J Chron Dis 1986; 39:453-463.)
Patients dying of colorer-tal cancer in the rectum Of distant sigmoid were less
likely \0 have undergone a screening sigmoidoscopy in the previous 10 years
(8.8'\.) than those in the control group (24.2%), and sigmoidoscopy followed
by early therapy prevented almost 60(:'~, of deaths [rom distal colorcctat cancer.
Also, by showing that there was no protection for colorectal cancers above
the level reached by sigmoidoscopy, the authors suggested that "it is difficult
to conceive of how such anatomical specificity of effect could be explained
by confounding."
1000
False-positive
mammograms
300
-
1 50
Procedures because
of mammograms
.1
15
New breast cancers
.115
Deaths from
breast cancer
5-B 7-8
Figure 8.8. Weighing tenent and harm from screening. What happens during a
decade of annual mammography in 1OOll women starting at ago 40,
Current Recommendations
With progress in the science of prevention, current recommendations
on health maintenance are quite different from those of the past. Several
groups have recommended abandoning routine annual checkups in favor
of a selective approach in which the tests to be done depend on a person's
agc, sex, and clinical characteristics (thereby increasing prevalence and
184 CLINICAL EPIDEMIOLOGY
Summary
Disease can be prevented by keeping it from occurring in the first place
(primary prevention), with interventions such as immunization and behav-
ioral counseling. Such interventions should be evaluated for effectiveness
as rigorously as other kinds of clinical interventions.
Il] effects from disease can also be prevented by conducting screening
tests at a time when presymptomatic treatment is more effective than treat-
ment when symptoms occur (secondary prevention). A disease is sought
if the disease causes a substantial burden of suffering, if a good screening
test is available, and if prcsymptomatic treatment is more effective than
treatment at the usual time. Screening tests should be sensitive enough to
pick up most cases of the condition sought, specific enough that there are
not too many false-positive results, inexpensive, safe, and well accepted
by both patients and clinicians.
In secondary prevention, three potential biases threaten studies of the
effectiveness of presymptomatic treatment: failure to account for the lead
time gained by early detection, the tendency to detect a disproportionate
number of slowly advancing cases when screening prevalent cases, and
confounding the good prognosis associated with compliance with the ef-
fects of the preventive intervention itself.
Based on these criteria, i\ limited number of primary prevention inter-
ventions and screening tests for secondary prevention are recommended
for health maintenance, according to the age, sex, and clinical status of the
patient.
REFERENCES
I. Schappert SM, National arnbulJlory medical care survey: 1991 summary, Advance data
from vital and health statistics, No. 210. Hyattsville, MD: National Center for II<,alth
Statistics, 1993.
2. Webskr's ninth new collegiate dictionary, Springfield, MA: M,ni"m-Webster, 1991.
J. Rics LAC, Miller 13A, Hanh:y BF. Kosary Cl, Harras A, Edwards 13K, eds. SEER ,ancer
statistics review, 1973 1991, tanks ond grophs. NIH Publication No. 94-27H9. Ikthesda,
MD: National Cancer Institute, 1994.
4. Cameron A), Ott B). Payne WS, The incidence of ad"n",'arcinoma in columnar-lined
(l:Iarrdt's) esophagus. N Fngl J Med 1985;J13:H57-H5<J.
CIIAPTER 8 I PREVENTION 185
:;, Mandel JS, ct al. R"ducing mortality from colorectal cancer by screening tor t",.,,1 occult
blood:--'- Engl j Med 1';l93;J2Ii:D6S·-1371
h. Ihk",r l.H. RW,1bl Cancer Detection Demonstration Project: fiv(~ye~r bUmnlMY report. CA
19H2;J2:1% - 231.
7. Eddy OM, S.:r"ening f"r mlor"d~1 ('~ncer Ann Intern IvIed 1<J<J1l;1Ll:171 3H4,
8. Dales LG, Friedman CO, Collen MF, F.valu~ting periodic multiphasic health ched("ups:
a controlled trial. ] Chron Oisl<J79;,12:3H.5-404.
'I. lerman C Trod: B, Rimer BK Boyce A,Jepson C Engstrom PF. Psychological and behav-
ioral Impli"atiom "f ~hllormal mammograms, Ann Intern Med '199'1;'1'14:6-57-661.
lO, komm 1'.1, l'ktcht'rSW, 1'lulL! llS, The periodic health examination: comparison "fr",com-
mendaticns and internists' performance, South Med J 1981;74:263-271.
11 Fiore MC, Kenford SL, lorenby Dr, Weller OW, Smith S5, llaker TE. The nicotine patch;
clinical effedivcn,,~s with dlffewnt counseling treatments. Chestl<J<J4;1O,S:,S24-533.
12, Kenford, SL, Fiore MC, Jorcnby DE, Smitl1 S5, "Vetter D, Baker TEl. Prediding sm"klng
(ess~ti"n: who will quit with and without the nicotine patch. JAMA. 1994;2715H'J-5<J4.
13, Fontana RS, Sanderblll1 DI~, Woolner LB, Taylor WF, Miller We, Muhm JR. Lung cancer
screening: the 'Vby" progrilm. J Occup Med 1986;2H:746-750.
14, friedman GD, Collen MF, Fin'JIlan BH. Multiphasic health checkup evaluation: il 16-yeilr
follow-up. J Chron Ois 19H6;39:453 4h3.
15. Sdhy JV, Frledm,'n GD, Quesenberry C!', ~Veis-, NS. A cilse-control study of screening
slgm"ido~c()py and mortality from colorcctal cancer. N Fng J Med 1992;326:65J-657,
16. l'ddy OM, Hilsselhbd V, McGivney W, Hendee W. The value "f m~mmography screening
in women under agc Sil yt'art']AMA 1988;239:1512-1.519,
SUCCESTED READrI\;GS
Eddy DM fed) Common soecntng tests, I'hiladelphia, An",riciln College of Physicians. 1991
Goldbloom lUI, Lawn'n"" RS, <,ds, Pr<,v",nting disease: beyond the rhetoric, Nt'w York:
Springer-Verlag, I'JtJIl,
Cuy"tt CH, Sarkdt D1., Cook DJ. Users' guides to Ih~ m"di"al Iiter~hJre. 11: How to use an
article about therapy or prevention. A: Are the results of tl1<' sh"ly valid? jAylA
1'I'IJ;270:259H-2601
Guyall GH, Sackett DL, Cook OJ. Users' guide" to th", medical literature. II: I low to use an
arlic1e about therapy or prcvention, 5: What w,,,e the results and will they help me in
eming for my patlenh 7 JAMA 1994;27159-6.1.
H"yw,lrd "';, Slelnherg FP, Ford DE, Roizcn MI', Roach KW, I'reventiv'.' C,1r", guidelines: 1491.
Ann Intern Mcd 1991,1l47.SR-7!l3
Miller AB, cd. Screening tor Gu",er. Orbndo, FI Academic Press, 1%5.
Miller All, Chamberlain J, Oay NE, llakdn,a M, Prorok PC eds. Cancer screening. Cambridg",
England: Inkrniltional Union Against Cancer, 1991
Russell LI:\. Educ,lted guesses: making policy about medical screening tesls. R"rkel('Y' Univ('r-
blty of California t'ross, [9<J4.
u.s. Preventive Services Task rorcc. Cuide to pn"',,ntive st'rvin'b: an assessment of the effec-
tiveness of '169 Intervention,;. Raltimore: Williams & VVilkins, '1989,
9
CHANCE
Random Error
The observed differences between treated and control patients in a clini-
cal trial cannot be expected to represent the true differences exactly because
of random variation in both of the groups being compared. Statistical tests
186
CHAPTER 9 ! CHANCE 187
help estimate how well the observed difference approximates the true
one. Why not measure the phenomenon directly and do away with this
uncertainty? Because research must ordinarily be conducted on a sample
of patients and not all patients with the condition under study. As a result,
there is always a possibility that the particular sample of patients in a
study, even though selected in an unbiased way, might not be similar to
the population of patients as a whole.
Two general approaches are used to assess the role of chance in clinical
observations. The first, called hypothesis testillg, asks whether an effect (dif-
ference) is present or not by using statistical tests to examine the hypothesis
that there is no difference (the "null hypothesis"). This is the traditional
way of assessing the role of chance, popular since statistical testing W<lS
introduced at the beginning of this century and associated with the familiar
"p values." The other approach, called estimation, uses statistical methods
to estimate the range of values that is likely to include the true value.
This approach has gained popularity recently and is now favored by most
journals for reasons that we describe below.
We begin with a description of the traditional approach.
Hypothesis Testing
In the usual situation, where the principal conclusions of a trial are
expressed in dichotomous terms (e.g., the treatment is considered to be
either successful or not) and the results of the statistical test is also dichoto-
mous (the result is either "statistically slgnificantv-c-i.e., unlikely to oe
purely by chance-c-or not), there are four ways in which the conclusions
of the test might relate to reality (Fig. 9.1).
Two of the four possibilities lead to correct conclusions: (a) when the
treatments really do have different effects and that is the conclusion of the
study and (b) when the treatments really have similar effects and the studv
makes that conclusion.
There are also two ways of being wrong. The treatments under study
may actually have similar effects, but it is concluded that the study treat-
ment is better. Error of this kind, resulting in the "false-positive" conclu-
sion that the treatment is effective, is referred to as an a or Type 1 error.
Alpha is the probability of saying that there is a difference in treatment
effects when there is not. On the other hand, treatment might be effective,
but the study concludes that it is not. This "false-negative" conclusion is
called a f3 or Type II error. Beta is the probability of saying that there is
no difference in treatment effects when there is one. "No difference" is a
simplified way of saying that the true difference is unlikely to be larger
than a certain size. It is not possible to establish that there is no differenco
at all between two treatments.
Figure 4.1 is similar to the two-by-two table comparing the results of a
188 CLINICAL cPIDl:MIOLOGY
TRUE
DIFFERENCE
Present Absent
-
§ ••••.• ,,,
CONCLUSION Significant Correct l~~!&f!X1
OF
STATISTICAL
TEST Not
Significant .• (1~~~r~r· Correct
Figure 9.1. The relationship between the results of a statistical test and the true
difference between two treatment groups. 'Absent is a simplification. It really means
that the true difference is not greater than a specified arnount.)
diagnostic test to the true diagnosis (Chapter 3). Here the "test" is the
conclusion of a clinical trial, based on a statistical test of results from a
sample of patients. Reality is the true relative merits of the treatments
being compared-if it could be established, for example, by making obser-
vations on all patients with the illness under study or a large number of
samples of these patients. Alpha error is analogous to a false-positive and
(i error to a false-negative test result. ln the absence of bias, random varia-
tion is responsible for the uncertainty of the statistical conclusion.
Because random variation plays a part in all observations, it is an over-
simplification to ask whether or not chance is responsible for the results.
Rather, it is a question of how likely random variation is to account for
the findings under the particular conditions of the study. The probability
of error due to random variation is estimated by means of inferel1tial statis-
tics, a quantitative science that, based on assumptions about the mathemati-
cal properties of the data, allows calculations of the probability that the
results could have occurred by chance alone.
Statistics is a specialized field with its own jargon-null hypothesis,
variance, regression, power, and modeling-Lhat is unfamiliar to many
clinicians. However, leaving aside the genuine complexity of statistical
methods, inferential statistics should be regarded by the nonexpert as a
useful means to an end. Statistical tests are the means by which the effects
of random variation arc estimated.
The next two sections discuss a and f3 error, respectively. We will at-
tempt to place hypothesis testing, as it is used to estimate the probabilities
of these errors, in context. However, we will make no attempt to deal with
CHAPTER 9 I CHANCE 189
below very low Ii values (such as p (l.OOl) chance if; a vcry unlikely
explanation for the observed difference. and little further information is
imparted by describing this chJnce more precisely.
STATISTICAL SIGNIFICANCE AND CLINICAL IMPORTANCE
A statistically significant difference, no matter how smJII the p, does
not mean that the difference is climcallv important. A p < 0.0001, if it
emerges from a well-designed study, conveys a high degree of confidence
that a difference really exists. But this Ii value tells us nothing about the
magnitude of that difference or its cliniral importance. In fact, entirely
trivial differences may be highly statistically significant if a large enough
number of patients '''/(IS studied.
Example In the carlv 1990s there was a heated debate about which
thrombolytic ilgt'nl, streptokinase or tissue plasminogen activator (tPA), is
most effective during acute n-yorardial infarction, Large trials had shown a
difference in reperfuxion rates but nol mortality. The two were rnmpared
(along with subcutaneous or intravenous heparin) in a large randomized
controlled trial. called CUSTO, involvini-\ 4'1,021 patients in 15 countries (2),
tl'A was givcn by a more aggressive regimen than in ear-lier studies. The
death rate at 30 days was lower among patients receiving: tl'A (63\,) than
among those receiving: stre-ptokinase (7.2 or 7.4%, depending on how hepa-
rin was given) and this difference was highly unlikely to be by chance
(p ().OOI), However, the difference is not large; one would have to treat
about 100 patients with tl'A instead of with streptokinase to prevent one
short-term death. Because tf'A is muc-h more expensive than streptokinase-
it would cost nearly $250 thousand to prevent that death (3)-and because
tl'A is more likely to cause hemorrhagic strokes, some have questioned
whether the marginal benefit of tl' A is worl hwhi lc, i.c.. whether the difference
in mortality between tl'A and xh-eptokinnsc treatment. all things considered,
is "clinically slgntflcanr."
On the other hand, very unimpressive p values can result from studies
showing strong treatment effects if there are few patients in the study (sec
the following section),
STATISTICAL TESTS
Commonly used statistical tests, familiar to many readers, are used to
estimate the probability of an a error. The tests arc applied to the data to
give a test statistic, which in turn can be used to corm- up with a probability
of error (Pig. 9.2). The tests Me of the l1ul/ h.llpofhe.c.i.s, the proposition that
there is no true difference in outcome between the 1\'/0 treatment groups.
This device is for mathematiral reasons, not because "no difference" is the
working scientific hypothesis of the study. One ends up rejecting the null
hypothesis (concluding there is a difference) or failing to reject it (conclud-
ing there is no difference).
Some commonly used st,ltisticallcsts are listed in Table 9.1. The validity
CHAPTER 9 I CHANCE 191
Data • Test
statistic Compare to
• Estimate of
probability that
Statistical
standard observed value
test
distribution could be by
(using tables, chance alone
etc.)
Table 9.1
Some Statistical Techniques Commonly Used in Clinical Research
Chi square (x?) Between two or more proportions (wilen there is a large number
of observations]
Fisher's exact Between two proportions (when there is a small number of
observationsj
Mann-Whitney U Between two medians
Student I Between two means
Ftest Between two or more means
of each test depends on certain assumptions about the data. If the data at
hand do not satisfy these assumptions, the resulting p" may be misleading.
A discussion of how these statistical tests are derived and calculated and
of the assumptions on which they rest can be found in any biostatistics
textbook.
Example The chi square (X') test, for nominal data (counts) is more easily
understood than most and so can be used to illustrate how statistical testing
works. Consider the following data from a randomized trial of two ways of
initiating anticoagulation with heparin: a weight-based dosing nomogram
and standard care (4). The outcome was a partial thromboplastin time (PTT)
exceeding the therapeutic threshold within 24 hr of beginning anticoagula-
tion. In the nomogrilm group hO of h2 (97(:~) did so; in the standard care
grO\Jp, 37 of 4,s (77%).
192 CLINICAL EPIDEMIOLOGY
Observed Rates
Yes No rulHI
Nomogram 60 , 62
Stannard care 3i 11 48
Total 97 13 110
Yes No Total
Nomogram 55 7 62
Standard care 42 6 48
total 97 13 110
Power and prJ arecomplell1l'!ltary ways ofe:xpressing the saine concept. Power
is analogous to the sCllsitivity of a diasnostic test. ln [ace. one speaks of u study
being powe/illl if it has a high probability of ddectins as di;ffercnt treatments that
rcally arc differenl.
HOW MANY PATIENTS ARE ENOUGH?
Suppose you are reading about a clinical trial comparing a promising
new therapy to the current form of treatment. You are aware that random
variation can be the source of whatever differences are observed, and you
194 CLINICAL EPIDEMIOLOGY
wonder if the number of patients (sample _~ize) in this study is large enough
to make chance an unlikely explanation for what was found. How many
patients would be necessary to make an adequate comparison of the effects
of the two treatments? The answer depends on four characteristics of the
study: the magnitude of the difference in outcome between treatment
groups, p", pfj, and the nature of the study's data. These are taken into
account when the researcher plans the study and when the reader decides
whether the study has a reasonable chance of giving a useful answer.
Effect Size
Sample size depends on the magnitude of the difference to be detected.
We are free to look for differences of any magnitude, and of course, we
hope to be able to detect even very small differences. But more patients
are needed to detect small differences, everything else being equal. So it
is best to ask only that there is a sufficient number of patients to detect
the smallest degree of improvement that would be clinically meaningful.
On the other hand, if we Me interested in detecting only very large differ-
ences between treated and control groups (t.c.. strong treatment effects),
then fewer patients need be studied.
Alpha Error
Sample size is also related to the risk of an a error (concluding that
treatment is effective when it is not). The acceptable size for a risk of this
kind is a value judgment; the risk could be as large as 1 or as small as O.
If one is prepared to accept the consequences of a large chance of falsely
concluding that the therapy is valuable, one can reach conclusions with
relatively few patients. On the other hand, if one wants to take only a
small risk of being wrong in this way, a larger number of patients will be
required. As we discussed earlier, it is customary to set p, at 0.05 (1 in 20)
or sometimes 0.01 (I in 100).
Beta Error
The chosen risk of a /j error is another determinant of sample size. An
acceptable probability of this error is also a judgment that can be freely
made and changed, to suit individual tastes. Probability of 13 is often set
at 0.20, a 20% chance of missing true differences in a particular study.
Conventional (J errors are much larger than a errors, reflecting the higher
value usually placed on being sure an effect is really present when we say
it is.
Characteristics of the Data
The statistical power of a study is also determined by the nature of the
data. When the outcome is expressed on a nominal scale and so is described
by counts or proportions of events, its statistical power depends on the
CHAPTER 9 I CHANCE 195
rate of events: the larger the number of events, the greater the statistical
power for a given number of people at risk. As Pete et al. (5) put it,
In clinical trials of time to death (or of the time to some other particular
"event" -rdapst', metastasis, first thrombosis, stroke, recurrence, or time to
death from a particular cause), the ability of the trial to distinguish between
the merits of two treatments depends on how many patients die (or suffer a
relevant event), rather than on the number of patients entered. A study of
100 patients, 50 of whom die, is about as sensitive as a study with 1000
patients, 50 of whom die.
Table 9,2
Determinants of Sample Size
uctermoeo by
V I
N varies according to and - or-
I P
WllElre n
= number or patients studied, .0. = size of difference in outcome between groups;
P" probability of an" [Type I) error, i.e., talse-posltlve results: p" = probability of a {3
=
(Type II) error, te.. talse-negative result; V variability ot observations (tor interval data):
and P proportion ot patients experiencing outcome of interest (for nominal data)
196 CLINICAL EPIDEMIOLOGY
one is willing to accept one kind of error, the less it will be nece~sary to
risk the other, Neither kind of error is Inherently Norse than the other. The
consequences of accepting erroneous information depend on the clinical
situation. When a better treatment is badly needed (c.g., when the disease
is very dangerous and no satisfactory alternative treatment is available)
and the proposed treatment is not dangerous, it would be reasonable to
accept a relatively high risk of concluding a new treatment is effective
when it really is not (large u enol') to minimize the pos~ibility of missing
a valuable treatment (small /3 error). On the other hand, if the disease is
less serious, alternative treatments are available, or the new treatment is
expensive or dangerous, one might want to minimize the risk of accepting
the new treatment when it is not really effective (low IX error), even at the
expense of a relatively large chance of missing an effective treatment (large
/3 error). Tt is of course possible to reduce both a and (i errors if the number
of patients is increased, outcome events are more frequent, variability is
decreased. OJ a larger treatment effect is sought.
For conventional levels of p.. and Pi)' the relationship between the size
of the treatment effect and the number of patients needed for a trial if>
illustrated by the following examples, one representing a situation in which
a relatively small number of patients was sufficient and the other in which
a very large number of patients WdS required.
Example Small sample size: Case series suggest that the nonsteroidal
antiinflammatory drug sulindac is active "gainSl colonic polyps. This possi-
bility was tested in a randomized trial (6). A total of 22 patients with familial
adcnomatous polyposis were randomized; 11 received sulindac and II pla-
cebo. After Y months, patients receiving sulindac had an averaae of 44'1'0
fewer polyps than those receiving placebo. This difference was statistically
significant (p =- 0.(14)- Because of the larK" effect size and tile large number
of polyps per patient (some had more than IOU), few patients were needed
to establish that the effect was beyond cham:". (In lhis arialyxis it was neccs-
sary to assume that treatment affected polyps independently of which patient
tht,y occurred in-c-an unlikely, but probahly not damilging, assumption.)
Example Large sample size: The GUSTO Irial. described above, was de-
signed to include 41,000 patients to have a YO':;, chance of detecting a 15'%.
reduction in mortality or a 1'%, dec-rease in mortahtv rate, whichever was
larger, between the experimental and control treat!ll~nts with a pa of ll.05,
assuming the mortality rate in the control patients was at least R% (2). The
sample stzc had to he so large because a relatively small proportion of pa-
tients experienced the outcome event (death), the effect size was small (L,)'~(,),
and the investigators wanted a relatively high chance of detecting the effect
if it were present (90%).
For most of the therapeutic questions encountered today, a surprisingly
large number of patients is required. The value of dramatic, powerful
treatments, such as insulin for diabetic ketoacidosis or surgery for append i-
CHAPTER 9 I CHANCE 197
c. 1500 -
~
o
~
e
J:
c
"
W Outcome event
~
..
c.
o
-.
l1.
o
.... 500
.0
E
~
z
20 40 60 80 100
Proportional Reduction in Event Rate
(% of P)
Figure 9.3. The number of people required in each of two treatment groups (of
equal size) to have an 80% chance of detecting a difference (p = 0.05) in a given
outcome rate (P) between treated and untreated patients, for various rates of out-
come events in the untreated group, (Calculated from formula in Weiss NS, Clinical
epidemiology. The study of the outcome of illness, New York: Oxford University
Press, 1986.)
198 CLINICAl.. [PIDEMIOlOGY
10 I Endometrial
l' cancer
"
iii
c
(/l
Cl
5 (> 8 years)
o Breast cancer
-
..J
o
~
Risk
Protection
0:
U> 0.5
"C
"C
o
Hip fracture
0.1
Figure 9.4. Point estimates (0) and confidence intervals (1): the risks ami benefits
of exogenous fJstrogens for postmenopausal women. (Data from Grady D, Rubin
8M, Petitti DB, Fox GS, Black lJ, tttinger R, Ernster VL, Cummings SR. Ilormonal
therapy to prevent disease and prolong life in postmenopausal women, Ann Intern
Mod 1992; 117:1016-1037; Colot- GA, Stampter MJ, Willett we, Hennekens CH,
Rosner [3, Speizer FE. Prospective study of estrogen replacement therapy and risk
of breast cancer in postmenopausal women. JAMA 1990;264:2648-2653; and
Paganini-Hill A, nos; RK, Oerkins VR, Henderson t)c, Arthur M, Mack TM Meno-
pausal estrogen therapy and hip fractures. Ann Intern Med 1981 ;95:?8-31.)
the higher risks arc narrower than they really are.] The estimate of risk lor
endometrial cancer (ntlcr 8 or rnure year~ of estrogens) is 8.22, but the true
value i~ not precisely estimated and could easily be as high <JS 10.61 or as
low as 6.2,:;, In any case, it is unlikclv to be as low as 1.0 (no risk). In contrast,
this one study suggests that e~lrog<'ns are unlikely to be a risk factor for
breast cancer; the best estimate of relative risk is nearly 1.0, although the
data Me consistent with either a small harmful or a small protective effect.
Finally, estrogens Me likely to protect against hip fracture. That the upper
boundary of the confidence interval falls below ].0 is another way of indicat-
ing that the protective cffccl is ~l,lli~tically significant at the 0.05 level.
example, individual studies have shown that :14% of Ij.S. adults have used
unconventional therapy (95'X, confidence interval 31-37%) (8), that inten-
sive treatment of insulin-dependent diabetes lowers the risk of develop-
ment of retinopathy by 76(~;' (95°/', confidence interval 62-85%) relative to
conventional therapy (9), and that the sensitivity of clinical examination
for splenomegaly is 2n;, (95% confidence interval 19-36')';,) (10).
Confidence intervals have become the usual way of reporting the main
results of clinical research because of their many advantages over the hy-
pothesis testing (p value) approach. The P values are still used because of
tradition and as a convenience when many results are reported and it
would not be feasible to include confidence intervals for all.
Statistical Power before and after a Study Is Done
Ca lculation of statistical power based on the hypothesis testing approach
is done by the researchers before a study is undertaken to ensure that
enough patients will be entered to have a good chance of detecting a
clinically meaningful effect if it is present. However, after the study is
completed this approach is no longer as relevant (11). There is no need to
estimate effect size, outcome event rates, and variability among patients;
they are now known.
Therefore, for researchers who report the results of clinical research
and readers who try to understand their meaning, the confidence interval
approach is more relevant. One's attention should shift from statistical
power for a somewhat arbitrarily chosen effect size, which may be relevant
in the planning stage, to the actual effect size observed in the study and
the statistical precision of that estimate of the true value.
Detecting Rare Events
It is sometimes important to detect a relatively uncommon event (e.g.,
1/1000), particularly if that event is severe, such as aplastic anemia or life-
threatening arrhythmia following a drug. In such circumstances, a great
many people must be observed in order to have a good chance of detecting
even one such event, much less to develop a relatively stable estimate of
its frequency.
Figure 95 shows the probability of detecting an event as a function of
the number of people under observation. A rule of thumb is as follows:
To have a good chance of detecting a l/x event one must observe 3x people
(12). For example, to detect a 1/1 OUO event, one would need to observe
3000 people.
Multiple Comparisons
The statistical conclusions of research have an aura of authority that
defies challenge, particularly by noncxpcrts. But as many skeptics have
suspected, it is possible to "lie with statistics," even if unintentionally.
CHAPTtJ-1 9 CHANCE 201
1 1 1 1
Risk --
100
--
1,000 10,000 100,000
1 .0
~
-
.g
-
u
~
0.8
-
~
C
0.6
0
>-
'"co
s:l
0.4
s:l
0 0.2
~
D.
- I I
1,000 10,000 100,000 1,000,000
Size of Treatment Group
Figure 9.5. The probability of detecting one event according to the rate of the
event and the number of people observed, (From Guess HA, Rudnick SA, Use at
cost effectiveness analysis in planning cancer chemoprophylaxis trials, Control Clin
Trials 1983:4:89-100.)
What is more, this is possible even if the research is well designed, the
mathematics flawless, and the investigators' intentions beyond reproach.
Statistical conclusions can be misleading because the strength of statisti-
cal tests depends on the number of research questions considered in the
study and when those questions were asked. If many comparisons arc
made among the variables in a large set of data, the p value associated
with each individual comparison is an underestimate of how often the
result of that comparison, ilmong the others, is likely to arise by chance.
As implausible as it might seem, the interpretation of the p value from a
single statistical test depends on the context in which it is done.
To understand how this might happen, consider the following example.
Suppose a large study has been done in which there arc multiple subgroups
of patients and many different outcomes. Por instance, it might be a clinical
trial of the value of a treatment for coronary artery disease for which
patients arc in several clinically meaningful groups (c.g., 1-, 2-, and 3-
vessel disease; good and bad ventricular function; the presence or absence
of arrhythmias; and various combinations of these) and several outcomes
are considered (c.g.. death, myocardial infarction, and angina). Suppose
also that there arc no true associations between treatment and outcome in
any of the subgroups and for any of the outcomes. Finally, suppose that
202 CLINICAL EPIDEMIOLOGY
the effects of treatment are assessed separately for each subgroup and for
each outcome-a process that involves a great many comparisons. As
pointed out earlier in this chapter, I in 20 of these comparisons is likely
to be statistically significant at the 0.05 level. In the general case, if 20
comparisons are made, on the average, 1 would be found to be statistically
significant; if 100 comparisons are made, about 5 would be likely to emerge
as significant, and so on. Thus, when a great many comparisons have been
made, a few will be found that are unusual enough, because of random
variation, to exceed the level of statistical significance even i r no true associ-
ations between variables exist in nature. The more comparisons that are
made, the more likely that one of them will be found statistically
significant.
This phenomenon is referred to as the multiple comparisons problem.
Because of this problem, the strength of evidence from clinical research
depends on how focused its questions were at the outset.
Unfortunately, when the results of research are presented, it is not al-
way~ possible to know how many comparisons really were made. Often,
interesting findings arc selected from a larger number of uninteresting
ones. This process of deciding what is and is not important about a mass
of data can introduce considerable distortion of reality.
How can the statistical effects of multiple comparisons be taken into
account when interpreting research? Although ways of adjusting p., have
been proposed, probably the best advice is to be aware of the problem
and to be cautious about accepting positive conclusions of studies where
multiple comparisons were made. As one statistician (13) put it:
If you dredge the data sufficiently deeply and sufficiently often, you will
find something odd. Many of these bizarre findings will be due to chance. I
do not imply thai data dredging is not an occupation for honorable persons,
but rather that discoveries that were not initially postulated as among the
major objectives of the [rial should be treated with extreme caution. Statistical
theory may in due course show us how to allow for such incidental findings.
At present, I think the best attitude to adopt is caution, coupled with an
attempt to confirm or refute the findings by further studies.
An approach to assessing the validity of statistically significant findings
in subgroups was presented in Chapter 7.
Describing Associations
Statistics are also used to describe the degree of association between
variables, e.g., the relationship between body mass and blood pressure.
Familiar expressions of association are Pearson's product moment correla-
tion (r) for interval data and Spearman's rank correlation for ordinal data.
Each of these statistics expre~ses in quantitative terms the extent to which
the value of one variable is associated with the value of another. Each has
CHAPTER 9 ! CHANCE 203
Multivariable Methods
Most clinical outcomes are the result of many variables acting together
in complex way,;. For example, coronary heart disease is the joint result
of lipid abnormalities, hypertension, cigarette smoking, family history, dia-
betes, exercise, and perhaps personality. It is appropriate first to try to
understand these relationships by examining relatively simple arrange-
ments of the data, such as 2-by-2 tables (for one variable at a time) or
contingency tables (stratified analyses, examining whether the effect of
one variable is changed by the presence or absence of one or more other
variables), because it is easy to understand the data when they arc dis-
played in this way. However, it is usually not possible to account for more
than a few variables using this method, because there arc not enough
patients with each combination of characteristics to allow stable estimates
of rates. For example, if 120 patients were studied, 60 in each treatment
group, and just one additional dichotomous variables were taken into
account, there would only be at most about 15 patients in each subgroup;
if patients were unevenly divided, there would be fewer in some.
What is needed then, in addition to contingency tables, is a way of
examining the effects of several variables at a time. This is accomplished
by multivariable modeling, developing a mathematical expression of the ef-
fects of many variables taken together. It is "multi variable" because it
examines the effects of multiple variables simultaneously. It is "modeling"
because it is a mathematical construct, calculated from the data but also
based on simplifying assumptions about characteristics of the data (e.g.,
that the variables arc all normally distributed and have the same variance).
Mathematical models can be used in studies of cause, when one wants
to define the independent effect of one variable by adjusting for the effects
of several other, extraneous variables. They can also be used to give more
precise predictions than individual variables allow by including several
variables together in a predictive model.
The basic structure of a multivariable model is
where /31, /32< . arc coefficients that are determined by the data; and
variable I, varlable-, . . are the predictor variables that might be related
to outcome. The best estimates of the coefficients are determined mathe-
matically, depending on the powerful calculating ability of modern
computers.
204 CLINICAL EPIDEMIOLOGY
Summary
Clinical information is based on observations made on samples of pa-
tients. Even samples that are selected without bias may misrepresent events
in a larger population of such patients because of random variation in its
members.
Two general approaches to assessing the role of chance in clinical obser-
vations arc hypothesis testing and estimation. With the hypothesis testing
approach, statistical tests are used to estimate the probability that the ob-
served result was by chance. When two treatments arc compared, there
are two ·ways in which the conclusions of the trial can be WTOng: The
treatments may be no different, and it is concluded one is better; or one
treatment may be better, and it is concluded there is no difference. The
probabilities that these errors will occur in a given situation arc called pI)
and p~, respectively.
The power of a statistical test (1 - fill) is the probability of finding a
statistically significant difference when a difference of a given size really
exists. Statistical power is related to the number of patients in the trial,
size of the treatment effect, J!", and the rate of outcome events or variability
of responses among patients. Everything else being equal, power can be
increased by increasing the number of patients in a trial, but that is not
always feasible.
Estimation involves using the data to define the range of values that is
likely to include the true effect size. Point estimates (the observed effects)
and confidence intervals are used. This approach has many advantages
over hypothesis testing: It emphasizes effect size, not II value; indicates the
range of plausible values for the effect, which the user can relate to clini-
cally meaningful effects; and provides information about power.
Individual studies run an Increased risk of reporting a false-positive
result if many subsets of the data are compared; they are at increased risk
206 CLINICAL ErlDFMIOLOGY
R[·:FEREl\'CES
1, Johnson AF. Beneath the technological fix. Outliers and probability stclkments. J Chmn
Dis 1Y85;38:957-96L
2. The GUSTO Investigators. An international randomized tri,,1 comparing fpur thrombolytic
strategies for acute myocardial infJrdion. New Engl J Mcd 1<f93;329:673-liIi2.
3. Parkouh ME, LangJD, Sackett lJl Thrombolytic agents: till' science of the art of choosing
the better treatment. Ann Intern Med 1994;120:886-88R
4, Raschke RA, Reilly 8M, GUidry JR, Fontana JR, Srinivus ,;, The weight-based heparin
dosing nomogram compared with" "standard care" nomof\'ram. A randomized controlled
trial. Ann Intern Med 1993;1t9:874-·Rfll
5. Peto R, Pike MC, Armitage P, Breslow ;\;E, Cox DR, Howard SV, Mantel N, Mcl'hcrson
K, Peto J, Smith Pc. Design and annlysis of randol11i7ed clinical trials requiring: prolonged
observation of each patient l. Introduction and dpsign, Br J emu"r 1976;34:51i5-612,
6. Giardiel10 Fl'vl, Hamilton SR, Krush AJ, Piantadosi S, Hylind 1.\1, Celano P, Booker SV,
Robinson CR, OfferhJus GJA. Tr<'cltm~nl of colonic and rectal adenomas \vith sulindac in
familial adenomatous polyposis. New Eng! J MdJ 1993;328:1313-1316.
7, Braitman LE. Confidence intervals assess both clinical significance and statistical signifi-
cance. Ann Intern Med 1991;114:515-517.
8. Eisenberg OM, Kessler RC, foster C, No-lock FE, Cillkins OK Delbaucc TL. Unconven-
tional medicine in the United StJte~. Prevalence, costs and piltterns of liS". New Engl J
Med 1993;328:246·-252.
9, LJiabetes Control and Complications Trial Research Croup. The effect of intensive treot-
ment of diabetes on the d"vdopl1lent anu progression of long-term complications in
insulin-dependent diabetes mellitis New lingl J Med 199:1;329:977-'186.
10. Grover SA, Barkun A"-;, Sackett LJI.. Does the pcltient have spl<'nomegaly? JAMA
1993;270:2218- 2221.
11 Goodm"n Si\, Berlin JA. The usc of predicted confidence intervals when planning experi-
ments and the misuse of power when interpreting results. Ann l"lPm IvIed 19',14; l2l :200-
206.
12. Sackett DL, llaynes RI3, Gent \1, T"ylor DVV, COl11pli"JKC. I,,: Inm,ln \-VIIW, "d. Monitoring
lor drug safety, I.Jncasler, L'K: MTP Press,IY8il.
13 Armitage 1', Importance of prognostic f"ctors in the ;m"ly~is of d"t" from clinical trials,
Control Clin Triab 1'J81;I:J47-J5J.
14. ("oneato J, reinstl'in AR, Ilolford TR. The risk of determilling risk with mult'variabk-
JIlodds. Ann Intern Med l'N:1;llS:2111-2W,
I.~ Diamond C.-\. Future iJllpl'rf<'ct: the limit"tipns or clinical prediction [llo<!"b and thl' limits
of clinical pr"dktion,1 Am Coil Cardiel 19S9;14:12/\ 221\
CHAPTER 9 ! CHANCF 207
SUGGESTElJ l':EAlJINCS
Altman DC, Gore 511.1, GJrdner MJ, Pocock SJ. Statisti,,,l guidelines ~or contributors 10 medical
jOllrnJIs, n- Med J 1983;286;1489·-1493,
Ilailer Je Ill, Mostdler r, cds..Medical uses of statistics. Waltham, MA: New England ]ournJI
of Medicine Books, 1986.
Cupples LA, Heeren T, Schatzkin A, Colton T. Multiple kbting uf hypotheses in comparing
two groups. Ann Intern Med 1984; 100:122-129.
Detsky AS, Sackett 01 .. When WclS" "negJtive" clinical trial big enough? How many patients
you net'd dep"nd~ on whclt you found, Arch Intern Mcd 1985;145:70'J-712,
facts, figurc>, and fallacies ,cries
Clrpenkr LM, Is Ihe study worth doing? Lancet 1993;343:221-223
Dalla M, You cannot exclude the explanJlion you haven't considered. Lmcet 1993;342:345-
347
Clyrl1l JI\. A question of attribution, Lancet 1993;342:530-532.
Crisso JA. Making comparisons. Lancet 1993;342:157-160.
Jolley T.TIle glitter of the I table IJncel 1993;342:27-29.
Leon D. Failed or mi~leading 'ldJustment for confounding. Lancet 1993;342:479-481.
Mertens TF. Estimating the effccts of nusclassification. Lancet 1'J'J3;342:418-421.
Sitthi-amom C l'oshachinda V. Bias. Lancet 1993;343:286-288.
Victoria CG. what's the denominator? Lancet 1993;342:97-99.
Friedman GD. Primer of epidemiology. 4th ed. New York: McCmw·Hill, 1994.
Gardner Mj. Altman DG. Stilti~tks with ("(lIlfidence confidcnce intervJls and statistical
guidelines. LonJon: 13r>'1J l3ooKs, 1989,
Coodman SN, lJerlin JA, The usc of predicted confidence intervals when pbnning expnim'mts
and the misuse of power when inte'lweting r~SIlIts. Ann Intern Med '1994;'12'1:2(KI-206,
Hilnky [A, Lil'mcln.IIJnd A. If nothing goes wrong is everything right? Interpreting zero
nUlllerators. JA1I.1'\ 1'J~3:24'J:1743.
Hennekens CH, Bllring IF. Fpid~miology in medicine. Boston: Little, Brown & Co., 1987,
lngelflnger JA. r>-l%tdkr F. Thihodeau I,A, VI/Me Jll, biostatistics in clinical medicine, New
York: Ma"milklll, 1983,
Moses 1£ Statistical concepts fundamental to investigations N Engl J Med 1985;312:890··H97.
Ricgclman I{K, Hirsch RP Studying and stLldy ~nd testing a tebt. 2nd "d. Boston; Little,
Brown & C». 1989,
Rothman KI A bhow of confidence. N Fllgi J MedI97i'l;299:B62 1361.
Young l\ilJ, Bresnitz EA, Strom ilL. Sample size nomograms for interpreting negative clinic,,1
studie~, /\nn Intern Mcd 1'JH3;9'J;24B-251.
10
STUDYING
CASES
Eacii case lias its lesson-a lesson which may be but is not always
learned.
-Sir William Osler
Most medical knowledge has emanated from the intensive study of sick
patients. The exhausted but engrossed physician at the bedside of the
febrile child, chin in hand, is a favorite medical image. The presentation
and discussion of a "case" is the foundation of modern medical education.
Most clinicopathologic conferences and grand rounds begin with the pre-
sentation of an interesting case and then use the case to illustrate general
principles and relationships. So, too, much of the medical literature is
devoted to studying cases, whether narrative descriptions of a handful of
cases (case reports), quantitative analyses of larger groups of patients (case
series), or comparisons of groups of cases with noncascs (case control
studies).
Case Reports
Case reporls Me detailed presentations of a single case or a handful of
cases. They represent an important way in which new or unfamiliar dis-
eases, or manifestations or associations of disease Me brought to the atten-
tion of the medical community. Approximately 20-30~;, of the original
articles published in major general medical journals are studies of 10 or
fewer patients.
USES OF CASE REPORTS
Case reports serve several different purposes. First, they arc virtually
our only means of describing rare clinical events. Therefore, they Me a rich
source of ideas (hypotheses) about disease presentation, risk, prognosis,
and treatment. Case reports rarely can be used to test these hypotheses,
but they do place issues before the medical community and often trigger
208
CHAPTeR 10 ! STUDYING CASES 209
more decisive studies. Some conditions that were first recognized through
case reports include birth defects from thalidomide, fetal alcohol syndrome,
toxic shock syndrome, Lyme disease, and HANTA virus infection.
Case reports also serve to elucidate the mechanisms of disease and
treatment by reporting highly detailed and methodologically sophisticated
clinical and laboratory studies of a patient or small group of patients. In
this instance, the complexity, cost, and often experimental nature of the
investigations limit their application to small numbers of patients. Such
studies have contributed a great deal to our understanding of the genetic,
metabolic, and physiologic basis of human diseases. These studies repre-
sent the bridge between laboratory research and clinical research and have
a well-established place in the annals of medical progress.
The following is an example of how a report of a single case can reveal
a great deal about the mechanism of a disease.
Because of this unusual case, it was clear that halothane can cause hepa-
titis. But the case report provided no information as to w hcthcr this reaction
was rare or common. Subsequent studies showed that it was not a rare
reaction, which contributed to the replacement of halothane with less hepa-
totoxic agents.
Another use of the case report is to describe unusual manifestations of
disease. Sometimes this can become the medical version of Ripley'S Beueoe
l! or Not, an informal compendium of medical oddities, with the interest
lying in the sheer unbclicvability of the case. The larger the lesion and the
more outrageous the foreign body, the more likely a case report is to find
its way into the literature. Oddities that are simply bizarre aberrations
from the usual course of events may titillate, but usually are less clinically
important than other types of studies.
Some so-caned oddities are, however, are the result of a fresher, more
insightful look at a problem and prove to be the first evidence of a subse-
quently useful finding. The problem for the reader is how to distinguish
between the freak and the fresh insight. There are no rules. When all else
fails, one can only rely on common sense and a well-developed sense of
skepticism.
210 CLINICAL EPIDEMIOI_OGY
BIASED REPORTING
Because case reports involve a small and highly selected group of pa-
tients, they are particularly susceptible to bias. For example, case reports
of successful therapy may be misleading because journals are unlikely to
receive or publish case reports of unsuccessful therapy. Perhaps the wisest
stance to take when reviewing a case report is to use it as ,1 signal to look
for further evidence of the described phenomenon in the literature or
among your patients.
Example A case report (2) described a 23-year-old woman who devol-
oped severe abdominal pain while on treatment with enalapril for essential
hypertension. An elevated serum lipase led to a diagnosis of pancreatitis.
Symptoms resolved, and the lipase returned to normal shortly after discontin-
uing the drug. The investigators found only one published case and began
an exhaustive search of the published and unpublished literature. The search
revealed an additional flO cases, the majority of which were unpublished
cases reported to the drug manufacturer. The additional cases lent strength
to the possibility of a causal association between enalapril treatment and
pancreatitis.
With very few exceptions, case reports on their own should not serve
as the basis for altering clinical practice because of their inability to estimate
the frequency of the described occurrence or the role of bias or chance.
THE JOINT OCCURRENCE OF RARE EVENTS
Table 10.1
The Joint Occurrence of Two Rare Conditions: An Estimate of the Frequency and
Number of Cases of Exposure to a Nonsteroidal Antiinflammatory Drug and End-
Stage Renal Failure Occurring Together if the Two Were Not Biologically Related B
Frequency separately
Prevalence of use of the drug (hypothetical) 1/100 persons
Incidence of eno-stoqo ronal osccsc 40/1.000,000/year
ocoocc of joint occurrence t/lOO x 40/1,000,000/year
= 4/1 O,OOO,OOO/year
., Data from Hiatt RA, Friedman GD, Characteristics 01patients referred for treatment 01end-stage renal disease
'n a defined population, Am J Public I iealth 1911?: 7:::':8?9-8,'3,').
Case Series
i\ case series is a study of a larger group of patients (c.g.. 10 or more) with
a particular disease. The larger number of cases allows the investigator to
assess the play of chance, and p values and other statistics often appear in
case series, unlike in case reports. A case series is a particularly common
way of delineating the clinical picture of a disease and serves this purpose
well-but with important limitations.
Case series suffer from the absence of a comparison group. Occasionally,
this is not a major problem.
Example Between june J981 and February 1983, a few years after A]])S
was first recognized and while its manifestations were being defined, re-
searchers from the Centers for Disease Control gathered information on 1000
patients living in the United States who met a surveillance definition for
the disease. They described demographic and behavioral characteristics of
patients and complications of the disease.
Pnellnlocystis carinii pneumonia (PCP) was found in smo, Kaposi's sarcoma
in 28%, and both in 8% of patients: 14% had opportunistic infections other
than PCP. All bur e'x, of the patients could be classified into one or more of the
following groups: homosexual or bisexual men, intravenous drug abusers,
Haitian natives, and patients with hemophilia (.'i).
212 CLINICAL EPID[MIOlOGY
Another limitation of case series is that they generally describe the clini-
cal manifestations of disease and its treatments in a group of patients
assembled at one point in time, a survival cohort (see Chapter 6). They
must be distinguished, therefore, from cohort studies or trials of treatment
for which <In inception cohort of patients with a disease is followed over
time with the purpose of looking for the outcomes of the disease. Case
series often look backward in time and that restricts their value as a means
of studying prognosis or causr-and-cffect relationships.
Case-Control Studies
To find out whether a finding or possible cause really is more common
in patients with a given disease, one needs a study with several features.
First and foremost, in addition to a series of cases there must be <I compari-
son group that does not have the disease. Second, there must be enough
people in the study so that cbance is less likely to playa large part in the
observed results. Third, the groups must be similar enough, even though
one is nondiscascd, to produce a credible comparison. Finally, if one wants
to show that a risk factor is independent of others-and, therefore, a
possible c<luse- it is necessary to control for a II other important differences
in the analysis of the findings.
Case reports and case series cannot take us this far. Neither can cohort
studies in many situations, because it is not feasible to accrue enough cases
to rule out the pl<lY of chance. Case-control studies, studies that compare
the frequency of a purported risk factor (generally called the "exposure")
in a group of cases and it group of controls, have these features.
CHAPTER 10 I STUDYING CASES 213
Exposure to
Risk Factor Uisease
l\'jj 1-
INo 1 1- - - - -
iYes 1
j\lJl~ .I
DESIGN
The basic design of a case-control study is diagrammed in Figure 10.1.
Patients who have the disease and a group of otherwise similar people who
do not have the disease are selected. The researchers then look backward in
time to determine the frequency of exposure in the two groups. These data
can be used to estimate the relative risk of disease related to the characteris-
tic of interest.
Example Does the use of nonsteriodal antiinflammatory drugs (NSAJDs)
Increase the risk of renal disease? Researchers have addressed this question
using a case-control study (7). How did they go about it?
First, they had to define renal disease and find a sizable group of cases
available to he interviewed. For obvious reasons, they looked in tertiary care
hospitals, where many such cases are gathered. The cases, of course, included
only patients in whom the diagnosis had been made in the course of usual
medical care. For example, asymptomatic patients with mild renal failure
were much less likelv to be included among the cases.
Once the cases were assembled and the diagnosis confirmed, a compari-
son, or control,' group was selected. Before deciding which people to choose
as controls, the investigators considered the purpose of the study. They
wanted to ascertain whether patients with renal failure were more likely to
have received NSAJD therapy in the past than a similar group of people with
no evidence of renal disease.
The investigators found that thc csfirnatcd relative risk of NSAID exposure
for renal failure was 2.1, using data on the rates of exposure in cases and
controls, and that the excess risk was largely confined to older men.
214 CLINICAL EPIDEMIOLOGY
CASE-CONTROL STUDV
Exposure to
NSAIDs Renal Failure
•Yes" I
,-NO]
['les 1
<R ES E AR C HI
I No I
COHORT STUDV
YES
Population
o
c
z
~
q:
Table 10.2
Summary of Characteristics of Cohort, Case-Control, and Prevalence Designs
em
ColKlrt Case-Control Prevalence
"o
Begins with a defined population at risk
Cases Noncases
..............
Exposed A i
Not )
exposed D
.....
A+C B+D
Figure 10.3 Calculation at relative risk for a cohort study and odds ratio (estimated
relative risk) for a case-control study.
tals and other treatment facilities, find similar groups without the disease,
and compare frequencies of past NSAID use. In this way, several hundred
study subjects can be interviewed in a matter of weeks or months, and an
answer can be obtained at a fraction of the cost of a cohort study.
A real advantage of the case-control study in exploring the effect of
some causal or prognostic factors is that one need not wait for a long time
for the answer. Many diseases have a long latency-the period of time
between exposure to a risk factor and the expression of its pathologic
effects. For example, it has been estimated that at least 10-20 years must
pass before the carcinogenicity of various chemicals becomes manifest. It
would require an extremely patient investigator and scientific community
to wait so many years to see if a suspected risk to health can be confirmed.
Because of their ability to address important questions rapidly and effi-
ciently, case-control studies play an increasingly prominent role in the
medical literature. If one wants to study cause and effect using a relatively
strong method, the case-control approach is the only practical way to study
some diseases. Case-control studies comprise a growing percentage of all
original articles and the majority of epidemiologic articles. Their quickness
and cheapness justify their popularity as long as their results are valid;
and here is the problem, because case-control studies are particularly prone
to biased results. These biases are discussed in the next section.
ing, will bias the odds ratio toward 1 and diminish the ability of a study
to detect a significantly increased or decreased cdds ratio.
A third strategy is to choose more than one control group. Because of
the difficulties attending the selection of truly comparable control groups,
a systematic error in the odds ratio may arise for <lny one of them. One
way to guard against this possibility is to choose more than one control
group from different sources.' One approach used when cases are drawn
from a hospital is to choose one control group from other patients in the
same hospital and a second control group from the neighborhoods in which
the cases live. If similar odds ratios are obtained using different control
groups, this is evidence against bias, because it is unlikely that bias would
affect otherwise dissimilar groups in the same direction and to the same
extent. If the estimates of relative risks arc different, that is a signal that
one or both are biased, and an opportunity exists to investigate where the
bias lies.
Example In a case-control study of l:'strogl:'n and endometrial cancer,
cases were identified from a single teclC"hing hospital. Two control groups
wen' selected: one from among gynecologic admissions to the same hospital
and the second from a random sample of women living in the area served
by the hospital.
The presence of other diseases, such as hypertension, diabetes, and gall-
bladder disease, was much more common among the cast's and the hospital
control group, presumably reflecting the various forces that lead to hospital-
ization. Despite these differences, the two control groups reported much less
long-term t'strogen usc than did the cases and yielded very similar odds
ratios (4.1 and 16).
The authors (l2l conc-luded that "this consistency of results with two very
different comparison groups suggests that neither is significantly biased and
that the results .. are reasonably accurate."
Options for selecting cases and controls are summarized in Figure 10.4.
If cases are all occurring in a defined population (or a representative sample
of all cases), then controls should be too. This is the optimal situation. If
cases are a biased sample of all cases, as they arc in most hospital samples,
then controls should be selected with similar biases.
MEASURING EXPOSURE
Even if selection bias can be avoided in choosing cases and controls,
the investigator faces problems associated with validly measuring expo-
sure after the disease or outcome has occurred, i.c., avoiding measurement
bias. Case-control studies are prone to three forms of measurement bias
1 Ch",."ing tw" or more conlr,,1 gmup' ,'cr cose t;r0up, i, dilf~"·,,t f",m ch"'''ing Iw(' or m(m.' c"ntr,,1s
pcr case, Til., btkr i, d"",·, to illn",,,,, ,r,lh,ti,al pow"r (or prc",i,ion "I the c,t,male of rd"ti"" 'i,kl. Tn
W'n"ral, using mOl'e thon ,'n,' contr"T ,uhi~ct T)("""", ,,-,ults in ,m"IT I:>ulu,dul gain, in pnwer. bullherc
i, Tittle u",fuT ,d"ant"I:<' to "dding mor,' controls ,'er C05C beyond thr~~ <>r lour.
CHAPTER 10 ! STUDYING CASES 223
Samples
Ratloom
•
•
ICONTROlS> samplePt
ncncases
• •
Figure 10.4 Two strategies for selecting cases and controls from the general popu-
lation: unbiased samples and samples with rnatchinq biases.
because the exposure is measured after the onset of the disease or outcome
under study.
l. The presence of the outcome directly affects the exposure;
2. The presence of the outcome affects the subject's recollection of the
exposure; and
3. The presence of the outcome affects the measurement or recording of
the exposure.
The first bias is particularly problematic if the exposure under study is
a medical treatment, since the early manifestations of the illness may lead
to treatment. This is sometimes referred to as confounding by indication.
for example, a case-control design was used to determine whether beta-
blocker drugs prevented first myocardial Infarctions in patients being
treated for hypertension (13). Because angina is a major indication for use
of beta-blockers, the investigators carefully excluded any subjects with a
history suggesting angina or other rnani festation of coronary heart disease.
They found that hypertensive patients treated with beta-blockers still had
a significantly reduced risk of nonfatal myocardial infarctions, even after
those with angina or other evidence of coronary disease were carefully
excluded.
Second, people with a disease may recall exposure differently from those
without the disease. With all the publicity surrounding the possible risks
of various environmental exposures or drugs, it is entirely possible that
victims of disease would remember their previous exposures more acutely
than nonvfctirns or even overestimate their exposure. The influence of
224 CLINICAL EPIDEMIOLOGY
It has been suggested that one should judge the validity of a case-
control study by first considering how a randomized controlled trial of
the same question would have been conducted (14). Of course, one could
not actually do the study that \ViJY. But a randomized controlled trial
would be the scientific standard against which to consider the effects
of the various compromises that are inherent in a case-control study.
One would enter into a trial only those patients who could take the
experimental intervention if it were offered, so in a case-control study
one would select cases and controls who could have been exposed. For
example, a study of whether NSl\lDs are a cause of renal failure would
include men and women who had no contratndtcations to taking
NSI\IDs, such as peptic ulcer. Similarly, both cases and controls should
have been subjected to equal efforts to discover renal disease if it were
present. These and other parallels between clinical trials and case-con-
trol studies can be exploited when trying to think through just what
CHAPTER 10 ! STUDYING CASES 225
could go wrong, how serious a problem il; it, and what can be done
about it.
There have also been efforts to set out criteria for sound case-control
studies (15). To apply these guidelines requires an in-depth understand-
ing of the lllany possible determinants of exposure and disease, as well
as the detection of both, in actual clinical situations.
USING CASE·CONTROL STUDIES TO EXAMINE HEALTH CARE
The major use of case-control studies has been to test hypotheses about
the etiology of disease. More recently, investigators have exploited the
advantages of the case-control design to study questions related to the
provision and quality of health care.
Example Is cerebral palsy and fetal death preventable? British investiga-
tors (16) used a case-control design to compure the antepartum care received
by 141 babies developing cerebral palsy and 62 dying intrapartum or neona-
tally. Each case was matched with two healthy babies born at the same time
and place. A failure to respond to signs of severe fetal distress was more
common among cases than controls but only accounted for a v,'ry small
percentage of babies with cerebral palsy.
Summary
Much of medical progress is derived from the careful study of sick
individuals. Case reports are studies of just a few patients, e.g.. < l O. They
are a useful means of describing unusual presentations of disease, examin-
ing the mechanisms of disease, and raising hypotheses about causes and
treatments. However, case reports are particularly prone to bias and
chance. Case series-studies of larger collections of patients-still suffer
from the absence of a reference group with which to compare the experi-
ence of the cases and from sampling cases at various times in the course
of their disease.
In case-control studies. a group of cases is compared with a similar
group of noncascs (controls). A major advantage resides in the ability to
assemble easel; from treatment centers or disease registries as opposed to
finding them or waiting for them to develop in a defined population at
risk. Thus case-control studies are much less expensive and much quicker
to perform than cohort studies and the only feasible strategy for studying
risk factors for rare diseases. Relative risk can be estimated by the odds
ratio, although it is not possible to compute incidences or relative risk
directly. The disadvantages of the case-control design all relate to its con-
226 CLINICAL EPIDEMIOLOGY
SUGGESTED READINGS
Ieinstetn AR. Clinical biostatistics XX: th(, epidemiologic trohoc, the ablative risk ratio, and
"retrospective TtN,arch." Clin Pharmacol Ther 1973;14:291-307.
Feinstein AR, Horwitz Rl, Spitzer WO, Battista RN. Coffee and pancreatic cancer: the problems
of etiologic science and epidemiological case-(:ontrol research. JAMA 19111;246:957-961.
Hayden GF, Kramer MS, Horwitz RI. The case-control study. A practical review for the
clinician. JAMA 1982;247:326-331.
Ibrahim MA, Spitzer WOo 1he case-control study: consensus and controversy. New York:
Pergamon Press, 1979.
Rothman KJ. Modern epidemiology. Boston: Little, Brown & Co., 1986.
11
CAUSE
- - _... --
causal relationships. Tn fact, most of this book has been about methods
used to establish cause, although we have not called special attention to
the term.
Tn this chapter, we review concepts of cause in clinical medicine. We
then outline the kinds of evidence that, when present, strengthen the likeli-
hood that an association represents a causal relationship. We also deal
briefly with a kind of research design not yet considered in this book:
studies in which exposure to a possible cause is known only for groups
and not specifically for individuals in the groups.
Concepts of Cause
Webster's (2) defines canse as "something that brings about an effect or
a result." In medical textbooks, cause is usually discussed under such
headings as "etiology," "pathogenesis," "mechanisms." or "risk factors."
Ciluse is important to procticing physicians primarily in guiding their
approach to three clinical tasks: prevention, diagnosis, and treatment. The
clinical example at the beginning of this chapter illustrates how knowledge
of cause-and-effect relationships can lead to successful preventive strate-
gies. Likewise, when we periodically check patients' blood pressures, we
are reacting to evidence that hypertension causes morbidity and mortality
and that treatment of hypertension prevents strokes and myocardial in-
farction. The diagnostic process, especially in infectious disease, frequently
involves a search for the causative agent. Less directly, the diagnostic
process often depends on information about cause when the presence of
risk factors is used to identify groups of patients in whom disease preva-
lence is high (see Chapter 3). Finally, belief in a causal relation.s. hip under-
lies every therapeutic maneuver in clinical medicine. Why give penicillin
for pneumococcal pneumonia unless we think it will cause a cure? Or
advise a patient with metastatic cancer to undergo chemotherapy unless
we believe the antimetabolite will cause a regression of metastases and a
prolongation of survival, comfort, and/or ability to carryon daily
activities?
By and large, clinicians are more interested in treatable or reversible
than immutable causes. Researchers, on the other hand, might also be
interested in studying causal factors for which no efficacious treatment
or prevention exists, in hopes of developing preventive or therapeutic
interventions in the future.
SINGLE AND MULTIPLE CAUSES
In 1882, 40 years after the Holmes-Meigs confrontation, Koch set forth
postulates for determining that an infectious agent is the cause of a disease.
Basic to his approach was the assumption that a particular disease has one
cause and a particular cause results in one disease. He stipulated that:
230 CUNICAL EPIDf:'MIOLOGY
~xP9slJreto
Crowding MYCOBACTtRiUM
Malnutrition
Vaccination
Genetic Tissue Invasion and Reaction
~',r, INFECTION TUBERCULOSIS
~4iil'
Example During the polsl two decades, the death rate from coronary
artery disease has dropped more than a third. This decline accompanied
decreased exposure, in the population as a whole, to several risk factors for
cardiovascular disease: A larger proportion of people with hypertension are
being treated effectively, middle-aged men are smoking less, and fat and
cholesterol consumption has declined. These developments were, at least in
part, the result of both epidemiologic and biomedicaIstudies and have spared
tens of thousands of lives per year. It is doubtful that they would have
occurred without understanding"of both the proximal mechanisms and the
more remote origins of cardiovascular disease (6).
INTERPLAY OF MULTIPLE CAUSES
When more than one cause act together, the resulting risk may be greater
or kss than would be expected by simply combining the effects of the
separate causes. Clinicians call this phenomenon synergism if the joint
CHAPTER 11 ! CAUSE 233
400 A
300
Tubercle bacillus identified
200 I
I
o'---,-_,-~,---~,--~,--~~-"'.,..~,
1838 1860 1880 1900 1920 1940 1960 1980
Year
B
50000 -
40000
....
Q)
30000
25000
o'"
lD
20000
f0-
e; 52,100 excess cases
... 15000
~
E
::J
Z
1980198119821983198419851986198719881989199019911992
Year
-------~-----
Figure 11.2. A, Declining death rate from respirafory tuberculosis in England and
Wales over the past 150 years. (From McKeown T The role of medicine: dream,
mirage or nemesis. London: Nuffield Provincial Hospital Trust, 1976.) B, Excess
tuberculosis cases in the United States, 1985-1992. Difference between expected
and observed number of cases. Dotted line, observed cases; solid line, expected
cases. (From Cantwell MF, et al. Epidemiology of tuberculosis in the United States,
1985 through 1992. JAMA 1994;272:535-539.)
234 CLINICAL EPIDEMIOI_OGY
400
.,
III
~
~ 317
ee 300
r::
r::
0
"
::;; 200
0
0
~
100
US 61
".,
U
III
12 20
0
Serum cholesterol 185 185 185 ':l:;l!i q
(mg %)
Cigarette smoking 0 0 0
Figure 11.3. Interaction of multiple causes of disease, The risk of developing car-
diovascular diseaso in men according to the level at several risk factors alone and
in combination. Abnormal values are in shaded boxes. (Redrawn from Kannel WB,
Preventive cardiology. Postgrad Med 1977;61:74-85.}
effect is greater than the sum of the effects of the individual causes, and
antagonism if it is less.'
Example Figure 11.3 shows the probabilitv of developing cardiovascular
disease over an 8-pcilf period among men aged 40. Men who did not smoke
cigarettes, had low serum cholesterol values, and had low systolic blood
pressure readings were at low risk of developing disease (l2/IOOO). Risk
increased, in the range of 20 to 61/1000, when the various factors were
present individually. But when all three factors were present, the absolute
risk of cardiovascular disease (317/1000) was almost three times greater than
the sum of the individual risks (7).
i Slnli,licu! ;"lemc/ioll is pr~,~nt when combination' of "ariah"', in a malhemoti""l modd add to the
model's expldnah>ry power aller lhe nd ~fh~·t, of the individual pr",lict,,, variables have been t"k~n into
.m"",lllt. It is eoneeplu"lly rt'btl'd to biologic SY"~Tgy ond Jnla8oni,m but is " m<1themillical con,trud,
not <1n "b"'rv"bh' phenomenon in nalur<'
CHAPTER 11 I CAUSE 235
Establishing Cause
ln clinical medicine, it is not possible to prove causal relationships be-
yond any doubt. It is only possible to increase one's conviction of a cause-
and-effect relationship, by means of empiric evidence, to the point at which,
for all intents and purposes, cause is established. Conversely, evidence
against a cause can be mounted until a cause-and-effect relationship be-
comes implausible. The possibility of a postulated cause-and-effect rela-
tionship should be examined in as many different ways as possible. This
usually means that several studies must be done to build evidence for or
against cause.
ASSOCIATION AND CAUSE
Two factors-the suspected cause and the effect-obviously must
appear ttl be associated if they are to be considered as cause and effect.
However, not all associations are causal. Figure 11.5 outlines other
kinds of associations that must be excluded. Pirst. a decision must be
made as to whether an apparent association between a purported cause
and an effect is real or merely all artifact because of bias or random
variation. Selection and measurement biases and chance are most likely
to give rise to apparent associations that do not exist in nature. If these
problems can be considered unlikely, a true association exists. But
before deciding that the association is causal, it is necessary to know
if the association occurs indirectly, through another (confounding) fac-
236 CLINICAL [PIDEMIOlOGY
Odds Ratio
0.1 0.2 0.5 1,0 2.0 50 10.0
Figure 11.4. Example of effect rnoduication: how the risk of cardiac arrest in oa
tients using thiazide diuretics Icornpared with the risk of cardiac arrest in patients
using beta-blockers) changes according to use of potassium-sparing diuretics, Odds
ratios-with 95% confidence intervals (GI) ---increase with increasing dose of diuretic,
suggesting that it is safer to use beta-blockers than thiazide diuretics. However, with
the addition of potassium sparing diuretics, thiazide diuretics cause a lower risk of
cardiac arrest than beta-blocker therapy, (Redrawn from Siscovick OS, et a. Diuretic
therapy for hypertension and the risk of primary cardiac arrest. N Engl J Med
1994; 330:1852-1857,)
EXPLANATION FINDING _
Association
Bias in selection
or measurement
LY-';J ~o
.. "
~
Confounding
Cause Cause
RESEARCH DESIGN
When considering a possible causal relationship, the strength of the
research design used to establish the relationship is an important piece of
evidence.
Of the research designs so far discussed in this book, well-conducted
randomized controlled trials, with adequate numbers of patients; blinding
of therapists, patients, and researchers; and carefully standardized meth-
ods of measurement and analysis are the best evidence for a cause-and-
effect relationship. Randomized controlled trials guard against differences
in the groups being compared, both for factors already known to be im-
portant, which can be overcome by other methods, and for unknown con-
founding factors.
We ordinarily usc randomized controlled trials to provide evidence
about causal relationships for treatments and prevention, However, as
pointed out in Chapter 6, randomized controlled trials are rarely feasible
when studying causes of disease. Observational studies must be used
instead.
In general, the further one must depart from randomized trials, the less
the research design protects against possible biases and the weaker the
evidence is for a cause-and-effect relationship. Well-conducted cohort stud-
ies are the next best design, because they can be performed in a way that
minimizes known confounding, selection and measurement biases. Cross-
sectional studies are vulnerable because they provide no direct evidence
of the sequence of events. True prevalence surveys- cross-sectional stud-
ies of a defined population-guard against selection bias but are subject
238 CUNICAL EPIDeMIOLOGY
11
• Finland
10
U,S.A.
9 • ••
Scotland
N.Z.
• Australia
8
• Canada
E.B.W.
7 •
• Ireland
e Norway
6
• Netherlands
• Denmark
5 • eBelgium
Sweden .
• e Austna
4 West Germany
3 •
Switzerland
•
Italy
2 •
France
Figure 11.6. Example of an aggregate risk study: relationship between wine con-
sumption and cardiac mortality in developed countries. (Drawn from 81. Leger AS,
Cochrane AL, Moore F. Factors assoceteo with cardiac mortality in developed coun-
tries with particular reference to the consumption of wine. Lancet 19/9; 1.1017-
1020,)
made among the groups to determine if the effect occurs in the same
sequential manner in which the suspected cause was introduced. If the
effect regularly follows introduction of the suspected cause at various times
and places, there is stronger evidence for cause than if this phenomenon
is observed only once, because it is even more improbable that the same
extraneous factor(s) occurred at the same time in relation to the cause in
many different places and eras.
Example Because there were no randomized controlled trials of cervical
cancer screening programs before they became widely accepted, their effec-
tiveness must be evaluated by means of observational studies. A multiple
time series study has provided some of the most convincing evidence of their
effectiveness (12). Data were gathered on screening programs begun in the
various Canadian provinces at various times during a 10-year period in the
1960s and 1970s. Reductions in mortality regularly followed the introduction
of screening programs regardless of time and location. With these data, it
12 800
C
UTI
~.
Eo
'" -c
TI_
c •
~.
9 O~
600
I Cl
~
::>
6
tv ~ 400 ·0
e
J(
>-
E
..
-e
3 200 <3"
I~
1Iqj LJ V
o 'I o
1988 1989 1990 1991 1992
Years of Surveillance
Figure 11.7. A time-series study of the relationship of c1indamycin use and Clostrid·
ium diffie/Ie-associated diarrhea (Redrawn from Pear SM, Williamson TH, Bettin KM,
Gerding ON, Galgiani IN. Decrease in nosocomial Clos/ridium difficile-associated
diarrhea by restricting c1indamycin use. Ann Intern Medl1 94;120:272--277_)
CHAPTER 11 ! CAUSE 241
Table 11.1
Evidence That an Association Is Cause and Effect"
Criteria CormTlHrlts
"Modlttcd from Bradford-Hill A8 The erlVirorHT1Hni and disease ussoccton ere ccuseuoo. Proc R Soc Med
190S; :,B:?9:"i-300
242 CLINICAL EPIDEMIOLOGY
300
~
~"
251
-
~ Gl"
Gl:; 200
"
''0
,,0
O~ 127
",0 -
,,0
, , 0 100
...J~ ~
-"
.t:
e
C 0 ~
0 1-14 15-24 25+
Biologic Plausibility
Whether the assertion of cause and effect is consistent with our knowl-
edge of the mechanisms of disease as they are currently understood is
often given considerable weight when assessing causation. When we have
absolutely no idea how an association might have arisen, we tend to be
skeptical that the association is real. Such skepticism often serves us well.
For example, the substance Laetrile was touted as a cure for cancer in the
early 1980s. However, the scientific community was not convinced, because
they could think of no biologic reason why an extract of apricot pits not
chemically related to compounds with known anticancer activity should
be effective against cancer cells. To nail down the issue, Laetrile was finally
submitted to a randomized controlled trial in which it was shown that the
substance was, in fact, without activity against the cancers studied (19).
It is important to remember, however, that what is considered biologi-
cally plausible depends on the state of medical knowledge at the time. In
Mcig's day, contagious diseases were biologically implausible. Today, a
biologically plausible mechanism for puerperal sepsis, the effects of strep-
tococcal infection, has made it easier for us to accept Holmes's observations.
-"
0""
" 0
- E
~cn
"0 20 -
-
>~
- ""
.- >
~ Z 10 ~
-"
0"
:;; l!!
0"" ~ 4.7
o ~ ~
~~)(
a:
2.0
w 0 r-l
o <5 5-9 10-14 15+
Figure 11.9. Reversible association: declining mortality from lung cancer in ex-
cigarette smokers. The data exclude people who stopped smoking after gettin9
cancer. (Drawn from 0011 R, Petro R. Mortality in relation to smoking: 20 years'
observations on male British doctors Br Med J 1976;2:1525-1536.)
CHAPTER 11 ! CAUSE: 245
Figure 11.10. Relative strength of evidence for and against a causal effect Note
that with study designs, except for case reports and time series, the strenqth of
evidence for a causal relationship is a mirror image of that iJ9iJinsl. Wilh findings,
evidence for a causal effect docs not mirror evidence against an effect.
weaken the evidence for cause. The figure roughly indicates relative
strengths in helping to establish or discard a causal hypothesis. Thus a
carefully done cohort study showing a strong association and a dose-
response relationship is strong evidence for cause, while a cross-sectional
study finding no effect is weak evidence against cause.
Summary
Cause-and-effect relationships underlie diagnostic, preventive, and ther-
apeutic activities in clinical medicine.
Diseases usually have many causes, although occasionally one might
predominate. Often, several causes interact with one another in such a
way that the risk of disease is more than would be expected by simply
combining the effects of the individual causes taken separately. In other
cases, the presence of a third variable, an effect modifier, modifies the
strength of a cause-and-effect relationship between two variables.
Causes of disease can be proximal pathogenetic mechanisms or more
remote genetic, environmental, or behavioral factors. Medical interventions
to prevent or reverse disease can occur at any place in the development
of disease, from remote origins to proximal merh.misrns.
CHAPTER II/CAUSE 247
The case for causation is usually built over time with several different
studies. It rests primarily on the strength of the research designs used to
establish it. Because 'we rarely have the opportunity to establish cause
using randomized controlled trials, observational studies are necessary.
Some studies of populations (time series and multiple time series studies)
may suggest causal relationships when a given exposure of groups of
people is followed by a given effect.
features that strengthen the argument for a cause-and-effect relation-
ship include an appropriate temporal relationship, a strong association
between purported cause and effect, the existence of a dose-response rela-
tionship, a fall in risk when the purported cause is removed, and consistent
results among several studies. Biologic plausibility and coherence with
known facts are other features that argue for a causal relationship.
RI':FHI':bJ\CES
I. Holmes OW. Oil the contagious.ness of puerpelill fever. Med Classics 1936;1:207-2(-,8
IOriginally puhlishedI84.1.1
2. W"b~kr'~ ninth new collegi"te didionary. SrringH,·ld. r'ilA: Mcrriam-Web~ter, 199'1
.'. IliLlCM"hon B, Pugh TE, Epid"miol"gy' principles and methods. Bost"n: Littl", Ilmwn &
CO.,1'!70,
4. Cantwell MI;, Snider Dc. Cauthen Grv1. Onorato IIv\. Epidemiology of tuberculosis in the
Un;led States, 1985 through 1992. lAMA 199·1,272:535-539.
5. Wds SE, d al. Th" ,.ff"d of directly "bs"rved therapy (\11 thl' rat(,s of dmg rt'si~tan("(' ilnd
relapse in tuberculosis. N eng J Mcd 1,)()4;.l.lO:1l79-11li4.
h. Goldman L An,'lyzing the decline in the CAD death rail'. H05p I'ract 1988;23(2):109-
117.
7. Kimnl'l WB. I'rl'vl'ntivl' clrdiol"gy l'llstgr"d r-.kd '1977;6'1:74 as
Ii. Si'>Covick LlS, ct ul. Lliuretic therapy for hyperknsion and the risk of primary cardi"e
arrest. N Eng J I,led 1994;330:1852-1857.
'). O'Conner GT, Morton JR, Diehl \-11, Olmslead f'1\1, Coffin LH, Levy DC- Differences
bdw,'cn ml'n and wumen in Iw~pital mortality "ot'o("iil!t'd ,,--ilh "on>l1"ry artery bypass
graft ~urg('ry CirnllJtion 199,);88(1):2104-2110.
10. St. I."ger J\S, Cochra"" AI., Moor" E, I'adors as~"ciilted with cardi,,,, mortality ill devel-
oped countries with pJrticular rd"r"nee tn th" consumpti"n of wine, Lancet '1979; I,1017-
HJ20.
11 PeilT 511.1, William~(>l1 TH, Ilettin KM, Cl'rding DN, Gllgiill1lN. Decrease in no~ocomiJI
Clo,/ridilUlI difH('il"-asso"i,,tcd diilrrlw;, by r"stridillg din<\;Hnycin US", I\lln Inkrn M"d
1994;12ll:272-277.
12. Cervical Cancer Screening Programs. I. Epidemiology and natural history of carcinoma
"f th" """,'ix, {:;lI\ Med Ass'" J 197(';114:IOD.'··llFl3,
13. IJradford-J lill All. The "l1vironn",nt and dis"ase, a.%llciati"n Or causation? Proc R Soc
Med 1%5;5i'l:2'!5-301l.
14. Kuller L, Wing R. Weigh I loss and mortality. Ann Intern IVIed 1993;119:630-632,
15. Centers for Disease Control. Higlllight~ of the ~mgeon g<'lwral's report on smoking and
health, MMIVRI97<J;28:1-11
11,. 1J""sley 1,1', I.i" Cc, Hwal1g I.Y, Chit'n CS. H"patocellular c,m,inmna and hepatitis lJ
virus. Lancet 1<Ji'l1;2:1129-11J3.
17. Mandel JS, et al. Reducing mortality from cclorectal cancer by screening for fecal occult
blood. N Engl] Med 1993;.328:131'6-1371
248 CLINICAL EPIDEMIOLOGY
18. Selby JV, Friedman GO, Quesenberry CP, Weiss NS. A case-control study of screening
slgrnotdoscopy and mortality from coiorectai cancer. N Engl J Med 1992;326:653-657.
19. Moertei CG, et al. A clinical triai of amygdalin (laetrile] in the treatment of human cancer.
N Eng] J Med 1982;306:201-206.
SUGGESTED READINGS
Buck C. Popper's philosophy for epidemiologists, lnt J Epidemiol 1975;4:159-168
Chalmers AF. What Is this Thing Calied Science? 2nd cd. New York University of Queensland
Press, 1982.
Department of Clinical Epidemiology and Bio~tatistics, McMaster University Health Sciences
Centre. How to read clinical journals. Iv: To determine etiology or causation. Can Med
Assoc J 1981;124:985-990.
Rothman KJ (ed). Causal inference. Chestnut Hill, MA: Epidemiology Resources, 1988,
12
SUMMING UP
Where is the knowledge we hmx: lost in informlltiol/?
T. S. Eliot, "The l{ock"
fallible and subject to revision. One scholar (1) has distinguished between
"decisions" and "conclusions." We decide something is true if we will act
as if it is so, for the present, until better information comes along. Conclu-
sions, on the other hand, are settled issues and are expected to be more
durable. Clinicians are mainly concerned with decisions. The integrity of
the scientific enterprise rests on the willingness of its participants to engage
in open-minded, well-Informed arguments for and against a current view
of the truth, to accept new evidence, and to change their minds.
Contribution to Answering
the Clinical Question
T
Secondary
r t
Laboratory ("bench," T
Analogy
"basic") research
T
Weak
Primary
Direct
t
Strong
Figure 12.1. The literature on a research question: the relative value of various
kinds of articles for answering a clinical question.
evidence by several vears and not infrequently disagreed with it. For exam-
pie- by 1980 there were 12 RCTs in the literature that had examined the
efficacy of prophylactic lidocaine in the treatment of acute MI. Essentially,
all showed that treatment with lidocaine was no better and often worse than
placebo, yet the majority of review articles and chapters published during
the 19805 continued to recommend routine or selective use of lidocaine.
Other articles describe original research done in laboratories for the
purpose of understanding the biology of disease. These studies provide
the richest source of hypotheses about health and disease. Yet, "bench"
research cannot, in itself, establish with certainty what will happen in
humans, because phenomena in actual patients, who are complex organ-
isms in a similarly complex physical and social environment, involve vari-
ables that have been deliberately excluded from laboratory experiments.
Research involving intact humans and intended to guide clinical deci-
sion making ("clinical research") is, of course, conducted with varying
degrees of scientific rigor. Even by crude standards, most studies are rela-
tively weak. For example, a recent review of the methods of clinical studies
in three surgical journals revealed that more than 80% had no comparison
group, much less a randomized control group (3).
Throughout this book we have argued that the validity of clinical re-
search depends on the strength of its methods (internal validity) and the
extent to which it applies to a particular clinical setting (generalizability).
252 CLINICAL EPIDEMIOL.oGY
If this is so, a few good articles are more valuable than many weak or
inappropriate ones. Thus the overall conclusion frcrn the medical literature
often depends on how a relatively few articles are interpreted. A review
of the literature should involve selecting these articles carefully, identifying
their scientific strengths and 'weaknesses, and synthesizing the evidence
when their conclusions differ.
FINDING USEFUL ARTICLES
• Does the article address the specific clinical question that was the reason
for the search in the first place?
CHAPTER 12 / SUMMING UP 253
OOOOD 1631
OODDD 369 no original dates
~
DOODD 362 non-English language
D 70 with 10 or fewer patients
27 no outcomes reported
DDOOD 803
DDD
~
115 not a cohort undergoing the replica
96 no pertinent outcomes reported
37 different surgical procedures
DOODD 553
~
336 inadequate outcome assessment
DD 217
DO
~
130
87 other exclusions
Figure 12.2. Literature search: identifying the few most important articles from the
medical literature as a whole, (Callahan CM, Drake BG, Heck DA, Dittus RS. Patient
outcome following Iricompartmental total knee replacement. JAMA 1994; 271 1349-
1357,)
to date. The results, <It least for some clinical questions, should be available
in the rrud-Issus.
Table 12.1
Matching the Strongest Research Designs to Clinical Questions
Oue-uon
Dtaqnose Prevalence
Prevalence Prevalence
Incidence Cr)~IOr1
Hisk Cohort
Case control
PIO~_JrI{)sis Cohort
T reatment CliniCdl trial
I'reven1ion Clnucal tnal
Cause Cohort
Case control
256 CLINiCAL EPIDEMIOLOGY
Table 12.2
Characteristics of a Study That Detennine Whethe, It Can Test
or Only Raise Hypotheses
Characteristic
The first, a strong research design, is not a strictly separate factor from
the others. Making hypotheses in advance and limiting the number of
comparisons examined reduce the number of apparently "significant"
comparisons that emerge from a study. The exploration for effects in vari-
ous subgroups of a larger study population is a common analytic strategy
that may result in chance or spurious associations. When hypotheses made
in advance, a priori hypotheses, are confirmed, one can place more confi-
dence in the findings. Alternatively, investigators can simply limit the
number of comparisons made after the fact, so that there is less chance of
false-positive findings for the study as a whole. Or they can insist on a
particularly small p value before ruling out the role of chance in explaining
particular findings.
Another strategy to protect against the acceptance of spurious or chance
associations is to raise hypotheses on one set of data and test them on a
separate one (fig. 12.3). The availability of large data sets and statistical
computer software makes it relatively easy for the analysis to include
multiple variables, considered either separately or together in models. The
analysis of multiple variables should be viewed as raising hypotheses, as
the investigators rarely specify in advance what the model will find, much
less the weight given to each finding. 1£ the data set is large enough. it can
be divided randomly in hillf, with one half being used to develop the
model and the second half used to confirm it. Or it can be tested in a
different setting. This latter process is illustrated in the Following example.
Figure 12.3. Developing a hypothesis on one data set and testing it on another.
258 CLINICAL EPIDEMIOLOGY
Relative Risk
Type of Study or Odds Ratio
Treatment Treatment
- Better Worse --.---.
Randomized trials
RCT 1 0.41 x
RCT2 0.20 x
RCT3 0.26 x
RCT 4 0.24 x
RCT5 0.20 x
RCT6 1.01 x
RCT7 0.63
Nonrandomized trials
Study 1 0.80 X
Study 2 0.46 X
Study 3 0.25 X
Study 4 1.56 X
Study 5 0.71 ·x
Study 6 0.98 X
Pooling refers to the process of aggregating the data from several rela-
tively small studies of the same question to form, in effect, one large one.
It is permissible when it can be shown t11M the studies are sufficiently
similar to each other (in patients, intervention and outcome measures) to
treat them as if they are part of a Single study. Pooling attempts to assemble
enough observations to generate a precise overall estimate of effect, not to
account for differences in conclusions among studies. The advantage of
pooling is that it can result in adequate xtatisfica 1power to detect meaning-
ful differences, if they exist. Pooling is particularly useful when the disease
and/or the outcome events of interest occur infrequently. Under these
circumstances there are no other feasible ways to achieve statistical power.
Example There arc many reports of peptic ulcer disease during cortico-
steroid therapy. Yet, it has been difficult to establish by means of observa-
tional studies whether corticosteroids cause ulcers, because manv or the situa-
tions in which they are given-e.g., during stress and in conjunction with
gastric-irritating drugs-may themselves predispose to peptic ulcer disease.
Abo, ulcers may be sought more diligently in patients receiving corticoste-
roids and go undetected in other patients.
Randomized controlled trials are the best wav to determine cause and
effect. There have been manv randomized trials in ;....hich corticosteroids were
used to treat various lunditiolls and peptic ulcer disease was a side effect.
None of these studies was large enough in itself to test the corticosteroid!
ulcer hypothesis. But together they provide an opportunity to examine the
rate of rare event.
In one review of 71 controlled trials of corticosteroids in which patients
were randomized (or its equivalent} and peptic ulcer disease was considered,
there were about 86 patients and 1 case of peptic ulcer disease per study;
only 31 of the trials reported any patients with ulcers (14). The investigators
pooled the results of these 71 trials to Increase statistical power. In the pooled
study, there were 6111 patients and about 80 ulcers. The rate of peptic ulcer
disease was 1.8 in the corticosteroid group and 0.8 in the control group
(relative risk, 2.3; 95°;;. confidence interval, 1.4-3.7). The results were similar
when examined separately according to the presence and absence of other
risk factors; various doses, routes of administration, and duration of therapy;
and whether the disease was suspected, defined as bleeding, or specifically
diagnosed.
Thus the combined results of many studies, each with relatively sound
design but too small to answer the question, gave sufficient statistical power
to detect risk.
One way to avoid this bias is to give more credibility to large studies
than to small ones. Most large studies, having required great effort and
expense in their execution, will be published regardless of whether they
have a positive or negative finding. Smaller studies, on the other hand,
require less investment and so are more easily discarded in the selection
process.
tion, but their opinions <Ire only as good <IS the consultant, who m<lY be
biased by the beliefs and financial interests of his or her field. For example,
it is natural for gastroenterologists to believe in endoscopy more than
radiologic contrast studies and surgeons to believe in surgery over medical
therapy.
A growing number of databases are complete, up to date, and widely
available by telephone, fax, floppy disks, CD-ROM, and e-mail, the In-
ternet, and comptcr bulletin board. Examples include a 24-hr telephone
connection to the Centers for Disease Control and Prevention for informa-
tion about disease prevention offered to those traveling to any part of the
world; To-cline for information on poisonings; PDQ for current recommen-
dation for cancer chemotherapy; and an array of databases on drugs, their
toxicities, and adjustment of dose in renal failure. These databases contain
information that is essential to the practice of medicine but are too infre-
quently needed and too extensive for clinicians to carry around in their
heads. Clinicians should find ways to access them in their location. 'Ihey
should also usc these databases with the lessons of this book in mind: The
data are only as good as the methods used to select them. Many of the
databases, such as guidelines of the Agency for Health Care Policy and
Research, the U.S. Preventive Services Task Force, and the American Col-
lege of Physicians, are created by excellent methods and make the process
clear. Some are the results of individuals or industries with conflict of
interest, and they should be used with skepticism.
Clinical Guidelines
Throughout the book we have argued that clinical research provides
the soundest grounds for establishing one's approach to clinical practice
and making decisions about patients. The shift away from anecdote and
personal experience has been called "evidence-based medicine" (19). An
important element in evidence-based med icinc is the translation of research
findings into clear. unambiguous recommendations for clinicians. Practice
guidelines are systematically developed statements to assist clinicians in
deciding about appropriate health care for specific clinical problems (19).
Their development and use are now commonplace in many organized
medical settings. At their best, the validity of guidelines is established by
including in the panel that prepZlres them people who represent all relevant
aspects of the question (ranging from highly specialized researchers to
clinicians, economists .and patients) to cover all important aspects of the
question and to balance, if not eliminate, the vested interests of anyone
or another participant. The best guidelines are based on research evidence,
not just expert opinion, and so often usc formal processes of literature
review and synthesis, as described in this chapter (20).
266 CLINICAL EPIDEMIOLOGY
18. Hebert PI{, Fiebach NH, Eberlein KA, T<lylor jO, Hennekens CH. The community ba~"d
randomized trial~ of pharmacological tr,'atment of mild-tlHnoderate hyperten~ion. Am J
EpidemioI19HR;127:581-589.
19. Evidence based medicine working group. Evid"nce based medicine: a new approach to
teJching the practice of m,,,licine. JAMA 1992;26H:242U-2425.
20, Hayward [{SA, Wilbon MC, Tunis SR, BJ~~ EI), Rubin HR, Haynes RI), More informiltive
ab~tracts of articles d",~nibing clinical pra,·ti"" guidelines. Ann Intern Med 1993;118:731 ~
737.
21. Crimshaw JM, Russell IT. Effect of clinical guidelines on medical practice. A systematic
review "f rigorous evaluJtion~. Lancet 1993;342:1117-1321.
SUGGESTED lUiADING
Epidemiology Work Croup of the lnkragency Regulatory Liaison Group. Cuiddines [or
documentation of epidemiologic studies, Am J Fpid."ninl lY81;114:6O'J-713,
Goldschmidt Pc. Information ~ynthesis: a practical guide. Health Services I,,,, lY!l6;21:215-237.
Haynes Rll McKibbon KA, Fit~.gerJld D, GllyJll CI-l, Walker C/,SJck"tt OL. How to keep up
with the medical literature. I: Why try to 1<",,1" up and how to get started. Ann Inkrn Mcd
IYH6;105: 149-153.
Haynes RB, McKibbon KA, Fitzgemld 0, Cuyatt GH, Walk"r Cj. Sackett m. How to keep
up with the m.,dical literature. 11: I:kciding which jOlJrnJls 10 read regulJrly. Ann Intern
Med 19R6;10S:JIIY-312.
Haynes RB, Mckibben KA, Fitzgerald D, Cuyatt CH, Walker C/, Sackett OL. How to keep
up with the medical lit('rature. III: Expanding the number of journals you read regubrly.
Ann Intern )vl",JIYH6;105:474-47R.
Ilaynes RB, McKibbon KA, Fitzgerald D, Cuvatt GH, Walker Cj. Sackett 01 .. How to keep
up with the medical literaturc. IV: Using th., literature to solve clinical problems. Ann
Intern Mcd 1986;105:636· MD.
Haynes RB,\·kKibbon KA, Walker Cj. Mousseau J, Baker I.M, Pitzgcrald D, Gl-lyatt C, Norman
CR Computer searching of th~ medical literature: an evaluation of MFDLlNE searching
systems. Ann Intern M",d 19H5;lIlJ:!l12-8J6.
Irwig I" 'tostcson AN, Gastonis C Lau J, Colditz C, Chalmers TC. Mostcllar f. Cuiddines
for meta-analyses eVilluating diagnostic te~ts. Ann Intern Med lYY4;120:667-676.
I 'Abbe KA, Detsky AS, O'Rourke K. Meta-analysis in clinical research, Ann Intcrn Med
1987;W7:224-2J1
Light RJ, Pillmer Oil. Summing up, The science of reviewing research. Cambridge. MA:
Harvard University Press, 19H4.
Mulrow CU, Stale of the ~;cience [The Medical Review Article]. Ann Intern Med 1987;106:485-
488.
Thompson SC, Po<ock 5J. CJn meta-analyses be lru~ted? Lancet 19YI;338:1127-IUD,
ALL STU1)1 ES
1. What kind of clinical question is the research intended to answer?
The research design should match the clinical question (see Table 12.1)
2. What iatiente. variables, and outcomes were studied?
These determine the generallzability of the results.
268 CLINICAL EPIDEMIOLOGY
PREVALENCE STUDIES
1. What arc the criteria for being a case?
Prevalence depends on what one calls a case.
2. In what population arc the cases found?
Prevalence depends on the group of people in which it is described.
3. Is prevalence described for an unbiased sample of the population?
Prevalence for the sample estimates prevalence for the population to
the extent that the sample is unbiased.
COHORT STUDIES
1. Arc all members of the cohort:
a. Entered at the beginning of follow-up (inception cohor!)?
Otherwise people who do unusually well or badly will not be
counted in the result.
b. At risk for developing the outcome?
It makes no sense to describe how outcomes develop over time in
people who already have the disease or cannot develop it.
e. At a similar point (zero time) in the course of disease?
Prognosis varies according to the point in the course of disease at
which one begins counting outcome events.
2. Is there complete follow-up on all members?
Drop-outs can bias the results if they on average have a better or worse
course than those who remain in the study.
3. Are <Ill members of the cohort assessed for outcomes with the sameillll'nsity?
Otherwise differences in outcome rates might be from measurement
bias, not true differences.
4. Are comparisons unbiased? (would members of the cohorts have the same
outcome rate except for the variable of Interest")
To attribute outcome to the factor of interest other determinants of
outcome must occur equally in the groups compared.
KANDOMIZED TRIALS
1. Are the basic guidelines for cohort studies satisfied?
Clinical trials are cohort studies
2. Were paiierus randomly al/ocaled to treated and control groups?
This is the only effective way to make a completely unbiased comparison
of treatments.
3. Were patients, caregivers, and researchers unaware of the treatment
group (masked) to which each patient belonged?
Masking participants in a trial helps assure that they are unbiased.
4. Were cointenxnsions the same in both groups?
Treating patients differently can destroy the comparability that was
achieved by randomization.
270 CLINICAL EPIDEMIOLOGY
5. Were rcsu Its described according to the tremment allocated or the treatment
actually received?
If not all patients receive the treatment assigned to them there are two
kinds of analyses with different objectives and scientific strengths. "In-
tention to treat" analyses arc for management decisions and are of the
randomly constituted groups. "Efficacy" analyses <He to explain the
effect of the intervention itself, are of the treatment actually received,
and are a cohort study.
CASE-CONTl~OL STU 1)1 ES
Cause-continued Data
single, 229-230 interval,21
specificity, 245 nominal, 21
temporal relationships, 241-242 ordinal, 21
Chance, 10-12, 11, 186-207, 189 simplifying, 43-44
diagnosis and, 56-57, 57 Decision, clinical, analysis, 89-90
overview, 205-206 Dt'lllographic groups, diagnosis and,
Clinical course, 112 62
Clinical data, summarization of, 20 Diagnosis, 43-74
Clinical epidemiology bias, 55-56
basic principles of, 4-14 chance, 56-57, 57
elements of, 3-4 cut-off point, 50
health outcomes, 4-5, 5-6t independence, assumption of,
need for, 3 70-71
overview, 2-4 multiple tests, 63, 67-71, 68
social context, 4 overview, 72-73
uses of, 14-15 standards
Clinical medicine, epidemiology for disease, lack of, 47
and,2 imperfect, consequence~ of,
Clinical perspective, traditional, 2-3 47-48
Cohort, 100-101 Distribution
vs. case-control research, 214-215, abnormality and, 30, 30-34
n5,216t actual, 32, 32-33, 32-33
inceptions, 112 describing, 30, 31 t. 32
survival, 125-126, 125 frequency, 30
Cohort studies, 100, 101-105, 102 Gaussian, 34
normal, 33-34, 34
advantages, l03t, 103-105
Dose-response relationships, 242
bias in, 123-128, 123
Double-blind, randomized controlled
disadvantages, 103t, 103-105
tria Is, 148
historical, 102, 103
Duration of disease, incidence,
Cninterventions, 147-148
prevalence, relationship, 78,
Comparison 84-85, Wit
groups, randomized controlled
trials,142-144
multiple, 200-202 Ecologic studies, 2.18
Compliance, 146-147 Effect, modification, 235, 236
Confidence interval, 198-200, 199 Effectiveness, 151-152, 153
Confounding bias, 8-10, 9 Efficacy, 151-152, 15,'!
Consistency, cause and, 243 Error, random, 18fi-187
Controls Explanatory trials, 151, 152
concurrent, 158
selection of, 220-222, 223 False-positive results, 44-45, 176,
Cost, 172-173 17fit
Cox proportional hazards regression Frequency, 7.')-93
model, 123 distribution, 30
Cut-off point, diagnosis and, 50 incidence, 76-89, 78
INDEX 273
Validity, 22-23
clinical studies, determining,
guidelines for, 267-270 Zero time, 112-113