Clinical Informatics: Hypothesis Tests and P-Values
Clinical Informatics: Hypothesis Tests and P-Values
Clinical Informatics: Hypothesis Tests and P-Values
2
f
n
f
+
2
p
np
t =
5.0 hours 4.0 hours
(1.7 hours)
2
6
+
(1.6 hours)
2
7
= 1.09
Journal of Psychiatric Practice Vol. 17, No. 4 July 2011 289
Clinical Informatics
Using a cutoff of 0.8, we would decide that Fraudulin
had an effect on insomnia. If the standard deviation
were twice as large for each group, it would halve the
resulting value of t to give t = 0.54. More widely dis-
persed data would lead us to decide the null hypoth-
esis. If n = 3 for both groups, t would be
Fewer patients constitute weaker evidence, and the
value of t goes down. In this case it drops far enough
that we decide the null hypothesis.
If the mean number of hours slept with Fraudulin
was 3 instead of 5, t would have the value
If we proceed as for a diagnostic questionnaire, we
would say Fraudulin has no effect. Obviously, some-
thing is wrong, since Fraudulin appears to be causing
patients to lose sleep.
There is a problem with how we set our cutoff. The
null hypothesis is Fraudulin has no effect, but
using the cutoff as we described, we decide the null
hypothesis even if Fraudulin keeps patients awake.
No effect requires that we set our cutoff in both
directions, that is, if the means are no farther apart
than some cutoff, we decide the null hypothesis. Such
a two-sided cutoff is called two tailed. If you wish to
be ignorant of harmful effects to your patients, use a
one tailed test. Otherwise, prefer the two tailed.
Now that we have something to apply a cutoff to, we
can recapitulate the program we established for diag-
nostic questionnaires: consider the population we are
dealing with, the various probabilities of error, and
losses associated with each kind of error.
For diagnostic questionnaires, the population con-
sisted of patients. In our fictional trial. the popula-
tion consists of groups of patients. The rule to
remember is that an individual member of the popu-
lation is the grouping on which you calculate a sum-
mary statistic. One group of patients produces one
value of t. The population as a whole could be all clin-
ical trials of Fraudulin, all clinical trials of insomnia
medications, all clinical trials of drugs in the state of
Ohio, or as many other variations as we saw with
diagnostic questionnaires.
Once we have established our population, we
repeat the mental gymnastics we performed for diag-
nostic questionnaires: divide the population into two
groups, for one of which the null hypothesis is true
(Fraudulin has no effect) and for one of which the
other hypothesis is true (Fraudulin affects how much
patients sleep). We divide each of those groups into
those for which we decide correctly and those for
which we make a mistake.
For diagnostic questionnaires, we estimated the
relative numbers of individuals in each of the blocks
by referring to a more reliable test (e.g., a structured
interview). Yet that interview must have been estab-
lished based on a yet more reliable test, which must
have been established by reference to some even
more reliable test, and so on. This way lies madness.
Gossets contribution was breaking this cycle.
Instead of establishing a populations properties
based on some more reliable test, a mathematical
model is used to approximate the population. We try
to find mathematical models whose form depends
only on plausible assumptions about the real world.
Many such models are known and bear the names of
the mathematicians who discovered them. If your
measurements arise from the sum of lots of small,
independent effects, they will be approximated by
the famous bell curve, more properly known as the
Gaussian or normal distribution. Sums of a few
large, independent effects produce a Cauchy distri-
bution. Poisson gave his name to the distribution of
random events of some kind you would expect to hap-
pen in a fixed period of time if there were no connec-
tions among them. The classic example is the
number of soldiers who died from being kicked by a
horse each year in the Prussian army. If no horse, no
matter how ornery, kills more than one soldier, espe-
cially since it is likely to be put down, there will be
no clumps of deaths, and this model is extremely
accurate. Gumbel, Frechet, and Weibull figured out
the generic distributions of maximum values (e.g.,
maximum height of a river or the maximum magni-
tude of an earthquake that will hit an area in some
period). Such mathematical models are ideals based
on an explicit set of assumptions. It takes common
Below cutoff Above cutoff
Null hypothesis Type I error
Other hypothesis Type II error
t =
5.0 hours 4.0 hours
(1.7 hours)
2
3
+
(1.6 hours)
2
3
= 0.74
t =
3.0 hours 4.0 hours
(1.7 hours)
2
6
+
(1.6 hours)
2
7
= 1.09
sense, experience, and experimental design to
approach these assumptions in real life, and a failure
of these assumptions constitutes a new kind of error,
one for which we cannot give a probability.
Students t-test makes three assumptions: each of
the measurements we make is independent, each is
drawn from the same population, and the population
is well approximated by a Gaussian distribution.
Independence is a subtle concept. When you make
a measurement on some member of a population, you
hope it will tell you something about the population as
a whole. If all the members of the population are inde-
pendent, it will only tell you about the population as a
whole. As soon as it tells you about parts of the popu-
lation beyond that one individual, you have lost inde-
pendence. For example, if your population is a rural
town where three quarters of the citizens have the
last name Cox, and you are studying congenital dis-
eases, finding your disease in a Cox tells you some-
thing about Coxes as opposed to the whole
population. Finding the same disease in a random
patient in Chicago tells you very little except about
whatever population the patient came from. Of
course, finding the disease in a person in Chicago
also tells you about his or her family. This is where
idealization comes in, and why independence is sub-
tle. If you arrange your experiment to exclude multi-
ple family members, then the idealization holds. Of
course, some subjects may have arrived in Chicago in
the same mass population movements. We can
always make an error and have dependence among
samples, but without plausible independence, you
may as well throw your data away. There are no
mathematical tricks that will save you.
Once we have a mathematical model of the popu-
lation, we use it to calculate error probabilities.
Recall the standard names and for the probabil-
ity of type I and type II error,
In order to calculate the error probabilities,
Student assumed that the measurements for each
treatment come from a Gaussian distribution. When
the null hypothesis is true, the Gaussian distribu-
tions for each group are centered at the same value,
although they may have different widths, as in
From these two distributions, Student calculated
the distribution of his t,
The probability of type I error (of wrongly con-
cluding there is an effect) is the probability of meas-
uring a value of t beyond the cutoff (the shaded
fraction of the area in the figure) when the two
Gaussian distributions are centered at the same
value.
Now we can define the p-value. To do so, you set
the value of t you calculated as your cutoff and cal-
culate for that cutoff (i.e., probability of a type I
error). When the cutoff is set at t, the value of is the
p-value. To calculate the p-value for our Fraudulin
data, we take 1.09 (the t value) as our cutoff and
have a computer calculate the fraction of the area
under the t distribution beyond 1.09 and 1.09 (shad-
ed areas in the graphic) to produce , which is our p-
value. In this case p = 0.30. If you set the level of
significance at the traditional 0.05, you decide the
null hypothesis (Fraudulin has no effect). Good luck
getting these data published! If the p value is lower
(e.g., p = 0.04), researchers would decide the other
hypothesis (results are statistically significant).
The p-value is the smallest relevant value of
given your data (i.e., the smallest probability of mak-
ing a Type I error and deciding there is an effect
when there isnt one). Lets look at a few other cut-
offs, both smaller and larger than 1.09.
Journal of Psychiatric Practice Vol. 17, No. 4 290 July 2011
Clinical Informatics
Below cutoff Above cutoff
Null hypothesis
Other hypothesis
Patient
receiving
drug
cutoff + cutoff 0
0
Patient
receiving
placebo
t
time slept (/h)
Cutoff Cutoff
0.9 0.39 1.5 0.16
1.0 0.34 2.0 0.07
1.09 0.30
Journal of Psychiatric Practice Vol. 17, No. 4 July 2011 291
Clinical Informatics
Given t = 1.09, you would decide the null hypothesis
for the cutoffs of 1.5 or 2.0 (, the probability of
deciding the other hypothesis when the null hypoth-
esis is true, is not relevant). For the cutoffs 0.9, 1.0,
and 1.09, you decide the other hypothesis. As the cut-
offs get smaller, the values of alpha (shaded areas
under the tails in the figure) increase. With a cutoff
of 0.9, you would incorrectly decide the other hypoth-
esis in 39 of 100 cases compared with 30 of 100 cases
with a cutoff of 1.09. Thus, 1.09 is the largest cutoff
for which we would decide the other hypothesis from
our data, and it corresponds to the smallest relevant
value of alpha. The p-value represents the smallest
value of alpha you can get from your data while
deciding the other hypothesis. It is the lower limit of
your probability of making a type I error.
The most common misconception about the p-
value is that it is the probability of deciding the
wrong hypothesis. It is not. It is the lower limit of the
probability of deciding the wrong hypothesis when
the null hypothesis is true (of deciding there is an
effect when there is not). It says nothing about what
happens when the other hypothesis is true, nor does
it account for relative probability of each hypothesis.
The p-value seems like a strange thing to report,
does it not? Wouldnt it be better to give the cutoff, or
perhaps the and you used to make your decision?
All of the conventions around hypothesis testing in
practice are sufficiently odd that a vocal minority of
statisticians call for them to be abandoned on a reg-
ular basis, and the silent majority shifts uncomfort-
ably in their seats, since they dont really disagree.
The peculiarity of hypothesis testing comes from its
history. Its conventions were cobbled together from
the wreckage of a decades long dispute over how the
theory should work between Ronald Fisher on one
side, and Jerzy Neyman and Egon Pearson on the
other. Fisher took Gossets work on the t test, extend-
ed his ideas in many directions, and used them as
the foundation for his book Statistical Methods for
Research Workers, first published in 1925.
3
He pro-
posed the p-value as a useful measure: if a
researcher looked at his data and claimed he saw an
effect, the p-value was the probability of any effect he
saw arising purely by chance. Fisher wasnt particu-
larly concerned about the extremely rare problem of
researchers looking at their data and saying, No, I
really dont think theres anything there.
It wasnt until 1933 that Neyman and Pearson
4
took the first (rather opaque) steps towards the for-
mal idea of decisions and probabilities of error that
we use here, and the full structure of decision theory
didnt appear until 1939 in the work of Abraham
Wald.
5
By that time, Fishers book had gone through
six more editions, and would go through another
seven more before his death in 1962.
To make matters worse, the arbitrary p-value of
0.05 has become enshrined in the scientific communi-
ty. If your results yield a p-value of 0.049, many sins
of method and technique will be forgiven for publica-
tion, but let it crawl above that magic 0.05, and sud-
denly your paper is universally rejected.
6
This may
seem absurd, but it took so much work to achieve con-
sistent reporting of p-values that statisticians hesi-
tate to undertake yet more changes.
Given this, you must exercise good judgment with
hypothesis tests. They are ubiquitous and useful for
many things. Hammers are also useful for many
things, including driving nails and smashing your
thumb. Both hammers and hypothesis tests require
common sense. Fisher encouraged researchers not to
consider p-values in isolation but also to take into
account other relevant evidence (e.g., results of pre-
vious studies).
7
He used hypothesis tests in support
of his intellect, not in place of it, and so should you.
References
1. Ross FJ. Statistics for the clinician: Diagnostic question-
naires. J Psychiatr Pract 2011;17:5760.
2. Student. Probable error of a correlation coefficient.
Biometrika 1908;6:30210.
3. Fisher RA. Statistical methods for research workers.
Oliver and Boyd; 1925.
4. Neyman J, Pearson ES. On the problem of the most effi-
cient tests of statistical hypotheses. Philosophical
Transactions of the Royal Society of London. Series A,
Containing papers of a mathematical or physical charac-
ter. 1933;231:289337.
5. Wald A. Contributions to the theory of statistical estima-
tion and testing hypotheses. The Annals of Mathematical
Statistics 1939;10:299326.
6. Ioannidis JPA. Why most published research findings are
false. PloS Medicine 2005;2:e124.
7. Goodman SN. Toward evidence-based medical statistics. 1:
The P value fallacy. Ann Intern Med 1999;130:9951004.