Why The P-Value Culture Is Bad and Con Fidence Intervals A Better Alternative
Why The P-Value Culture Is Bad and Con Fidence Intervals A Better Alternative
Why The P-Value Culture Is Bad and Con Fidence Intervals A Better Alternative
Editorial
Why the P-value culture is bad and condence intervals a better alternative
s u m m a r y
In spite of frequent discussions of misuse and misunderstanding of probability values (P-values) they still
appear in most scientic publications, and the disadvantages of erroneous and simplistic P-value
interpretations grow with the number of scientic publications. Osteoarthritis and Cartilage prefer
condence intervals. This is a brief discussion of problems surrounding P-values and condence intervals.
2012 Osteoarthritis Research Society International. Published by Elsevier Ltd. All rights reserved.
Statistical precision
Statistical precision has two determinants, the number of observations in the sample and the observations variability. These determinants specify the standard error (SE) of an estimate such as the
mean:
p
SE [ SD= n
SE [
q
2
SD =n1 DSD2 =n2
1063-4584/$ see front matter 2012 Osteoarthritis Research Society International. Published by Elsevier Ltd. All rights reserved.
doi:10.1016/j.joca.2012.04.001
806
Both the P-value and the condence intervals are based on the
SE. When the studied difference, d, has a Gaussian distribution it
is statistically signicant at the 5% level when
jd=SEj>t0:05
Here t0.05 is the value in the Students t-distribution (introduced
in 1908 by William Gosset under the pseudonym Student) that
discriminates between the 95% jd/SEj having lower values and the
5% that have higher. Conversely, the condence interval
d t0:05 SE
describes a range of plausible values in which the real effect is 95%
likely to be included.
P-values
A P-value is the outcome from a hypothesis test of the null
hypothesis, H0: d 0. A low P-value indicates that observed data
do not match the null hypothesis, and when the P-value is lower
than the specied signicance level (usually 5%) the null hypothesis
is rejected, and the nding is considered statistically signicant.
The P-value has many weaknesses that needs to be recognized in
a successful analysis strategy.
First, the tested hypothesis should be dened before inspecting
data. The P-value is not easily interpretable when the tested
hypothesis is dened after data dredging, when a statistically
signicant outcome has been observed. If undisclosed to the reader
of a scientic report, such post-hoc testing is considered scientic
misconduct5.
Second, when multiple independent hypotheses are tested,
which usually is the case in a study or experiment, the risk that
at least one of these tests will be false positive increases, above
the nominal signicance level, with the number of hypotheses
tested. This multiplicity effect reduces the value of a statistically
signicant nding. Methods to adjust the overall signicance level
(like Bonferroni adjustment) exist, but the cost of such adjustments
is high. Either the number of observations has to be increased to
compensate for the adjustment, or the signicance level is maintained at the expense of the statistical power to detect an existing
effect or difference.
Third, a statistically insignicant difference between two
observed groups (the sample) does not indicate that this effect
does not exist in the population from which the sample is taken,
because the P-value is confounded by the number of observations;
it is based on the SE, which has On in the denominator. A statistically insignicant outcome indicates nothing more than that the
observed sample is too small to detect a population effect. A statistically insignicant outcome should be interpreted as absence of
evidence, not evidence of absence6.
Fourth, for the same reason a statistically signicant effect in
a large sample can represent a real, but minute, clinically insignificant, effect. For example, with sufciently large sample size even
a painkiller reducing pain with as little as an average of 1 mm
VAS on a 100 mm scale will eventually demonstrate a highly statistically signicant pain reduction. Any consideration of what constitutes the lowest clinically signicant effect on pain would be
independent of sample size, perhaps depend on cost, and possibly
be related to the risk of side effects and availability of alternative
therapies.
Fifth, a P-value provides only uncertainty information vis-a-vis
a specic null hypothesis, no information on the statistical precision of an estimate. This means that comparisons with a lowest
clinically signicant effect (which may not be denable in laboratory experiments) cannot be based on P-values from conventional
hypothesis test. For example, a statistically signicant relative risk
of 2.1 observed in a sample can correspond to a relative risk of
1.1, as well as to one of 10.0, in the population. The statistical significance comes from the comparison with the null hypothesis relative
risk of 1.0. That one risk factor in the sample has lower P-value than
another one says nothing about their relative effect.
Sixth, when the tested null hypothesis is meaningless the
P-value will not be meaningful. For example, inter-observer reliability is often presented with a P-value, but the null hypothesis
in this hypothesis test is that no inter-observer reliability exists.
However, why should two observers observing the same object
Fig. 1. Statistically and clinically signicant effects, measured in arbitrary units on an absolute scale, as evaluated by P-values and condence intervals.
807
condence intervals are often misunderstood as representing variability of observations instead of uncertainty of the sample estimate. Some further common misunderstandings should be
mentioned.
A consequence of the dominant P-value culture is that condence intervals are often not appreciated by themselves, but the
information they convey are transformed into simplistic terms of
statistical signicance. For example, it is common to check if the
condence intervals of two mean values overlap. When this
happens, the difference of the mean values is often considered
statistically insignicant. However, Students t-test has a different
denition of the mean difference's standard error (SE) than what
is used in the calculation of the overlapping condence intervals.
Two means may well be statistically signicantly different and still
have somewhat overlapping condence intervals. Overlapping
condence intervals can therefore not be directly interpreted in
terms of statistical signicance7.
SEs are also often used to indicate uncertainty, as error bars in
graphical presentations. Using condence intervals is, however,
a better alternative because the uncertainty represented by a SE
is confounded by the number of observations8. For example, one
SE corresponds to a 58% condence interval when n is 3 and to
a 65% condence interval when n 9.
When pairwise multiple groups are compared with one and the
same reference or control group in terms of relative risk or odds
ratios, comparisons of condence intervals are only valid vis-a-vis
the reference group. However, condence intervals encourage
comparing effect sizes, and invalid comparisons are often made
between other groups. Assume, for example, that the knee replacement revision risks of a low- (A) and a high (B) -exposed group of
smokers are compared with that of a group of non-smokers (C).
The three-group comparison leads to two relative risks, A/C and
B/C, both having condence intervals. These cannot be directly
compared; they depend on C. An alternative analysis method,
oating absolute risks (FAR), have been developed as a solution
to this problem9.
Fig. 2. The use of condence intervals in superiority, non-inferiority and equivalence trials, measured in arbitrary units on an absolute scale.
808
References
1. Rigby AS. Getting past the statistical referee: moving away
from P-values and towards interval estimation. Health Educ
Res 1999;14:7135.
2. Nester MR. An applied statisticians creed. Appl Statist 1996;
45:40110.
3. Fidler F, Thomason N, Cumming G, Finch S, Leeman J. Editors
can lead researchers to condence intervals, but cant make
them think. Psychol Sci 2004;15:11926.
4. Ranstam J. Sampling uncertainty in medical research. Osteoarthritis Cartilage 2009;17:14169.