Chapter 2
Misuses of Statistical
Analysis in Climate
Research
by Hans von Storch
2.1
Prologue
The history of misuses of statistics is as long as the history of statistics itself.
The following is a personal assessment about such misuses in our field, climate
research. Some people might find my subjective essay of the matter unfair
and not balanced. This might be so, but an effective drug sometimes tastes
bitter.
The application of statistical analysis in climate research is methodologically more complicated than in many other sciences, among others because
of the following reasons:
• In climate research only very rarely it is possible to perform real independent experiments (see Navarra’s discussion in Chapter 1). There is
more or less only one observational record which is analysed again and
again so that the processes of building hypotheses and testing hypotheses are hardly separable. Only with dynamical models can independent
Acknowledgments: I thank Bob Livezey for his most helpful critical comments, and
Ashwini Kulkarni for responding so positively to my requests to discuss the problem of
correlation and trend-tests.
11
12
Chapter 2: Misuses
data be created - with the problem that these data are describing the
real climate system only to some unknown extent.
• Almost all data in climate research are interrelated both in space and
time - this spatial and temporal correlation is most useful since it allows the reconstruction of the space-time state of the atmosphere and
the ocean from a limited number of observations. However, for statistical inference, i.e., the process of inferring from a limited sample robust
statements about an hypothetical underlying “true” structure, this correlation causes difficulties since most standard statistical techniques use
the basic premise that the data are derived in independent experiments.
Because of these two problems the fundamental question of how much
information about the examined process is really available can often hardly
be answered. Confusion about the amount of information is an excellent
hotbed for methodological insufficiencies and even outright errors. Many
such insufficiencies and errors arise from
• The obsession with statistical recipes in particular hypothesis testing.
Some people, and sometimes even peer reviewers, react like Pawlow’s
dogs when they see a hypothesis derived from data and they demand a
statistical test of the hypothesis. (See Section 2.2.)
• The use of statistical techniques as a cook-book like recipe without a
real understanding about the concepts and the limitation arising from
unavoidable basic assumptions. Often these basic assumptions are disregarded with the effect that the conclusion of the statistical analysis
is void. A standard example is disregard of the serial correlation. (See
Sections 2.3 and 9.4.)
• The misunderstanding of given names. Sometimes physically meaningful
names are attributed to mathematically defined objects. These objects,
for instance the Decorrelation Time, make perfect sense when used as
prescribed. However, often the statistical definition is forgotten and the
physical meaning of the name is taken as a definition of the object - which
is then interpreted in a different and sometimes inadequate manner. (See
Section 2.4.)
• The use of sophisticated techniques. It happens again and again that
some people expect miracle-like results from advanced techniques. The
results of such advanced, for a “layman” supposedly non-understandable,
techniques are then believed without further doubts. (See Section 2.5.)
13
2.2
Mandatory Testing and the Mexican Hat
In the desert at the border of Utah and Arizona there is a famous combination
of vertically aligned stones named the “Mexican Hat” which looks like a
human with a Mexican hat. It is a random product of nature and not manmade . . . really? Can we test the null hypothesis “The Mexican Hat is of
natural origin”? To do so we need a test statistic for a pile of stones and a
probability distribution for this test statistic under the null hypothesis. Let’s
take
½
1 if p forms a Mexican Hat
t(p) =
(2.1)
0 otherwise
for any pile of stones p. How do we get a probability distribution of t(p) for all
piles of stones p not affected by man? - We walk through the desert, examine
a large number, say n = 106 , of piles of stones, and count the frequency of
t(p) = 0 and of t(p) = 1. Now, the Mexican Hat is famous for good reasons
- there is only one p with t(p) = 1, namely the Mexican Hat itself. The
other n − 1 = 106 − 1 samples go with t(p) = 0. Therefore the probability
distribution for p not affected by man is
½ −6
10
for k = 1
prob (t(p) = k) =
(2.2)
1 − 10−6 for k = 0
After these preparations everything is ready for the final test. We reject the
null hypothesis with a risk of 10−6 if t(Mexican hat) = 1. This condition is
fulfilled and we may conclude: The Mexican Hat is not of natural origin but
man-made.
Obviously, this argument is pretty absurd - but where is the logical error?
The fundamental error is that the null hypothesis is not independent of the
data which are used to conduct the test. We know a-priori that the Mexican
Hat is a rare event, therefore the impossibility of finding such a combination
of stones cannot be used as evidence against its natural origin. The same
trick can of course be used to “prove” that any rare event is “non-natural”,
be it a heat wave or a particularly violent storm - the probability of observing
a rare event is small.
One might argue that no serious scientist would fall into this trap. However,
they do. The hypothesis of a connection between solar activity and the
statistics of climate on Earth is old and has been debated heatedly over
many decades. The debate had faded away in the last few decades - and
has been refueled by a remarkable observation by K. Labitzke. She studied
the relationship between the solar activity and the stratospheric temperature
at the North Pole. There was no obvious relationship - but she saw that
during years in which the Quasibiennial Oscillation (QBO) was in its West
Phase, there was an excellent positive correlation between solar activity and
North Pole temperature whereas during years with the QBO in its East Phase
14
Chapter 2: Misuses
SOLARFLUX 10.7cm
Figure 2.1: Labitzke’ and van Loon’s relationship between solar flux and
the temperature at 30 hPa at the North Pole for all winters during which the
QBO is in its West Phase and in its East Phase. The correlations are 0.1,
0.8 and -0.5. (From Labitzke and van Loon, 1988).
300
Independent
data
-54
°C
250
-58
200
-62
-66
150
-70
100
-74
70
SOLARFLUX 10.7cm
-78
300
WEST
-54
°C
-58
250
-62
200
-66
150
-70
100
-74
70
SOLARFLUX 10.7cm
-78
300
EAST
250
-54
°C
-58
-62
200
-66
150
-70
100
-74
70
-78
094HVSaa.ds4
1956
1960
1970
1980
1990
TIME [year]
Section 2.2: Neglecting Serial Correlation
15
there was a good negative correlation (Labitzke, 1987; Labitzke and van Loon,
1988).
Labitzke’s finding was and is spectacular - and obviously right for the data
from the time interval at her disposal (see Figure 2.1). Of course it could be
that the result was a coincidence as unlikely as the formation of a Mexican
Hat. Or it could represent a real on-going signal. Unfortunately, the data
which were used by Labitzke to formulate her hypothesis can no longer be
used for the assessment of whether we deal with a signal or a coincidence.
Therefore an answer to this question requires information unrelated to the
data as for instance dynamical arguments or GCM experiments. However,
physical hypotheses on the nature of the solar-weather link were not available
and are possibly developing right now - so that nothing was left but to wait for
more data and better understanding. (The data which have become available
since Labitzke’s discovery in 1987 support the hypothesis.)
In spite of this fundamental problem an intense debate about the “statistical significance” broke out. The reviewers of the first comprehensive paper on
that matter by Labitzke and van Loon (1988) demanded a test. Reluctantly
the authors did what they were asked for and found of course an extremely
little risk for the rejection of the null hypothesis “The solar-weather link is
zero”. After the publication various other papers were published dealing with
technical aspects of the test - while the basic problem that the data to conduct
the test had been used to formulate the null hypothesis remained.
When hypotheses are to be derived from limited data, I suggest two alternative routes to go. If the time scale of the considered process is short compared
to the available data, then split the full data set into two parts. Derive the
hypothesis (for instance a statistical model) from the first half of the data and
examine the hypothesis with the remaining part of the data.1 If the time scale
of the considered process is long compared to the time series such that a split
into two parts is impossible, then I recommend using all data to build a model
optimally fitting the data. Check the fitted model whether it is consistent with
all known physical features and state explicitly that it is impossible to make
statements about the reliability of the model because of limited evidence.
2.3
Neglecting Serial Correlation
Most standard statistical techniques are derived with explicit need for statistically independent data. However, almost all climatic data are somehow
correlated in time. The resulting problems for testing null hypotheses is discussed in some detail in Section 9.4. In case of the t-test the problem is
nowadays often acknowledged - and as a cure people try to determine the
“equivalent sample size” (see Section 2.4). When done properly, the t-test
1 An
example of this approach is offered by Wallace and Gutzler (1981).
16
Chapter 2: Misuses
Figure 2.2: Rejection rates of the Mann-Kendall test of the null hypothesis
“no trend” when applied to 1000 time series of length n generated by an
AR(1)-process (2.3) with prescribed α. The adopted nominal risk of the test
is 5%.
Top: results for unprocessed serially correlated data.
Bottom: results after pre-whitening the data with (2.4). (From Kulkarni and
von Storch, 1995)
Section 2.3: Neglecting Serial Correlation
17
becomes conservative - and when the “equivalent sample size” is “optimized”
the test becomes liberal2 . We discuss this case in detail in Section 2.4.
There are, however, again and again cases in which people simply ignore
this condition, in particular when dealing with more exotic tests such as the
Mann-Kendall test, which is used to reject the null hypothesis of “no trends”.
To demonstrate that the result of such a test really depends strongly on the
autocorrelation, Kulkarni and von Storch (1995) made a series of Monte Carlo
experiments with AR(1)-processes with different values of the parameter α.
Xt = αXt−1 + Nt
(2.3)
with Gaussian “white noise”Nt , which is neither auto-correlated nor correlated with Xt−k for k ≥ 1. α is the lag-1 autocorrelation of Xt . 1000 iid3
time series of different lengths, varying form n = 100 to n = 1000 were generated and a Mann-Kendall test was performed. Since the time series have no
trends, we expect a (false) rejection rate of 5% if we adopt a risk of 5%, i.e., 50
out of the 1000 tests should return the result “reject null hypothesis”. The
actual rejection rate is much higher (see Figure 2.2). For autocorrelations
α ≤ 0.10 the actual rejection rate is about the nominal rate of 5%, but for
α = 0.3 the rate is already 0.15, and for α = 0.6 the rate > 0.30. If we test a
data field with a lag-1 autocorrelation of 0.3, we must expect that on average
at 15% of all points a “statistically significant trend” is found even though
there is no trend but only “red noise”. This finding is mostly independent of
the time series length.
When we have physical reasons to assume that the considered time series is
a sum of a trend and stochastic fluctuations generated by an AR(1) process,
and this assumption is sometimes reasonable, then there is a simple cure,
the success of which is demonstrated in the lower panel of Figure 2.2. Before
conducting the Mann-Kenndall test, the time series is “pre-whitened” by first
estimating the lag-autocorrelation α̂ at lag-1, and by replacing the original
time series Xt by the series
Yt = Xt − α̂Xt−1
(2.4)
The “pre-whitened” time series is considerably less plagued by serial correlation, and the same Monte Carlo test as above returns actual rejections rates
close to the nominal one, at least for moderate autocorrelations and not too
short time series. The filter operation (2.4) affects also any trend; however,
other Monte Carlo experiments have revealed that the power of the test is
reduced only weakly as long as α is not too large.
A word of caution is, however, required: If the process is not AR(1) but
of higher order or of a different model type, then the pre-whitening (2.4)
2 A test is named “liberal” if it rejects the null hypothesis more often than specified by
the significance level. A “conservative” rejects less often than specified by the significance
level.
3 “iid” stands for “independent identically distributed”.
18
Chapter 2: Misuses
is insufficient and the Mann-Kenndall test rejects still more null hypotheses
than specified by the significance level.
Another possible cure is to “prune” the data, i.e., to form a subset of
observations which are temporally well separated so that any two consecutive
samples in the reduced data set are no longer autocorrelated (see Section
9.4.3).
When you use a technique which assumes independent data and you believe
that serial correlation might be prevalent in your data, I suggest the following
“Monte Carlo” diagnostic: Generate synthetic time series with a prescribed
serial correlation, for instance by means of an AR(1)-process (2.3). Create
time series without correlation (α = 0) and with correlation (0 < α < 1)
and try out if the analysis, which is made with the real data, returns different
results for the cases with and without serial correlation. In the case that they
are different, you cannot use the chosen technique.
2.4
Misleading Names:
Decorrelation Time
The Case of the
The concept of “the” Decorrelation P
Time is based on the following reasoning: 4
n
1
n
The variance of the mean X̄ = n k=1 Xk of n identically distributed and
independent random variables Xk = X is
¡ ¢
1
Var X̄n = Var(X)
n
(2.5)
If the Xk are autocorrelated then (2.5) is no longer valid but we may define
a number, named the equivalent sample size n′ such that
¡ ¢
1
Var X̄n = ′ Var(X)
n
(2.6)
The decorrelation time is then defined as
#
"
∞
X
n
τD = lim ′ · ∆t = 1 + 2
ρ(∆) ∆t
n→∞ n
(2.7)
∆=1
with the autocorrelation function ρ of Xt .
The decorrelation times for an AR(1) process (2.3) is
τD =
1+α
∆t
1−α
(2.8)
There are several conceptual problems with “the” Decorrelation Time:
4 This section is entirely based on the paper by Zwiers and von Storch (1995). See also
Section 9.4.3.
Section 2.4: Misleading Names
19
• The definition (2.7) of a decorrelation time makes sense when dealing with
the problem of the mean of n-consecutive serially correlated observations.
However, its arbitrariness in defining a characteristic time scale becomes
obvious when we reformulate our problem by replacing the mean in (2.6)
by, for instance, the variance. Then, the characteristic time scale is
(Trenberth, 1984):
"
τ = 1+2
∞
X
2
#
ρ (k) ∆t
k=1
Thus characteristic time scales τ depends markedly on the statistical
problem under consideration. These numbers are, in general, not physically defined numbers.
• For an AR(1)-process we have to distinguish between the physically
meaningful processes with positive memory (α > 0) and the physically
meaningless processes with negative memory (α < 0). If α > 0 then
formula (2.8) gives a time τD > ∆t representative of the decay of the
auto-correlation function. Thus, in this case, τD may be seen as a physically useful time scale, namely a “persistence time scale” (but see the
dependency on the time step discussed below). If α < 0 then (2.8) returns times τD < ∆t, even though probability statements for any two
states with an even time lag are identical to probabilities of an AR(p)
process with an AR-coefficient |α|.
Thus the number τD makes sense as a characteristic time scale when
dealing with red noise processes. But for many higher order AR(p)processes the number τD does not reflect a useful physical information.
• The Decorrelation Time depends on the time increment ∆t: To demonstrate this dependency we consider again the AR(1)-process (2.3) with
a time increment of ∆t = 1 and α ≥ 0. Then we may construct other
AR(1) processes with time increments k by noting that
Xt = αk Xt−k + N′t
(2.9)
with some noise term N′t which is a function of Nt . . . Nt−k+1 . The
decorrelation times τD of the two processes (2.3,2.9) are because of α < 1:
τD,1 =
1+α
·1≥1
1−α
and
τD,k =
1 + αk
·k ≥k
1 − αk
(2.10)
so that
lim
k→∞
τD,k
=1
k
(2.11)
20
Chapter 2: Misuses
Figure 2.3: The dependency of the decorrelation time τD,k (2.10) on the time
increment k (horizontal axis) and on the coefficient α (0.95, 0.90, 0.80, 0.70
and 0.50; see labels). (From von Storch and Zwiers, 1999).
That means that the decorrelation time is as least as long as the time
increment; in case of “white noise”, with α = 0, the decorrelation time
is always equal to the time increment. In Figure 2.3 the dimensional
decorrelation times are plotted for different α-values and different time
increments k. The longer the time increment, the larger the decorrelation
time. For sufficiently large time increments we have τD,k = k. For small
α-values, such as α = 0.5, we have virtually τD,k = k already after k = 5.
If α = 0.8 then τD,1 = 9, τD,11 = 13.1 and τD,21 = 21.4. If the time
increment is 1 day, then the decorrelation time of an α = 0.8-process is
9 days or 21 days - if we sample the process once a day or once every 21
days.
We conclude that the absolute value of the decorrelation time is of questionable informational value. However, the relative values obtained from
several time series sampled with the same time increment are useful to
infer whether the system has in some components a longer memory than
in others. If the decorrelation time is well above the time increment,
as in case of the α = 0.95-curve in Figure 2.3, then the number has
some informational value whereas decorrelation times close to the time
increment, as in case of the α = 0.5-curve, are mostly useless.
Section 2.4: Misleading Names
21
We have seen that the name “Decorrelation Time” is not based on physical
reasoning but on strictly mathematical grounds. Nevertheless the number is
often incorrectly interpreted as the minimum time so that two consecutive
observations Xt and Xt+τD are independent. If used as a vague estimate
with the reservations mentioned above, such a use is in order. However, the
number is often introduced as crucial parameter in test routines. Probably
the most frequent victim of this misuse is the conventional t-test.
We illustrate this case by a simple example from Zwiers and von Storch
(1995): We want to answer the question whether the long-term mean winter
temperatures in Hamburg and Victoria are equal. To answer this question,
we have at our disposal daily observations for one winter from both locations.
We treat the winter temperatures at both locations as random variables, say
TH and TV . The “long term mean” winter temperatures at the two locations, denoted as µH and µV respectively, are parameters of the probability
distributions of these random variables. In the statistical nomenclature the
question we pose is: do the samples of temperature observations contain
sufficient evidence to reject the null hypothesis H0 : µH − µV = 0.
The standard approach to this problem is to use to the Student’s t-test.
The test is conducted by imposing a statistical model upon the processes
which resulted in the temperature samples and then, within the confines of
this model, measuring the degree to which the data agree with H0 . An essential part of the model which is implicit in the t-test is the assumption that the
data which enter the test represent a set of statistically independent observations. In our case, and in many other applications in climate research, this
assumption is not satisfied. The Student’s t-test usually becomes “liberal”
in these circumstances. That is, it tends to reject that null hypothesis on
weaker evidence than is implied by the significance level5 which is specified
for the test. One manifestation of this problem is that the Student’s t-test
will reject the null hypothesis more frequently than expected when the null
hypothesis is true.
A relatively clean and simple solution to this problem is to form subsamples of approximately independent observations from the observations. In
the case of daily temperature data, one might use physical insight to argue
that observations which are, say, 5 days apart, are effectively independent
of each other. If the number of samples, the sample means and standard
∗
∗
∗
deviations from these reduced data sets are denoted by n∗ , T̃H , T̃V , σ̃H and
∗
σ̃V respectively, then the test statistic
∗
t= p
∗
T̃H − T̃V
(σ̃H 2 + σ̃V 2 )/n∗
∗
∗
(2.12)
has a Student’s t-distribution with n∗ degrees of freedom provided that the
null hypothesis is true6 and a test can be conducted at the chosen signif5 The significance level indicates the probability with which the null hypothesis will be
rejected when it is true.
6 Strictly speaking, this is true only if the standard deviations of T
H and TV are equal.
22
Chapter 2: Misuses
icance level by comparing the value of (2.12) with the percentiles of the
t(n∗ )-distribution.
The advantage of (2.12) is that this test operates as specified by the user
provided that the interval between successive observations is long enough.
The disadvantage is that a reduced amount of data is utilized in the analysis. Therefore, the following concept was developed in the 1970s to overcome
this disadvantage: The numerator in (2.12) is a random variable because it
differs from one pair of temperature samples to the next. When the observations which comprise the samples are serially uncorrelated the denominator
in (2.12) is an estimate of the standard deviation of the numerator and the
ratio can be thought of as an expression of the difference of means in units
of estimated standard deviations. For serially correlated data, with sample
means T̃ and sample standard deviations σ̃ derived
p from all available ob2 + σ̃ 2 )/n′ with the
servations, the standard deviation of T̃H − T̃V is (σ̃H
V
′
equivalent sample size n as defined in (2.6). For sufficiently large samples
sizes the ratio
t= p
T̃H − T̃V
2 + σ̃ 2 )/n′
(σ̃H
V
(2.13)
has a standard Gaussian distribution with zero mean and standard deviation
one. Thus one can conduct a test by comparing (2.13) to the percentiles of
the standard Gaussian distribution.
So far everything is fine.
Since t(n′ ) is approximately equal to the Gaussian distribution for n′ ≥ 30,
one may compare the test statistic (2.13) also with the percentiles of the
t(n′ )-distribution. The incorrect step is the heuristic assumption that this
prescription - “compare with the percentiles of the t(n′ ), or t(n′ − 1) distribution” - would be right for small (n′ < 30) equivalent samples sizes. The
rationale of doing so is the tacitly assumed fact that the statistic (2.13) would
be t(n′ ) or t(n′ − 1)-distributed under the null hypothesis. However, this assumption is simply wrong. The distribution (2.13) is not t(k)-distributed for
any k, be it the equivalent sample size n′ or any other number. This result
has been published by several authors (Katz (1982), Thiébaux and Zwiers
(1984) and Zwiers and von Storch (1995)) but has stubbornly been ignored
by most of the atmospheric sciences community.
A justification for the small sample size test would be that its behaviour
under the null hypothesis is well approximated by the t-test with the equivalent sample size representing the degrees of freedom. But this is not so, as
is demonstrated by the following example with an AR(1)-process (2.3) with
α = .60. The exact equivalent sample size n′ = 14 n is known for the process
since its parameters are completely known. One hundred independent samples of variable length n were randomly generated. Each sample was used to
test the null hypothesis Ho : E(Xt ) = 0 with the t-statistic (2.13) at the 5%
significance level. If the test operates correctly the null hypothesis should be
(incorrectly) rejected 5% of the time. The actual rejection rate (Figure 2.4)
Section 2.4: Misleading Names
23
Figure 2.4: The rate of erroneous rejections of the null hypothesis of equal
means for the case of auto correlated data in a Monte Carlo experiment. The
“equivalent sample size” n′ (in the diagram labeled ne ) is either the correct
number, derived from the true parameters of the considered AR(1)-process
or estimated from the best technique identified by Zwiers and von Storch
(1995). (From von Storch and Zwiers, 1995).
Sample Size n
24
Chapter 2: Misuses
is notably smaller than the expected rate of 5% for 4n′ = n ≤ 30. Thus, the
t-test operating with the true equivalent sample size is conservative and thus
wrong.
More problems show up when the equivalent sample is unknown. In this
case it may be possible to specify n′ on the basis of physical reasoning. Assuming that conservative practices are used, this should result in underestimated values of n′ and consequently even more conservative tests. In most
applications, however, an attempt is made to estimate n′ from the same data
that are used to compute the sample mean and variance. Monte Carlo experiments show that the actual rejection rate of the t-test tends to be greater
than the nominal rate when n′ is estimated. Also this case has been simulated
in a series of Monte Carlo experiments with the same AR(1)-process. The
resulting rate of erroneous rejections is shown in Figure 2.4 - for small ratio
sample sizes the actual significance level can be several times greater than
the nominal significance level. Thus, the t-test operating with the estimated
equivalent sample size is liberal and thus wrong.
Zwiers and von Storch (1995) offer a “table look-up” test as a useful alternative to the inadequate “t-test with equivalent sample size” for situations
with serial correlations similar to red noise processes.
2.5
Use of Advanced Techniques
The following case is an educational example which demonstrates how easily
an otherwise careful analysis can be damaged by an inconsistency hidden in
a seemingly unimportant detail of an advanced technique. When people have
experience with the advanced technique for a while then such errors are often
found mainly by instinct (“This result cannot be true - I must have made
an error.”) - but when it is new then the researcher is somewhat defenseless
against such errors.
The background of the present case was the search for evidence of bifurcations and other fingerprints of truly nonlinear behaviour of the dynamical
system “atmosphere”. Even though the nonlinearity of the dynamics of the
planetary-scale atmospheric circulation was accepted as an obvious fact by
the meteorological community, atmospheric scientists only began to discuss
the possibility of two or more stable states in the late 1970’s. If such multiple
stable states exist, it should be possible to find bi- or multi-modal distributions in the observed data (if these states are well separated).
Hansen and Sutera (1986) identified a bi-modal distribution in a variable
characterizing the energy of the planetary-scale waves in the Northern Hemisphere winter. Daily amplitudes for the zonal wavenumbers k = 2 to 4 for
500 hPa height were averaged for midlatitudes. A “wave-amplitude indicator” Z was finally obtained by subtracting the annual cycle and by filtering
out all variability on time scales shorter than 5 days. The probability density function fZ was estimated by applying a technique called the maximum
Section 2.5: Epilogue
25
penalty technique to 16 winters of daily data. The resulting fZ had two maxima separated by a minor minimum. This bimodality was taken as proof
of the existence of two stable states of the atmospheric general circulation:
A “zonal regime”, with Z < 0, exhibiting small amplitudes of the planetary
waves and a “wavy regime”, with Z > 0, with amplified planetary-scale zonal
disturbances.
Hansen and Sutera performed a “Monte Carlo” experiment to evaluate the
likelihood of fitting a bimodal distribution to the data with the maximum
penalty technique even if the generating distribution is unimodal. The authors concluded that this likelihood is small. On the basis of this statistical
check, the found bimodality was taken for granted by many scientists for
almost a decade.
When I read the paper, I had never heard about the “maximum penalty
method” but had no doubts that everything would have been done properly
in the analysis. The importance of the question prompted other scientists to
perform the same analysis to further refine and verify the results. Nitsche et
al. (1994) reanalysed step-by-step the same data set which had been used in
the original analysis and came to the conclusion that the purportedly small
probability for a misfit was large. The error in the original analysis was not
at all obvious. Only by carefully scrutinizing the pitfalls of the maximum
penalty technique did Nitsche and coworkers find the inconsistency between
the Monte Carlo experiments and the analysis of the observational data.
Nitsche et al. reproduced the original estimation, but showed that something like 150 years of daily data would be required to exclude with sufficient
certainty the possibility that the underlying distribution would be unimodal.
What this boils down to is, that the null hypothesis according to which the
distribution would be unimodal, is not rejected by the available data - and
the published test was wrong . However, since the failure to reject the null
hypothesis does not imply the acceptance of the null hypothesis (but merely
the lack of enough evidence to reject it), the present situation is that the
(alternative) hypothesis “The sample distribution does not originate from a
unimodal distribution” is not falsified but still open for discussion.
I have learned the following rule to be useful when dealing with advanced
methods: Such methods are often needed to find a signal in a vast noisy phase
space, i.e., the needle in the haystack - but after having the needle in our hand,
we should be able to identify the needle as a needle by simply looking at it. 7
Whenever you are unable to do so there is a good chance that something is
rotten in the analysis.
7 See again Wallace’s and Gutzler’s study who identified their teleconnection patterns
first by examining correlation maps - and then by simple weighted means of few grid point
values - see Section 12.1.
26
Chapter 2: Misuses
2.6
Epilogue
I have chosen the examples of this Chapter to advise users of statistical concepts to be aware of the sometimes hidden assumptions of these concepts.
Statistical Analysis is not a Wunderwaffe8 to extract a wealth of information
from a limited sample of observations. More results require more assumptions, i.e., information given by theories and other insights unrelated to the
data under consideration.
But, even if it is not a Wunderwaffe Statistical Analysis is an indispensable
tool in the evaluation of limited empirical evidence. The results of Statistical
Analysis are not miracle-like enlightenment but sound and understandable
assessments of the consistency of concepts and data.
8 Magic
bullet.