Background and Threshold: Critical Comparison of Methods of Determination
Background and Threshold: Critical Comparison of Methods of Determination
Background and Threshold: Critical Comparison of Methods of Determination
www.elsevier.com/locate/scitotenv
a
Geological Survey of Norway, N-7491 Trondheim, Norway
Institute of Statistics and Probability Theory, Vienna University of Technology, Wiedner Hauptstr. 8-10, A-1040 Wien, Austria
c
Geological Survey of Canada, Natural Resources Canada, 601 Booth Street, Ottawa, Ontario, Canada K1A 0E8
Received 7 June 2004; received in revised form 1 November 2004; accepted 12 November 2004
Available online 4 February 2005
Abstract
Different procedures to identify data outliers in geochemical data are reviewed and tested. The calculation of
[meanF2 standard deviation (sdev)] to estimate threshold values dividing background data from anomalies, still used
almost 50 years after its introduction, delivers arbitrary estimates. The boxplot, [medianF2 median absolute deviation
(MAD)] and empirical cumulative distribution functions are better suited for assisting in the estimation of threshold
values and the range of background data. However, all of these can lead to different estimates of threshold.
Graphical inspection of the empirical data distribution using a variety of different tools from exploratory data
analysis is thus essential prior to estimating threshold values or defining background. There is no good reason to
continue to use the [meanF2 sdev] rule, originally proposed as a dfilterT to identify approximately 2O% of the data
at each extreme for further inspection at a time when computers to do the drudgery of numerical operations were
not widely available and no other practical methods existed. Graphical inspection using statistical and geographical
displays to isolate sets of background data is far better suited for estimating the range of background variation and
thresholds, action levels (e.g., maximum admissible concentrationsMAC values) or clean-up goals in environmental
legislation.
D 2004 Elsevier B.V. All rights reserved.
Keywords: Background; Threshold; Mean; Median; Boxplot; Normal distribution; Cumulative probability plot; Outliers
1. Introduction
* Corresponding author. Tel.: +47 73 904 307; fax: +47 73 921
620.
E-mail addresses: [email protected] (C. Reimann)8
[email protected] (P. Filzmoser)8 [email protected]
(R.G. Garrett).
0048-9697/$ - see front matter D 2004 Elsevier B.V. All rights reserved.
doi:10.1016/j.scitotenv.2004.11.023
6000
5000
far outliers
outliers
3000
2000
mg/kg K
4000
outliers
maximum
1000
Fig. 1. Tukey boxplot for potassium (K%) concentrations in the Ohorizon of podzols from the Kola area (Reimann et al., 1998). For
definitions of the different boundaries displayed see text.
1
2
Mean
Natural
BSS
As
Cu
Ni
Pb
Zn
TOP
2.6
13
11
18
42
Kola
As
Cu
Ni
Pb
Zn
O-horizon
1.6
44
51
24
48
sdev
Median
MAD
[Mean+2 sdev]
[Median+2 MAD]
Upper whisker
Natural
Anti-log
Natural
Anti-log
98th
percentile
11
64
53
42
200
9
43
34
38
121
8 (11)
38 (51)
45
37
155
3.4
76
258
55
107
6.4
241
395
48
93
2.1 (9)
13 (1000)
10 (500)
44 (70)
85
179
110
153
202
200
207
84
104
200
139
90 (160)
(40) 80
(42) 90
90 (350)
145 (175)
Log10
Natural
Log10
Natural
Log10
Natural
Log10
Natural
Anti-log
0.293
1.01
0.907
1.22
1.53
2.4
11
8.1
8.1
30
0.325
0.307
0.301
0.172
0.293
1.9
9.8
8
17
33
0.279
0.992
0.904
1.22
1.52
1.2
6.6
5.3
5.7
20
0.294
0.306
0.316
0.152
0.291
7.4
36
27
34
101
8.8
42
35
37
129
4.3
23
19
28
74
7.4
40
34
34
128
5.8
31
36
32
100
0.094
1.12
1.12
1.29
1.66
2.5
245
119
49
18
0.245
0.432
0.565
0.208
0.157
1.2
9.7
9.2
19
46
0.065
0.986
0.963
1.27
1.66
0.46
5.1
7.7
7.4
15
0.174
0.267
0.455
0.185
0.143
6.6
535
450
122
84
3.8
96
177
52
93
2.1
20
25
34
76
2.6
33
75
44
89
2.5
35
54
43
88
83
23
23
53
34
0.368
0.291
0.291
0.368
0.253
30
28
30
23
73
1.48
1.45
1.48
1.36
1.86
19
15
19
16
30
0.294
0.23
0.294
0.32
0.171
211
79
81
145
141
163
101
120
139
209
Walchen B-horizon
As 46
1.47
Cu 32
1.42
Ni
34
1.43
Pb 39
1.41
Zn 74
1.81
69
58
69
56
132
116
81
116
100
160
87
69
80
81
152
Outer limit
CDF
Table 1
Mean, standard deviation (sdev), median, median absolute deviation (MAD), (log-transformed data) and results of the definition of an upper threshold via: [meanF2 sdev],
[medianF2 MAD], the boxplot (Tukey, 1977) and cumulative probability plots (see Fig. 7) for selected variables for three example data sets: BSS data, agricultural soils from
Northern Europe, ploughed (TOP) 020 cm layer, b2 mm fraction, N=750, 1,800,000 km2 (Reimann et al., 2003); Kola, O-horizon of podzol profiles, b2 mm fraction, N=617,
180,000 km2 (14); and Walchen, B-horizon of forest soils, b0.18 mm fraction, 100 km2 (Reimann, 1989)
10
mean +/ 2 s
median +/ 2 MAD
Boxplot
% detected outliers
15
mean +/ 2 s
median +/ 2 MAD
Boxplot
% detected outliers
10
50 100
500
5000
Sample size
10
50 100
500
5000
Sample size
Fig. 2. Average percentage of outliers detected by the rules [meanF2 sdev], [medianF2 MAD], and the boxplot method. For several different
sample sizes (N=10 to 10 000), the percentages were computed based on 1000 replications of simulated normally distributed data (A) and of
simulated lognormally distributed data (B).
scale (MAD or hinge width) are relatively uninfluenced by the extreme values of the lognormal data
distribution. Both location and scale are low relative
to the mean and standard deviation, resulting in lower
fence values and higher percentages of identified
extreme values.
Results from these two simulation exercises
explain the empirical observations in Table 1. The
[medianF2 MAD] procedure always results in the
lowest threshold value, the boxplot in the second
lowest, and the classical rule in the highest threshold.
Because geochemical data are in general right-skewed
and often closely resemble a lognormal distribution,
the second simulation provides the explanation for the
observed behaviour.
Based on the results of the two simulation
exercises, one can conclude that the data should
approach a symmetrical distribution before any
threshold estimation methods are applied. A graphical
inspection of geochemical data is thus necessary as an
initial step in data analysis. In the case of lognormally
distributed data, log-transformation results in a symmetric normal distribution (see discussion above). The
percentages of detected extreme values using
[meanF2 sdev] or [medianF2 MAD] are unrealistically high without a symmetrical distribution. Only
percentiles (recommendation (2) of Hawkes and
Webb, 1962) will always deliver the same number
40
30
% detected outliers
10
20
30
Boxplot
mean +/ 2 s
median +/ 2 MAD
Boxplot
20
mean +/ 2 s
median +/ 2 MAD
10
40
% detected outliers
10
20
30
% simulated outliers
40
10
20
30
40
% simulated outliers
Fig. 3. Average percentage of outliers detected by the rules [meanF2 sdev], [medianF2 MAD], and the boxplot method. Simulated standard
normally distributed data (A) and simulated standard lognormally distributed data (B) were both contaminated with (log)normally distributed
outliers with mean 10 and variance 1, where the percentage of outliers was varied from 0 to 40% for a constant sample size of 500. The
computed percentages are based on 1000 replications of the simulation.
20
median +/ 2*MAD
10
15
Boxplot
% detected outliers
mean +/ 2*s
10
20
30
40
% simulated outliers
Fig. 4. Average percentage of outliers detected by the rules
[meanF2 sdev], [medianF2 MAD], and the boxplot method.
Simulated standard normally distributed data with mean zero and
variance 1 were contaminated with normally distributed outliers
with mean 5 and variance 1, where the percentage of outliers was
varied from 0 to 40% for a constant sample size of 500. The
computed percentages are based on 1000 replications of the
simulation.
3. Data distribution
There has been a long discussion in geochemistry
whether or not data from exploration and environmental geochemistry follow a normal or lognormal
distribution (see Reimann and Filzmoser, 2000). The
discussion was fuelled by the fact that the [meanF2
sdev] rule was extensively used to define the range of
background concentrations and differentiate background from anomalies. Recently, it was again
demonstrated that the majority of such data follow
neither a normal nor a lognormal distribution (Reimann and Filzmoser, 2000).
In the majority of cases, geochemical distributions
for minor and trace elements are closer to lognormal
(strong right-skewness) than to normal. When plotting
histograms of log-transformed data, they often
approach the bell shape of a Gaussian distribution,
which is then taken as a sufficient proof of lognormality. Statistical tests, however, indicate that in most
cases the data do not pass as drawn from a lognormal
distribution (Reimann and Filzmoser, 2000).
As demonstrated, the boxplot (Tukey, 1977) is
another possibility for graphically displaying the data
distribution (Fig. 1). It provides a graphical data
summary relying solely on the inherent data structure
and not on any assumptions about the distribution of
the data. Besides outliers it shows the centre, scale,
skewness and kurtosis of a given data set. It is thus
ideally suited to graphically compare different data
(sub)sets.
Fig. 5 shows that a combination of histogram,
density trace, one-dimensional scattergram and boxplot give a much improved insight to the data
Fig. 5. Combination of histogram, density trace, one-dimensional scattergram and boxplot for the study of the empirical data distribution (data
from Reimann et al., 1998).
structure than the histogram alone. The one-dimensional scattergram is a very simple tool where the
measured data are displayed as a small horizontal line
at an arbitrarily chosen y-scale position at the
appropriate position along the x-scale. In contrast to
histograms, a combination of density traces, scattergrams and boxplots will at once show any peculiarities in the data, e.g., breaks in the data structure (Fig.
5, Pb) or data discretisation due to severe rounding of
analytical results in the laboratory (Fig. 5, Sc).
One of the best graphical displays of geochemical
distributions is a cumulative probability plot (CDF
diagram), originally introduced to geochemists by
Tennant and White (1959), Sinclair (1974, 1976) and
others. Fig. 6 shows four forms of such displays. The
best choice for the y-axis is often the normal
probability scale because it spreads the data out at
the extremes, which is where interest usually lies.
Also, it permits the direct detection of deviations from
normality or lognormality, as normal or lognormally
(logarithmic x-scale) distributed data plot as straight
10
Fig. 6. Four different variations for plotting CDF diagrams. Upper row, empirical cumulative distribution plots; lower row, cumulative
probability plots. Left half, data without transformation; lower right, data plotted on a logarithmic scale equivalent to a logarithmic
transformation. Example data: Cu (mg/kg) in podzol O-horizons from the Kola area (Reimann et al., 1998).
11
Fig. 7. Four selected cumulative probability plots for the example data from Table 1. Arrows mark some different possible thresholds (compare
with Table 1). Example data are taken from (Reimann et al., 1998, 2003; Reimann, 1989). The vertical lines in the lower left corner of the plots
for the Baltic Soil Survey (Reimann et al., 2003) data are caused by data below the detection limit (set to half the detection limit for
representation).
12
Cu, mg/kg
N
25 50 km
Barents Sea
4080
35
18
NORWAY
9.7
6.9
2.7
FINLAND
RUSSIA
Fig. 8. Regional distribution of Cu (mg/kg) in podzol O-horizons from the Kola area (Reimann et al., 1998). The high values in Russia (large
and small crosses) mark the location of the Cu-refinery in Monchegorsk, the Cu-smelter in Nikel and the CuNi ore roasting plant in Zapoljarnij
(close to Nikel). The map suggests that practically all sample sites in Russia, and some in Norway and Finland, are contaminated.
13
4. Conclusions
Of the three investigated procedures, the boxplot
function is most informative if the true number of
outliers is below 10%. In practice, the use of the
boxplot for preliminary class selection to display
spatial data structure in a map has proven to be a
powerful tool for identifying the key geochemical
processes behind a data distribution. If the proportion
of outliers is above 15%, only the [medianF2 MAD]
procedure will perform adequately, and then up to the
point where the outlier population starts to dominate
the data set (50%). The continued use of the [meanF2
sdev] rule is based on a misunderstanding. Geochemists want to identify data outliers and not the
extreme values of normal (or lognormal) distributions
that statisticians are often interested in. Geochemical
outliers are not these extreme values for background
populations but values that originate from different,
often superimposed, distributions associated with
processes that are rare in the environment. They can,
and often will, be the dextreme valuesT for the whole
data set. This is the reason that the [meanF2 sdev] rule
appears to function adequately in some real instances,
but breaks down when the proportion of outliers in the
data set is large relative to the background population
size. The derived values, however, have no statistical
14
Appendix A
The following is one heuristic for data inspection
and selection of the limits of background variation that
has proved to be informative in past studies. The
following assumes that the data are in a computer
processable form and have been checked for any
obvious errors, e.g., in the analytical and locational
data; and that the investigator has access to data
analysis software. Maps suitable for data inspection
can be displayed with data analysis software; however,
References
Allen HE, editor. Bioavailability of metals in terrestrial ecosystems:
importance of partitioning for bioavailability to invertebrates,
microbes, and plants. Pensacola, FL7 Society of Environmental
Toxicology and Chemistry (SETAC); 2002.
AMC. Analyst 2001;126:256 9.
Barnett V, Lewis T. Outliers in statistical data. 3rd edition. New
York7 Wiley & Sons; 1994.
Bjfrklund A, Gustavsson N. Visualization of geochemical data on
maps: new options. J Geochem Explor 1987;29:89 103.
15
16