Aoas Rejoinder
Aoas Rejoinder
Aoas Rejoinder
REJOINDER
We heartily thank Michael Stein and Brad Efron for selecting our paper for
discussion and for the tremendous task of recruiting and editing 13 discussion ar-
ticles on this controversial and timely topic. We are grateful for the opportunity
to receive feedback on our work from such a large number of knowledgable dis-
cussants whose work and fields of expertise are so broad. The fact that our paper
was of interest not only to academic statisticians and climate scientists but also
economists and popular bloggers1 bespeaks the importance of the topic.
We thank all 13 discussants for the time they put into considering and respond-
ing to our paper. Each one deserves a detailed response, but time and space con-
straints make that impossible. We therefore acknowledge the favorable tenor of the
discussions generally if not specifically.
The discussion has great value, particularly for raising points of contrast some-
times about fundamental issues. For instance, Wahl and Amman (WA) note “there
is an extensive literature contradicting McShane and Wyner’s (2011a) assertions
about low or poor relationships between proxies and climate.” On the other hand,
Tingley asserts “each proxy is weakly correlated to the northern hemisphere mean
(for two reasons: proxies generally have a weak correlation with local climate,
which in turn is weakly correlated with a hemispheric average)” and Davis and Liu
(DL) state “there is just not much signal present.” This contrast can be explained
at least in part by context. Our paper addresses the specific task of reconstruct-
ing annual temperatures over relatively short epochs during which temperatures
vary comparatively little. Nevertheless, such contrasts are suggestive of the impor-
tant frontiers for research and we hope our paper and this discussion will lead to
advances on these fronts.
In this rejoinder, we aim to do three things. First, we respond to the detailed
and highly critical discussion of Schmidt, Mann, and Rutherford (SMR). Next,
we reiterate our key findings while targeting themes that emerge from multiple
discussants. Finally, we conclude with a more in-depth response to Tingley and
Smerdon who address the same broad issue. The discussions of SMR and Tingley
are noteworthy because they take a “scientific” approach as opposed to the “sta-
tistical” approach taken by many of the other discussants (e.g., DL and Kaplan),
99
100 B. B. MCSHANE AND A. J. WYNER
F IG . 1. Reproduction of SMR Figure 1a. The left panel gives smoothed fits thereby reproducing
SMR Figure 1a whereas the right panel gives unsmoothed fits. Results using the reduced set of 55
Mann et al. (2008) proxies (excluding Tiljander) are plotted with solid lines whereas results using
the full set of 93 proxies are plotted with dashed lines. Two features stand out from these plots. First,
the differences between the fit of a given method to the full or reduced set of proxies are quite small
compared to the annual variation of a given fit or compared to the variations between fits. Second,
the RegEM EIV methods produce reconstructions which are nearly identical to those produced by
OLS PC4 and OLS G5 PC5. Compare also with SMR Figure S2 which is similar to the bottom panel
but excludes RegEM EIV.
F IG . 2. Difference between our Bayesian AR2 + PC10 model of Section 5 and various other mod-
els. The left panel gives the difference between our Bayesian AR2 + PC10 model fit to the network of
93 proxies dating back to 1000 AD and the original Mann et al. (2008) RegEM EIV fit to the network
of 59 proxies dating back to 1000 AD. The right panel gives the difference between our Bayesian
AR2 + PC10 model fit to the network of 93 proxies dating back to 1000 AD and the model of SMR
Figure 1b (i.e., the same Bayesian AR2 + PC10 model but fit to the network of 55 proxies dating back
to 1000 AD instead of the network of 93 proxies). As can be seen, there are no statistically significant
differences between these two models and our Bayesian AR2 + PC10 model fit to the 93 proxies.
The annual difference is given in red, the smoothed difference in thick red, and annual uncertainties
bands are given in gray. The right plot has wider intervals because the uncertainty in both models is
accounted for. Since we lack uncertainty estimates for RegEM, the left panel uses only the uncertainty
estimates of our Bayes model. Compare to SMR Figures 1b and 1c as well as figures in the SI.
the archives consist of a large number of files layered on top of one another and,
despite a major effort on our part, we were unable to replicate published results
within the publication time constraints of this rejoinder. That said, independent
102 B. B. MCSHANE AND A. J. WYNER
researchers have, after important modifications,2 successfully run code from the
first URL (Jason Smerdon, personal communication). Consequently, throughout
this section, we work with RegEM ŷ’s (pre-fit by SMR to both real and simulated
data) as well as one particular draw of the data from their simulation which were
provided.
1.1. Proxy data: Full versus reduced network of 1000 year-old proxies. SMR
allege that we have applied the various methods in Sections 4 and 5 of our paper to
an inappropriately large group of 95 proxies which date back to 1000 AD (93 when
the Tiljander lightsum and thicknessmm series are removed due to high correlation
as in our paper; see footnote 11). In contrast, the reconstruction of Mann et al.
(2008) is applied to a smaller set of 59 proxies (57 if the two Tiljander series
mentioned previously are removed; 55 if all four Tiljander series are excluded
because they are “potentially contaminated”).
The process by which the complete set of 95/93 proxies is reduced to 59/57/55
is only suggestively described in an online supplement to Mann et al. (2008).3 As
statisticians we can only be skeptical of such improvisation, especially since the
instrumental calibration period contains very few independent degrees of freedom.
Consequently, the application of ad hoc methods to screen and exclude data in-
creases model uncertainty in ways that are unmeasurable and uncorrectable.
Moreover, our interpretation of SMR Figure 1 is quite different. We see the vari-
ation between the larger and smaller datasets as relatively small with respect to the
variation among the models. The appearance of a difference in SMR Figure 1a
is especially magnified because those reconstructions are smoothed. Smoothing
exaggerates the difference and requires careful adjustment of fit statistics such
as standard errors, adjustments which are lacking in SMR and which are in gen-
eral known only under certain restrictive conditions. In contrast, consider the right
panel of Figure 1 which is a reproduction of SMR Figure 1a without smoothing.
The difference between a given model fit to the full dataset or the reduced data set
is clearly dwarfed by the annual variation of the fit; the full and reduced set of prox-
ies yield inconsequentially different reconstructions. It thus seems to us the main
point of Figure 14 of the paper (which SMR Figures 1a and S2 roughly mimic)
stands: various methods which have similar holdout RMSEs in the instrumental
period produce strikingly different reconstructions including “hockey sticks” (such
2 A bypass of the function used to generate new pseudoproxies during each run (pseudoprox-
ytwo.m) is required since this module appears to be inoperative.
3 The Mann et al. (2008) Supplementary Information contains the following note: “Tree-ring data
included 926 tree-ring series extracted from the International Tree Ring Data Bank (ITRDB, version
5.03: www.ncdc.noaa.gov/paleo/treering.html). All ITRDB tree-ring proxy series were required to
pass a series of minimum standards to be included in the network: (i) series must cover at least the
interval 1750 to 1970, (ii) correlation between individual cores for a given site must be 0.50 for this
period, (iii) there must be at least eight samples during the screened period 1800–1960 and for every
year used.”
REJOINDER 103
as the red one in Figure 1), “inverted check marks” (such as the green), and things
in between (such as the blue and purple). In short, while SMR allege that we use
the “wrong” data, the result remains the same (also see SI).
We have two additional findings. First, as shown in Figure 1, the RegEM recon-
struction is nearly identical to to OLS PC4 and OLS G5 PC5. This is particularly
interesting in light of the performance comparisons of SMR Figure 2. Second,
SMR Figure 1a and our Figure 1 given here do not account for uncertainty in the
model fits. When such uncertainty is accounted for, as can easily be done for the
models in SMR Figures 1b and 1c, we see that the difference between the recon-
structions produced from the larger data set of 95/93 proxies and the 59/57/55
are negligible with respect to overall uncertainty (see Figure 2; see SI for more
details).
1.2. The selection of principal components. SMR Figure 1c replots our Bayes
model (Figure 16 of the paper) with two differences: it uses the reduced dataset
of 55 proxies and only four principal components. There are no statistically sig-
nificant differences between the resulting model and our original one (see SI), yet
SMR allege that “K = 10 principal components is almost certainly too large, and
the resulting reconstruction likely suffers from statistical over-fitting. Objective
selection criteria applied to the Mann et al. (2008) AD 1000 proxy network, as
well as independent “pseudoproxy” analyses discussed below, favor retaining only
K = 4.”
SMR are wrong on two counts. First, the two “objective” criteria they suggest
select differing numbers of principal components. Second, each criterion has mul-
tiple implementations each producing different results. As is well known to statisti-
cians, there is no single objective way to resolve these discrepancies. Furthermore,
the PC selection procedures that SMR prefer select “significant” PCs based en-
tirely on the matrix of predictors without considering the response variable. To
protect against overfitting, the selection process should in some way take into ac-
count the relationship between the predictor and the response [see also Izenman
(2008), Hastie, Tibshirani and Friedman (2009)]. Compounding matters, SMR im-
plement their allegedly objective criteria in nonstandard and arbitrary ways and
several times in error.4 When correctly implemented, the number of principal com-
ponents retained varies across each “objective” criterion from two to fifty-seven.
Using ten principal components, therefore, can hardly be said to induce the “sta-
tistical over-fitting” claimed by SMR.
4 They appear to mistake the squared eigenvalues for the variances of the principal components
which leads to a thresholding of total variance squared instead of variance. We provide complete
details in the SI.
104 B. B. MCSHANE AND A. J. WYNER
1.3. Simulated data. SMR Figure 2 (along with SMR Table S1) purports to
show that the Lasso (applied in Section 3 of our paper) and the variety of prin-
cipal component methods (applied in Section 4) are fundamentally inferior to the
RegEM EIV method of Mann et al. (2008) and to thereby challenge our assertion
that various methods perform similarly (based on Figures 11–13 of the paper).
RegEM is considered to be a state of the art model for temperature reconstructions
in the climate science literature [Mann et al. (2007, 2008), Lee, Zwiers and Tsao
(2008)].
SMR Figure 2 is based on data simulated from National Center for Atmospheric
Research (NCAR) Climate System Model (CSM) as well as Helmholtz-Zentrum
Geesthacht Research Centre (GKSS) European Centre Hamburg Ocean Primitive
Equation-G (ECHO-G) “Erik” model. We see several problems with this simula-
tion:
(1) While we can vaguely outline the process which generated the simulated
temperatures and pseudoproxies, the details are buried in layers of code at various
supplementary websites and therefore are not reproducible.
(2) In contrast to the methods of Sections 4 and 5 of our paper which are
transparent, RegEM appears to be a classic, improvised methodology with no
known statistical properties, particularly in finite samples or when assumptions
are violated. For instance, the “missing at random” assumption [Little and Ru-
bin (2002)] likely fails to hold here [Smerdon, Kaplan and Chang (2008)]. Fur-
ther, there are enormous numbers of variations on RegEM (e.g., RegEM-Ridge,
RegEM-Truncated Total Least Squares, etc.) each with their own associated tun-
ing parameters and no firmly agreed upon methods for tuning them [Smerdon, Ka-
plan and Chang (2008), Christiansen, Schmith and Thejll (2009, 2010), Rutherford
et al. (2010)]. Consequently, we cannot rule out the possibility that RegEM was
tailor-made to the specific features of this simulation,5 particularly since the same
simulations have been used in repeated studies.6 This is an especially important
5 This is suggested by the fact that RegEM performs nearly identically to OLS PC4 and OLS G5
PC5 on the real proxy data (see SMR Figure 1a and our Figure 1; see also SMR Figure 2 and our
corrected versions in Figure 3 and the SI) but substantially better on the simulated data (see SMR
Table S1 and our corrected version in the SI).
6 For a review of papers using these simulations, see Smerdon, Kaplan and Chang (2008) who state
in their opening two paragraphs: “Rutherford et al. (2005) used RegEM to derive a reconstruction of
the NH temperature field back to A.D. 1400. This reconstruction was shown to compare well with
the Mann, Bradley and Hughes (1998) CFR. . . Mann et al. (2005) attempted to test the Rutherford
et al. (2005) RegEM method using pseudoproxies derived from the National Center for Atmospheric
Research (NCAR) Climate System Model (CSM) 1.4 millennial integration. Subsequently, Mann et
al. (2007) have tested a different implementation of RegEM and shown it to perform favorably in
pseudoproxy experiments. This latter study was performed in part because Mann et al. (2005) did
not actually test the Rutherford et al. (2005) technique, which was later shown to fail appropriate
pseudoproxy tests [Smerdon and Kaplan (2007)]. . . Mann et al. (2005) used information during the
reconstruction interval, a luxury that is only possible in the pseudoclimate of a numerical model
simulation and not in actual reconstructions of the earth’s climate.”
REJOINDER 105
F IG . 3. Corrected version of SMR Figure 2a. The left panel gives smoothed fits as in SMR Figure 2a
whereas the right panel gives unsmoothed fits. The annual variation of each method’s fit dwarfs the
difference between methods. See SI for corrected versions of SMR Figures 1b, 1c, and 1d.
point since it is common to find that some methods work well in some settings and
quite poorly in others.
(3) SMR make absolutely no attempt to deal with uncertainties, either for a
given draw of data from the simulation or across repeated draws of the simulation
even though there is considerable variation in both [see Burger and Cubasch (2005)
for variation of fit conditional on data and Christiansen, Schmith and Thejll (2009)
for variation of fit across draws of a simulation].
(4) How relevant are the results of the simulation to the real data application
(i.e., Berliner’s point about the “need to better assess” these large-scale climate
system models, something we return to in Section 1.4 below)?
Fortunately, we are able to use the data and code provided to us to rebut SMR’s
findings. Before proceeding, however, we must note a troubling problem with
SMR Figure 2. Visual inspection of the plots reveals an errant feature: OLS meth-
ods appear to have nonzero average residual in-sample! Upon examining the code
SMR did provide, we confirmed that this is indeed the case. The culprit is an un-
reported and improper centering of the data subsequent to the model fits, resulting
in biased estimates and uncalibrated confidence intervals.
SMR Figure 2 does not plot raw Northern Hemisphere annual temperature but
rather NH temperature anomaly, that is, NH temperature minus the average NH
temperature over a subset of the in-sample period (defined to be 1900–1980 AD
for the CSM simulation and 1901–1980 for the GKSS simulation). This center-
ing technique is common in climate science and simply represents a change of
location. However, SMR fit the various OLS and Lasso methods to the raw (un-
centered) temperature over the full calibration period 1856–1980 AD. In order
to center the predictions, rather than subtracting off the mean 1900–1980 (1901–
1980) AD NH temperature, they subtracted off the mean of each model’s fitted val-
ues over 1900–1980 (1901–1980) AD. This erroneous and nonstandard centering
results in a substantially biased predictor with an overestimated RMSE. We refit
the models to centered rather than raw temperature and the RMSEs were about 15–
20% lower than in SMR Table S1 (see SI). Furthermore, the differences between
the various methods were dramatically reduced.
106 B. B. MCSHANE AND A. J. WYNER
F IG . 4. Bayesian AR2 + PC10 model of Section 5 applied to simulated data and smoothed. As can
be seen, the Bayes model appears to perform similarly to both RegEM EIV methods. Furthermore,
the confidence intervals of the Bayes model (gray) appear to be calibrated. We give the unsmoothed
version of this figure in the SI.
1.4. Real data versus simulated data. Berliner calls for an assessment of
whether large-scale climate models like those studied in Section 1.3 can serve as
a surrogate for controlled experiments. In this section, we make a modest advance
on this front (see SI for all plots).
Climate scientists, when evaluating these simulations, have focused on several
technical issues. Smerdon, Gonzalez-Rouco and Zorita (2008) shows that Mann et
al. (2007) employed an inappropriate interpolation of GKSS temperatures and that
verification statistics “are weakened when an appropriate interpolation scheme is
adopted.” More recently, Smerdon, Kaplan and Amrhein (2010) “identified prob-
lems with publicly available versions of model fields used in Mann et al. (2005)
and Mann et al. (2007)” thereby showing that “the quantitative results of all
pseudoproxy experiments based on these fields are either invalidated or require
reinterpretation.” Hence, climate scientists have questioned the value of results de-
rived from the CSM and GKSS simulations due to technical issues internal to the
simulation procedure.
As statisticians, we approach the evaluation of simulated data from a somewhat
different perspective. When statisticians design simulations, we tend to follow one
very important general rubric: if one wants insights gleaned from the simulation
to carry over to real data, then key features of the simulated data should match key
features of the real data. Thus, we augment climate scientists’ “internal” evaluation
of the simulated data with an “external” evaluation comparing it to real data.
We have already seen one way in which the real data and simulated data ap-
pear to differ: RegEM gives fits and predictions that are nearly identical to those
of OLS PC4 and OLS G5 PC5 on the real proxy data (see SMR Figure 1a and our
Figure 1) but the fits and predictions on the simulated data are quite different (see
SMR Figure 2 and our corrected versions in Figure 3 and the SI; see also SMR
Table S1 and our corrected version given in the SI). More broadly, we observe that
the simulations appear to have smoother NH temperatures than the real data. This
108 B. B. MCSHANE AND A. J. WYNER
2. Section 3 revisited.
F IG . 5. Cross-validated RMSE on 30-year holdout blocks for various models fit to proxies and
pseudo-proxies. The procedures used to generate Intercept and ARMA boxplots are discussed in
Section 3.2 of the paper. The procedures used to generate the White Noise, AR1, and Brownian
motion pseudo-proxies are discussed in Section 3.3 of the paper. The CPS fitting procedure used for
the Proxy, White Noise, AR1, and Brownian motion boxplots is described in Section 2 of the long
form rejoinder.
7 SMR and others have chided us for calling our various noise series “pseudoproxies.” We note,
with Tingley, that such series represent the limiting case of pseudoproxies with zero signal-to-noise
ratio and thus can lay some claim to the name. It is regardless an unimportant distinction and, for this
rejoinder, we stick with the nomenclature of the paper.
110 B. B. MCSHANE AND A. J. WYNER
only model. Finally, the various pseudoproxy RMSEs are strikingly similar to one
another.
In juxtaposition to Figure 9 of the paper which gave the holdout RMSEs from
the Lasso, the CPS holdout RMSEs of Figure 5 are quite provocative and deserve
more attention. Using this implementation of the CPS method, one might indeed
conclude that the proxies are statistically significant against all pseudoproxies.
However, CPS appears to be a “weak” method in that its holdout RMSEs are larger
than the corresponding ones from the Lasso as can be seen by comparing Figure 5
here to Figure 9 of the paper. If, in turn, one uses a more powerful method like the
Lasso—one that performs better at the ultimate goal of predicting temperature us-
ing proxies—the results, as indicated in Figure 9 of the paper, are mixed: the Lasso
appears somewhat significant against weak null benchmarks like AR1(0.25) and
AR1(0.4) but not against strong ones like AR1(Empirical) and Brownian motion.
Actually, the conclusions are decidedly more unclear than the boxplots of Fig-
ure 5 suggest. In the SI, we plot the RMSE by year for the CPS method and pro-
vide null bands based on the sampling distribution of the pseudo-proxies (as we
did for the Lasso in Figure 10 of the paper). We see that the proxies have consis-
tently lower RMSEs than those of the pseudo-proxies for a majority of the holdout
blocks. Against weak pseudo-proxies, the real proxy predictions are highly sta-
tistically significant on the first few and last few blocks. However, against more
sophisticated pseudo-proxies, CPS forecasts do not appear to be statistically sig-
nificant. Furthermore, though much worse than ARMA models on the interpolation
blocks, CPS predictions are not necessarily worse on the first and last blocks.
Like Aeneas’ description of Dido, statistical significance is varium et mutabile
semper (fickle and always changing) [Maro (29BC)]. Conditional on a choice of
method, pseudoproxy, and holdout block (or aggregation of blocks), the proxies
may appear statistical significant. Yet, ever so slight variations in those choices
might lead to statistical insignificance. Consequently, it is easy to misinterpret
statistical significance tests and their results (as also discussed by DL, Kaplan,
Smerdon, Tingley, and WA). We fault many of our predecessors for assiduously
collecting and presenting all the facts that confirm their theories while failing to
seek facts that contradict them. For science to work properly, it is vital to stress
one’s model to its fullest capacity [Feynman (1974)]. The results presented in Fig-
ures 9 and 10 of the paper, in Figure 5 here, and in various figures in the SI, sug-
gest that maybe our models are in fact not strong enough (or that proxies are too
weak). Furthermore, in contexts where the response series has substantial ability
to locally self-predict, it is vital to recognize this and make sure the model and co-
variates provide incremental value over that [otherwise, “a particular covariate that
is independent of the response, but is able to mimic the dependence structure of the
response can lead to spurious results” (DL)]. Methods like the Lasso coupled with
pseudoproxies like AR1(Empirical) and Brownian motion will naturally account
for this self-prediction (see also Kaplan), whereas naive CPS with latitude-based
weighting will not (CPS using univariate correlation weights does, however; see
SI).
REJOINDER 111
3. Two points on Section 4. Only two of the discussants (MM and SMR)
commented on Section 4 of our paper. In it, we showed that 27 methods have
very similar instrumental period holdout RMSEs yet provide extremely different
temperature reconstructions [see also Burger and Cubasch (2005)]. This remains
true whether one uses the dataset of 93 proxies from the paper or the dataset of
55 proxies favored by SMR, or whether one uses 30- or 60-year holdout blocks
(see SI; it also appears to broadly hold for data simulated from climate models
as well [Lee, Zwiers and Tsao (2008), Christiansen, Schmith and Thejll (2009),
Smerdon et al. (2010)]). Thus, based on predictive ability, one has no reason to
prefer “hockey sticks” to “inverted check marks” or other shapes and MM are
correct to label this “a phenomenon that very much complicates the uncertainty
analysis.”
Also unremarked upon was our point that the proxies seem unable to capture
the high levels of and sharp run-up in temperatures experienced in the 1990s, even
in-sample or in contiguous holdout blocks. It is thus highly improbable that they
would be able to detect such high levels and sharp run-ups if they indeed occurred
in the more distant past. That is, we lack statistical evidence that the recently ob-
served rapid rise in temperature is historically anomalous.
sophisticated models used to produce reconstructions from real proxy data. Thus
far, other more sophisticated approaches have only been applied to simulated data.
We now turn to the putatively more sophisticated approaches advocated by our
critics [see NL for a very clear exposition; see also Tingley and Huybers (2010) and
Li, Nychka and Amman (2010)]. While these models have potential advantages,
such as a richer spatio-temporal structure, our experience with real temperature
and proxy data causes us to be a bit more circumspect. These models make a
large number of assumptions about the relationships among global temperature,
local temperatures, proxies, and external forcings. We would like to see a more
thorough investigation of these assumptions because they do not seem to apply to
real data (e.g., how does DL’s finding that proxies appear to lead temperature by 14
years square with such models?). Furthermore, there are even deeper assumptions
embedded in these models which are difficult to tease out and test on real data.
Hence, we strongly believe that these models need to be rigorously examined
in light of real data. How do they perform in terms of holdout RMSE and cali-
bration of their posterior intervals? How about when they are fit to various noise
pseudoproxies as in our Section 3? When replicated data is drawn from the model
conditional on the observed data, does the replicated data “look” like the observed
data, especially in terms of prominent features? In sum, while we believe these
models have much to recommend for themselves and applaud those working on
them, we also strongly believe that tests like those employed in Section 3 of our
paper, Section 1.4 of this rejoinder, and various other posterior predictive checks
[Gelman et al. (2003), Gelman and Hill (2006)] are absolutely vital in the context
of such assumption-laden models.
As for the indirect “multivariate calibration” approach suggested by some of
the discussants, we point out that it was designed for highly-controlled almost
laboratory-like settings (e.g., chemistry) with very tight causal relationships. The
relationships between temperature and proxies is considerably dissimilar. Further-
more, we believe the two approaches, direct and indirect, ought not differ much
in terms of ŷ, suggesting that “both types of procedures should be able to yield
similar results, else we have reason for skepticism” [Sundberg (1999)]. While one
approach or the other might give better predictions or confidence intervals in this
application or that [a fact that has been observed even in climate settings; ter Braak
(1995)], we believe Sections 4 and 5 of the paper and Section 1 of the Rejoinder
suggest our model is adept at both prediction and interval estimation.
We return to Kaplan’s subtle point that the proxies do not necessarily need to
out-predict ARMA-like models, rather that they must simply provide additional
benefits when added to such models. This is a trenchant point and the dangers of
not evaluating proxy reconstructions in light of ARMA models is illustrated in
Figure 6. The Bayesian AR2 + PC10 model in the upper left and the Bayesian
PC10 model in the upper right provide essentially identical reconstructions. While
the PC10 model has a somewhat smaller total posterior interval, the more striking
feature is the disparity in the decomposition. In the AR2+PC10 model, most of the
114 B. B. MCSHANE AND A. J. WYNER
F IG . 6. Bayesian backcasts and uncertainty decompositions. In the upper left panel, we re-plot
the Bayesian AR2 + PC10 model from Figure 16 of the paper: CRU Northern Hemisphere annual
mean land temperature is given by the thin black line and a smoothed version is given by the thick
black line. The forecast is given by the thin red line and a smoothed version is given by the thick
red line. The model is fit on 1850–1998 AD and backcasts 998–1849 AD. The cyan region indicates
and the gray region indicates
uncertainty due to εt , the green region indicates uncertainty due to β,
total uncertainty. In the upper right panel, we give the same plot for a Bayesian PC10 model with
no AR coefficients. In the bottom panels, we re-plot each model’s backcast and total uncertainty. We
also provide smooths of each posterior reconstruction path in yellow.
uncertainty is due to β as indicated in green; for the PC10 model, the uncertainty
due to εt and β are more equal in their contribution to total uncertainty.
This has dramatic implications for, among other things, smoothed reconstruc-
tions as shown in the bottom two panels. Smoothing has the effect of essentially
eliminating all uncertainty due to εt . Thus, the yellow region in the bottom right
plot is extremely narrow (which explains why confidence intervals in the climate
science literature are typically so narrow; see Figure 17 of the paper). On the other
hand, when an AR2 structure is added, even smoothed confidence bands are quite
wide. This is a profoundly important point which highlights the necessity of mod-
eling the temporal dependence of the NH temperature series and we thank Kaplan
for raising it.
5. Statistical power. Our results of Section 3 do not depend on the Lasso and
are robust to changes in the null distribution (i.e., the pseudoproxies), the fitting
algorithm, the holdout period length and location, and the calibration target series.
REJOINDER 115
5.1. Tingley. Tingley asserts that the Lasso “is simply not an appropriate tool
for reconstructing paleoclimate” and purports to show this via a simulation study
which has two components. The second of the two components (featured in the
right-hand plots of Tingley Figure 1 and in Tingley Figure 2) does an exemplary
job of showing how the Lasso can use autocorrelated predictors in order to provide
excellent fits of an autocorrelated response series—even when the response and
predictors are generated independently.
The first component of the simulation (featured in the left-hand plots of Tingley
Figure 1) is problematic for a number of reasons. First, it is not clear to us how
this simulation relates to proxy-based reconstruction of temperature. If one com-
pares and contrasts plots of Tingley’s simulated data (see SI) to Figures 5 and 6
of the paper, one sees that his target “temperature” series fails to look like the real
temperature series and his pseudo-proxies fail to look like the real proxies.
Second, Tingley implements the Lasso in a completely nonstandard way: “The
Lasso penalization parameter [λ on page 13 of McShane and Wyner (2011a)] is set
to be 0.05 times the smallest value of λ for which all coefficients are zero.” There is
no apparent statistical justification for this choice, and, when λ is selected through
ten repetitions of five-fold cross-validation (as is done throughout our paper), the
Lasso RMSE is twice as good as in Tingley’s Figure 1 (see SI).
Third, we must consider the two methods under consideration. This simula-
tion is exactly the kind of situation where the Lasso is known to perform poorly.
When one has identical predictors each with the same coefficient, “the Lasso prob-
lem breaks down” and methods like ridge regression are superior [Zou and Hastie
(2005), Friedman, Hastie and Tibshirani (2010)]. Furthermore, Tingley’s bench-
mark of composite regression is both unrealistically good for this simulation and
utterly nonrobust (furthermore, it also fails to reject the null tests of Section 3; see
SI).
Composite regression performs unrealistically well because it is a univariate
i.i.d.
linear regression
√ of yt on a series which roughly equals yt + νt where νt ∼
N(0, σω / 1138) (where σω is set to various levels by Tingley). It is impossible
for any method to perform comparably well against such an ideal procedure (one
that has asymptotic zero RMSE as the number of pseudoproxies goes to infinity).
Ridge regression, known to perform well in settings like this simulation, does 1–6
times worse than composite regression (see SI) and is therefore not much better
than the Lasso (in fact, it is worse for the high noise settings). Even the true data-
generating model for yt with the true parameters—another unrealistically strong
116 B. B. MCSHANE AND A. J. WYNER
F IG . 7. Tingley simulation perturbed with σβ = 3. We plot the ratio of the RMSE of the Lasso to
that of composite regression. Compare to the lefthand plots of Tingley Figure 1. In the SI, we give the
raw RMSEs of two methods as well as their ratio for this value of σβ and others.
8 Though it is beyond the scope of our work, we note that, by making use of additional information
(e.g., the spatial locations of the proxies and local temperatures), it is possible that the proxies might
become considerably more predictive/informative than they have so far proven to be.
REJOINDER 117
9 Smerdon samples the temperature grid only once and he samples from the whole globe as op-
posed to either sampling from the NH or using the locations of the Mann et al. (2008) proxies. He
also samples the noise series only once for each setting. Furthermore, he conducts only one repetition
for each holdout block. Finally, he compares the Lasso performance on 283 predictors constructed
from local temperatures to performance on 1138 proxies and noise pseudoproxies, thus lowering p
substantially. We believe consideration of these factors are unlikely to alter the basic picture pre-
sented in the top left panel of Figure 8. However, it would likely increase the variance of the various
boxplots thus making the differences less stark. Moreover, it would be interesting to see the RMSE
distributions from holdout block to holdout block along with intervals for resampled temperatures
and noise pseudoproxies (i.e., in the style of Figure 10 versus Figure 9 of the paper).
118 B. B. MCSHANE AND A. J. WYNER
F IG . 8. Comparison of Lasso and CPS on various pseudoproxies. In the top left, we plot give hold-
out RMSEs for the Lasso trained using Smerdon’s instrumental period local temperature pseudo-
proxies (this figure replicates Smerdon Figure 1a). In the top right, we give the same plot for CPS. In
the middle row and bottom rows, we consider two variants of Smerdon’s pseudoproxies (see text for
more detail). Compare to Smerdon Figure 1a and the SI.
10 The eight right-most boxplots of the top panel of Figure 8 should be reminiscent of Figure 5.
They are identical except the former is based on one repetition for each holdout block whereas the
latter averages over 100 such repetitions thus decreasing variation.
REJOINDER 119
to perform better than these noise pseudoproxies. So it seems that CPS passes
Smerdon’s test. On the other hand, the Lasso performs equivalently to or better
than CPS in all 13 cases. Hence, it does not seem that CPS reconstructions are so
skillful after all since they are worse than those of the Lasso.
As mentioned in the response to Tingley, the Lasso is known to perform poorly
in situations where all the predictors are the same. This is approximately the situ-
ation governing the five leftmost boxplots in the two upper panels of Figure 8. In
contrast, we consider two variants of Smerdon’s simulation. The first variant, plot-
ted in the second row of Figure 8, defines noise percentage differently. Rather than
adding noise to local temperatures such that the variance of the noise accounts for
some fixed percentage of the variance of the sum, we add predictors to the matrix
of local temperatures such that a fixed percentage of the predictors are pure noise
(e.g., for 50% white noise, the matrix of predictors has 566 columns, 283 of local
temperatures and 283 of white noise). The Lasso performs spectacularly well in
this setting and seems fully powered with the local temperatures always outper-
forming various noise series. On the other hand, CPS performs similarly in this
case as in Smerdon’s example: local temperatures outperform noise series but the
predictions on the whole are quite poor.
The second variant of Smerdon’s simulation replicates what we did with Tin-
gley’s simulation. Rather than using local temperature plus noise as predictors, we
used a random slope times local temperature plus noise where the slopes were i.i.d.
N(1, σβ = 3). In some sense, this better reflects the relationship between proxies
and temperature. In this setting, the Lasso again performs very strongly with the
corrupted local temperatures always outperforming various noise series. CPS gives
worse results across the board. For results using different values of σβ , see SI.
In sum, when predictors have approximately the same coefficient and there is a
very high noise level (e.g., the 86% and 94% noise conditions of the top left panel
of Figure 8), the Lasso is perhaps underpowered. In variants of the simulation that
might be more true to real data (the middle left and bottom left panels), the Lasso
performs very well. On the other hand, CPS performs weakly in all settings: it
simply does not provide particularly good out of sample predictions compared to
methods like the Lasso.
We are left wondering why the Lasso “fails” Smerdon’s test, suggesting a lack of
power. Low power can be explained by recognizing an unavoidable reality: the NH
temperature sequence is short, highly autocorrelated, and “blocky” (DL observe
“3 or possibly 4 segments”). Thus, the effective sample size is far smaller than
n = 149. Consequently, as is always the case with small samples, one lacks power.
It therefore follows (as shown in Section 2.2 and the figures in the Appendix to
the SI) that failure to reject the null against AR1(Empirical) and Brownian motion
pseudo-proxies is not specific to the Lasso. Rather, it is endemic to the problem.
Unless we can find proxies that strongly predict temperature at an annual level,
power will necessarily be low and uncertainty high.
120 B. B. MCSHANE AND A. J. WYNER
SUPPLEMENTARY MATERIAL
Supplement A: Long form rejoinder: A statistical analysis of multiple tem-
perature proxies: Are reconstructions of surface temperatures over the last
1000 years reliable? (DOI: 10.1214/10-AOAS398REJSUPPA; .zip). This docu-
ment is the long form of “Rejoinder: A statistical analysis of multiple temperature
REJOINDER 121
proxies: Are reconstructions of surface temperatures over the last 1000 years reli-
able?” It contains all the text from the short form which appeared in print as well
as the supporting details and figures for the claims made in that document.
Supplement B: Code repository for “Rejoinder: A statistical analysis of
multiple temperature proxies: Are reconstructions of surface temperatures
over the last 1000 years reliable?” (DOI: 10.1214/10-AOAS398REJSUPPB;
.zip). This repository archives all data and code used for “Rejoinder: A statistical
analysis of multiple temperature proxies: Are reconstructions of surface tempera-
tures over the last 1000 years reliable?” In particular, it contains code to make all
figures and tables featured in the long form (which is a superset of those in the
short form).
REFERENCES
B ROWN , P. J. and S UNDBERG , R. (1987). Confidence and conflict in multivariate calibration. J. Roy.
Statist. Soc. Ser. B 49 46–57. MR0893336
B URGER , G. and C UBASCH , U. (2005). Are multiproxy climate reconstructions robust? Geophysi-
cal Research Letters 32 L23711.
C HRISTIANSEN , B., S CHMITH , T. and T HEJLL , P. (2009). A surrogate ensemble study of climate
reconstruction methods: Stochasticity and robustness. Journal of Climate 22 951–976.
C HRISTIANSEN , B., S CHMITH , T. and T HEJLL , P. (2010). Reply. Journal of Climate 23 2839–2844.
D IACONIS , P. (1985). Theories of data analysis: From magical thinking through classical statistics.
In Exploring Data Tables, Trends and Shapes 1–36. Wiley, New York.
E FRON , B. and T IBSHIRANI , R. J. (1994). An Introduction to the Bootstrap. Monographs on Statis-
tics and Applied Probability 57. Chapman & Hall, New York.
F EYNMAN , R. (1974). Cargo cult science. Engineering and Science 37 7.
F RIEDMAN , J., H ASTIE , T. and T IBSHIRANI , R. (2010). Regularization paths for generalized linear
models via coordinate descent. J. Stat. Softw. 33 1–22.
G ELMAN , A. and H ILL , J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical
Models. Cambridge Univ. Press, New York.
G ELMAN , A., C ARLIN , J. B., S TERN , H. S. and RUBIN , D. B. (2003). Bayesian Data Analysis,
2nd ed. Chapman & Hall, Boca Raton, FL.
H ASTIE , T., T IBSHIRANI , R. and F RIEDMAN , J. (2009). The Elements of Statistical Learning: Data
Mining, Inference, and Prediction, 2nd ed. Springer, New York. MR2722294
I ZENMAN , A. J. (2008). Modern Multivariate Statistical Techniques: Regression, Classification, and
Manifold Learning. Springer, New York. MR2445017
L EE , T. C. K., Z WIERS , F. W. and T SAO , M. (2008). Evaluation of proxy-based millennial recon-
struction methods. Climate Dynamics 31 263–281.
L I , B., N YCHKA , D. W. and A MMAN , C. M. (2010). The value of multi-proxy reconstructin of past
climate. J. Amer. Statist. Assoc. 105 883–895.
L ITTLE , R. J. A. and RUBIN , D. B. (2002). Statistical Analysis with Missing Data, 2nd ed. Wiley,
Hoboken, NJ. MR1925014
M ANN , M. E., B RADLEY, R. E. and H UGHES , M. K. (1998). Global-scale temperature patterns
and climate forcing over the past six centuries. Nature 392 779–787.
M ANN , M. E., RUTHERFORD , S., WAHL , E. and A MMANN , C. (2005). Testing the fidelity of
methods used in proxy-based reconstructions of past climate. Journal of Climate 18 4097–4107.
M ANN , M. E., RUTHERFORD , S., WAHL , E. and A MMANN , C. (2007). Robustness of proxy-based
climate field reconstruction methods. Journal of Geophysical Research 112 D12109.
122 B. B. MCSHANE AND A. J. WYNER
M ANN , M. E., Z HANG , Z., H UGHES , M. K., B RADLEY, R. S., M ILLER , S. K., RUTHERFORD , S.
and N I , F. (2008). Proxy-based reconstructions of hemispheric and global surface temperature
variations over the past two millenia. Proc. Natl. Acad. Sci. USA 105 36.
M ARO , P. V. (29BC). Aeneid. Octavius Press, Rome.
M C I NTYRE , S. and M C K ITRICK , R. (2005a). Hockey sticks, principal components, and spurious
significance. Geophysical Research Letters 32 L03710.
M C I NTYRE , S. and M C K ITRICK , R. (2005b). Reply to comment by von Storch and Zorita on
“Hockey sticks, principal components, and spurious significance.” Geophysical Research Letters
32 L20714.
M C S HANE , B. B. and W YNER , A. J. (2011a). A statistical analysis of multiple temperature proxies:
Are reconstructions of surface temperatures over the last 1000 years reliable? Ann. Appl. Stat. 5
5–44.
M C S HANE , B. B. and W YNER , A. J. (2011b). Supplement to “Rejoinder on A statistical analysis
of multiple temperature proxies: Are reconstructions of surface temperatures over the last 1000
years reliable?” DOI: 10.1214/10-AOAS398REJSUPPA.
M C S HANE , B. B. and W YNER , A. J. (2011c). Supplement to “Rejoinder on A statistical analysis
of multiple temperature proxies: Are reconstructions of surface temperatures over the last 1000
years reliable?” DOI: 10.1214/10-AOAS398REJSUPPB.
NRC (2006). Surface Temperature Reconstructions. The National Academic Press, Washington, DC.
Available at http://www.nap.edu/catalog.php?record_id=11676.
RUTHERFORD , S., M ANN , M. E., O SBORN , T. J., B RADLEY, R. S., B RIFFA , K. R.,
H UGHES , M. K. and J ONES , P. D. (2005). Proxy-based northern hemispheric surface recon-
structions: Sensitivity to method, predictor network, target season, and target doman. Journal of
Climate 18 2308–2329.
RUTHERFORD , S., M ANN , M. E., A MMANN , C. and WAHL , E. (2010). Comments on “A surro-
gate ensemble study of climate reconstruction methods: Stochasticity and robustness.” Journal of
Climate 23 2832–2838.
S MERDON , J., G ONZALEZ -ROUCO , J. F. and Z ORITA , E. (2008). Comment on “Robustness of
proxy-based climate field reconstruction methods” by Michael E. Mann et al. Journal of Geo-
physical Research — Atmospheres 113 D18106.
S MERDON , J. and K APLAN , A. (2007). Comment on “Testing the Fidelity of methods used in proxy-
based reconstructions of past climate”: The role of the standardization interval, by M. E. Mann,
S. Rutherford, E. Wahl, and C. Ammann. Journal of Climate 20 5666–5670.
S MERDON , J., K APLAN , A. and C HANG , D. (2008). On the origin of the standardization sensitivity
in regem climate field reconstructions. Journal of Climate 21 6710–6723.
S MERDON , J., K APLAN , A. and A MRHEIN , D. E. (2010). Erroneous model field representations in
multiple pseudoproxy studies: Corrections and implications. Journal of Climate 23 5548–5554.
S MERDON , J., K APLAN , A., C HANG , D. and E VANS , M. (2010). A pseudoproxy evaluation of
the cca and regem methods for reconstructing climate fields of the last millennium. Journal of
Climate 23 4856–4880.
S UNDBERG , R. (1999). Multivariate calibration—direct and indirect regression methodology. Scand.
J. Statist. 26 161–207. MR1707599
TER B RAAK , C. (1995). Non-linear methods for multivariate statistical calibration and their use in
palaeoecology: A comparison of inverse (k-nearest neighbours, partial least squares and weighted
averaging partial least squares) and classical approaches. Chemometrics and Intelligent Labora-
tory Systems 28 165–180.
T INGLEY, M. and H UYBERS , P. (2010). A Bayesian algorithm for reconstructing climate anomalies
in space and time. Part I: Development and applications to paleoclimate reconstruction problems.
Journal of Climate 23 2759–2781.
REJOINDER 123
Z OU , H. and H ASTIE , T. (2005). Regularization and variable selection via the elastic net. J. R. Stat.
Soc. Ser. B Stat. Methodol. 67 301–320. MR2137327