Amelia Imputation
Amelia Imputation
Amelia Imputation
March 2001
e propose a remedy for the discrepancy between the way political scientists analyze data with
missing values and the recommendations of the statistics community. Methodologists and
statisticians agree that multiple imputation is a superior approach to the problem of missing
data scattered through ones explanatory and dependent variables than the methods currently used in applied
data analysis. The discrepancy occurs because the computational algorithms used to apply the best multiple
imputation models have been slow, difficult to implement, impossible to run with existing commercial
statistical packages, and have demanded considerable expertise. We adapt an algorithm and use it to
implement a general-purpose, multiple imputation model for missing data. This algorithm is considerably
faster and easier to use than the leading method recommended in the statistics literature. We also quantify
the risks of current missing data practices, illustrate how to use the new procedure, and evaluate this
alternative through simulated data as well as actual empirical examples. Finally, we offer easy-to-use
software that implements all methods discussed.
n average, about half the respondents to surveys do not answer one or more questions
analyzed in the average survey-based political
science article. Almost all analysts contaminate their
data at least partially by filling in educated guesses for
some of these items (such as coding dont know on
party identification questions as independent). Our
review of a large part of the recent literature suggests
that approximately 94% use listwise deletion to eliminate entire observations (losing about one-third of
their data, on average) when any one variable remains
missing after filling in guesses for some.1 Of course,
Gary King ([email protected], http://GKing.Harvard.Edu) is Professor of Government, Harvard University, and Senior Advisor,
Global Programme on Evidence for Health Policy, World Health
Organization, Center for Basic Research in the Social Sciences,
Harvard University, Cambridge, MA 02138. James Honaker
([email protected],
http://www.gov.harvard.edu/graduate/
tercer/) is a Ph.D. candidate, Department of Government, Harvard
University, Center for Basic Research in the Social Sciences, and
Anne Joseph ([email protected]) is a Ph.D. candidate in
Political Economy and Government, Harvard University, Cambridge, MA 02138. Kenneth Scheve ([email protected],
http://pantheon.yale.edu/ks298/) is Assistant Professor, Department of Political Science, Institution for Social and Policy Studies,
Yale University, New Haven, CT 06520.
The authors thank Tim Colton and Mike Tomz for participating in
several of our meetings during the early stages of this project; Chris
Achen, Jim Alt, Micah Altman, Mike Alvarez, John Barnard, Larry
Bartels, Neal Beck, Adam Berinsky, Fred Boehmke, Ted Brader,
Charles Franklin, Rob Van Houweling, Jas Sekhon, Brian Silver, Ted
Thompson, and Chris Winship for helpful discussions; Joe Schafer
for a prepublication copy of his extremely useful book; Mike Alvarez,
Paul Beck, John Brehm, Tim Colton, Russ Dalton, Jorge
Domnguez, Bob Huckfeldt, Jay McCann, and the Survey Research
Center at the University of California, Berkeley, for data; and the
National Science Foundation (SBR-9729884), the Centers for Disease Control and Prevention (Division of Diabetes Translation), the
National Institutes on Aging (P01 AG17625-01), and the World
Health Organization for research support. Our software is available
at http://GKing.Harvard.Edu.
1 These data come from our content analysis of five years (199397)
49
50
TABLE 1.
March 2001
Assumption
Missing completely
at random
Missing at random
Nonignorable
Acronym
You Can
Predict M with:
MCAR
MAR
NI
Dobs
Dobs and Dmis
51
VbI1
1
F VbI2 2 2 F.
(1)
52
March 2001
1
q.
m j1 j
(2)
nonrobustness and difficulty of use (Rubin 1996; Schafer 1997). Although this is surely not true in every
application, the advantages make this approach an
attractive option for a wide range of potential uses. The
MAR assumption can also be made more realistic by
including more informative variables and information
in the imputation process, about which more below.
Finally, note that the purpose of an imputation model
is to create predictions for the distribution of each of
the missing values, not causal explanation or parameter
interpretation.
One model that has proven useful for missing data
problems in a surprisingly wide variety of situations
assumes that the variables are jointly multivariate
normal. This model obviously is an approximation, as
few data sets have variables that are all continuous and
unbounded, much less multivariate normal. Yet, many
researchers have found that it works as well as more
complicated alternatives specially designed for categorical or mixed data (Ezzati-Rice et al. 1995; Graham
and Schafer 1999; Rubin and Schenker 1986; Schafer
1997; Schafer and Olsen 1998). Transformations and
other procedures can be used to improve the fit of the
model.10 For our purposes, if there exists information
in the observed data that can be used to predict the
missing data, then multiple imputations from this normal model will almost always dominate current practice. Therefore, we discuss only this model, although
the algorithms we discuss might also work for some of
the more specialized models as well.
For observation i (i 1, . . . , n), let D i denote the
vector of values of the p (dependent Yi and explanatory
Xi) variables, which if all observed would be distributed
normally, with mean vector and variance matrix .
The off-diagonal elements of allow variables within D
to depend on one another. The likelihood function for
complete data is:
L, D
1
SEqj 2 Sq2 1 1/m.
m j1
(4)
i1
SEq2
ND , .
n
(3)
An Imputation Model
Implementing multiple imputation requires a statistical
model from which to compute the m imputations for
each missing value in a data set. Our approach assumes
that the data are MAR, conditional on the imputation
model. The literature on multiple imputation suggests
that in practice most data sets include sufficient information so that the additional outside information in an
application-specific NI model (see Appendix A) will
not add much and may be outweighed by the costs of
By assuming the data are MAR, we form the observed data likelihood. The procedure is exactly as for
application-specific methods (equations 1213 in Appendix A, where with the addition of a prior this
likelihood is proportional to P (D obs)). We denote
Di,obs as the observed elements of row i of D, and i,obs
and i,obs as the corresponding subvector and submatrix of and (which do not vary over i), respectively.
Then, because the marginal densities are normal, the
observed data likelihood is
ND
n
L, Dobs
i,obsi,obs,
i,obs.
(5)
i1
53
(6)
Computational Algorithms
Computing the observed data likelihood in equation 5,
and taking random draws from it, is computationally
infeasible with classical methods. Even maximizing the
function takes inordinately long with standard optimization routines. In response to such difficulties, the
Imputation-Posterior (IP) and Expectation-Maximization (EM) algorithms were devised and subsequently
applied to this problem.12 From the perspective of
statisticians, IP is now the gold standard of algorithms
for multivariate normal multiple imputations, in large
part because it can be adapted to numerous specialized
models. Unfortunately, from the perspective of users, it
is slow and hard to use. Because IP is based on Markov
Chain Monte Carlo (MCMC) methods, considerable
expertise is needed to judge convergence, and there is
Since the number of parameters p( p 3)/2 increases rapidly with
the number of variables p, priors help avoid overfitting and numerical
instability in all the algorithms discussed here.
12 Gelman et al. (1995), Jackman (2000), McLachlan and Krishan
(1997), and Tanner (1996) provide excellent introductions to the
literature on these algorithms and on Bayesian methods more
generally.
11
54
March 2001
(7)
(8)
55
IR
Dobs
L
.
N , V
(9)
PQDobs, DmisPDmisDobsdDmis.
(10)
For difficult cases, our software allows the user to substitute the
heavier tailed t for the approximating density. The normal or t with
a larger variance matrix, scaled up by some additional factor (1.11.5
to work well), can also help.
14
56
March 2001
Practical Suggestions
As with any statistical approach, if the model-based
estimates of EMis are wrong, then there are circumstances in which the procedure will lead one astray. At
the most basic level, the point of inference is to learn
something about facts we do not observe by using facts
we do observe; if the latter have nothing to do with the
former, then we can be misled with any statistical
method that assumes otherwise. In the present context,
our method assumes that the observed data can be
used to predict the missing data. For an extreme
counterexample, consider an issue scale with integer
responses 17, and what you think is a missing value
code of 9. If, unbeknownst to you, the 9 is actually
an extreme point on the same scale, then imputing
values for it based on the observed data and rounding
to 17 will obviously be biased.17 Of course, in this case
Because the imputation and analysis stages are separate, proponents of multiple imputation argue that imputations for public use
data sets could be created by a central organization, such as the data
provider, so that analysts could ignore the missingness problem
altogether. This strategy would be convenient for analysts and can be
especially advantageous if the data provider can use confidential
information in making the imputations that otherwise would not be
available. The strategy is also convenient for those able to hire
consultants to make the imputations for them. Others are not
enthusiastic about this idea (even if they have the funds) because it
can obscure data problems that overlap the two stages and can
provide a comforting but false illusion to analysts that missingness
problems were solved by the imputer (in ways to which analysts
may not even have access). The approach also is not feasible for large
data sets, such as the National Election Studies, because existing
computational algorithms cannot reliably handle so many variables,
even in theory. Our alternative but complementary approach is to
make the tools of imputation very easy to use and available directly
to researchers to make their own decisions and control their own
analyses.
17 In this sense, the problem of missing data is theoretically more
difficult than ecological inference, for example, since both involve
filling in missing cells, but in missing data problems deterministic
bounds on the unknown quantities cannot be computed. In practice,
dealing with the missing data problem may be relatively easier since
its assumption (that observed data will not drastically mislead in
predicting the missing data) is very plausible in most applications.
16
57
19
58
March 2001
59
FIGURE 1.
March 2001
Note: These graphs show, for one mean parameter, how the correct posterior (marked IP) is approximated poorly by EM, which only matches the mode,
and EMs when n is small (top left). IP is approximated well by EMs for a larger n (bottom left) and by EMis for both sample sizes (right top and bottom).
21
60
FIGURE 2.
Note: This figure plots the average root mean square error for four missing data procedureslistwise deletion, multiple imputation with IP and EMis, and
the true complete dataand the five data-generation processes described in the text. Each point in the graph represents the root mean square error
averaged over two regression coefficients in each of 100 simulated data sets. Note that IP and EMis have the same root mean square error, which is lower
than listwise deletion and higher than the complete data.
61
FIGURE 3.
March 2001
Note: T statistics are given for the constant (b 0 ) and the two regression coefficients (b 1 , b 2 ) for the MAR-1 run in Figure 2. Listwise deletion gives the
wrong results, whereas EMis and IP recover the relationships accurately.
62
EXAMPLES
We present two examples that demonstrate how
switching from listwise deletion can markedly change
substantive conclusions.
TABLE 2.
in Russia
Listwise
Deletion
.06
(.06)
Multiple
Imputation
.10
(.04)
.08
(.08)
.12
(.05)
.06
(.08)
.12
(.04)
63
64
March 2001
Multiple
Imputation
.248*
(.046)
.041
(.045)
.005
(.047)
.026
(.047)
.011
(.042)
.050
(.045)
.068*
(.035)
Anti-Semitism
.097
(.047)
.115*
(.045)
Egalitarianism
.201*
(.049)
.236*
(.053)
Ideology
.076
(.054)
.133*
(.063)
N
2
p(2)
1,575
8.46
.08
2,009
11.21*
.02
Modern racism
Individualism
Antiblack
Authoritarianism
significant.29 This test measures whether their sophisticated analysis model is superior to a simple probit
model, and thus whether the terms in the variance
model warrant our attention. Under their treatment of
missing values, the variance component of the model
does not explain the between-respondent variances,
which implies that their methodological complications
were superfluous. Our approach, however, rejects the
simpler probit in favor of the more sophisticated model
and explanation.30
See Meng (1994b) and Meng and Rubin (1992) for procedures and
theory for p values in multiply imputed data sets. We ran the entire
multiple imputation analysis of m 10 data sets 100 times, and this
value never exceeded 0.038.
30 Sometimes, of course, our approach will strengthen rather than
reverse existing results. For example, we also reanalyzed Domnguez
and McCanns (1996) study of Mexican elections and found that their
main argument (voters focus primarily on the potential of the ruling
party and viability of the opposition rather than specific issues) came
through stronger under multiple imputation. We also found that
several of the results on issue positions that Domnguez and McCann
were forced to justify ignoring or attempting to explain away turned
out to be artifacts of listwise deletion.
We also replicated Dalton, Beck, and Huckfeldts (1998) analysis
of partisan cues from newspaper editorials, which examined a
merged data set of editorial content analyses and survey responses.
29
CONCLUSION
For political scientists, almost any disciplined statistical
model of multiple imputation would serve better than
current practices. The threats to the validity of inferences from listwise deletion are of roughly the same
magnitude as those from the much better known
problems of omitted variable bias. We have emphasized the use of EMis for missing data problems in a
survey context, but it is no less appropriate and needed
in fields that are not survey based, such as international
relations. Our method is much faster and far easier to
use than existing multiple imputation methods, and it
allows the usage of about 50% more information than
is currently possible. Political scientists also can jettison
the nearly universal but biased practice of making up
the answers for some missing values. Although any
statistical method can be fooled, including this one, and
although we generally prefer application-specific methods when available, EMis normally will outperform
current practices. Multiple imputation was designed to
make statistical analysis easier for applied researchers,
but the methods are so difficult to use that in the twenty
years since the idea was put forward it has been applied
by only a few of the most sophisticated statistical
researchers. We hope EMis will bring this powerful
idea to those who can put it to best use.
Application-Specific Approaches
Application-specific approaches usually assume MAR or NI.
The most common examples are models for selection bias,
such as truncation or censoring (Achen 1986; Amemiya 1985,
chap. 10; Brehm 1993; Heckman 1976; King 1989, chap. 7;
Winship and Mare 1992). Such models have the advantage of
including all information in the estimation, but almost all
allow missingness only in or related to Y rather than scattered
throughout D.
When the assumptions hold, application-specific approaches are consistent and maximally efficient. In some
cases, however, inferences from these models tend to be
sensitive to small changes in specification (Stolzenberg and
Relles 1990). Moreover, different models must be used for
each type of application. As a result, with new types of data,
application-specific approaches are most likely to be used by
Most missing data resulted from the authors inability to content
analyze the numerous newspapers that respondents reported reading. Because the survey variables contained little information useful
for predicting content analyses that were not completed, an MCAR
missingness mechanism could not be rejected, and the point estimates did not substantially change under EMis, although confidence
intervals and standard errors were reduced. Since Dalton, Beck, and
Huckfeldts analysis was at the county level, it would be possible to
gather additional variables from census data and add them to the
imputation stage, which likely would substantially improve the
analysis.
65
PDPMD, dDmis,
(11)
(12)
31
66
March 2001
Estimators
L
L
Finally, let b L A L Y L (b L
(X L
1 , b 2 ), where A
L 1 L
X ) X , and where the superscript L denotes listwise
deletion applied to X and Y. So b L
1 is the Listwise deletion
estimator of 1.
Bias
The infeasible estimator is unbiasedE(b I ) E(AY)
AX and thus bias(b I1 ) 0. The omitted variable
estimator is biased, as per the usual calculation, E(b O
1 )
E(b I1 Fb I2 ) 1 F 2 , where each column of F is a
factor of coefficients from a regression of a column of X2 on
X1 so bias (b O
1 ) F 2 . If MCAR holds, then listwise
deletion is also unbiased, E(b L ) E(A L Y L ) A L X L ,
and thus bias(b L
1 ) 0.
Variance
REFERENCES
I
V(b
)
1
1 FV(b 2 )F. Because V(b ) V(A Y )
L 2
L
2
L L 1
A IA (X X ) , the variance of the listwise
2
L 11
L 11
deletion estimator is V(b L
is
1 ) (Q ) , where (Q )
L L 1
the upper left portion of the (X X ) matrix corresponding
to X L
1.
MSE
Putting together the (squared) bias and variance results gives
I
MSE computations: MSE(b O
1 ) V(b 1 ) F[ 2 2
2
L 11
I
L
V(b 2 )]F, and MSE(b 1 ) (Q ) .
Comparison
In order to evaluate when listwise deletion outperforms the
omitted variable bias estimator, we compute the difference d
in MSE:
O
L
I
d MSEbL
1 MSEb1 Vb1 Vb1
F VbI2 2 2 F.
(13)
APPENDIX C. SOFTWARE
To implement our approach, we have written easy-to-use
software, Amelia: A Program for Missing Data (Honaker et al.
1999). It has many features that extend the methods discussed here, such as special modules for high levels of
missingness, small ns, high correlations, discrete variables,
data sets with some fully observed covariates, compositional
data (such as for multiparty voting), time-series data, timeseries cross-sectional data, t distributed data (such as data
with many outliers), and data with logical constraints. We
intend to add other modules, and the code is open so that
others can add modules themselves.
The program comes in two versions: for Windows and for
GAUSS. Both implement the same key procedures. The
Windows version requires a Windows-based operating system and no other commercial software, is menu oriented and
thus has few startup costs, and includes some data input
procedures not in the GAUSS version. The GAUSS version
requires the commercial program (GAUSS for Unix 3.2.39 or
later, or GAUSS for Windows NT/95 3.2.33 or later), runs on
any computer hardware and operating system that runs the
most recent version of GAUSS, is command oriented, and
has some statistical options not in the Windows version. The
software and detailed documentation are freely available at
http://GKing.Harvard.Edu.
67
68
March 2001
69