Advanced Handling of Missing Data: One-Day Workshop

Advanced Handling of
Missing Data
One-day Workshop
Nicole Janz
[email protected]
Goals
•  Discuss types of missingness
•  Know advantages & disadvantages of

missing data methods
•  Learn multiple imputation
•  Practical: diagnose, visualize and handle

missing data in R
2
Steps in the research process
1.  Identify patterns of missingness for each variable
2.  Why are data missing? Could this bias your sample?
3.  How do other scholars in your field handle

missingness?
4.  Decide on method to handle missingness for your

particular variables
5.  Robustness: try different missing data methods, run

your analysis, compare the results
3
Proportions of missingness
A SIMPLIFIED per
BIVARIATE TEST GUIDE
variable in a table
variable nmiss n propmiss
country 0 5568 0.00000000

year 0 5568 0.00000000
UN_FDI_flow 477 5568 0.08566810
US_fdi_electrical 1896 5568 0.34051724
US_fdi_machinery 1922 5568 0.34518678
US_fdi_transport 1968 5568 0.35344828
US_fdi_mining 3908 5568 0.70186782
US_fdi_services 3955 5568 0.71030891
US_fdi_petrol 4258 5568 0.76472701
US_fdi_utilities 4984 5568 0.89511494
4
Proportions of missingness
A SIMPLIFIED per
variable in a graph
Proportion of missingness
Petrol/GDP
Mining/GDP
Other FDI/GDP
Deposit./GDP
Finance/GDP
US FDI/GDP
Wh.Trade/GDP
Food/GDP
Chemical/GDP
Metal/GDP
Transp./GDP
Machinery/GDP
Mosley Law
Mosley Prac.
Mosley Labor
Electr./GDP
PTS
Democracy
CIRI Women
CIRI Phys.
CIRI Emp.
CIRI Worker
Trade
GDP p. capita
Population
Conflict
Fariss
Life exp.
Inf.mort.
0.0 0.2 0.4 0.6 0.8 1.0

5
Time series: number
A SIMPLIFIED of years
BIVARIATE TESTwith
GUIDE
existing data
6
Heatmap per country-year
A SIMPLIFIED andGUIDE
BIVARIATE TEST
variable
7
yellow=missing
Why are my data
A SIMPLIFIED missing?
Due to social/natural processes
•  school graduation, dropout, death
•  a country does not exist anymore e.g. GDR
•  statistics office reclassified variables
•  intentional non-disclosure
Skip patterns in surveys

•  E.g. only married respondents are asked certain
follow-up questions
Respondent refusal
•  income
8
Why are my data
A SIMPLIFIED missing?
variable nmiss n propmiss
US_fdi_mining 3908 5568 0.70186782

US_fdi_petrol 4258 5568 0.76472701
US_fdi_utilities 4984 5568 0.89511494
•  Mining FDI is available until 1999

•  Petrol FDI is available from 2000
•  Utilities FDI is a new category was introduced after 2000
9
Three types of missingness
A SIMPLIFIED BIVARIATE TEST GUIDE
1.  MCAR - Missing Completely at Random
2.  MAR - Missing at Random
3.  MNAR Missing not at Random
10
MCAR: Missing Completely
A SIMPLIFIED at Random
Missing value (y) neither depends on x nor y. Probability of
missingness is the same for all units.
Survey respondent decides whether to answer the

“earnings” question by rolling a die and refusing to answer if
a “6” shows up
Some survey questions asked of a simple random sample of

original sample
What to do:
If data are missing completely at random, then throwing out
cases with missing data does not bias your inferences
-> do listwise deletion, then run analysis 11

MAR: Missing at
A SIMPLIFIED RandomTEST GUIDE
BIVARIATE
Probability that a variable is missing depends only observed
data, but not the missing data itself, or unobserved data.
If sex, race, education, and age are recorded for all the
people in the survey, then “earnings” is MAR if the
probability of nonresponse depends only on these variables
If men are more likely to tell you their weight than women,
and we record gender, then weight is MAR.
What to do?
Some say listwise deletion is fine, but only if regression
controls for all variables that affect probability of missingness.
More common: use multiple imputation (MI) because listwise

deletion introduces bias.
12
MNAR: MissingBIVARIATE
A SIMPLIFIED not at Random
TEST GUIDE
(non-ignorable missingness)
Missingness depends at least in part on unobserved factors.
Special case: Missingness depends on variable that is missing
People with college degrees are less likely to reveal their

earnings, we don’t have education data for all respondents
If a particular treatment causes discomfort, a patient is more

likely to drop out of the study. We don’t have a measure for
discomfort for all patients.
Respondents with high income less likely to report income.
13
MNAR: MissingBIVARIATE
A SIMPLIFIED not at Random
TEST GUIDE
(non-ignorable missingness)
What to do?
Most problematic case. Potential lurking variables are often
unobserved.
MI based on auxiliary, external data e.g. estimate race based on

Census data associated with the address of the respondent.
Try to include as many predictors as possible in a model to get

MNAR closer to MAR.
14
How to distinguish
A SIMPLIFIED between
BIVARIATE MNAR
TEST GUIDE
and MAR?
Think about your variables and use your substantive scientific
knowledge of the data and your field.
Can you collect more data that explain missingness, or is it

very likely that they will remain unobserved?
What does the literature say about predictors of that particular

missing variable?
15
How to distinguish
BIVARIATE MAR
TEST GUIDE
and MCAR?
Again, think about the data. Some indication (but no definitive
answer) can be gained from two tests:
1)  Little’s test for MCAR (Little 1988)
Maximum likelihood chi-square test for missing completely at

random. H0 is that the data is MCAR.
If the p value for Little's MCAR test is not significant, then

the data may be assumed to be MCAR and missingness is
ignorable (do listwise deletion).
mcartest in STATA; EM option in SPSS; in R see lab

16
How to distinguish
BIVARIATE MAR
TEST GUIDE
and MCAR?
2. Dummy variable approach for MCAR
create dummy variables for whether a variable is missing:

1 = missing
0 = observed
Run t-tests (continuous) and chi-square (categorical) tests

between this dummy and other variables to see if the
missingness is related to the values of other variables
Tests which return a finding of significance indicate MAR

rather than MCAR (-> use multiple imputation)
(SPSS: MVA option, R see lab) 17

Ad-hoc methods
Listwise deletion (complete case analysis)
Automatically done in regression in most software; or by hand;
assumes MCAR
•  If MAR or MNAR: introduces biased sample
•  reduces sample size
Pairwise deletion (available case analysis)

different aspects of a problem are studied with different
subsets of the data
•  Results between subsets not consistent / comparable
•  if the non-respondents differ systematically from the
respondents, this will bias the available-case summaries
•  Potential omitted variable bias if excludes a complete
variable because its high missingness
18
Ad-hoc methods
Last value carried forward
replace missing outcome values with pre-treatment measure
•  would lead to underestimates of the true treatment effect
•  ignores changes over time
Mean imputation
easiest way to impute is to replace each NA with the mean
•  distorts distribution for this variable, e.g. underestimates sd
•  ignores changes over time
Filling in values manually based on case-based

knowledge from other sources
•  time-consuming
•  prone to measurement error
19
Single imputation
Impute missing values from predicted values results from
regression
•  the error in these cases becomes zero. However, random

errors are a feature of the real world and one variable
treated with single imputation will be fundamentally different
from the other variables.
•  leads to overconfidence in our models and biases the
coefficients upwards
20
Multiple Imputation
A SIMPLIFIED Techniques
Multiple imputation (MI) is also based on the idea of using
predicted values, but it builds in mechanisms to incorporate
uncertainty about the predicted values.
MI imputes values for each missing data point, but it does so n

times (usually 5). It then creates n (5) completed data sets.
The observed values remain the same, but the imputed value
varies across these 5 data sets, reflecting uncertainty.
MI is much closer to reality when calculating new values.
MI is a good alternative to listwise deletion because the main

assumption is that data are MAR, meaning that some other
variables in the data set may (and should) explain why an 21
observation is missing
Multiple Imputation
A SIMPLIFIED Techniques
Figure: https://cran.r-project.org/web/packages/Amelia/vignettes/amelia.pdf
22
Details on expectation maximization (EM) algorithm, see King et al. (2001).
Combination ofBIVARIATE
A SIMPLIFIED results TEST GUIDE
Run each analysis (e.g. regression) on all 5 imputed data
sets.
Collect all 5 coefficients and standard errors (and other

measures of interest), and combine them into one estimate
according to Rubin’s Rule (1987):
•  Estimates: average of the individual estimates
•  Standard error: combine between-imputation variance and

within-imputation variance
23
See King et al. (2001).
Multiple Imputation
A SIMPLIFIED Software
{Amelia} in R (by Gary King and collaborators)
{mi} in R (by Andrew Gelman and collaborators)
{mice} in R (by Stef van Buuren and

collaborators)
SPSS (Analyze > Multiple Imputation)
STATA mi estimate
24
Social Sciences Research
Methods Centre
Lab
Summarizing and
A SIMPLIFIED Visualizing
Missingness in R
% of missingness per variable and subsets of variables
Graphical display
Using Amelia for diagnosis of missingness
26
MCAR patterns?
1)  BaylorEdPsych (Little’s Test to diagnose MCAR)

https://cran.r-project.org/web/packages/BaylorEdPsych/
BaylorEdPsych.pdf
2) Creating a dummy variable for missingness 0/1, then

running correlations among variables
27
Ad-hoc measures
A SIMPLIFIED in R TEST GUIDE
BIVARIATE
1)  Listwise deletion, pairwise
deletion
2) Carry last value forward
3) Mean imputation
4) Manually recoding particular

variables
5) Replace NAs with predicted

values from regression
28
Example 1
Adapted from Schlomer et al. (2010)
60 clients under age 21 years at a large university counseling

center were referred for counseling by the dean of students
due to underage drinking violations. The counseling center
randomly assigned the students to one of two treatment
programs (independent variable: Group), one of which uses
the harm reduction approach, and the other of which is based
on a 12-step model. Participants’ self-efficacy for sobriety
was measured before (covariate) and after (dependent
variable) the counseling.
•  7 variations of the DV: DV with no missing; DV with 10%,

20%, and 50% MCAR, and DV with 10%, 20%, and 50%
MAR
29
Example 1
Goal: Compare biases in estimates of mean, standard

deviation, regression coefficient, and standard error when the
DV has 20% missing at random with when the DV has 0%
missing using different missing data handling techniques.
Step 1:
Calculate M, SD, B, and SE with DV0Miss
Step 2:
Create the target data set with DV20MAR
30
Example 1
Describe missing patterns
Summarize and Little's (1998) MCAR Dummy code
visualize missingness test missingness
Ad-hoc methods
Delete listwise Carry last Substitute Recode Predict from
or pairwise value forward with mean manually regression
Multiple imputation
Amelia II
31
Multiple Imputation
A SIMPLIFIED with Amelia
BIVARIATE II
TEST GUIDE
How to run an imputation in R incl diagnostics
- run Amelia on a data set

- saving an imputed data set
- combining several data into an amelia object
- how to deal with ordinal, nominal, natural log data
- time series cross-section
- lags and leads
- overimputation
- time series plots
32
Reproducibility
•  Set seed (!!!) – for yourself and others
When you re-run Amelia after diagnostics and want to

make changes, it’s best to re-use exactly what you had
with minimal changes
•  Work in R, not the GUI version
•  Keep your Rscript well commented; make a note of

sessionInfo(), especially the Amelia and R version used
33
Reproducibility
On 12/4/2012 5:40 AM, Nicole Janz wrote:
Dear _____,
I'm a PhD student at Cambridge University, and I work on foreign
investment and labor standards. I read your with great interest. I
was wondering if you could make the imputation Rcode available
to me? I am asking this because I am using Amelia as well, and I
would like to try and replicate your imputation with the same
specifications.
Hi Nicole - Thanks for the note. Unfortunately, we did this in

AmeliaView, so we don't have R code available (I assume you've
found the replication data and Stata code on my website). 34
More practical tips
•  Set the seed!
•  Include any variable in the analysis model in your imputation

model. Maybe use auxiliary variables if they make sens.
•  Include variables in the form they enter the model (lags, logs,
leads, transformations).
•  Don’t impute things that don’t make sense! Don’t impute

decades of missing data.
•  Check diagnostics
35
Literature and tutorials
Amelia mailing list
https://lists.gking.harvard.edu/mailman/listinfo/amelia
Tutorial for three MI software packages by Thomas Leeper

http://thomasleeper.com/Rcourse/Tutorials/mi.html
MISSING VALUES ANALYSIS & DATA IMPUTATION

http://www.statisticalassociates.com/missingvaluesanalysis_p.pdf
James Honaker and Gary King, What to do About Missing Values in Time
Series Cross-Section Data American Journal of Political Science Vol. 54,
No. 2 (April, 2010): Pp. 561-581.
Gary King, James Honaker, Anne Joseph, and Kenneth Scheve. Analyzing
Incomplete Political Science Data: An Alternative Algorithm for Multiple
Imputation, American Political Science Review, Vol. 95, No. 1 (March,
2001): Pp. 49-69.
36
Literature and tutorials
Andrew Gelman and Jeniffer Hill, Data Analysis Using Regression and
Multilevel/Hierarchical Models, CHAPTER 25: Missing-data imputation.
Cambridge University Press, Cambridge (2006).
Much Ado About Nothing: A Comparison of Missing Data Methods and

Software to Fit Incomplete Data Regression Models
www.math.smith.edu/~nhorton/muchado.pdf
Allison, Paul D. 2001. Missing Data. Sage University Papers Series on

Quantitative Applications in the Social Sciences. Thousand Oaks: Sage.
Enders, Craig. 2010. Applied Missing Data Analysis. Guilford Press: New
York.
Little, Roderick J., Donald Rubin. 2002. Statistical Analysis with Missing
Data. John Wiley & Sons, Inc: Hoboken.
Schafer, Joseph L., John W. Graham. 2002. “MissingData: Our View of the
State of the Art.” Psychological Methods. 37
Thank you !
Nicole Janz
www.nicolejanz.de

Advanced Handling of Missing Data: One-Day Workshop

Uploaded by

Copyright:

Available Formats

Advanced Handling of Missing Data: One-Day Workshop

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advanced Handling of Missing Data: One-Day Workshop

Uploaded by

Copyright:

Available Formats

Advanced Handling of

• Discuss types of missingness

• Know advantages & disadvantages of

• Learn multiple imputation

• Practical: diagnose, visualize and handle

3. How do other scholars in your field handle

4. Decide on method to handle missingness for your

5. Robustness: try different missing data methods, run

country 0 5568 0.00000000

0.0 0.2 0.4 0.6 0.8 1.0

Skip patterns in surveys

US_fdi_mining 3908 5568 0.70186782

• Mining FDI is available until 1999

1. MCAR - Missing Completely at Random

2. MAR - Missing at Random

3. MNAR Missing not at Random

Survey respondent decides whether to answer the

Some survey questions asked of a simple random sample of

-> do listwise deletion, then run analysis 11

More common: use multiple imputation (MI) because listwise

People with college degrees are less likely to reveal their

If a particular treatment causes discomfort, a patient is more

Respondents with high income less likely to report income.

MI based on auxiliary, external data e.g. estimate race based on

Try to include as many predictors as possible in a model to get

Can you collect more data that explain missingness, or is it

What does the literature say about predictors of that particular

1) Little’s test for MCAR (Little 1988)

Maximum likelihood chi-square test for missing completely at

If the p value for Little's MCAR test is not significant, then

mcartest in STATA; EM option in SPSS; in R see lab

create dummy variables for whether a variable is missing:

Run t-tests (continuous) and chi-square (categorical) tests

Tests which return a finding of significance indicate MAR

(SPSS: MVA option, R see lab) 17

Pairwise deletion (available case analysis)

Filling in values manually based on case-based

• the error in these cases becomes zero. However, random

MI imputes values for each missing data point, but it does so n

MI is much closer to reality when calculating new values.

MI is a good alternative to listwise deletion because the main

Collect all 5 coefficients and standard errors (and other

• Estimates: average of the individual estimates

• Standard error: combine between-imputation variance and

{Amelia} in R (by Gary King and collaborators)

{mi} in R (by Andrew Gelman and collaborators)

{mice} in R (by Stef van Buuren and

SPSS (Analyze > Multiple Imputation)

Using Amelia for diagnosis of missingness

1) BaylorEdPsych (Little’s Test to diagnose MCAR)

2) Creating a dummy variable for missingness 0/1, then

2) Carry last value forward

4) Manually recoding particular

5) Replace NAs with predicted

60 clients under age 21 years at a large university counseling

• 7 variations of the DV: DV with no missing; DV with 10%,

Goal: Compare biases in estimates of mean, standard

How to run an imputation in R incl diagnostics

- run Amelia on a data set

•  Discuss types of missingness

•  Know advantages & disadvantages of

•  Learn multiple imputation

•  Practical: diagnose, visualize and handle

3.  How do other scholars in your field handle

4.  Decide on method to handle missingness for your

5.  Robustness: try different missing data methods, run

•  Mining FDI is available until 1999

1.  MCAR - Missing Completely at Random

2.  MAR - Missing at Random

3.  MNAR Missing not at Random

1)  Little’s test for MCAR (Little 1988)

•  the error in these cases becomes zero. However, random

•  Estimates: average of the individual estimates

•  Standard error: combine between-imputation variance and

1)  BaylorEdPsych (Little’s Test to diagnose MCAR)

•  7 variations of the DV: DV with no missing; DV with 10%,

•  Set seed (!!!) – for yourself and others

•  Work in R, not the GUI version

•  Keep your Rscript well commented; make a note of

•  Set the seed!

•  Include any variable in the analysis model in your imputation

•  Don’t impute things that don’t make sense! Don’t impute