2009 Data Cleaning
2009 Data Cleaning
2009 Data Cleaning
net/publication/266714997
CITATIONS READS
317 19,005
1 author:
Jason W Osborne
Miami University
116 PUBLICATIONS 20,067 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Regression and linear modeling book and ancillary materials View project
All content following this page was uploaded by Jason W Osborne on 01 June 2015.
In quantitative research, it is critical to perform data cleaning to ensure that the conclusions drawn from the
data are as generalizable as possible, yet few researchers report doing so (Osborne JW. Educ Psychol.
2008;28:1-10). Extreme scores are a significant threat to the validity and generalizability of the results. In this
article, I argue that researchers need to examine extreme scores to determine which of many possible causes
contributed to the extreme score. From this, researchers can take appropriate action, which has many
laudatory effects, from reducing error variance and improving the accuracy of parameter estimates to reducing
the probability of errors of inference.
Most authors of peer-reviewed journal articles go to great as to arouse suspicions that it was generated by a different
lengths to describe their study, the research methods, the mechanism.”7 Arguably, if an extreme score has origins in a
sample, the statistical analyses used, results, and conclusions different mechanism or population, it does not belong in your
based on those results. However, few seem to mention data analysis. Outliers have also been defined as values that are
cleaning (which can include screening for extreme scores, “dubious in the eyes of the researcher”8 and contaminants,9 all
missing data, normality, etc). To be sure, some of the of which lead to the same conclusion.
researchers do check their data for these things (and may
neglect to report having done that), but Osborne1 examined 2
years' worth of empirical articles in top-tier Educational So Why Do We Care About Extreme Values?
Psychology journals, none explicitly discussed any data cleaning.
Extreme values can cause serious problems for statistical
There is no reason to believe that the situation is different in
analyses. First, they generally serve to increase error variance and
other disciplines.
reduce the power of statistical tests. Second, if nonrandomly
The goal of this article is to discuss the issue of extreme
distributed, they can substantially alter the odds of making both
scores, which can dramatically increase risk for errors of
type I and type II errors. Third, they can seriously bias or
inference, problems with generalizability (biased estimates), and
influence estimates that may be of substantive interest because
suboptimal power (some “robust” procedures and nonpara-
they may not be generated by the population of interest.2,5,10
metric tests are incorrectly considered to be immune from these
sorts of issues; however, even robust and nonparametric tests
benefit from clean data2,3).
The goal of this article is to highlight why it is critical to
What Is an Extreme Score?
screen data for extreme scores and specific suggestions for how There is as much controversy over what constitutes an
to deal with them. extreme score as whether to remove them or not. It is always a
good idea to visually inspect data before any other analysis.
Simple rules of thumb (eg, data points 3 or more SDs from the
What Are Extreme Scores and Why Do mean) are good starting points, unless the sample is particularly
small.11,12 I recommend examining scores at or beyond 3 SDs
We Care About Them? from the mean, as in a normally distributed population, the
An extreme score, or data point far outside the normal probability of an individual being more than 3 SDs from the
distribution for a variable or population,4-6 is also described as mean by random chance alone is 0.26%. Because of this, we
an observation that “deviates so much from other observations have a strong basis for suspecting data points beyond ±3 SD
from the mean are not generated by the population of interest and
as such should be dealt with in some fashion.
From North Carolina State University.
Bivariate and multivariate outliers are typically measured
Address correspondence to Jason W. Osborne, PhD, North Carolina State
using either an index of influence or leverage, or distance.
University, Curriculum and Instruction and Counselor Education, Poe
602c, Campus Box 7801, NCSU, Raleigh, NC 27695-7801. E-mail: Popular indices include Mahalanobis' distance and Cook's D are
[email protected]. both frequently used to calculate the leverage (influence) that
© 2010 Elsevier Inc. All rights reserved. specific cases may exert on the predicted value of the regression
1527-3369/09/1001-0343$36.00/0 line.13 Standardized or studentized residuals in regression and
doi:10.1053/j.nainr.2009.12.009 analysis of variance (ANOVA)-type analyses can also help
identify within-group outliers. The z = ±3 rule works well for Extreme Scores From Standardization Failure
standardized residuals as well.
Unexpectedly, extreme scores can be caused by research
methodology, particularly if something anomalous happened
What Causes Extreme Scores and What during a particular subject's experience. Unusual phenomena
such as construction noise outside a research laboratory or an
Should We Do About Them? experimenter feeling particularly grouchy, or even events
Extreme scores can arise from (at least) six possible reasons outside the context of the research laboratory, such as a student
for data points that may be suspect. First note that not all protest, a rape, or murder on campus, observations in a
extreme scores are illegitimate contaminants, and not all classroom the day before a big holiday recess, and so on can
illegitimate scores show up as extreme scores.14 It is therefore produce outliers. Faulty or noncalibrated equipment is another
important to consider the range of causes that may be common cause of extreme scores.
responsible for extreme scores. Inferred cause can then inform Let us consider two possible cases in relation to this source of
what action a researcher should take with a given extreme score. outliers. In the first case, we might have a piece of equipment in
our laboratory that was miscalibrated, yielding measurements
that were extremely different from other days' measurements. If
Extreme Scores From Data Errors the miscalibration results in a fixed change to the score that is
consistent or predictable across all measurements (eg, all
Extreme scores are often caused by errors in data collection,
measurements are off by 100), then adjustment of the scores is
recording, or entry. Data from an interview or survey can be
appropriate. If there is no clear way to defensibly adjust the
recorded incorrectly, or mis-keyed upon data entry (eg, a
measurements, they must be discarded.
survey respondent reporting yearly wage rather than hourly
wage). Errors of this nature can often be corrected by returning
to the original documents, recalculating or inferring the correct Extreme Scores From Faulty Distributional
response, or recontacting the original participant. This can save
Assumptions
important data and eliminate an problematic extreme score.
Incorrect assumptions about the distribution of the data can
also lead to the presence of suspected outliers.18 Blood sugar
Extreme Scores From Intentional or levels, disciplinary referrals, scores on classroom tests where
Motivated Misreporting students are well-prepared, and self-reports of low-frequency
behaviors (eg, number of times a student has been suspended or
Motivated misreporting by research participants is a long- held back a grade) may give rise to highly nonnormal
discussed source of bias in data. A participant may make a distributions. These distributions may look like they have a
conscious effort to sabotage the research,15 or may be acting substantial number of extreme scores, but after transformation
from social desirability or self-presentation motives. Identifying (s) to improve normality,19 it might be the case that few, if any,
and reducing this issue is difficult, unless researchers take care of the data points are subsequently identified as outliers.
to triangulate or validate data in some manner. Osborne and The data presented in Fig 1 on 180 students taking an
Blanchard16 summarizes several approaches to identifying examination in an undergraduate psychology class shows a
response sets such as this. If you suspect motivated mis- highly skewed distribution with a mean of 87.50 and an SD of
responding in your data, you should probably remove that 8.78. Although one could argue that the lowest scores on this
participant, because the data are being influenced by more than test are outliers because they are more than 3 SDs below the
the phenomena you wish to examine.
values. For this reason, researchers turn to robust or “high shows strong normality, with a skew of −0.001 (0.00 is
breakdown” methods to provide alternative estimates for these perfectly symmetrical; as depicted in Fig 2).
important aspects of the data. Samples from this distribution should also share these
A common robust estimation method for univariate distributional traits, especially large samples. For example, a
distributions involves the use of a trimmed mean, which is relatively large sample of n = 416 that included 4% extreme
calculated by temporarily eliminating extreme observations at scores on one side of the distribution (high-poverty students),
both ends of the sample.25 Alternatively, researchers may the distribution properties changed substantially (as depicted
choose to compute a Windsorized mean, for which the highest in Fig 3):
and lowest observations are temporarily censored and replaced The skew is now −2.18. Substantial error has been added to
with adjacent values from the remaining data.14 the variable (SD is increased 56%), and it is clear that those 16
Assuming that the distribution of prediction errors is close to students at the very bottom of the distribution do not belong to
normal, several common robust regression techniques can help the normal population of interest. Removal of these outliers
reduce the influence of outlying data points. The least trimmed returned the distribution to a mean of −0.02, SD = 0.78, skew =
squares and the least median of squares estimators are 0.01, not significantly different from the original population of
conceptually similar to the trimmed mean, helping to minimize over 24000.
the scatter of the prediction errors by eliminating a specific Osborne and Overbay24 performed similar simulations of
percentage of the largest positive and negative outliers,26 whereas the effects of small numbers of outliers on repeated samples
Windsorized regression smoothes the Y-data by replacing from a known population in the context of correlation and
extreme residuals with the next closest value in the dataset.27 ANOVA-type analyses. The effects were striking.
Many options exist for analysis of nonideal variables. In
addition to the abovementioned options, analysts can choose
from nonparametric analyses, because these types of analyses
have few if any distributional assumptions, although research by
Zimmerman3,28 do point out that even nonparametric analyses
suffer from outlier cases.
⁎ From the National Centers for Educational Statistics NELS 88 Fig 4. Correlation of SES and achievement, with
data set. 4% outliers.
‡
Equal group means, 52 0.34 0.18 3.70 66.0 −0.20 −0.12 1.02 2.0 1.0 b1
outliers in one cell
104 0.22 0.14 5.36 ‡ 67.0 0.05 −0.08 1.27 3.0 3.0 b1
& INFANT NURSING REVIEWS, MARCH 2010
416 0.09 0.06 4.15 ‡ 61.0 0.14 0.05 0.98 2.0 3.0 b1
Equal group means, 52 0.27 0.19 3.21 ‡ 53.0 0.08 −0.02 1.15 2.0 4.0 b1
outliers in both cells
104 0.20 0.14 3.98 ‡ 54.0 0.02 −0.07 0.93 3.0 3.0 b1
416 0.15 0.11 2.28 ⁎ 68.0 0.26 0.09 2.14 ⁎ 3.0 2.0 b1
Unequal group means, 52 4.72 4.25 1.64 52.0 0.99 1.44 −4.70 ‡ 82.0 72.0 2.41 †
outliers in one cell
104 4.11 4.03 0.42 57.0 1.61 2.06 −2.78 † 68.0 45.0 4.70 ‡
416 4.11 4.21 −0.30 62.0 2.98 3.91 −12.97 ‡ 16.0 0.0 4.34 ‡
Unequal group means, 52 4.51 4.09 1.67 56.0 1.01 1.36 −4.57 ‡ 81.0 75.0 1.37
outliers in both cells
104 4.15 4.08 0.36 51.0 1.43 2.01 −7.44 ‡ 71.0 47.0 5.06 ‡
416 4.17 4.07 1.16 61.0 3.06 4.12 −17.55 ‡ 10.0 0.0 3.13 ‡
One hundred samples were drawn for each row. Outliers were actual members of the population who scored at least z = ± 3 on the relevant variable.
⁎P b .05.
†
P b .01.
‡
P b .001.
41
The Effect of Extreme Scores on represents which category each subject falls into and then can
analyze other data as a function of missingness to determine if
Correlations and Regression missingness is associated with particular subgroups or other
As Table 1 demonstrates, outliers had adverse effects on variables. This can shed important light onto whether
correlations. Removal of the outliers produced more accurate missingness can be causing significant bias.
(ie, closer to the known “population” correlation) estimates of
the population correlation 70% to 100% of the time.
Furthermore, in most cases errors of inference were significantly Summary
less common (between 89.7%–100% of errors of inference were In sum, the best, most sophisticated analyses must be
eliminated for all but the largest data sets, which had few errors considered flawed if quantitative researchers do not take the
of inference). with cleaned than uncleaned data. time to thoroughly understand and examine their data to
As Fig 4 shows, a few randomly chosen outliers in a sample ensure the best possible outcome (ie, the most accurate,
of 100 can cause substantial mis-estimation of the population generalizable representation of the population). Although over
correlation. In the sample of almost 24 000 students, these two a century of writings on quantitative methods has yielded a
variables were correlated very strongly, r = 0.46. In this very diverse set of opinions about this topic, analyses and
particular example, the correlation with 4% outliers in the principles summarized herein should convince the readers
analysis was r = 0.16 and was not significant, whereas after that it is in their best interest to thoroughly clean their data
removal of the extreme scores, the correlation closely estimated before analysis.
the expected magnitude (r = 0.48).
References
The Effect of Outliers on t Tests
1. Osborne JW. Sweating the small stuff in educational
and ANOVAs psychology: how effect size and power reporting failed
The second example deals with analyses that look at group to change from 1969 to 1999, and what that means for
mean differences, such as t tests and ANOVA. For the purpose the future of changing practices. Educ Psychol. 2008;28:
of simplicity, these analyses are simple t tests, but these results 1-10.
easily generalize to more complex analyses such as ANOVA. 2. Zimmerman DW. A note on the influence of outliers on
For these analyses, two different conditions were examined: parametric and nonparametric tests. J Gen Psychol. 1994;
when there were no significant differences between the groups 121:391-401.
in the population (sex differences in SES produced a mean 3. Zimmerman DW. Increasing the power of nonparametric
group difference of 0.0007 with an SD of 0.80 and with 24 tests by detecting and downweighting outliers. J Exper Educ.
501 df produced a t of 0.29) and when there were significant 1995;64:71-78.
group differences in the population (sex differences in 4. Jarrell MG. A comparison of two procedures, the Mahala-
mathematics achievement test scores produced a mean nobis Distance and the Andrews-Pregibon Statistic, for
difference of 4.06 and an SD of 9.75 and 24 501 df produced identifying multivariate outliers. Res Sch. 1994;1:49-58.
a t of 10.69, P b .0001). 5. Rasmussen JL. Evaluating outlier identification tests:
The results in Table 2 again illustrate the expected effects of Mahalanobis D Squared and Comrey D. Multivariate
outliers on t test analyses designs. Removal of outliers had Behav Res. 1988;23:189-202.
beneficial effects, in that the results tended to become more like 6. Stevens JP. Outliers and influential data points in regression
the population: for both groups, differences and t statistics analysis. Psychol Bull. 1984;95:334-344.
became more accurate in most the samples. 7. Hawkins DM. Identification of Outliers. New York:
Chapman and Hall; 1980.
8. Dixon WJ. Analysis of extreme values. Ann Math Stat. 1950;
Missing Data as a Special Case of 21:488-506.
9. Wainer H. Robust statistics: a survey and some prescrip-
Extreme Score tions. J Educ Stat. 1976;1:285-312.
Missing data can be thought of as another potential type of 10. Schwager SJ, Margolin BH. Detection of multivariate
extreme score. As such, much of the previous discussion outliers. Ann Stat. 1982;10:943-954.
applies. There are multiple reasons why data might be missing, 11. Miller J. Reaction time analysis with outlier exclusion: bias
and it is important to attempt to ascertain the reason for varies with sample size. Q J Exp Psychol. 1991;43:907-912.
missingness, just as extremeness. Cole gives a much more 12. Van Selst M, Jolicoeur P. A solution to the effect of sample
thorough treatment of how to analyze and deal with missing size on outlier elimination. Q J Exp Psychol. 1994;47:
data, for those interested.29 631-650.
However, one underutilized technique is analyzing differ- 13. Newton RR, Rudestam KE. Your Statistical Consultant:
ences between those with missing data and those with complete Answers to Your Data Analysis Questions. Thousand Oaks,
data. For example, researchers can code a variable that CA: Sage.; 1999.