2009 Data Cleaning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/266714997

Best Practices in Data Cleaning: A Complete Guide to Everything You Need to


Do Before and After Collecting Your Data

Book · January 2013


DOI: 10.4135/9781452269948

CITATIONS READS

317 19,005

1 author:

Jason W Osborne
Miami University
116 PUBLICATIONS   20,067 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Regression and linear modeling book and ancillary materials View project

All content following this page was uploaded by Jason W Osborne on 01 June 2015.

The user has requested enhancement of the downloaded file.


Data Cleaning Basics: Best Practices in
Dealing with Extreme Scores
Jason W. Osborne, PhD

In quantitative research, it is critical to perform data cleaning to ensure that the conclusions drawn from the
data are as generalizable as possible, yet few researchers report doing so (Osborne JW. Educ Psychol.
2008;28:1-10). Extreme scores are a significant threat to the validity and generalizability of the results. In this
article, I argue that researchers need to examine extreme scores to determine which of many possible causes
contributed to the extreme score. From this, researchers can take appropriate action, which has many
laudatory effects, from reducing error variance and improving the accuracy of parameter estimates to reducing
the probability of errors of inference.

Keywords: Data cleaning; Extreme scores; Outliers; Parameter estimates

Most authors of peer-reviewed journal articles go to great as to arouse suspicions that it was generated by a different
lengths to describe their study, the research methods, the mechanism.”7 Arguably, if an extreme score has origins in a
sample, the statistical analyses used, results, and conclusions different mechanism or population, it does not belong in your
based on those results. However, few seem to mention data analysis. Outliers have also been defined as values that are
cleaning (which can include screening for extreme scores, “dubious in the eyes of the researcher”8 and contaminants,9 all
missing data, normality, etc). To be sure, some of the of which lead to the same conclusion.
researchers do check their data for these things (and may
neglect to report having done that), but Osborne1 examined 2
years' worth of empirical articles in top-tier Educational So Why Do We Care About Extreme Values?
Psychology journals, none explicitly discussed any data cleaning.
Extreme values can cause serious problems for statistical
There is no reason to believe that the situation is different in
analyses. First, they generally serve to increase error variance and
other disciplines.
reduce the power of statistical tests. Second, if nonrandomly
The goal of this article is to discuss the issue of extreme
distributed, they can substantially alter the odds of making both
scores, which can dramatically increase risk for errors of
type I and type II errors. Third, they can seriously bias or
inference, problems with generalizability (biased estimates), and
influence estimates that may be of substantive interest because
suboptimal power (some “robust” procedures and nonpara-
they may not be generated by the population of interest.2,5,10
metric tests are incorrectly considered to be immune from these
sorts of issues; however, even robust and nonparametric tests
benefit from clean data2,3).
The goal of this article is to highlight why it is critical to
What Is an Extreme Score?
screen data for extreme scores and specific suggestions for how There is as much controversy over what constitutes an
to deal with them. extreme score as whether to remove them or not. It is always a
good idea to visually inspect data before any other analysis.
Simple rules of thumb (eg, data points 3 or more SDs from the
What Are Extreme Scores and Why Do mean) are good starting points, unless the sample is particularly
small.11,12 I recommend examining scores at or beyond 3 SDs
We Care About Them? from the mean, as in a normally distributed population, the
An extreme score, or data point far outside the normal probability of an individual being more than 3 SDs from the
distribution for a variable or population,4-6 is also described as mean by random chance alone is 0.26%. Because of this, we
an observation that “deviates so much from other observations have a strong basis for suspecting data points beyond ±3 SD
from the mean are not generated by the population of interest and
as such should be dealt with in some fashion.
From North Carolina State University.
Bivariate and multivariate outliers are typically measured
Address correspondence to Jason W. Osborne, PhD, North Carolina State
using either an index of influence or leverage, or distance.
University, Curriculum and Instruction and Counselor Education, Poe
602c, Campus Box 7801, NCSU, Raleigh, NC 27695-7801. E-mail: Popular indices include Mahalanobis' distance and Cook's D are
[email protected]. both frequently used to calculate the leverage (influence) that
© 2010 Elsevier Inc. All rights reserved. specific cases may exert on the predicted value of the regression
1527-3369/09/1001-0343$36.00/0 line.13 Standardized or studentized residuals in regression and
doi:10.1053/j.nainr.2009.12.009 analysis of variance (ANOVA)-type analyses can also help
identify within-group outliers. The z = ±3 rule works well for Extreme Scores From Standardization Failure
standardized residuals as well.
Unexpectedly, extreme scores can be caused by research
methodology, particularly if something anomalous happened
What Causes Extreme Scores and What during a particular subject's experience. Unusual phenomena
such as construction noise outside a research laboratory or an
Should We Do About Them? experimenter feeling particularly grouchy, or even events
Extreme scores can arise from (at least) six possible reasons outside the context of the research laboratory, such as a student
for data points that may be suspect. First note that not all protest, a rape, or murder on campus, observations in a
extreme scores are illegitimate contaminants, and not all classroom the day before a big holiday recess, and so on can
illegitimate scores show up as extreme scores.14 It is therefore produce outliers. Faulty or noncalibrated equipment is another
important to consider the range of causes that may be common cause of extreme scores.
responsible for extreme scores. Inferred cause can then inform Let us consider two possible cases in relation to this source of
what action a researcher should take with a given extreme score. outliers. In the first case, we might have a piece of equipment in
our laboratory that was miscalibrated, yielding measurements
that were extremely different from other days' measurements. If
Extreme Scores From Data Errors the miscalibration results in a fixed change to the score that is
consistent or predictable across all measurements (eg, all
Extreme scores are often caused by errors in data collection,
measurements are off by 100), then adjustment of the scores is
recording, or entry. Data from an interview or survey can be
appropriate. If there is no clear way to defensibly adjust the
recorded incorrectly, or mis-keyed upon data entry (eg, a
measurements, they must be discarded.
survey respondent reporting yearly wage rather than hourly
wage). Errors of this nature can often be corrected by returning
to the original documents, recalculating or inferring the correct Extreme Scores From Faulty Distributional
response, or recontacting the original participant. This can save
Assumptions
important data and eliminate an problematic extreme score.
Incorrect assumptions about the distribution of the data can
also lead to the presence of suspected outliers.18 Blood sugar
Extreme Scores From Intentional or levels, disciplinary referrals, scores on classroom tests where
Motivated Misreporting students are well-prepared, and self-reports of low-frequency
behaviors (eg, number of times a student has been suspended or
Motivated misreporting by research participants is a long- held back a grade) may give rise to highly nonnormal
discussed source of bias in data. A participant may make a distributions. These distributions may look like they have a
conscious effort to sabotage the research,15 or may be acting substantial number of extreme scores, but after transformation
from social desirability or self-presentation motives. Identifying (s) to improve normality,19 it might be the case that few, if any,
and reducing this issue is difficult, unless researchers take care of the data points are subsequently identified as outliers.
to triangulate or validate data in some manner. Osborne and The data presented in Fig 1 on 180 students taking an
Blanchard16 summarizes several approaches to identifying examination in an undergraduate psychology class shows a
response sets such as this. If you suspect motivated mis- highly skewed distribution with a mean of 87.50 and an SD of
responding in your data, you should probably remove that 8.78. Although one could argue that the lowest scores on this
participant, because the data are being influenced by more than test are outliers because they are more than 3 SDs below the
the phenomena you wish to examine.

Extreme Scores From Sampling Error or Bias


No sampling framework is perfect, and sampling error or
bias can produce extreme scores by erroneously including
individuals from populations not intended to be sampled.
For example, some colleagues and I17 randomly sampled
registered nurses from licensure rolls for a survey on
organizational commitment. As part of this survey, we asked
nurses to report their salary. Upon examining some very
extreme scores, we discovered we had inadvertently
surveyed some registered nurses who had moved into
hospital administration (with a much higher salary) but who
had also maintained their nursing license. These cases, being
extreme and not of the population of interest (floor nurses) Fig 1. Performance on class unit examination,
were removed. Undergraduate Education Psychology Course.

38 VOLUME 10, NUMBER 1, www.nainr.com


influence of these scores. Data transformations (eg, square root
and log) also have the effect of reducing the effect of extreme
scores when used appropriately.19
Alternatively, extreme scores can present an opportunity for
inquiry. When researchers in Africa discovered some women
who had been repeatedly exposed to human immunodeficiency
virus over several years but remained uninfected,22 they
represent potential for an important advance in understanding.
Thus, before discarding outliers, researchers need to consider
whether those data contain valuable information that may not
necessarily relate to the intended study but have importance in a
more global sense.
To be clear on this point, no matter the inferred cause of
the extreme score, it must be dealt with in some fashion and
that decision should be reported and defended in any
research reports that involve the data. Extreme scores should
be corrected, removed, truncated, reduced in importance
through data transformation, or separated from the rest of the
sample for separate study. This affords the most replicable,
honest estimate of the population parameters possible.23,24
Not only are basic parameter estimates closer to population
values when illegitimate extreme values are removed, but
Fig 2. Distribution of SES. inferential statistics (correlations, t tests, etc) have substan-
tially lower error rate.24
mean, a better interpretation is that the data are not normally
distributed. In this case, a transformation should be used to
normalize the data before analysis of extreme scores should
Advanced Techniques for Dealing With Extreme
occur or analyses appropriate for nonnormal distributions
should be used. Scores: Robust Methods
Instead of transformations or truncation, researchers some-
times use various “robust” procedures to protect their data from
Extreme Scores as Legitimate Cases Sampled From being distorted by the presence of outliers. Certain parameter
estimates, especially the mean and least squares estimations, are
the Correct Population
particularly vulnerable to outliers, or have “low breakdown”
Finally, it is possible that an outlier can come from the
population being sampled legitimately through random chance.
It is important to note that sample size plays a role in the
probability of outlying values. Within a normally distributed
population, it is more probable that a given data point will be
drawn from the most densely concentrated area of the
distribution, rather than one of the tails.20,21 As a researcher
casts a wider net and the data set becomes larger, the more the
sample resembles the population from which it was drawn, and
thus, the likelihood of legitimate individual outlying values
becomes greater, although as a percentage of the sample, they
become less significant overall.
When extreme scores occur as a function of the inherent
variability of the data, opinions differ widely on what to do.
When legitimate extreme scores are in a data set, they can have
deleterious effects on power, accuracy, and type I/II error rates.
One way to deal with them is to use truncation, in which you
specify an upper reasonable limit to your data and recode higher
scores to that number (eg, in a study of adolescents one
indicated he had 99 close friends, yet by our definition that
would be impossible; thus, we recoded all responses above 15
(the highest reasonable number of close friends) to 15). This
keeps all data in the sample while at the same time reducing the Fig 3. Distribution of SES with 4% outliers.

NEWBORN & INFANT NURSING REVIEWS, MARCH 2010 39


Table 1. The effects of outliers on correlations
Average Average % More % Errors % Errors after
Population, (r) N initial r cleaned r t accurate before cleaning cleaning T
−0.06 52 0.01 −0.08 2.5 ⁎ 95 78 8 13.40 †
104 −0.54 −0.06 75.44 † 100 100 6% 39.38 †
416 0 −0.06 16.09 † 70 0 21 5.13 †
0.46 52 0.27 0.52 8.1 † 89 53 0 10.57 †
104 0.15 0.50 26.78 † 90 73 0 16.36 †
416 0.30 0.50 54.77 † 95 0 0 –
One hundred samples were randomly drawn for each row. Outliers were actual members of the population who scored at least z = ±3 on the relevant variable.
With n = 52, a correlation of 0.274 is significant at P b .05. With n = 104, a correlation of 0.196 is significant at P b .05. With n = 416, a correlation of 0.098 is
significant at P b .05, two tailed.
⁎P b .01.

P b .001.

values. For this reason, researchers turn to robust or “high shows strong normality, with a skew of −0.001 (0.00 is
breakdown” methods to provide alternative estimates for these perfectly symmetrical; as depicted in Fig 2).
important aspects of the data. Samples from this distribution should also share these
A common robust estimation method for univariate distributional traits, especially large samples. For example, a
distributions involves the use of a trimmed mean, which is relatively large sample of n = 416 that included 4% extreme
calculated by temporarily eliminating extreme observations at scores on one side of the distribution (high-poverty students),
both ends of the sample.25 Alternatively, researchers may the distribution properties changed substantially (as depicted
choose to compute a Windsorized mean, for which the highest in Fig 3):
and lowest observations are temporarily censored and replaced The skew is now −2.18. Substantial error has been added to
with adjacent values from the remaining data.14 the variable (SD is increased 56%), and it is clear that those 16
Assuming that the distribution of prediction errors is close to students at the very bottom of the distribution do not belong to
normal, several common robust regression techniques can help the normal population of interest. Removal of these outliers
reduce the influence of outlying data points. The least trimmed returned the distribution to a mean of −0.02, SD = 0.78, skew =
squares and the least median of squares estimators are 0.01, not significantly different from the original population of
conceptually similar to the trimmed mean, helping to minimize over 24000.
the scatter of the prediction errors by eliminating a specific Osborne and Overbay24 performed similar simulations of
percentage of the largest positive and negative outliers,26 whereas the effects of small numbers of outliers on repeated samples
Windsorized regression smoothes the Y-data by replacing from a known population in the context of correlation and
extreme residuals with the next closest value in the dataset.27 ANOVA-type analyses. The effects were striking.
Many options exist for analysis of nonideal variables. In
addition to the abovementioned options, analysts can choose
from nonparametric analyses, because these types of analyses
have few if any distributional assumptions, although research by
Zimmerman3,28 do point out that even nonparametric analyses
suffer from outlier cases.

The Effects of Extreme Scores and Their


Removal on Individual Variables
Extreme scores have several specific effects on variables that
are otherwise normally distributed. To illustrate this, we will use
socioeconomic status (SES)⁎ that represents a composite of family
income and social status based on parent occupation. In this
data set, the scores were transformed to z scores. This variable

⁎ From the National Centers for Educational Statistics NELS 88 Fig 4. Correlation of SES and achievement, with
data set. 4% outliers.

40 VOLUME 10, NUMBER 1, www.nainr.com


Table 2. The effects of outliers on t tests
% Type % Type
Initial Cleaned % more I or II I or II
mean mean accurate Average Average errors before errors after
Outliers n difference difference t mean difference initial t cleaned t t cleaning cleaning t
NEWBORN


Equal group means, 52 0.34 0.18 3.70 66.0 −0.20 −0.12 1.02 2.0 1.0 b1
outliers in one cell
104 0.22 0.14 5.36 ‡ 67.0 0.05 −0.08 1.27 3.0 3.0 b1
& INFANT NURSING REVIEWS, MARCH 2010

416 0.09 0.06 4.15 ‡ 61.0 0.14 0.05 0.98 2.0 3.0 b1
Equal group means, 52 0.27 0.19 3.21 ‡ 53.0 0.08 −0.02 1.15 2.0 4.0 b1
outliers in both cells
104 0.20 0.14 3.98 ‡ 54.0 0.02 −0.07 0.93 3.0 3.0 b1
416 0.15 0.11 2.28 ⁎ 68.0 0.26 0.09 2.14 ⁎ 3.0 2.0 b1
Unequal group means, 52 4.72 4.25 1.64 52.0 0.99 1.44 −4.70 ‡ 82.0 72.0 2.41 †
outliers in one cell
104 4.11 4.03 0.42 57.0 1.61 2.06 −2.78 † 68.0 45.0 4.70 ‡
416 4.11 4.21 −0.30 62.0 2.98 3.91 −12.97 ‡ 16.0 0.0 4.34 ‡
Unequal group means, 52 4.51 4.09 1.67 56.0 1.01 1.36 −4.57 ‡ 81.0 75.0 1.37
outliers in both cells
104 4.15 4.08 0.36 51.0 1.43 2.01 −7.44 ‡ 71.0 47.0 5.06 ‡
416 4.17 4.07 1.16 61.0 3.06 4.12 −17.55 ‡ 10.0 0.0 3.13 ‡
One hundred samples were drawn for each row. Outliers were actual members of the population who scored at least z = ± 3 on the relevant variable.
⁎P b .05.

P b .01.

P b .001.
41
The Effect of Extreme Scores on represents which category each subject falls into and then can
analyze other data as a function of missingness to determine if
Correlations and Regression missingness is associated with particular subgroups or other
As Table 1 demonstrates, outliers had adverse effects on variables. This can shed important light onto whether
correlations. Removal of the outliers produced more accurate missingness can be causing significant bias.
(ie, closer to the known “population” correlation) estimates of
the population correlation 70% to 100% of the time.
Furthermore, in most cases errors of inference were significantly Summary
less common (between 89.7%–100% of errors of inference were In sum, the best, most sophisticated analyses must be
eliminated for all but the largest data sets, which had few errors considered flawed if quantitative researchers do not take the
of inference). with cleaned than uncleaned data. time to thoroughly understand and examine their data to
As Fig 4 shows, a few randomly chosen outliers in a sample ensure the best possible outcome (ie, the most accurate,
of 100 can cause substantial mis-estimation of the population generalizable representation of the population). Although over
correlation. In the sample of almost 24 000 students, these two a century of writings on quantitative methods has yielded a
variables were correlated very strongly, r = 0.46. In this very diverse set of opinions about this topic, analyses and
particular example, the correlation with 4% outliers in the principles summarized herein should convince the readers
analysis was r = 0.16 and was not significant, whereas after that it is in their best interest to thoroughly clean their data
removal of the extreme scores, the correlation closely estimated before analysis.
the expected magnitude (r = 0.48).

References
The Effect of Outliers on t Tests
1. Osborne JW. Sweating the small stuff in educational
and ANOVAs psychology: how effect size and power reporting failed
The second example deals with analyses that look at group to change from 1969 to 1999, and what that means for
mean differences, such as t tests and ANOVA. For the purpose the future of changing practices. Educ Psychol. 2008;28:
of simplicity, these analyses are simple t tests, but these results 1-10.
easily generalize to more complex analyses such as ANOVA. 2. Zimmerman DW. A note on the influence of outliers on
For these analyses, two different conditions were examined: parametric and nonparametric tests. J Gen Psychol. 1994;
when there were no significant differences between the groups 121:391-401.
in the population (sex differences in SES produced a mean 3. Zimmerman DW. Increasing the power of nonparametric
group difference of 0.0007 with an SD of 0.80 and with 24 tests by detecting and downweighting outliers. J Exper Educ.
501 df produced a t of 0.29) and when there were significant 1995;64:71-78.
group differences in the population (sex differences in 4. Jarrell MG. A comparison of two procedures, the Mahala-
mathematics achievement test scores produced a mean nobis Distance and the Andrews-Pregibon Statistic, for
difference of 4.06 and an SD of 9.75 and 24 501 df produced identifying multivariate outliers. Res Sch. 1994;1:49-58.
a t of 10.69, P b .0001). 5. Rasmussen JL. Evaluating outlier identification tests:
The results in Table 2 again illustrate the expected effects of Mahalanobis D Squared and Comrey D. Multivariate
outliers on t test analyses designs. Removal of outliers had Behav Res. 1988;23:189-202.
beneficial effects, in that the results tended to become more like 6. Stevens JP. Outliers and influential data points in regression
the population: for both groups, differences and t statistics analysis. Psychol Bull. 1984;95:334-344.
became more accurate in most the samples. 7. Hawkins DM. Identification of Outliers. New York:
Chapman and Hall; 1980.
8. Dixon WJ. Analysis of extreme values. Ann Math Stat. 1950;
Missing Data as a Special Case of 21:488-506.
9. Wainer H. Robust statistics: a survey and some prescrip-
Extreme Score tions. J Educ Stat. 1976;1:285-312.
Missing data can be thought of as another potential type of 10. Schwager SJ, Margolin BH. Detection of multivariate
extreme score. As such, much of the previous discussion outliers. Ann Stat. 1982;10:943-954.
applies. There are multiple reasons why data might be missing, 11. Miller J. Reaction time analysis with outlier exclusion: bias
and it is important to attempt to ascertain the reason for varies with sample size. Q J Exp Psychol. 1991;43:907-912.
missingness, just as extremeness. Cole gives a much more 12. Van Selst M, Jolicoeur P. A solution to the effect of sample
thorough treatment of how to analyze and deal with missing size on outlier elimination. Q J Exp Psychol. 1994;47:
data, for those interested.29 631-650.
However, one underutilized technique is analyzing differ- 13. Newton RR, Rudestam KE. Your Statistical Consultant:
ences between those with missing data and those with complete Answers to Your Data Analysis Questions. Thousand Oaks,
data. For example, researchers can code a variable that CA: Sage.; 1999.

42 VOLUME 10, NUMBER 1, www.nainr.com


14. Barnett V, Lewis T. Outliers in Statistical Data. New York: 21. Sachs L. Applied Statistics: A Handbook of Techniques.
Wiley; 1994. 2nd ed. New York: Springer-Verlag; 1982.
15. Huck SW, Sutton CO. Some comments concerning the use 22. Rowland-Jones S, Sutton J, Ariyoshi K, et al. HIV-specific
of monotonic transformations to remove the interaction in cytotoxic T-cells in HIV-exposed but uninfected Gambian
two-factor ANOVA's. Educ Psychol Meas. 1975;35:789-791. women. Nat Med. 1995;1:59-64.
16. Osborne JW, Blanchard MR. Random responding from 23. Judd CM, McClelland GH. Data analysis: A Model
students is a threat to the validity of educational research Comparison Approach. San Diego, CA: Harcourt Brace
results. Educational Psychology. in press. Jovanovich; 1989.
17. Brewer CS, Nauenberg E, Osborne JW. Differences among 24. Osborne JW, Overbay A. The power of outliers (and why
hospital and non-hospital RNs participation, satisfacton, researchers should ALWAYS check for them). Practical
and organizational committment in western New York. Assessment, Research, and Evaluation; 2004. p. 9.
Paper presented at: National meeting of the Association for 25. Anscome FJ. Rejection of outliers. Technometrics. 1960;2:
Health Service Research; June, 1998; Washington DC; 123-147.
1998. 26. Rousseeuw P, Leroy A. Robust Regression and Outlier
18. Iglewicz B, Hoaglin DC. How to Detect and Handle Detection. New york: Wiley; 1987.
Outliers. Wilwaukee, WI: ASQC Quality Press; 1993. 27. Lane K. What Is Robust Regression and How Do You Do It?
19. Osborne JW. Notes on the use of data transformations. Annual meeting of the southwest educational research
Practical assessment, research, and evaluation; 2002. p. 8. association. Austin, TX; 2002.
Available online at http://ericae.net/pare/getvn.asp? 28. Zimmerman DW. Invalidation of parametric and nonpar-
v=8&n=6. amteric statistical tests by concurrent violation of two
20. Evans VP. Strategies for detecting outliers in regression assumptions. J Exp Educ. 1998;67:55-68.
analysis: an introductory primer. In: Thompson B, editor. 29. Cole JC. How to deal with missing data. In: Osborne JW,
Advances in Social Science Methodology, Vol. 5. Stamford, editor. Best Practices in Quantitative Methods. Thousand
CT: JAI Press.; 1999. p. 213-233. Oaks, CA: Sage Publishing; 2008.

NEWBORN & INFANT NURSING REVIEWS, MARCH 2010 43

View publication stats

You might also like