Jason W Osborne
Miami University
In quantitative research, it is critical to perform data cleaning to ensure that the conclusions drawn from the
data are as generalizable as possible, yet few researchers report doing so (Osborne JW. Educ Psychol.
2008;28:1-10). Extreme scores are a significant threat to the validity and generalizability of the results. In this
article, I argue that researchers need to examine extreme scores to determine which of many possible causes
contributed to the extreme score. From this, researchers can take appropriate action, which has many
laudatory effects, from reducing error variance and improving the accuracy of parameter estimates to reducing
the probability of errors of inference.
Most authors of peer-reviewed journal articles go to great as to arouse suspicions that it was generated by a different
lengths to describe their study, the research methods, the mechanism.”7 Arguably, if an extreme score has origins in a
sample, the statistical analyses used, results, and conclusions different mechanism or population, it does not belong in your
based on those results. However, few seem to mention data analysis. Outliers have also been defined as values that are
cleaning (which can include screening for extreme scores, “dubious in the eyes of the researcher”8 and contaminants,9 all
missing data, normality, etc). To be sure, some of the of which lead to the same conclusion.
researchers do check their data for these things (and may
neglect to report having done that), but Osborne1 examined 2
years' worth of empirical articles in top-tier Educational So Why Do We Care About Extreme Values?
Psychology journals, none explicitly discussed any data cleaning.
Extreme values can cause serious problems for statistical
There is no reason to believe that the situation is different in
analyses. First, they generally serve to increase error variance and
other disciplines.
reduce the power of statistical tests. Second, if nonrandomly
The goal of this article is to discuss the issue of extreme
distributed, they can substantially alter the odds of making both
scores, which can dramatically increase risk for errors of
type I and type II errors. Third, they can seriously bias or
inference, problems with generalizability (biased estimates), and
influence estimates that may be of substantive interest because
suboptimal power (some “robust” procedures and nonpara-
they may not be generated by the population of interest.2,5,10
metric tests are incorrectly considered to be immune from these
sorts of issues; however, even robust and nonparametric tests
benefit from clean data2,3).
The goal of this article is to highlight why it is critical to
What Is an Extreme Score?
screen data for extreme scores and specific suggestions for how There is as much controversy over what constitutes an
to deal with them. extreme score as whether to remove them or not. It is always a
good idea to visually inspect data before any other analysis.
Simple rules of thumb (eg, data points 3 or more SDs from the
What Are Extreme Scores and Why Do mean) are good starting points, unless the sample is particularly
small.11,12 I recommend examining scores at or beyond 3 SDs
We Care About Them? from the mean, as in a normally distributed population, the
An extreme score, or data point far outside the normal probability of an individual being more than 3 SDs from the
distribution for a variable or population,4-6 is also described as mean by random chance alone is 0.26%. Because of this, we
an observation that “deviates so much from other observations have a strong basis for suspecting data points beyond ±3 SD
from the mean are not generated by the population of interest and
as such should be dealt with in some fashion.
Bivariate and multivariate outliers are typically measured
identify within-group outliers. The z = ±3 rule works well for Extreme Scores From Standardization Failure
standardized residuals as well.
Unexpectedly, extreme scores can be caused by research
methodology, particularly if something anomalous happened
What Causes Extreme Scores and What during a particular subject's experience. Unusual phenomena
such as construction noise outside a research laboratory or an
Should We Do About Them? experimenter feeling particularly grouchy, or even events
Extreme scores can arise from (at least) six possible reasons outside the context of the research laboratory, such as a student
for data points that may be suspect. First note that not all protest, a rape, or murder on campus, observations in a
extreme scores are illegitimate contaminants, and not all classroom the day before a big holiday recess, and so on can
illegitimate scores show up as extreme scores.14 It is therefore produce outliers. Faulty or noncalibrated equipment is another
important to consider the range of causes that may be common cause of extreme scores.
responsible for extreme scores. Inferred cause can then inform Let us consider two possible cases in relation to this source of
what action a researcher should take with a given extreme score. outliers. In the first case, we might have a piece of equipment in
our laboratory that was miscalibrated, yielding measurements
that were extremely different from other days' measurements. If
Extreme Scores From Data Errors the miscalibration results in a fixed change to the score that is
consistent or predictable across all measurements (eg, all
Extreme scores are often caused by errors in data collection,
measurements are off by 100), then adjustment of the scores is
recording, or entry. Data from an interview or survey can be
appropriate. If there is no clear way to defensibly adjust the
recorded incorrectly, or mis-keyed upon data entry (eg, a
measurements, they must be discarded.
survey respondent reporting yearly wage rather than hourly
wage). Errors of this nature can often be corrected by returning
to the original documents, recalculating or inferring the correct Extreme Scores From Faulty Distributional
response, or recontacting the original participant. This can save
important data and eliminate an problematic extreme score.
Incorrect assumptions about the distribution of the data can
also lead to the presence of suspected outliers.18 Blood sugar
Extreme Scores From Intentional or levels, disciplinary referrals, scores on classroom tests where
Motivated Misreporting students are well-prepared, and self-reports of low-frequency
behaviors (eg, number of times a student has been suspended or
Motivated misreporting by research participants is a long- held back a grade) may give rise to highly nonnormal
discussed source of bias in data. A participant may make a distributions. These distributions may look like they have a
conscious effort to sabotage the research,15 or may be acting substantial number of extreme scores, but after transformation
from social desirability or self-presentation motives. Identifying (s) to improve normality,19 it might be the case that few, if any,
and reducing this issue is difficult, unless researchers take care of the data points are subsequently identified as outliers.
to triangulate or validate data in some manner. Osborne and The data presented in Fig 1 on 180 students taking an
Blanchard16 summarizes several approaches to identifying examination in an undergraduate psychology class shows a
response sets such as this. If you suspect motivated mis- highly skewed distribution with a mean of 87.50 and an SD of
responding in your data, you should probably remove that 8.78. Although one could argue that the lowest scores on this
participant, because the data are being influenced by more than test are outliers because they are more than 3 SDs below the
the phenomena you wish to examine.
values. For this reason, researchers turn to robust or “high shows strong normality, with a skew of −0.001 (0.00 is
breakdown” methods to provide alternative estimates for these perfectly symmetrical; as depicted in Fig 2).
important aspects of the data. Samples from this distribution should also share these
A common robust estimation method for univariate distributional traits, especially large samples. For example, a
distributions involves the use of a trimmed mean, which is relatively large sample of n = 416 that included 4% extreme
calculated by temporarily eliminating extreme observations at scores on one side of the distribution (high-poverty students),
both ends of the sample.25 Alternatively, researchers may the distribution properties changed substantially (as depicted
choose to compute a Windsorized mean, for which the highest in Fig 3):
and lowest observations are temporarily censored and replaced The skew is now −2.18. Substantial error has been added to
with adjacent values from the remaining data.14 the variable (SD is increased 56%), and it is clear that those 16
Assuming that the distribution of prediction errors is close to students at the very bottom of the distribution do not belong to
normal, several common robust regression techniques can help the normal population of interest. Removal of these outliers
reduce the influence of outlying data points. The least trimmed returned the distribution to a mean of −0.02, SD = 0.78, skew =
squares and the least median of squares estimators are 0.01, not significantly different from the original population of
conceptually similar to the trimmed mean, helping to minimize over 24000.
the scatter of the prediction errors by eliminating a specific Osborne and Overbay24 performed similar simulations of
percentage of the largest positive and negative outliers,26 whereas the effects of small numbers of outliers on repeated samples
Windsorized regression smoothes the Y-data by replacing from a known population in the context of correlation and
extreme residuals with the next closest value in the dataset.27 ANOVA-type analyses. The effects were striking.
Many options exist for analysis of nonideal variables. In
addition to the abovementioned options, analysts can choose
from nonparametric analyses, because these types of analyses
have few if any distributional assumptions, although research by
Zimmerman3,28 do point out that even nonparametric analyses
suffer from outlier cases.
⁎ From the National Centers for Educational Statistics NELS 88 Fig 4. Correlation of SES and achievement, with
data set. 4% outliers.
Equal group means, 52 0.34 0.18 3.70 66.0 −0.20 −0.12 1.02 2.0 1.0 b1
outliers in one cell
104 0.22 0.14 5.36 ‡ 67.0 0.05 −0.08 1.27 3.0 3.0 b1
416 0.09 0.06 4.15 ‡ 61.0 0.14 0.05 0.98 2.0 3.0 b1
Equal group means, 52 0.27 0.19 3.21 ‡ 53.0 0.08 −0.02 1.15 2.0 4.0 b1
outliers in both cells
104 0.20 0.14 3.98 ‡ 54.0 0.02 −0.07 0.93 3.0 3.0 b1
416 0.15 0.11 2.28 ⁎ 68.0 0.26 0.09 2.14 ⁎ 3.0 2.0 b1
Unequal group means, 52 4.72 4.25 1.64 52.0 0.99 1.44 −4.70 ‡ 82.0 72.0 2.41 †
outliers in one cell
104 4.11 4.03 0.42 57.0 1.61 2.06 −2.78 † 68.0 45.0 4.70 ‡
416 4.11 4.21 −0.30 62.0 2.98 3.91 −12.97 ‡ 16.0 0.0 4.34 ‡
Unequal group means, 52 4.51 4.09 1.67 56.0 1.01 1.36 −4.57 ‡ 81.0 75.0 1.37
outliers in both cells
104 4.15 4.08 0.36 51.0 1.43 2.01 −7.44 ‡ 71.0 47.0 5.06 ‡
416 4.17 4.07 1.16 61.0 3.06 4.12 −17.55 ‡ 10.0 0.0 3.13 ‡
One hundred samples were drawn for each row. Outliers were actual members of the population who scored at least z = ± 3 on the relevant variable.
⁎P b .05.
P b .01.
P b .001.
The Effect of Extreme Scores on represents which category each subject falls into and then can
analyze other data as a function of missingness to determine if
Correlations and Regression missingness is associated with particular subgroups or other
As Table 1 demonstrates, outliers had adverse effects on variables. This can shed important light onto whether
correlations. Removal of the outliers produced more accurate missingness can be causing significant bias.
(ie, closer to the known “population” correlation) estimates of
the population correlation 70% to 100% of the time.
Furthermore, in most cases errors of inference were significantly Summary
less common (between 89.7%–100% of errors of inference were In sum, the best, most sophisticated analyses must be
eliminated for all but the largest data sets, which had few errors considered flawed if quantitative researchers do not take the
of inference). with cleaned than uncleaned data. time to thoroughly understand and examine their data to
As Fig 4 shows, a few randomly chosen outliers in a sample ensure the best possible outcome (ie, the most accurate,
of 100 can cause substantial mis-estimation of the population generalizable representation of the population). Although over
correlation. In the sample of almost 24 000 students, these two a century of writings on quantitative methods has yielded a
variables were correlated very strongly, r = 0.46. In this very diverse set of opinions about this topic, analyses and
particular example, the correlation with 4% outliers in the principles summarized herein should convince the readers
analysis was r = 0.16 and was not significant, whereas after that it is in their best interest to thoroughly clean their data
removal of the extreme scores, the correlation closely estimated before analysis.
the expected magnitude (r = 0.48).
The Effect of Outliers on t Tests
