Detection of Outliers: Iglewicz and Hoaglin

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

1.3.5.17.

Detection of Outliers
Introduction An outlier is an observation that appears to deviate markedly
from other observations in the sample.

Identification of potential outliers is important for the


following reasons.

1. An outlier may indicate bad data. For example, the


data may have been coded incorrectly or an
experiment may not have been run correctly. If it can
be determined that an outlying point is in fact
erroneous, then the outlying value should be deleted
from the analysis (or corrected if possible).
2. In some cases, it may not be possible to determine if
an outlying point is bad data. Outliers may be due to
random variation or may indicate something
scientifically interesting. In any event, we typically
do not want to simply delete the outlying observation.
However, if the data contains significant outliers, we
may need to consider the use of robust statistical
techniques.

Labeling, Iglewicz and Hoaglin distinguish the three following issues


Accomodation, with regards to outliers.
Identification
1. outlier labeling - flag potential outliers for further
investigation (i.e., are the potential outliers erroneous
data, indicative of an inappropriate distributional
model, and so on).
2. outlier accomodation - use robust statistical
techniques that will not be unduly affected by
outliers. That is, if we cannot determine that potential
outliers are erroneous observations, do we need
modify our statistical analysis to more appropriately
account for these observations?
3. outlier identification - formally test whether
observations are outliers.

This section focuses on the labeling and identification issues.

Normality Identifying an observation as an outlier depends on the


Assumption underlying distribution of the data. In this section, we limit
the discussion to univariate data sets that are assumed to
follow an approximately normal distribution. If the normality
assumption for the data being tested is not valid, then a
determination that there is an outlier may in fact be due to the
non-normality of the data rather than the prescence of an
outlier.

For this reason, it is recommended that you generate


a normal probability plot of the data before applying an
outlier test. Although you can also perform formal tests for
normality, the prescence of one or more outliers may cause
the tests to reject normality when it is in fact a reasonable
assumption for applying the outlier test.

In addition to checking the normality assumption, the lower


and upper tails of the normal probability plot can be a useful
graphical technique for identifying potential outliers. In
particular, the plot can help determine whether we need to
check for a single outlier or whether we need to check for
multiple outliers.

The box plot and the histogram can also be useful graphical
tools in checking the normality assumption and in identifying
potential outliers.

You might also like