11
$\begingroup$

In a blog post Andrew Gelman writes:

Stepwise regression is one of these things, like outlier detection and pie charts, which appear to be popular among non-statisticians but are considered by statisticians to be a bit of a joke.

I understand the reference to pie charts, but why is outlier detection looked down upon by statisticians according to Gelman? Is it just that it might cause people to over-prune their data?

$\endgroup$
4
  • 5
    $\begingroup$ If you look at the comments on that same page you linked to, you'll find an answer from Andrew himself, as well as further discussion. See for example this comment: andrewgelman.com/2014/06/02/hate-stepwise-regression/… $\endgroup$
    – Anonymous
    Commented Feb 7, 2015 at 0:57
  • 1
    $\begingroup$ The detail here on statisticians versus non-statisticians is unfortunate. Look through e.g. Barnett and Lewis' treatise on outliers and you will see test after test suggested mostly by statisticians focusing on implausible situations. It's true that (e.g.) in physics people often still follow ancient rules proposed by Peirce and Chauvenet but much of the dopeyness here is associated with statisticians too. Disclosure: I am not a statistician, and I tend to believe that outliers are often genuine and that finding the right scale on which to work makes almost all tractable. $\endgroup$
    – Nick Cox
    Commented Mar 4, 2016 at 0:56
  • 1
    $\begingroup$ @NickCox: I think Gelman may have been referring to different statistician vs non-statistician conversations. For example, when looking malicious behavior on networks, lots of non-statisticians are fired about outlier detection; "of course I want to know about unusual behavior!!". Reading through the statistical literature, many statisticians start and end their papers "well, this can be done and here's how but..." $\endgroup$
    – Cliff AB
    Commented Mar 4, 2016 at 17:57
  • 1
    $\begingroup$ ...or alternatively, biologists are often okay with dropping outliers, because they believe these outliers are due to procedural errors rather than an unusual result from a properly executed experiment. So to them, a procedure that automatically drops procedural errors sounds great, but a statistician is not so happy with what actually happens in practice. $\endgroup$
    – Cliff AB
    Commented Mar 4, 2016 at 18:09

2 Answers 2

4
$\begingroup$

@Jerome Baum's comment is spot on. To bring the Gelman quote here:

Outlier detection can be a good thing. The problem is that non-statisticians seem to like to latch on to the word “outlier” without trying to think at all about the process that creates the outlier, also some textbooks have rules that look stupid to statisticians such as myself, rules such as labeling something as an outlier if it more than some number of sd’s from the median, or whatever. The concept of an outlier is useful but I think it requires context—if you label something as an outlier, you want to try to get some sense of why you think that.

To add a little bit more, how about we first define outlier. Try to do so rigorously without referring to anything visual like "looks like it's far away from other points". It's actually quite hard.

I'd say that an outlier is a point that is highly unlikely given a model of how points are generated. In most situations, people don't actually have a model of how the points are generated, or if they do it is so over-simplified as to be wrong much of the time. So, as Andrew says, people will do things like assume that some kind of Gaussian process is generating points and so if a point is more than a certain number of SD's from the mean, it's an outlier. Mathematically convenient, not so principled.

And we haven't even gotten into what people do with outliers once they are identified. Most people want to throw these inconvenient points away, for example. In many cases, it's the outliers that lead to breakthroughs and discoveries, not the non-outliers!

There's a lot of ad-hoc'ery in outlier detection, as practiced by non-statisticians, and Andrew is uncomfortable with that.

$\endgroup$
1
$\begingroup$

This demonstrates the classic tug of war between the two types of objectives for statistical analyses such as regression: descriptive vs. predictive. (Pardon the generalizations in my comments below.)

From the statistician's point of view, description usually matters more than prediction. Hence, they are inherently "biased" towards explanation. Why is there an outlier? Is it really an error in data-entry (extra zeros at the end of a value) or is it a valid data point which just happens to be extreme? These are important questions for a statistician.

OTOH, the data scientists are more interested in prediction rather than description. Their objective is to develop a strong model that does a great job of predicting a future outcome (e.g., purchase, attrition). If there's an extreme value in one of the fields, a data scientist would happily cap that value (to the 98th percentile value, for instance) if that helps improve the predictive accuracy of the model.

I don't have a general inclination towards either one of these two approaches. However, whether the methods/approaches such as stepwise-regression and outlier-treatment are "a bit of a joke" or not depends on which side of the fence you are standing.

$\endgroup$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.