Add Spearman correlation support to statistics.correlation() #95861

rhettinger · 2022-08-10T21:42:52Z

Spearman correlation would be easily supported by adding a ranked option to statistics.correlation(). It is appropriate for ordinal data or for continuous data that doesn't meet the linear proportionality assumption required for Pearson correlation.

>>> # Example from https://statistics.laerd.com/statistical-guides/spearmans-rank-order-correlation-statistical-guide.php
>>> eng = [56, 75, 45, 71, 61, 64, 58, 80, 76, 61]
>>> math = [66, 70, 40, 60, 65, 56, 59, 77, 67, 63]

>>> # Pearson correlation reports the strength of a linear relationship
>>> correlation(math, eng)      
0.8005386765540159

>>> # Spearman correlation reports strength of a monotonic relationship
>>> correlation(math, eng, ranked=True)    
0.6686960980480711

The code would be mostly unchanged. If ranked is true, then the data is replaced by rankings using a helper function

The text was updated successfully, but these errors were encountered:

stevendaprano · 2022-08-13T01:16:36Z

Hi Raymond, this is a great idea!

I agree with you that this is a useful function to have, but I'm going to suggest a change in the interface. The statistics module doesn't usually go in for flag arguments that change the behaviour of the function, preferring separate named functions.

I suggest we add an explicit rank() function, and a spearman() function that is a thin wrapper around rank() + correlation().

Explicitly ranking is itself useful, e.g. Excel and R both provide it. One of my stats text books specifically talks about students often having trouble ranking data correctly, especially when there are ties.
This would allow people to also compute the ranked covariance using covariance(rank(xs), rank(ys)). (I don't think this is needed often enough to provide it as a named function.)

If we take this approach, Spearman's becomes just:

def spearman(xdata, ydata):
    return correlation(rank(xdata), rank(ydata))

I don't think either Excel or R provide an explicitly named Spearman function, but Stata and Mathematica do.

rhettinger · 2022-08-13T06:23:33Z

I looked at making _rank() public. It was attractive at first because it isn't that easy for a user to implement themselves, and it seemed like an obvious thing to do. However, I recommend leaving this for a future discussion. Someone else can push it forward if they are so inspired. When I went down this path, I found that it is a can worms because it would need a zoo of options:

Rank ascending or descending (Excel offers both other these).
Rank from zero or from one. The latter is the norm in statistics, but the former is needed in a zero-indexed language like Python so that the rankings can be related back to the input data.
Options for handling ties. SciPy's rankdata() has five options for ties: average, min, max, dense, and ordinal
Anything that sorts in Python will be expected to accept a key-function: rankdata(employees, key=attrgetter('salary')).
A rankdata() function that returns a list of positions (or fractional positions in the case of ties) doesn't meet the needs of people who need to rank just a single item. Excel's function computes only one rank at time, allowing it to answer questions like, "What is this particular movie's ranking?".
Keeping _rank() private means that we can keep our flexibility for the function signature. With the min or max option for ties, we might want to return a list of ints. With the default average option, it currently returns a list of floats, but could reasonably be changed to a list of fractions.

Also, I put some thought into why no one had ever requested rankdata() even though it is an obvious thing to want. The answer that came to mind was that just sorting the data tends to give people almost everything they want. Our users might not need rankdata() at all.

I looked at having a separate spearman() function but don't think it would serve our users well. We're targeting people who think of the concept of "correlation" and look that up. Our users are much less likely to know the names Pearson or Spearman or remember which test is which. Also, there is a trend now to stop naming things after people — already I've seen efforts to rename the Pythagorean Theorem and the Chinese Remainder Theorem.

I prefer the API in PR. It is easy to teach and offers a simple decision point. First, you decide to investigate correlation and then you ask "whether you want to measure a linear relationship or a monotonic relationship?" That kind of choice worked out well for quantiles() which offered a simple decision point, "does my data include or exclude the end-points?".

IMO, SciPy and Mathematica do a great job of serving experts who know exactly what they want (here are a thousand functions, each with a highly technical name and long list of options); however, neither is friendly for casual users. In our case, we can do better: Here is one function for correlation; its default is the one you want most of the time, but if your data is ordinal, just say the word and we'll convert it to ranks for you, and if you read our friendly docstring, we'll introduce you to Pearson and Spearman, tell you when which is appropriate, and give a nice example showing how these are used in practice and the differences between them.

I designed the API as shown in the PR because it is the API that I think would be easiest to teach (which is what I do for a living) and the easiest for a beginner to learn directly from the docs. Also, it works well for people reading the code later by having a self-explanatory flag — no more mysterious than reverse=False for the sorted() function or the strict/ignore/replace error options for encode().

All that said, I still have doubts about whether the keyword argument should be by_rank=True, ranked=True, or apply_rank=True. There is some merit to the adverbial phrase which reads like English, "correlate this data by rank". There is merit to the adjective which matches the wording in "Spearman's ranked correlation coefficient". And there is merit to a strong verb, as in "Apply ranking to the data before computing the correlation".

stevendaprano · 2022-08-18T10:30:30Z

Thanks Raymond, your arguments have persuaded me that we should keep `_rank()` private. I still urge you to reconsider the bool flag argument and use a seperate named function rather than a flag. If you don't like the name "spearman", `rank(ed)_correlation()` works for me. (Aside: your argument about users not knowing the terms "pearson" or "spearman" is not convincing to me. "Pearson's correlation coefficient" is taught as part of the secondary school statistics in Australia. Textbooks, Wikipedia, even your own source refer to "Spearman's". When I search for "rank correlation" on Google, 10 out of the top 12 search results refer to "Spearman's ..." in the page title.) Otherwise, let's follow the precedent of quantiles and use a named "method" enum (a string is fine): correlation(x, y, *, method='linear') with 'ranked' as the alternative. Do we have an agreement?

rhettinger · 2022-08-18T16:34:07Z

Yes, I think that is a nice solution. Thanks again for putting thought into this.

…-95863)

rhettinger added type-feature A feature request or enhancement stdlib Python modules in the Lib dir 3.12 bugs and security fixes labels Aug 10, 2022

rhettinger assigned stevendaprano Aug 10, 2022

bedevere-bot mentioned this issue Aug 10, 2022

GH-95861: Add support for Spearman's rank correlation coefficient #95863

Merged

rhettinger added a commit that referenced this issue Aug 18, 2022

GH-95861: Add support for Spearman's rank correlation coefficient (GH…

29c8f80

…-95863)

rhettinger closed this as completed Aug 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Spearman correlation support to statistics.correlation() #95861

Add Spearman correlation support to statistics.correlation() #95861

rhettinger commented Aug 10, 2022 •

edited

Loading

stevendaprano commented Aug 13, 2022

rhettinger commented Aug 13, 2022

stevendaprano commented Aug 18, 2022 via email

rhettinger commented Aug 18, 2022

Add Spearman correlation support to statistics.correlation() #95861

Add Spearman correlation support to statistics.correlation() #95861

Comments

rhettinger commented Aug 10, 2022 • edited Loading

stevendaprano commented Aug 13, 2022

rhettinger commented Aug 13, 2022

stevendaprano commented Aug 18, 2022 via email

rhettinger commented Aug 18, 2022

rhettinger commented Aug 10, 2022 •

edited

Loading