Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Spearman correlation support to statistics.correlation() #95861

Closed
rhettinger opened this issue Aug 10, 2022 · 4 comments
Closed

Add Spearman correlation support to statistics.correlation() #95861

rhettinger opened this issue Aug 10, 2022 · 4 comments
Assignees
Labels
3.12 bugs and security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@rhettinger
Copy link
Contributor

rhettinger commented Aug 10, 2022

Spearman correlation would be easily supported by adding a ranked option to statistics.correlation(). It is appropriate for ordinal data or for continuous data that doesn't meet the linear proportionality assumption required for Pearson correlation.

>>> # Example from https://statistics.laerd.com/statistical-guides/spearmans-rank-order-correlation-statistical-guide.php
>>> eng = [56, 75, 45, 71, 61, 64, 58, 80, 76, 61]
>>> math = [66, 70, 40, 60, 65, 56, 59, 77, 67, 63]

>>> # Pearson correlation reports the strength of a linear relationship
>>> correlation(math, eng)      
0.8005386765540159

>>> # Spearman correlation reports strength of a monotonic relationship
>>> correlation(math, eng, ranked=True)    
0.6686960980480711

The code would be mostly unchanged. If ranked is true, then the data is replaced by rankings using a helper function

@rhettinger rhettinger added type-feature A feature request or enhancement stdlib Python modules in the Lib dir 3.12 bugs and security fixes labels Aug 10, 2022
@stevendaprano
Copy link
Member

Hi Raymond, this is a great idea!

I agree with you that this is a useful function to have, but I'm going to suggest a change in the interface. The statistics module doesn't usually go in for flag arguments that change the behaviour of the function, preferring separate named functions.

I suggest we add an explicit rank() function, and a spearman() function that is a thin wrapper around rank() + correlation().

  1. Explicitly ranking is itself useful, e.g. Excel and R both provide it. One of my stats text books specifically talks about students often having trouble ranking data correctly, especially when there are ties.
  2. This would allow people to also compute the ranked covariance using covariance(rank(xs), rank(ys)). (I don't think this is needed often enough to provide it as a named function.)

If we take this approach, Spearman's becomes just:

def spearman(xdata, ydata):
    return correlation(rank(xdata), rank(ydata))

I don't think either Excel or R provide an explicitly named Spearman function, but Stata and Mathematica do.

@rhettinger
Copy link
Contributor Author

I looked at making _rank() public. It was attractive at first because it isn't that easy for a user to implement themselves, and it seemed like an obvious thing to do. However, I recommend leaving this for a future discussion. Someone else can push it forward if they are so inspired. When I went down this path, I found that it is a can worms because it would need a zoo of options:

  1. Rank ascending or descending (Excel offers both other these).
  2. Rank from zero or from one. The latter is the norm in statistics, but the former is needed in a zero-indexed language like Python so that the rankings can be related back to the input data.
  3. Options for handling ties. SciPy's rankdata() has five options for ties: average, min, max, dense, and ordinal
  4. Anything that sorts in Python will be expected to accept a key-function: rankdata(employees, key=attrgetter('salary')).
  5. A rankdata() function that returns a list of positions (or fractional positions in the case of ties) doesn't meet the needs of people who need to rank just a single item. Excel's function computes only one rank at time, allowing it to answer questions like, "What is this particular movie's ranking?".
  6. Keeping _rank() private means that we can keep our flexibility for the function signature. With the min or max option for ties, we might want to return a list of ints. With the default average option, it currently returns a list of floats, but could reasonably be changed to a list of fractions.

Also, I put some thought into why no one had ever requested rankdata() even though it is an obvious thing to want. The answer that came to mind was that just sorting the data tends to give people almost everything they want. Our users might not need rankdata() at all.

I looked at having a separate spearman() function but don't think it would serve our users well. We're targeting people who think of the concept of "correlation" and look that up. Our users are much less likely to know the names Pearson or Spearman or remember which test is which. Also, there is a trend now to stop naming things after people — already I've seen efforts to rename the Pythagorean Theorem and the Chinese Remainder Theorem.

I prefer the API in PR. It is easy to teach and offers a simple decision point. First, you decide to investigate correlation and then you ask "whether you want to measure a linear relationship or a monotonic relationship?" That kind of choice worked out well for quantiles() which offered a simple decision point, "does my data include or exclude the end-points?".

IMO, SciPy and Mathematica do a great job of serving experts who know exactly what they want (here are a thousand functions, each with a highly technical name and long list of options); however, neither is friendly for casual users. In our case, we can do better: Here is one function for correlation; its default is the one you want most of the time, but if your data is ordinal, just say the word and we'll convert it to ranks for you, and if you read our friendly docstring, we'll introduce you to Pearson and Spearman, tell you when which is appropriate, and give a nice example showing how these are used in practice and the differences between them.

I designed the API as shown in the PR because it is the API that I think would be easiest to teach (which is what I do for a living) and the easiest for a beginner to learn directly from the docs. Also, it works well for people reading the code later by having a self-explanatory flag — no more mysterious than reverse=False for the sorted() function or the strict/ignore/replace error options for encode().

All that said, I still have doubts about whether the keyword argument should be by_rank=True, ranked=True, or apply_rank=True. There is some merit to the adverbial phrase which reads like English, "correlate this data by rank". There is merit to the adjective which matches the wording in "Spearman's ranked correlation coefficient". And there is merit to a strong verb, as in "Apply ranking to the data before computing the correlation".

@stevendaprano
Copy link
Member

stevendaprano commented Aug 18, 2022 via email

@rhettinger
Copy link
Contributor Author

Yes, I think that is a nice solution. Thanks again for putting thought into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants