-
-
Notifications
You must be signed in to change notification settings - Fork 30.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Spearman correlation support to statistics.correlation() #95861
Comments
Hi Raymond, this is a great idea! I agree with you that this is a useful function to have, but I'm going to suggest a change in the interface. The statistics module doesn't usually go in for flag arguments that change the behaviour of the function, preferring separate named functions. I suggest we add an explicit
If we take this approach, Spearman's becomes just:
I don't think either Excel or R provide an explicitly named Spearman function, but Stata and Mathematica do. |
I looked at making
Also, I put some thought into why no one had ever requested rankdata() even though it is an obvious thing to want. The answer that came to mind was that just sorting the data tends to give people almost everything they want. Our users might not need rankdata() at all. I looked at having a separate I prefer the API in PR. It is easy to teach and offers a simple decision point. First, you decide to investigate correlation and then you ask "whether you want to measure a linear relationship or a monotonic relationship?" That kind of choice worked out well for quantiles() which offered a simple decision point, "does my data include or exclude the end-points?". IMO, SciPy and Mathematica do a great job of serving experts who know exactly what they want (here are a thousand functions, each with a highly technical name and long list of options); however, neither is friendly for casual users. In our case, we can do better: Here is one function for correlation; its default is the one you want most of the time, but if your data is ordinal, just say the word and we'll convert it to ranks for you, and if you read our friendly docstring, we'll introduce you to Pearson and Spearman, tell you when which is appropriate, and give a nice example showing how these are used in practice and the differences between them. I designed the API as shown in the PR because it is the API that I think would be easiest to teach (which is what I do for a living) and the easiest for a beginner to learn directly from the docs. Also, it works well for people reading the code later by having a self-explanatory flag — no more mysterious than All that said, I still have doubts about whether the keyword argument should be |
Thanks Raymond, your arguments have persuaded me that we should keep `_rank()` private.
I still urge you to reconsider the bool flag argument and use a seperate named function rather than a flag. If you don't like the name "spearman", `rank(ed)_correlation()` works for me.
(Aside: your argument about users not knowing the terms "pearson" or "spearman" is not convincing to me. "Pearson's correlation coefficient" is taught as part of the secondary school statistics in Australia. Textbooks, Wikipedia, even your own source refer to "Spearman's". When I search for "rank correlation" on Google, 10 out of the top 12 search results refer to "Spearman's ..." in the page title.)
Otherwise, let's follow the precedent of quantiles and use a named "method" enum (a string is fine):
correlation(x, y, *, method='linear')
with 'ranked' as the alternative.
Do we have an agreement?
|
Yes, I think that is a nice solution. Thanks again for putting thought into this. |
Spearman correlation would be easily supported by adding a
ranked
option tostatistics.correlation()
. It is appropriate for ordinal data or for continuous data that doesn't meet the linear proportionality assumption required for Pearson correlation.The code would be mostly unchanged. If
ranked
is true, then the data is replaced by rankings using a helper functionThe text was updated successfully, but these errors were encountered: