Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is_in and == treat nulls differently #9247

Closed
2 tasks done
mcrumiller opened this issue Jun 5, 2023 · 6 comments
Closed
2 tasks done

is_in and == treat nulls differently #9247

mcrumiller opened this issue Jun 5, 2023 · 6 comments
Labels
bug Something isn't working python Related to Python Polars

Comments

@mcrumiller
Copy link
Contributor

mcrumiller commented Jun 5, 2023

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

is_in([x]) should be equivalent to == x for single items, but the treatment of nulls is different. I'm not sure if this is intended.

Similarly, .is_in([None]) evaluations to True, whereas ==None raises an exception.

Reproducible example

import polars as pl

s = pl.Series([None], dtype=pl.Utf8)

s == 'a'        # null
s.is_in(['a'])  # false

s == None       # error
s.is_in(None)   # true
s.is_in([None]) # false

Expected behavior

Both false or both null.

Installed versions

--------Version info---------
Polars:      0.18.0
Index type:  UInt32
Platform:    Windows-10-10.0.19045-SP0
Python:      3.11.2 (tags/v3.11.2:878ead1, Feb  7 2023, 16:38:35) [MSC v.1934 64 bit (AMD64)]

----Optional dependencies----
numpy:       1.24.3
pandas:      2.0.0
pyarrow:     11.0.0
connectorx:  0.3.2a3
deltalake:   <not installed>
fsspec:      <not installed>
matplotlib:  3.7.1
xlsx2csv:    0.8.1
xlsxwriter:  3.1.0
@mcrumiller mcrumiller added bug Something isn't working python Related to Python Polars labels Jun 5, 2023
@avimallu
Copy link
Contributor

avimallu commented Jun 5, 2023

I would add this to a growing list of things to fix with is_in: #9105.

I don't know if its' worth my 2c here, but I don't think is_in strictly translates to == for single items. When I read "is X in Y?", I interpret it as, "Is the set of values in Y in the set of values in X", and assign true and false as appropriate for the list of X. The last example in your case works fine by this definition if you do:

s.is_in([None, "a"]) # true

Going by this understanding, s.is_in([None]) returning False is the bug. I recognize that not everyone will share my opinion about is_in, but I did want to highlight that the output of this must be an intentional design decision.

@mcrumiller
Copy link
Contributor Author

Is the set of values in Y in the set of values in X

I disagree; this would return a single boolean for the entire series, i.e. it means x.is_in(y) returns true if the entirety of x is in y, and that's a single true/false result. The general interpretation for x.is_in(y) is "for each element in x, is that element found in the set y"? This returns a boolean array the same length as x.

To ask if any element x_i is in the set y is to ask "is x_i equal to any of the elements in y", which is to say x_i == y_j for any i, j. Thus, the boolean comparator of None == None returning either True or False.

@avimallu
Copy link
Contributor

avimallu commented Jun 5, 2023

for each element in x, is that element found in the set y

This is much clearer an interpretation, thank you.

I don't quite know how to convey my interpretation of why I don't think == and is_in are the same for a single element well, and I cannot find a reason to contradict yours clearly either, so I'll leave it to those who do know better! 😅

@mcrumiller
Copy link
Contributor Author

Well, is_in sort of connotes a set-theory view in which there is a lot to be said about the null set. But the python None value shouldn't be confused with the null set, it's an actual value whose interpretation depends on its context. It usually is meant to stand for "nothing" but can also mean "invalid." With that in mind, asking if None is in some set X in python depends on how you define None == None.

@avimallu
Copy link
Contributor

avimallu commented Jun 5, 2023

I was leaning towards set theory in my mental explanation, but I learnt that so long ago I couldn't really articulate well. Thanks for the patient listening, @mcrumiller!

@ritchie46
Copy link
Member

So we changed equality to mimic other engines. That means that null will always propagate in the default equality operations.

I think for the is_in we should follow the join and groupby rules where the null == null.

# this should error. We should only allow `Series` or `list<T>`.
s.is_in(None)

This returns true on main which seems correct. 🤔 Shall I add a test for that?

s = pl.Series([None], dtype=pl.Utf8)
s.is_in([None])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants