`is_in` and `==` treat nulls differently #9247

mcrumiller · 2023-06-05T20:48:48Z

Polars version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Issue description

is_in([x]) should be equivalent to == x for single items, but the treatment of nulls is different. I'm not sure if this is intended.

Similarly, .is_in([None]) evaluations to True, whereas ==None raises an exception.

Reproducible example

import polars as pl

s = pl.Series([None], dtype=pl.Utf8)

s == 'a'        # null
s.is_in(['a'])  # false

s == None       # error
s.is_in(None)   # true
s.is_in([None]) # false

Expected behavior

Both false or both null.

Installed versions

--------Version info---------
Polars:      0.18.0
Index type:  UInt32
Platform:    Windows-10-10.0.19045-SP0
Python:      3.11.2 (tags/v3.11.2:878ead1, Feb  7 2023, 16:38:35) [MSC v.1934 64 bit (AMD64)]

----Optional dependencies----
numpy:       1.24.3
pandas:      2.0.0
pyarrow:     11.0.0
connectorx:  0.3.2a3
deltalake:   <not installed>
fsspec:      <not installed>
matplotlib:  3.7.1
xlsx2csv:    0.8.1
xlsxwriter:  3.1.0

The text was updated successfully, but these errors were encountered:

avimallu · 2023-06-05T21:54:03Z

I would add this to a growing list of things to fix with is_in: #9105.

I don't know if its' worth my 2c here, but I don't think is_in strictly translates to == for single items. When I read "is X in Y?", I interpret it as, "Is the set of values in Y in the set of values in X", and assign true and false as appropriate for the list of X. The last example in your case works fine by this definition if you do:

s.is_in([None, "a"]) # true

Going by this understanding, s.is_in([None]) returning False is the bug. I recognize that not everyone will share my opinion about is_in, but I did want to highlight that the output of this must be an intentional design decision.

mcrumiller · 2023-06-05T21:59:38Z

Is the set of values in Y in the set of values in X

I disagree; this would return a single boolean for the entire series, i.e. it means x.is_in(y) returns true if the entirety of x is in y, and that's a single true/false result. The general interpretation for x.is_in(y) is "for each element in x, is that element found in the set y"? This returns a boolean array the same length as x.

To ask if any element x_i is in the set y is to ask "is x_i equal to any of the elements in y", which is to say x_i == y_j for any i, j. Thus, the boolean comparator of None == None returning either True or False.

avimallu · 2023-06-05T22:13:47Z

for each element in x, is that element found in the set y

This is much clearer an interpretation, thank you.

I don't quite know how to convey my interpretation of why I don't think == and is_in are the same for a single element well, and I cannot find a reason to contradict yours clearly either, so I'll leave it to those who do know better! 😅

mcrumiller · 2023-06-05T22:21:00Z

Well, is_in sort of connotes a set-theory view in which there is a lot to be said about the null set. But the python None value shouldn't be confused with the null set, it's an actual value whose interpretation depends on its context. It usually is meant to stand for "nothing" but can also mean "invalid." With that in mind, asking if None is in some set X in python depends on how you define None == None.

avimallu · 2023-06-05T22:28:22Z

I was leaning towards set theory in my mental explanation, but I learnt that so long ago I couldn't really articulate well. Thanks for the patient listening, @mcrumiller!

ritchie46 · 2023-06-06T07:08:35Z

So we changed equality to mimic other engines. That means that null will always propagate in the default equality operations.

I think for the is_in we should follow the join and groupby rules where the null == null.

# this should error. We should only allow `Series` or `list<T>`.
s.is_in(None)

This returns true on main which seems correct. 🤔 Shall I add a test for that?

s = pl.Series([None], dtype=pl.Utf8)
s.is_in([None])

mcrumiller added bug Something isn't working python Related to Python Polars labels Jun 5, 2023

mcrumiller closed this as completed Jun 14, 2023

avimallu mentioned this issue Jul 24, 2023

is_in for boolean gives wrong result with Null #10057

Closed

2 tasks

mcrumiller mentioned this issue Jul 25, 2023

fix(rust, python): fix Boolean::isin(null values) #10074

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`is_in` and `==` treat nulls differently #9247

`is_in` and `==` treat nulls differently #9247

mcrumiller commented Jun 5, 2023 •

edited

Loading

avimallu commented Jun 5, 2023 •

edited

Loading

mcrumiller commented Jun 5, 2023

avimallu commented Jun 5, 2023

mcrumiller commented Jun 5, 2023

avimallu commented Jun 5, 2023

ritchie46 commented Jun 6, 2023

is_in and == treat nulls differently #9247

is_in and == treat nulls differently #9247

Comments

mcrumiller commented Jun 5, 2023 • edited Loading

Polars version checks

Issue description

Reproducible example

Expected behavior

Installed versions

avimallu commented Jun 5, 2023 • edited Loading

mcrumiller commented Jun 5, 2023

avimallu commented Jun 5, 2023

mcrumiller commented Jun 5, 2023

avimallu commented Jun 5, 2023

ritchie46 commented Jun 6, 2023

`is_in` and `==` treat nulls differently #9247

`is_in` and `==` treat nulls differently #9247

mcrumiller commented Jun 5, 2023 •

edited

Loading

avimallu commented Jun 5, 2023 •

edited

Loading