Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(python): Improve handling of decimal conversion with to_numpy in the absence of pyarrow #12888

Merged

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Dec 4, 2023

Closes #12730.

Several things to note here:

  1. numpy does not actually support decimal! Consequently the caller should either cast to Float64 first (with the potential loss of exactness that implies) and then export to numpy, or accept that the resulting array will contain decimal values in an object-dtype array, which will be inefficient.

  2. The Series to_numpy method exposed a "use_pyarrow" parameter that the DataFrame method did not (but should have done). When converting from DataFrame the use_pyarrow parameter was implicitly True in the underlying Series passthrough; now it is exposed explicitly with the same default. Note that if the dtypes don't require it then we still do the conversion ourselves without pyarrow, so there is no actual change from the previous default behaviour.

  3. Added a dedicated is_decimal() to the Python-side DataType methods (I already added the same for Rust a while back) as we frequently need to treat decimal a little differently to the other numeric types.

Update

Can now do the decimal conversions with/without pyarrow installed; in both cases the result is an ndarray of "object" dtype.

@github-actions github-actions bot added fix Bug fix python Related to Python Polars labels Dec 4, 2023
@ritchie46
Copy link
Member

numpy does not actually support decimal! Consequently the caller should either cast to Float64 first (with the potential loss of exactness that implies) and then export to numpy, or accept that the resulting array will contain decimal values in an object-dtype array, which will be inefficient.

What does pyarrow do here then?

The Series to_numpy method exposed a "use_pyarrow" parameter that the DataFrame method did not (but should have done). When converting from DataFrame the use_pyarrow parameter was implicitly True in the underlying Series passthrough; now it is exposed explicitly with the same default. Note that if the dtypes don't require it (eg: all int/float) we still do the conversion ourselves without pyarrow, so there is no actual change from the previous default behaviour.

Can we default to False?

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Dec 5, 2023

What does pyarrow do here then?

Creates an object-dtype numpy array - which it occurs to me we can trivially do ourselves so we can handle this with/without pyarrow; updating the PR accordingly (will update the change description in a sec) ;)

Can we default to False?

Not really; that would result in a potential change of behaviour since DataFrame.to_numpy currently calls Series.to_numpy directly with the default param values - which include use_pyarrow=True. Note that we only actually use the pyarrow conversion if/when our own conversion fails, (or we're exporting to structured arrays).

If/when we want to change that then it should be done in conjunction with changing the default for Series, which would require some thinking (such as categorising exactly where/why we need the fallback and adding equivalent native paths as necessary, etc).

@ritchie46
Copy link
Member

Creates an object-dtype numpy array - which it occurs to me we can trivially do ourselves so we can handle this with/without pyarrow; updating the PR accordingly (will update the change description in a sec) ;)

Yeap, not ideal, but no data-loss, which I guess is what you care about when you use decimals in the first place.

@ritchie46 ritchie46 merged commit 314b912 into pola-rs:main Dec 5, 2023
18 checks passed
@alexander-beedie
Copy link
Collaborator Author

Yeap, not ideal, but no data-loss, which I guess is what you care about when you use decimals in the first place.

Exactly; can cast to float before export for performance if exactness isn't the #1 concern.

@stinodego stinodego changed the title fix(python): improve handling of decimal conversion with to_numpy in the absence of pyarrow fix(python): Improve handling of decimal conversion with to_numpy in the absence of pyarrow Dec 15, 2023
@alexander-beedie alexander-beedie deleted the decimal-numpy-conversion-errors branch December 30, 2023 12:46
@alexander-beedie alexander-beedie added the A-dtype-decimal Area: decimal data type label Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dtype-decimal Area: decimal data type fix Bug fix python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

decimal.Decimal -> to_numpy regressed in 0.19.17
2 participants