-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect assumptions about Binary being a String representation when parsing to UInt32 (or any other numerical) type #18991
Comments
I think that's the only parsing of strings/ binary to number we should support. We probably should remove the cast from binary to integer support. This was never intended to work. |
Personally, I think this is a useful capability to have. My use case stems from the fact that Delta Tables do not have unsigned ints, so oftentimes storing them as a binary type is a valid alternative. FWIW: Databricks does support casting from binary to numerical types source. |
It just seems to ambigous to me. Endianness, encoding etc all must be assumed. Maybe we can support it, but as a specific method, not as a generic cast. |
I agree that it is ambiguous, but I think the only thing that is ambiguous is in endianness. I understand not wanting to introduce arguments to There is also no ambiguity for single byte sizes, so this could potentially be part of At the least, removing current behavior for Binary -> Numerical would be appreciated, since it is not I expected, and leads to silent errors. |
I vote for removing the binary -> numerical cast and possibly add integer<->byte reinterpret function(s) in the future. |
Will this issue and related PR fix the ability to cast I'm trying to import data from postgres table with
|
Not yet. For now it is just numerical types. That being said, it can be extended to lists/arrays. My plan for now is:
|
Checks
Reproducible example
For minimal just atoi example:
For minimal rust example:
in crates/polars-core/src/series/mod.rs
Log output
No response
Issue description
There is an incorrect assumption that all binary types are in string/UTF8 representation. As such parsing of actual UTF8 LE byte representation is incorrect, as shown in the example.
Furthermore, potentially worse, sometimes parsing will succeed but results in arguably incorrect value.
The issue seems to step from the fact that polars internally uses
atoi_simd
which is fine for String -> Numerical parsing, but it is not a correct choice for BinaryView -> Numerical parsing.This in turns brings a question on which endianness should be supported/whether there should be an option. My guess would be to first switch it so that parsing from binary type is correct with fixed (low?) endianness, and in a separate MR adding choice of endianess.
Expected behavior
Any {1, 2, 4, 8} byte array should be parse-able to UInt{8, 16, 32, 64} respectively, and similarly for Int, Float and potentially Timestamp types.
Installed versions
This is on main (commit: 901b243), with default features. Not sure when the issue started, but my guess is since change to
atoi_simd
.The text was updated successfully, but these errors were encountered: