Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect assumptions about Binary being a String representation when parsing to UInt32 (or any other numerical) type #18991

Open
2 tasks done
balbok0 opened this issue Sep 28, 2024 · 7 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer rust Related to Rust Polars

Comments

@balbok0
Copy link
Contributor

balbok0 commented Sep 28, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

For minimal just atoi example:

use atoi_simd;

fn main() {
    let slice = &[0, 1, 2, 5]; // Doesn't work
    // let slice = b"0125"; // Works, but array is different
    println!("{:?}", slice);
    let result = atoi_simd::parse::<u32>(
        slice
    );
    println!("{:#?}", result);
}

For minimal rust example:

    #[test]
    fn cast_strict() {
        // let ar = BinaryChunked::new("a".into(), [vec![0u8, 1, 2, 5]]);
        let ar = BinaryChunked::new("a".into(), [vec![0u8, 1, 2, 5]]);
        // let ar = UInt32Chunked::new("a".into(), &[16]);
        let s = ar.into_series();
        let s2 = s.strict_cast(&DataType::UInt32).unwrap();

        assert!(s2.u32().is_ok());
        println!("{:?}", s2.u32());
    }

in crates/polars-core/src/series/mod.rs

Log output

No response

Issue description

There is an incorrect assumption that all binary types are in string/UTF8 representation. As such parsing of actual UTF8 LE byte representation is incorrect, as shown in the example.
Furthermore, potentially worse, sometimes parsing will succeed but results in arguably incorrect value.

The issue seems to step from the fact that polars internally uses atoi_simd which is fine for String -> Numerical parsing, but it is not a correct choice for BinaryView -> Numerical parsing.

This in turns brings a question on which endianness should be supported/whether there should be an option. My guess would be to first switch it so that parsing from binary type is correct with fixed (low?) endianness, and in a separate MR adding choice of endianess.

Expected behavior

Any {1, 2, 4, 8} byte array should be parse-able to UInt{8, 16, 32, 64} respectively, and similarly for Int, Float and potentially Timestamp types.

Installed versions

This is on main (commit: 901b243), with default features. Not sure when the issue started, but my guess is since change to atoi_simd.

@balbok0 balbok0 added bug Something isn't working needs triage Awaiting prioritization by a maintainer rust Related to Rust Polars labels Sep 28, 2024
@ritchie46
Copy link
Member

I think that's the only parsing of strings/ binary to number we should support.

We probably should remove the cast from binary to integer support. This was never intended to work.

@balbok0
Copy link
Contributor Author

balbok0 commented Sep 28, 2024

Personally, I think this is a useful capability to have. My use case stems from the fact that Delta Tables do not have unsigned ints, so oftentimes storing them as a binary type is a valid alternative.

FWIW: Databricks does support casting from binary to numerical types source.

@ritchie46
Copy link
Member

ritchie46 commented Sep 29, 2024

It just seems to ambigous to me. Endianness, encoding etc all must be assumed. Maybe we can support it, but as a specific method, not as a generic cast.

@balbok0
Copy link
Contributor Author

balbok0 commented Sep 30, 2024

I agree that it is ambiguous, but I think the only thing that is ambiguous is in endianness. I understand not wanting to introduce arguments to cast function. Maybe a separate function similar to decode could work in this case? This would allow for arbitrary endianness.

There is also no ambiguity for single byte sizes, so this could potentially be part of cast.

At the least, removing current behavior for Binary -> Numerical would be appreciated, since it is not I expected, and leads to silent errors.

@orlp
Copy link
Collaborator

orlp commented Oct 1, 2024

I vote for removing the binary -> numerical cast and possibly add integer<->byte reinterpret function(s) in the future.

@AndBondStyle
Copy link

Will this issue and related PR fix the ability to cast pl.Binary series to pl.Array or pl.List? Please redirect me if it was asked somewhere else before.

I'm trying to import data from postgres table with bytea columns, which need to be interpreted as fixed-size float32 arrays. Using read_database_uri with connectorx engine, I get bytea columns as pl.Binary series, however it looks like it's not possible to cast them into any other usable data type:

import polars as pl
import numpy as np

# This data is actually coming from postgres bytea column via connectorx
a = np.random.rand(512).astype(np.float32).tobytes()
b = np.random.rand(512).astype(np.float32).tobytes()
df = pl.DataFrame(data=[{"data": a}, {"data": b}], schema={"data": pl.Binary})

print(df["data"].cast(pl.Array(pl.Float32, 512)))  # Not supported
print(df["data"].cast(pl.List(pl.Float32)))  # Not supported
print(df["data"].to_numpy().view("<f4"))  # Cannot change data-type for object array

@balbok0
Copy link
Contributor Author

balbok0 commented Dec 12, 2024

Not yet. For now it is just numerical types. That being said, it can be extended to lists/arrays. My plan for now is:

  • Merging currently open MR
  • Adding support for lists/arrays
  • Adding tobytes which is an inversion of from_buffer, following numpy convention

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer rust Related to Rust Polars
Projects
None yet
Development

No branches or pull requests

4 participants