Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign logical types #1489

Merged
merged 5 commits into from
Oct 6, 2021
Merged

Redesign logical types #1489

merged 5 commits into from
Oct 6, 2021

Conversation

ritchie46
Copy link
Member

This sets up a new architecture for logical types. Allowing for more data type support without increasing compile times. The basic ideas, is that polars only uses the primitive types and converts to the logical arrays upon the boundary between polars - arrow.

@ritchie46 ritchie46 changed the title WIP: Redesign logical types Redesign logical types Oct 5, 2021
@jorgecarleitao
Copy link
Collaborator

Looks solid, great work!

One general comment is that this approach (the previous also) may make it difficult to support timestamps with timezones (more generally, logical types whose number of variants is very large and can't be enumerated).

To address this in arrow2, what I did was to not rely on a trait to map logical information. For example:

fn subtract<T: PhysicalType>(lhs: &PrimitiveArray<T>, rhs: PrimitiveArray<T>) -> PrimitiveArray<T> {
    // ...
    let data_type = lhs.data_type().clone();
    PrimitiveArray::from_data(data_type, values, validity)
};

if we use something like

fn subtract<T: LogicalType>(lhs: &PrimitiveArray<T>, rhs: PrimitiveArray<T>) -> PrimitiveArray<T> {
    // ...
    PrimitiveArray::from_data(LogicalType::DATA_TYPE, values, validity)
};

the DATA_TYPE can't be described in timestamps with timezones because we need more information (the timezone).

In other words, for timezones, we will need to implement something like

impl Int64Chunked {
    pub fn into_datetime(self, tz: Option<String>) -> DatetimeChunked {
        DatetimeChunked::new(self, tz)
    }
}

and there won't be any .into(). Also means that

fn subtract<T: LogicalType>(lhs: &PrimitiveArray<T>, rhs: PrimitiveArray<T>) -> PrimitiveArray<T> {
    // ...
};

may not make sense when two timezones are not equal (i.e. when lhs.dtype() != rhs.dtype()).

I think we can figure out this later, just pointing out that this was one of the things that I went back and forth in arrow2.

@ritchie46
Copy link
Member Author

Agreed, the runtime DataType should dictate, not the statically known T. What I am thinking about for timestamp, is to have Logical<ChunkedArray<T>, Field>, where Field holds the runtime dtype. The statically known into_date would not work indeed. I am thinking to do the same as you did in arrow2: ChunkedArray<T>::to(DataType). Then DataType can hold all the timezone and precision information.
.
Note that almost all operations are dispatched to the physical type now and that only when going to and from arrow, and casting we need to check the logical type.

I think we can figure out this later, just pointing out that this was one of the things that I went back and forth in arrow2.

Yes, there are probably a lot of edge cases, once timezones come in to play. Maybe its a good idea to cast the ArrayRef to their logical type before using arrow2 kernels (now I only use the physical arrays). Then these edge cases can be maintained in one code base.

Thanks for the review, good to know we are on the good path for this.

@ritchie46 ritchie46 merged commit f167d3f into master Oct 6, 2021
@ritchie46 ritchie46 deleted the refactor_dates branch October 6, 2021 08:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants