Redesign logical types #1489

ritchie46 · 2021-10-05T12:22:30Z

This sets up a new architecture for logical types. Allowing for more data type support without increasing compile times. The basic ideas, is that polars only uses the primitive types and converts to the logical arrays upon the boundary between polars - arrow.

jorgecarleitao · 2021-10-06T04:39:02Z

Looks solid, great work!

One general comment is that this approach (the previous also) may make it difficult to support timestamps with timezones (more generally, logical types whose number of variants is very large and can't be enumerated).

To address this in arrow2, what I did was to not rely on a trait to map logical information. For example:

fn subtract<T: PhysicalType>(lhs: &PrimitiveArray<T>, rhs: PrimitiveArray<T>) -> PrimitiveArray<T> {
    // ...
    let data_type = lhs.data_type().clone();
    PrimitiveArray::from_data(data_type, values, validity)
};

if we use something like

fn subtract<T: LogicalType>(lhs: &PrimitiveArray<T>, rhs: PrimitiveArray<T>) -> PrimitiveArray<T> {
    // ...
    PrimitiveArray::from_data(LogicalType::DATA_TYPE, values, validity)
};

the DATA_TYPE can't be described in timestamps with timezones because we need more information (the timezone).

In other words, for timezones, we will need to implement something like

impl Int64Chunked {
    pub fn into_datetime(self, tz: Option<String>) -> DatetimeChunked {
        DatetimeChunked::new(self, tz)
    }
}

and there won't be any .into(). Also means that

fn subtract<T: LogicalType>(lhs: &PrimitiveArray<T>, rhs: PrimitiveArray<T>) -> PrimitiveArray<T> {
    // ...
};

may not make sense when two timezones are not equal (i.e. when lhs.dtype() != rhs.dtype()).

I think we can figure out this later, just pointing out that this was one of the things that I went back and forth in arrow2.

ritchie46 · 2021-10-06T07:11:38Z

Agreed, the runtime DataType should dictate, not the statically known T. What I am thinking about for timestamp, is to have Logical<ChunkedArray<T>, Field>, where Field holds the runtime dtype. The statically known into_date would not work indeed. I am thinking to do the same as you did in arrow2: ChunkedArray<T>::to(DataType). Then DataType can hold all the timezone and precision information.
.
Note that almost all operations are dispatched to the physical type now and that only when going to and from arrow, and casting we need to check the logical type.

I think we can figure out this later, just pointing out that this was one of the things that I went back and forth in arrow2.

Yes, there are probably a lot of edge cases, once timezones come in to play. Maybe its a good idea to cast the ArrayRef to their logical type before using arrow2 kernels (now I only use the physical arrays). Then these edge cases can be maintained in one code base.

Thanks for the review, good to know we are on the good path for this.

ritchie46 added 3 commits October 5, 2021 14:45

logical struct

557bb75

compile python

47d6c2a

impl serde

294c539

ritchie46 force-pushed the refactor_dates branch from aea523e to 294c539 Compare October 5, 2021 13:40

keep dates between date/datetime casting

d563dac

ritchie46 force-pushed the refactor_dates branch from be47a97 to f9d0ee0 Compare October 5, 2021 20:32

ritchie46 changed the title ~~WIP: Redesign logical types~~ Redesign logical types Oct 5, 2021

fix tests

ae7d5e6

ritchie46 force-pushed the refactor_dates branch from f9d0ee0 to ae7d5e6 Compare October 6, 2021 07:17

ritchie46 merged commit f167d3f into master Oct 6, 2021

ritchie46 deleted the refactor_dates branch October 6, 2021 08:37

ritchie46 mentioned this pull request Oct 8, 2021

Migrated Date64 to Timestamp() #1474

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesign logical types #1489

Redesign logical types #1489

ritchie46 commented Oct 5, 2021

jorgecarleitao commented Oct 6, 2021

ritchie46 commented Oct 6, 2021

Redesign logical types #1489

Redesign logical types #1489

Conversation

ritchie46 commented Oct 5, 2021

jorgecarleitao commented Oct 6, 2021

ritchie46 commented Oct 6, 2021