Polars is a blazingly fast DataFrames library implemented in Rust. Its memory model uses Apache Arrow as backend.
It currently consists of an eager API similar to pandas and a lazy API that is somewhat similar to spark. Amongst more, Polars has the following functionalities.
To learn more about the inner workings of Polars read the WIP book.
Functionality | Eager | Lazy (DataFrame) | Lazy (Series) |
---|---|---|---|
Filters | ✔ | ✔ | ✔ |
Shifts | ✔ | ✔ | ✔ |
Joins | ✔ | ✔ | |
GroupBys + aggregations | ✔ | ✔ | |
Comparisons | ✔ | ✔ | ✔ |
Arithmetic | ✔ | ✔ | |
Sorting | ✔ | ✔ | ✔ |
Reversing | ✔ | ✔ | ✔ |
Closure application (User Defined Functions) | ✔ | ✔ | |
SIMD | ✔ | ✔ | |
Pivots | ✔ | ✗ | |
Melts | ✔ | ✗ | |
Filling nulls + fill strategies | ✔ | ✗ | ✔ |
Aggregations | ✔ | ✔ | ✔ |
Moving Window aggregates | ✔ | ✗ | ✗ |
Find unique values | ✔ | ✗ | |
Rust iterators | ✔ | ✔ | |
IO (csv, json, parquet, Arrow IPC | ✔ | ✗ | |
Query optimization: (predicate pushdown) | ✗ | ✔ | |
Query optimization: (projection pushdown) | ✗ | ✔ | |
Query optimization: (type coercion) | ✗ | ✔ | |
Query optimization: (simplify expressions) | ✗ | ✔ | |
Query optimization: (aggregate pushdown) | ✗ | ✔ |
Note that almost all eager operations supported by Eager on Series
/ChunkedArrays
can be used in Lazy via UDF's
Want to know about all the features Polars support? Read the docs!
- installation guide:
pip install py-polars
- the book
- Reference guide
Polars is written to be performant, and it is! But don't take my word for it, take a look at the results in h2oai's db-benchmark.
Additional cargo features:
temporal (default)
- Conversions between Chrono and Polars for temporal data
simd (nightly)
- SIMD operations
parquet
- Read Apache Parquet format
json
- Json serialization
ipc
- Arrow's IPC format serialization
random
- Generate array's with randomly sampled values
ndarray
- Convert from
DataFrame
tondarray
- Convert from
lazy
- Lazy api
strings
- String utilities for
Utf8Chunked
- String utilities for
object
- Support for generic ChunkedArray's called
ObjectChunked<T>
(generic overT
). These will downcastable from Series through the Any trait.
- Support for generic ChunkedArray's called
parallel
- ChunkedArrays can be used by rayon::par_iter()
[plain_fmt | pretty_fmt]
(mutually exclusive)- one of them should be chosen to fmt DataFrames.
pretty_fmt
can deal with overflowing cells and looks nicer but has more dependencies.plain_fmt (default)
is plain formatting.
- one of them should be chosen to fmt DataFrames.
Want to contribute? Read our contribution guideline.
POLARS_PAR_SORT_BOUND
-> Sets the lower bound of rows at which Polars will use a parallel sorting algorithm. Default is 1M rows.POLARS_FMT_MAX_COLS
-> maximum number of columns shown when formatting DataFrames.POLARS_FMT_MAX_ROWS
-> maximum number of rows shown when formatting DataFrames.POLARS_TABLE_WIDTH
-> width of the tables used during DataFrame formatting.POLARS_MAX_THREADS
-> maximum number of threads used in join algorithm. Default is unbounded.
If you want a bleeding edge release or maximal performance you should compile py-polars from source.
This can be done by going through the following steps in sequence:
- install the latest rust compiler
$ pip3 install maturin
$ cd py-polars && maturin develop --release