-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support sink_parquet for anonymous scan #8719
Comments
@sid-6581 really good example here - https://github.com/universalmind303/polars-mongo |
Thanks for the reply! I had seen that example as well, but I didn't see anything significantly different about it that would make it work with |
Just want to add that I have basically the same use case: downloading large volumes of data via HTTP and streaming it to a parquet file. I'm trying to use the Python bindings and can reduce the error to the following minimum example: import pickle
import polars as pl
SCHEMA = {"foo": pl.Int64, "bar": pl.Utf8}
def _pseudo_scan(*args, **kwargs):
return pl.DataFrame(
{"foo": [1, 2], "bar": ["a", "b"]},
schema=SCHEMA
)
def pseudo_scan():
return pl.LazyFrame._scan_python_function(SCHEMA, _pseudo_scan)
pseudo_scan().sink_parquet("test.parquet") |
Hi 😄 I just started to contribute to Polars and want to help with this issue. I noticed that both AnonymousScan and PythonScan operations are not currently supported in streaming mode, therefore the scan operation is called only once in both cases. |
Any update on this? |
Supporting streaming for AnonymousScan will be a very great feature to parse some custom format which has a large size. |
Problem description
I have a use case that I would imagine wouldn't be too out of the ordinary. I have many files in a format that doesn't already have a reader, and I would like to convert them to a parquet file in a streaming fashion. They don't all fit in memory at the same time, so it's important that they are read individually and appended to the parquet file. I tried writing a lazy reader using
AnonymousScan
, but I get the errorsink_parquet not yet supported in standard engine. Use 'collect().write_parquet()'
with the following minimal reproduction:I found barely any examples of using
AnonymousScan
, so it's possible I missed something, but I don't know what that might be based on the example in the polars repo. It uses.collect().write_parquet()
which won't work for me.The text was updated successfully, but these errors were encountered: