Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File protocol not supported #13669

Open
2 tasks done
teaguesterling opened this issue Aug 31, 2024 · 10 comments · May be fixed by #13829
Open
2 tasks done

File protocol not supported #13669

teaguesterling opened this issue Aug 31, 2024 · 10 comments · May be fixed by #13829

Comments

@teaguesterling
Copy link

teaguesterling commented Aug 31, 2024

What happens?

DuckDB cannot load local files when the file:// prefix is provided.

In developing the duckdb_iceberg extension with test data created by pyiceberg, @ramonvermeulen observed that files were not loading. He traced this to the file paths in metadata being prefixed by file:// and removing the protocol resolved the issue.

In further testing to try and address the issue, I noticed that there is no fs subsystem for handling the file protocol and that a fix to this should probably be moved outside of the duckdb_iceberg extension.

To Reproduce

-- Failure
from 'file://data/iceberg/generated_spec1_0_001/pyspark_iceberg_table/data/00000-5-bd694195-a731-4121-be17-0a6b13d4e9fb-00001.parquet' LIMIT 1;
IO Error: No files found that match the pattern "file://data/iceberg/generated_spec1_0_001/pyspark_iceberg_table/data/00000-5-bd694195-a731-4121-be17-0a6b13d4e9fb-00001.parquet"
-- Success (without prefix)
from 'data/iceberg/generated_spec1_0_001/pyspark_iceberg_table/data/00000-5-bd694195-a731-4121-be17-0a6b13d4e9fb-00001.parquet' LIMIT 1;
┌─────────────────┬───────────────┬───┬──────────────────┬──────────────────────┬────────────────┐
│ l_orderkey_bool │ l_partkey_int │ … │ l_comment_string │         uuid         │ l_comment_blob │
│     boolean     │     int32     │   │     varchar      │       varchar        │      blob      │
├─────────────────┼───────────────┼───┼──────────────────┼──────────────────────┼────────────────┤
│                 │               │ … │                  │ 953a7daf-8493-4b7f…  │                │
├─────────────────┴───────────────┴───┴──────────────────┴──────────────────────┴────────────────┤
│ 1 rows                                                                    15 columns (5 shown) │
└────────────────────────────────────────────────────────────────────────────────────────────────┘

This is tested with data from the https://github.com/duckdb/duckdb_iceberg/ repository

OS:

PopOS 22.04

DuckDB Version:

1.0.0

DuckDB Client:

CLI

Full Name:

Teague Sterling

Affiliation:

23andMe

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a source build

Did you include all relevant data sets for reproducing the issue?

No - Other reason (please specify in the issue body)

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have
@ramonvermeulen
Copy link

Thanks for opening the issue @teaguesterling! I think it would be nice if DuckDB would add support for file:// protocol.

@djouallah
Copy link

I would really appreciate if you can add support to file:/ too or file:/// as this is the format used by our spark engine, just for compatibility reason

@kevinjqliu
Copy link

FYI, Duckdb's delta_scan function currently handles the file:// prefix, as mentioned in duckdb/duckdb-iceberg#38 (comment)

It would be great to support the file:// prefix natively!

@kevinjqliu
Copy link

I see support for thehf://prefix (HuggingFace urls) was added in #11831. Perhaps adding support for file:// is similar

@kevinjqliu kevinjqliu linked a pull request Sep 9, 2024 that will close this issue
@wssbck
Copy link

wssbck commented Sep 11, 2024

I can confirm this issue as well, in my case for the file:/ prefix. I wrote to an Iceberg table stored locally and this prefix is used in the table's metadata files. iceberg_scan does not seem to support it:

IO Error: Cannot open file "file:/usr/local/hadoop/warehouse/bank_transfers/metadata/00002-6486f14f-b8db-43e8-aa09-a84c2c8bdcf8.metadata.json": No such file or directory

@teaguesterling
Copy link
Author

I can confirm this issue as well, in my case for the file:/ prefix. I wrote to an Iceberg table stored locally and this prefix is used in the table's metadata files. iceberg_scan does not seem to support it:

IO Error: Cannot open file "file:/usr/local/hadoop/warehouse/bank_transfers/metadata/00002-6486f14f-b8db-43e8-aa09-a84c2c8bdcf8.metadata.json": No such file or directory

It may be worth adding support for file:. As I understand it, this would actually be the correct implementation as the // prefix is optional in the spec. The /, //, /// bits are all variations of the same thing but with subtly different interpretations.

Here's my understanding:

  • file:some/thing.csv: refers to ./some/thing.csv relative to the current working directory.
  • file://some/thing.csv: refers to ./some/thing.csv relative to the current working directory but with an implicit empty host name.
  • file:/some/thing.csv: refers to /some/thing.csv as an absolute path
  • file:///some/thing.csv: refers to /some/thing.csv as an absolute path but with an implicit empty host name.

@kevinjqliu
Copy link

kevinjqliu commented Sep 13, 2024

I used pyarrow to test out the urls above
https://gist.github.com/kevinjqliu/912a4c0e33d22e8ed5cba35a2f170296

This might get tricky to implement since Glob already has some path prefix handling
https://github.com/duckdb/duckdb/blob/main/src/common/local_file_system.cpp#L1231-L1269

@teaguesterling
Copy link
Author

Thanks for testing this! It seems the first two that I wrote are not what was expected

There may be an "easy" short circuit we can add in the glob code since the logic is a bit simpler. It will always need to begin with file:/ and be an absolute path, either with 1 or 3 /, but they all mean the same thing.

@2blo
Copy link

2blo commented Dec 7, 2024

i have the same issue

@djouallah
Copy link

@samansmink any chance that this get merged before 1.2 please :)

Mytherin added a commit that referenced this issue Jan 9, 2025
Takes a stab at #13669 @djouallah

This PR adds support for [`file://`
urls](https://en.wikipedia.org/wiki/File_URI_scheme) to the
LocalFileSystem.

It currently supports urls of 3 different formats:

- `file:/some/path` (host omitted completely)
- `file:///some/path` (empty host)
- `file://localhost/some/path` (localhost as host)

Note that the following is not supported because they are non-standard
(and actually forbidden by the spec) formats:
- relative paths (`file:some/relative/path`)
- double-slash paths (`file://some/path`)

Additionally, we also don't support
- non-localhost hosts (`file://somehostsomewhere/some/path`)

For the non-standard formats we could consider implementing them anyway
if they show up a lot.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants