Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(rust,python): Add run-length encoding to Parquet writer #16125

Merged
merged 7 commits into from
May 10, 2024

Conversation

thalassemia
Copy link
Contributor

@thalassemia thalassemia commented May 9, 2024

Second attempt at #15959. This adds two commits on top of my original PR. One contains a fix for #16109 and better code comments (bb09b83). The other updates my test to use more realistic random data and adds new columns to check each possible branch condition (ee90a5e). I verified that these columns invoke the branch(es) mentioned in their associated comments by manually stepping through with a debugger. If you remove just the first new commit and run the test suite (running the new test on my original PR), it fails as expected.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels May 9, 2024
}
// If buffer is full, bit-pack as literal run and reset
if buffer_idx == MAX_VALUES_PER_LITERAL_RUN {
T::bitpacked_encode(writer, buffered_bits.iter().copied(), num_bits as usize)?;
Copy link
Contributor Author

@thalassemia thalassemia May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with my original PR was here. Instead of buffered_bits.iter().copied(), I had buffered_bits.iter().copied().take(literal_run_idx).

When the buffer filled up in the middle of a consecutive run, literal_run_idx != MAX_VALUES_PER_LITERAL_RUN so data was lost. The solution was to remove the take clause and always flush the entire buffer when full.

I added a debug_assert to check that the right number of repeats is consolidated into the buffered literal run.

n, with_replacement=True
),
# Fill up bit-packing buffer in middle of consecutive run
"large_bit_pack": np.concatenate(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This column tests the failure mechanism in #16109. By changing the value every 5 elements, we never have a run long enough to run-length encode. Eventually, all 8192 elements of the bit-packing buffer are filled. At that point, literal_run_idx = 8190 because it was last updated when the value last changed and consecutive_repeats = 3. Two of the most recent consecutive repeats are buffered as the 8191st and 8192nd elements of the bit-packing buffer. That means they are bit-packed with the rest of the buffer and consecutive_repeats is reset to 1.

Copy link

codspeed-hq bot commented May 9, 2024

CodSpeed Performance Report

Merging #16125 will not alter performance

Comparing thalassemia:rle (ee90a5e) with main (12b40b9)

Summary

✅ 35 untouched benchmarks

@thalassemia thalassemia force-pushed the rle branch 2 times, most recently from b91f064 to 1a676bd Compare May 9, 2024 01:30
Copy link

codecov bot commented May 9, 2024

Codecov Report

Attention: Patch coverage is 98.39572% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 80.98%. Comparing base (12b40b9) to head (ee90a5e).
Report is 1 commits behind head on main.

Files Patch % Lines
crates/polars-arrow/src/compute/cast/binary_to.rs 0.00% 1 Missing ⚠️
crates/polars-arrow/src/compute/cast/binview_to.rs 50.00% 1 Missing ⚠️
crates/polars-arrow/src/compute/cast/utf8_to.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16125      +/-   ##
==========================================
+ Coverage   80.95%   80.98%   +0.02%     
==========================================
  Files        1386     1386              
  Lines      178434   178511      +77     
  Branches     2878     2878              
==========================================
+ Hits       144455   144562     +107     
+ Misses      33487    33457      -30     
  Partials      492      492              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

// * Final run: Extra values from buffer will be included in
// packed[..compressed_remainder_size] but ignored when decoding
// because they extend beyond known column length
bitpacked::encode_pack(&buffer, num_bits, packed.as_mut());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my original PR, I changed this to &buffer[..remainder] because I was worried about other data being included from the buffer (same buffer is used before reaching this branch).

I now realize that this is a non-issue for the reasons described in my comment and changed it back to &buffer. This theoretically avoids an extra memory allocation in bitpacked::encode_pack. Unfortunately, the current bitpacked::encode_pack always copies input data to a new array because it compares the input slice length to T::Packed::LENGTH instead of T::Unpacked::LENGTH. I'll fix that and optimize bit-packing in a separate PR.

@ritchie46
Copy link
Member

Alright, thank you for the retry @thalassemia.

I'll fix that and optimize bit-packing in a separate PR.

Waiting for that follow up. :)

@ritchie46 ritchie46 merged commit be57336 into pola-rs:main May 10, 2024
28 checks passed
@s-banach s-banach mentioned this pull request May 14, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants