-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(rust,python): Add run-length encoding to Parquet writer #16125
Conversation
} | ||
// If buffer is full, bit-pack as literal run and reset | ||
if buffer_idx == MAX_VALUES_PER_LITERAL_RUN { | ||
T::bitpacked_encode(writer, buffered_bits.iter().copied(), num_bits as usize)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue with my original PR was here. Instead of buffered_bits.iter().copied()
, I had buffered_bits.iter().copied().take(literal_run_idx)
.
When the buffer filled up in the middle of a consecutive run, literal_run_idx != MAX_VALUES_PER_LITERAL_RUN
so data was lost. The solution was to remove the take
clause and always flush the entire buffer when full.
I added a debug_assert
to check that the right number of repeats is consolidated into the buffered literal run.
n, with_replacement=True | ||
), | ||
# Fill up bit-packing buffer in middle of consecutive run | ||
"large_bit_pack": np.concatenate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This column tests the failure mechanism in #16109. By changing the value every 5 elements, we never have a run long enough to run-length encode. Eventually, all 8192 elements of the bit-packing buffer are filled. At that point, literal_run_idx = 8190
because it was last updated when the value last changed and consecutive_repeats = 3
. Two of the most recent consecutive repeats are buffered as the 8191st and 8192nd elements of the bit-packing buffer. That means they are bit-packed with the rest of the buffer and consecutive_repeats
is reset to 1.
CodSpeed Performance ReportMerging #16125 will not alter performanceComparing Summary
|
b91f064
to
1a676bd
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #16125 +/- ##
==========================================
+ Coverage 80.95% 80.98% +0.02%
==========================================
Files 1386 1386
Lines 178434 178511 +77
Branches 2878 2878
==========================================
+ Hits 144455 144562 +107
+ Misses 33487 33457 -30
Partials 492 492 ☔ View full report in Codecov by Sentry. |
// * Final run: Extra values from buffer will be included in | ||
// packed[..compressed_remainder_size] but ignored when decoding | ||
// because they extend beyond known column length | ||
bitpacked::encode_pack(&buffer, num_bits, packed.as_mut()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my original PR, I changed this to &buffer[..remainder]
because I was worried about other data being included from the buffer (same buffer is used before reaching this branch).
I now realize that this is a non-issue for the reasons described in my comment and changed it back to &buffer
. This theoretically avoids an extra memory allocation in bitpacked::encode_pack
. Unfortunately, the current bitpacked::encode_pack
always copies input data to a new array because it compares the input slice length to T::Packed::LENGTH
instead of T::Unpacked::LENGTH
. I'll fix that and optimize bit-packing in a separate PR.
Alright, thank you for the retry @thalassemia.
Waiting for that follow up. :) |
Second attempt at #15959. This adds two commits on top of my original PR. One contains a fix for #16109 and better code comments (bb09b83). The other updates my test to use more realistic random data and adds new columns to check each possible branch condition (ee90a5e). I verified that these columns invoke the branch(es) mentioned in their associated comments by manually stepping through with a debugger. If you remove just the first new commit and run the test suite (running the new test on my original PR), it fails as expected.