3

I have a 100 GB text file in a 7z archive. I can find a pattern 'hello' in it by reading it by 1 MB block (7z outputs the data to stdout):

Popen("7z e -so archive.7z big100gb_file.txt", stdout=PIPE)
while True:
    block = proc.stdout.read(1024*1024)    # 1 MB block
    i += 1
    ...
    if b'hello' in block:      # omitting other details for search pattern split in consecutive blocks...
        print('pattern found in block %i' % i)
    ...

Now that we have found after 5 minutes of search that the pattern 'hello' is, say, in the 23456th block, how to access this block or line very fast in the future inside the 7z file?

(if possible, without saving this data in another file/index)

With 7z, how to seek in the middle of the file?

Note: I already read Indexing / random access to 7zip .7z archives and random seek in 7z single file archive but these questions don't discuss concrete implementation.

1
  • @TDG a .7z file surely has headers and a table of files at the end of file, so I'm nearly sure we cannot seek in the middle simply like this
    – Basj
    Commented May 7, 2022 at 8:07

1 Answer 1

5
+100

It is possible, in principle, to build an index to compressed data. You would pick, say, a block size of uncompressed data, where the start of each block would be an entry point at which you would be able to start decompressing. The index would be separate file or large structure in memory that you would build, with the entire decompression state saved for each entry point. You would need to decompress all of the compressed data once to build the index. The choice of block size would be a balance of how quickly you want to access any given byte in the compressed data, against the size of the index.

There are several different compression methods that 7z can use (deflate, lzma2, bzip2, ppmd). What you would need to do to implement this sort of random access would be entirely different for each method.

Also for each method there are better places to pick entry points than some fixed uncompressed block size. Such choices would greatly reduce the size of the index, taking advantage of the internal structure of the compressed data used by that method.

For example, bzip2 has natural entry points with no history at each bzip2 block, by default each with 900 KiB of uncompressed data. This allows the index to be quite small with just the compressed and uncompressed offsets needing to be saved.

For deflate, the entry points can be deflate blocks, where the index is the compressed and uncompressed offset of selected deflate blocks, along with the 32K dictionary for each entry point. zran.c implements such an index for deflate compressed data.

The decompression state at any point in an lzma2 or ppmd compressed stream is extremely large. I do not believe that such a random access approach could be practical for those compression methods. The compressed data formats would need to be modified to break it up into blocks at the time of compression, at some cost to the compression ratio.

7
  • In the case we don't want a precise index/we don't want exact seek location, would it be possible for a .7z containing a single .txt file, to start decompression roughly at the middle of a 100 GB .7z file? Let's say we don't care if the decompression is done in the range 49-100 GB or 51-100GB precisely, but all we want is "search this pattern in approximatively the second half of the 7z file". How to do this? By looking in the 7z headers, can we find the block size, and know that for example the compression "state" begins from scratch at file offset, say, 49 500 010 128?
    – Basj
    Commented Oct 7, 2022 at 7:55
  • 1
    No. You can't start decompressing in the middle of a normally produced compressed data on the first pass, regardless of where you start. You would need to either a) specially prepare the compressed data to have historyless entry points, or b) decompress the entirety of the compressed data once to prepare an index as described above, which would allow subsequent decompression starts at the index points.
    – Mark Adler
    Commented Oct 7, 2022 at 8:03
  • Would you know @MarkAdler how to do b) with 7z.exe for a .7z LZMA2 file? I am ok to run a first pass on the whole file and log in an index file offsets where the state begins from scratch (with no history needed). Do you think such offset points always exist? Is there a command line 7z.exe ... that could help with this?
    – Basj
    Commented Oct 7, 2022 at 8:29
  • No, such historyless offset points never exist. You would need to write your own software to create such a .7z file.
    – Mark Adler
    Commented Oct 7, 2022 at 15:30
  • How to backup the compression state at given points, and reuse them for later use, as you suggested? Can 7z.exe with a special parameter give the "compression state" @MarkAdler?
    – Basj
    Commented Oct 11, 2022 at 8:27

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.