0

I have a tar which contains many zip files, each of which contains xml files. I would like to untar, unzip, and then do some stuff with the plaintext in the xml files. This entire program is being written as a bash pipeline.

I need the unzip command to return:

  1. uncompressed file contents of each file inside the zip
  2. file delimiter (to know when file contents of one file stop and next one start)
  3. name of each file inside the zip

The next command in the pipeline needs these 3 things to do its job correctly. The file contents and delimiter have to be in a pipe (stdout), the filenames can either be in the same pipe or in variables or something.

Question: Which implementation of unzip should I use, and how do I do this?

bsdtar works (but AFAIK can't return filenames):

tar -xf ~/tar/0.tar --to-command 'bsdtar -xO --include='*.html' --include='*.xhtml' | iconv -f UTF-8 -t UTF-8//IGNORE | htmlq -tw'

P.S. I am doing the untar, unzip and everything inside a bash pipeline rather than on disk because writing to disk slows down the program by 30x. Each zip contains a lot of small files and trying to find them later hits the disk I/O bottleneck.

P.P.S. I am aware that unzipping requires reading to the end first, so in theory pipes shouldn't help much. In practice this is not slowing down the program much (I'm assuming the entire zip file gets stored in RAM, which is fine).

8
  • Handling this in a pipe is overcomplicated, why not try a "RAM-disk" instead; e.g. scaler.com/topics/linux-ramdisk
    – Hannu
    Commented Oct 19 at 9:04
  • @Hannu Thanks, I will try this if I don't find another solution. The rest of my program is written as pipeline already, it is just this one command that I am stuck on. Commented Oct 19 at 9:25
  • (1) In a ramdisk or on a HDD/SSD, temporary directories and regular files were invented to be used. Pipelines are elegant for problems that fit them. There is no virtue in using a pipeline with programs or formats clearly not designed to be used in a pipeline. (2) Is this a recurring task? or a one-time (or rare) job? If the latter then possibly concocting a robust pipeline may take longer than just working with regular files. If the former then maybe you should consider switching/converting to a storage format that fits your needs better. Commented Oct 19 at 9:46
  • @KamilMaciorowski 1. Fair, I will try using ramfs just for this one step. I wonder if there's any standard practice for when to use ramfs versus pipeline. 2. It's a recurring task, I need to process many TB of data this way. Each tarball is 5 GB. re: switching storage format, time required to download all the tarballs, untar and unzip them is large enough, I might as well process them fully the first go. Commented Oct 19 at 9:52
  • About (2): "Switching": I was hoping that whoever creates the files might create them in a better format in the first place from now on. "Converting": in case you need to process the same archive many times (possibly slightly differently each time) and yet you choose to store the input as a single archive, not as separate files. Note my questions are because your post does not tell us: (a) that the task is recurring; (b) if all the input files are already created or some of them are about to be created and you can request changes; (c) if you will need to process each file multiple times. Commented Oct 19 at 10:25

1 Answer 1

1

This should be a comment, but would be overlong. It might provide some guidance.

gzip is fairly smart. It cannot assume how much memory is available (so it won't read the whole file into memory), and it does not know the amount of compression it has achieved until that phase is complete. (It will of course use normal caching.)

While compressing, I believe it accumulates the list of files in the archive (and the stats for each) and appends those after all the data. Then it cunningly writes an epilogue which contains the seek address of the start of that file list. The file stats could also contain the offset in the zip file of the start of each subfile, which would optimise partial extracts too.

So unzip with the -l or -v option can seek to the end, seek back by sizeof(epilogue), and report the file contents without reading anything else. You could verify my conjecture by running a small test file under strace.

You might find it is sufficiently fast to get the filename list with unzip -v, parse that with awk, and extract each file separately, maybe prepending the file details. This would also be an interesting exercise under strace, which would show the seek/read strategy.

You might find it requires to write files to disk, but you could optimise that with a ram-disk as the individual files are small. Also, the zip family may delete some files once they have been processed -- test in some development directory before attempting production.

If this is a production job, I would probably disconnect the extract and upload. Extract files into a pend subdirectory, move completed files into a live subdirectory, and the uploader (maybe in parallel) moves uploaded files into a done directory.

EDIT: Note zcat and gunzip do not fully handle zip archives. zip and unzip are required.

I made a test .zip file and ran a couple of commands inside strace. The results are somewhat confusing because the file structure aligns some things (like the file list) to block boundaries, and unzip also reads complete aligned 8192-byte chunks.

However, I can confirm:

(a) unzip -v Test1.zip makes two seeks and one read to produce the verbose file listing.

(b) unzip Test1.zip csvParse.c makes five seeks and six reads to search for, find and extract one file. It does not read anything that is not required.

My conclusion is that dealing with each extracted file individually will not impact performance significantly, and will simplify the required processing.

The archive contents list appears to be very regularly formatted in columns, and should be simple to parse to get the filenames.

5
  • Note: a tar file contains some sort of directory(?) of the files inside it. A ".tar.gz" file is the same tar file, compressed as a single file - with no "table of contents", it has to be decompressed to read the directory. pkzip-files and other "archiver"-files on the other hand has table of contents. [ Not researched! Old info; May be wrong ]
    – Hannu
    Commented Oct 19 at 13:37
  • Thanks for taking the time to reply, I will try to do this. (parse list and then extract each file individually) Commented Oct 19 at 14:16
  • @Hannu Fundamentally correct. tar writes a control block (normally 512 bytes) inline before every archived file, and each archive is rounded up to a multiple of 512 bytes. This is a throw-back to mag-tape, where you can skip blocks but not seek. The whole thing then gets compressed together, destroying any kind of alignment that might help to select any particular file. If it is not compressed, and on a seekable medium, tar can skip to the next control block (and I assume it can do this on some tape drives, but not all). Commented Oct 19 at 17:21
  • @ghosts_in_the_code Final statistic: I made a .zip of 215 MB (four files at 128M with a 340-byte file in the middle, 58% compression). unzip -p and send to /dev/null takes 10.3 secs. Extracting the small file alone takes just real 0m0.008s. Commented Oct 20 at 9:39
  • @Paul_Pedant Thanks I will look into this! I'm considering just skipping the filenames altogether. But your solution could definitely be useful for someone. Thanks for taking the time to type it out. Commented Oct 23 at 18:46

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .