I have a tar which contains many zip files, each of which contains xml files. I would like to untar, unzip, and then do some stuff with the plaintext in the xml files. This entire program is being written as a bash pipeline.
I need the unzip command to return:
- uncompressed file contents of each file inside the zip
- file delimiter (to know when file contents of one file stop and next one start)
- name of each file inside the zip
The next command in the pipeline needs these 3 things to do its job correctly. The file contents and delimiter have to be in a pipe (stdout), the filenames can either be in the same pipe or in variables or something.
Question: Which implementation of unzip should I use, and how do I do this?
bsdtar works (but AFAIK can't return filenames):
tar -xf ~/tar/0.tar --to-command 'bsdtar -xO --include='*.html' --include='*.xhtml' | iconv -f UTF-8 -t UTF-8//IGNORE | htmlq -tw'
P.S. I am doing the untar, unzip and everything inside a bash pipeline rather than on disk because writing to disk slows down the program by 30x. Each zip contains a lot of small files and trying to find them later hits the disk I/O bottleneck.
P.P.S. I am aware that unzipping requires reading to the end first, so in theory pipes shouldn't help much. In practice this is not slowing down the program much (I'm assuming the entire zip file gets stored in RAM, which is fine).