I have some huge files that are currently gzipped and I would like to xz them. I want set up a script to do this, but I want to be careful not to lose the data, i.e. I should never delete the gzipped version unless the xz version was definitely created correctly. Since these are big files, I'd also prefer not to unzip the file to disk first. I was thinking a pipe set -o pipefail; gzip -dc file.gz | xz > file.xz && rm file.gz
might be close to what I want. What the right way to do this? Is this guaranteed to catch any failures that occurred before removing the final file?
1 Answer
Adding SHA1 sum (which mathematically guarantees to a ridiculously high degree of certainty that the files either match when the hashes match, and the hashes don't match when the files don't match) adds a measure of data integrity to guard against cases where the disk subsystem might have made a (silent) mistake while writing. Silent corruption is rare but insidious when it happens.
Of course, you could still have confused results if you have random errors while reading, but in that case, the sums won't match anyway, to an extremely high degree of certainty. In other words, if the system is corrupt (either RAM or the disk producing wrong bits / flipped bits / corrupted data), then this will fail where a simple &&
might succeed, and the chances of this getting to the rm
line with corrupt data are vanishingly small (because most errors tend to corrupt data in random ways, the chances of the random change causing a hash collision in SHA1 during the readback is breathtakingly tiny.)
#!/bin/bash
set -e
set -o pipefail
ORIGSUM=$(gzip -dc file.gz | tee >(xz > file.xz) | sha1sum)
NEWSUM=$(unxz -c file.xz | sha1sum)
if [ "${ORIGSUM}" = "${NEWSUM}" ]; then rm file.gz; fi
The set -e
makes the shell script exit just as soon as any line of the script returns a nonzero exit code.
Then we use the tee
command to copy the un-gzipped output of the file to both the xz
compressor, and to the sha1sum
program. sha1sum
calculates the SHA1 sum of the original data contained within the gzipped archive by un-gzipping it temporarily into the sha1sum program, which reads the data to compute the sum and then discards the data. By using tee
, we only have to pay the CPU cost of ungzipping the file once.
Then we perform an additional computationally-expensive step (for super-extra-verification), and strip the xz compression on the file (temporarily, into a stream) and pipe it to sha1sum, to get our "new file" SHA1 sum.
Then we compare the two sums, and if they aren't equal strings, or if one or both of them is zero-length, we will either get a script error (which exits, thanks to set -e
), or the file won't be removed. You can implement an else
clause for user-friendly error handling if you want, but this essential script as-is will be extremely safe, albeit not very informative to a user running the command interactively.
In the end, the file.gz
will only be unlinked if and only if the uncompressed contents of file.gz
and file.xz
are exactly identical at the point in time that the hashes were computed, with an astronomically high degree of certainty (the odds of something bad going wrong would be something like 1 in 1 with 300 zeroes after it). At that point you only have to worry about the data getting corrupted after this script exits. ;)
Performance
This script will run at nearly the same speed as your original script in the question, except for the part that runs unxz
. Fortunately, uncompressing from LZMA is extremely fast, almost as fast as regular Zip, and something like an order of magnitude faster than compressing to LZMA. If you have a fast CPU, and the files are sufficiently small, this shouldn't add too much runtime to the script, but if you value data integrity over performance, it's a clear win.
Credit where credit is due
This answer on StackOverflow aided me substantially in writing this script.
-
That's a really nice way to make sure the compression worked correctly and keep runtime roughly the same. And, yes, I timed some of these operations, and the xz step was an order of magnitude slower than the unxz step, so I'm not concerned about that part. Commented Jan 31, 2014 at 1:46
-
I should add that, on a system with ECC RAM and running a filesystem with built-in integrity checks, such as
btrfs
orzfs
, the operating system and hardware are already working together to do something quite similar to my sha1sum test, so if you are running in such a configuration, the risk of removing the sha1sum check from this script is pretty negligible. On the other hand, if you have non-ECC RAM and a filesystem that does not have built-in integrity, this will make your script significantly safer. Commented Jan 31, 2014 at 1:54
&&
not||
in your&& rm file.gz
. Otherwise with||
, thefile.gz
would be removed even whenxz
fails which is what you don't want.