[kwlug-disc] So why not tar -cf tarball.tar a.xz b.xz c.xz, instead of tar -cJf tarball.tar.xz a b c ?

B. S. bs27975 at gmail.com
Thu Nov 3 18:10:34 EDT 2016


On 11/03/2016 12:37 PM, bob+kwlug at softscape.ca wrote:
> To turn this on its head a bit.....
>
> Are you lamenting the shortcomings of tar and compression or are you
> trying to solve for bit-rot within archives that are in this format?
>
> If the objective is to factor and mitigate bit-rot within compressed
> tar files, perhaps we should look at the medium it's being stored on
> instead.

No choice, today's medium, particularly given sizes involved, is disk. 
[Tape is and was always even worse, lifespan wise. Same is true for 
optical, but more importantly, for both even, is cost (given capacities 
involved).] ['Worse', here, for tape, is number of passes. Not 
necessarily longevity per pass.]

Objective has always been ... given it can be a very long time before it 
matters, and stuff happens, files go bad, let alone entire disks, how to 
know a backup tar is good. And being many files, one error does not 
necessarily mean all files are inaccessible. Problem: How to determine 
which files within the 'bad' archive are still good?

And remember the mantra ... test your backups!

> If the compressed tar sits on disk, then you have various options.
> ZFS and BTRFS have the notion of checksumming disk blocks plus
> redundancy and logic to "heal" bit-rotted sectors.

BTRFS has been mentioned throughout, for the reasons you again state. 
(Let alone, deduplication possibilities.)

However ... btrfs does not necessarily, and probably doesn't in the home 
environment, bring anything more than (possible) detection when 
something has gone bad. Which it still does - damaged is still damaged, 
even with btrfs. Moreover, one still doesn't know which files within the 
archive are damaged - or conversely, which files are still entirely 
accurate.

Note to selves: SCRUB YOUR DISKS REGULARLY! Detect that something has 
gone bad sooner rather than later, when you are more likely to be able 
to do something about it. (Like have a backup, or source copy, still 
around.)

> So your compressed tar file _should_ never fault.

No. They will. Count on it. It's only a matter of when. Problem is: will 
it ever matter? e.g. Archive files are no longer important, e.g. 
outdated copy, or superseded by a subsequent backup.

When / should it matter - how to know which files within the archive are 
still integral.

> Granted, you've traded your space
> gains with compression for losses with redundant data having to be
> being stored (ie: mirrored or erasure coded blocks). (I guess you're
> still benefiting since you're not having to keep redundancy for your
> uncompressed data.)

ONLY, if your btrfs is mirrored AND you scrub regularly. In those case 
it will self-heal.

But ... what about that backup disk (unmirrored) you've put off-site, or 
under the stairs in the basement? Perhaps in a slightly moist 
environment, perhaps you see a touch of rust ...

When you DO have to haul it out, and it's being persnickety, WHICH files 
within the tar are specifically bad, and thus conversely, which are OK? 
(And prove it.)


Tape is no longer in the picture in today's environment. 'Worst case' 
would be a robotic optical media library. I expect such are less and 
less prevalent given the 'cheapness' of disk space, let alone in the 
cloud - where I expect data redundancy and certainty -should- be built in.

[Again, though, how do you know? (The backup is good, wherever it resides.)]

> ...  RAID5

RAID 0 is for speed, RAID 1 is for safety (above), otherwise RAID is for 
uptime. Not data integrity.

Thus off-line archives, let alone off-site backups.

> Ie: tar cv some_source | gzip -9v | erasure_encode > /dev/st0 And:
> cat /dev/st0 | erasure_decode | gunzip | tar tv

Is there any difference here from the built in de/compression options, 
and/or de/compressing after the fact with 'xz file' -> file.xz?

Where, for the purposes of this thread, file is myfile.tar

> Does pushing the problem down lower into the storage stack help?

No, because ultimately it is the integrity of the whole, from tar's 
perspective, that matters (to be able to extract it with the tool used, 
in this case, tar).

These other steps merely make it more likely that tar will always be 
presented with an integral file to operate upon.

But btrfs bit rot demonstrates that such presentation, even today, isn't 
100%. [But btrfs will enable one to know when rot has happened, and 
prevent further writing and making it worse (by further writing).]

As said in prior, solution is --to-command='md5sum -' at creation time. 
Periodically rerun and diff the two.


In my recent exercising of these things ... I have seen some pretty 
startling differences in compressed file sizes, even though btrfs is 
compressing. e.g. file sizes before / after running xz.

Problem, as noted, is that compressing, and thus putting various control 
data (e.g. checksums) -throughout- the file, means that one sector of 
bit rot, instead of taking out one file in the tar, likely takes out the 
entire tar.

The dearth of checksumming within tar feels predicated on certainty of 
integrity of storage once stored. Experience has demonstrated that that 
certainty absolutely does not exist.

So back to the start - detecting which files within a tar are bad.





More information about the kwlug-disc mailing list