[kwlug-disc] So why not tar -cf tarball.tar a.xz b.xz c.xz, instead of tar -cJf tarball.tar.xz a b c ?

B.S. bs27975.2 at gmail.com
Fri Oct 28 19:00:20 EDT 2016


On 10/28/2016 02:16 AM, Chris Frey wrote:
> On Fri, Oct 28, 2016 at 01:54:25AM -0400, Chris Frey wrote:
>> In my tests, a 1M split size seemed too big for tar to recover from
>> by itself, but fixtar was able to do it:
>>
>> 	https://github.com/BestSolution-at/fixtar
>
> Putting this together with gzrecover (gzrt) may be enough to not
> have to worry about this question at all.  If it is possible to
> recover from both corrupt gzip blocks and corrupt tar blocks to
> get at the data on the other side of the file, then going to the
> trouble of compressing first may not gain much.

You lost me there, at "then going to the trouble of compressing first 
may not gain much." - seems to be saying not worth going through the 
trouble of compressing.

Which seems to be inconsistent with where you've been coming from / what 
you've been saying - so I doubt that's what you mean. (Not that you've 
beating anything in particular particularly.)

It does seem arguable to not compress at all, though, given compressing 
/ deduping filesystems.

Let's bear in mind, that none of this conversation has been about mere 
file transfer. i.e. I can see value in compress for transmission over 
slow links. Uncompress on other side, if errors, retransmit. As links 
get faster, though, e.g. within premises, I expect there must come some 
point where time to compress + time to decompress > time to transmit 
uncompressed. Let alone with today's fast processors, or compressed net 
traffic in the first place.

But, in this problem case of long term stored (compressed / deduped 
filesystem), bit rotted along the way, I've not yet encountered anything 
contesting the intuitive idea of not compressing.

And let's also not lose sight that what's compelling about compress is 
more about integrity failure detection than file size. And that often 
de/compress inherently keeps but one file - i.e. no sidecar files to 
additionally keep track of. For those purposes of rot detection, things 
like md5sum's serve just as well. So, --to-command='md5sum -' at time of 
creation, periodically run and outputs diff'ed, seem to take us to an 
equivalent place. (Regardless of compressing on the fly to file.tar.gz, 
or gzip file.tar.)

What isn't compelling is 'gzip file.tar' going bad, with zip metadata 
throughout, rendering the entire tar broken. vs. Individual gzip tar'red 
- broken gzip files being easier to skip over in tar, by having tar just 
skip to the next file header.

And gzip rot off the table entirely (if on compressing / deduping 
filesystem), while maintaining file rot detection via md5sum.

P.S. Did just note from tar manual, that tar does keep an internal 
checksum of its entirety. And that tar --list will report a checksum 
failure - so at least you'll know if you have a broken tar. You won't 
know where, though - which is where md5, or sha1 comes into play. (Or 
compressing.)





More information about the kwlug-disc mailing list