[kwlug-disc] So why not tar -cf tarball.tar a.xz b.xz c.xz, instead of tar -cJf tarball.tar.xz a b c ?

B.S. bs27975.2 at gmail.com
Wed Oct 26 00:10:35 EDT 2016



On 10/25/2016 09:02 PM, bob+kwlug at softscape.ca wrote:
> I know this is not what you meant, but I submit this for
> consideration and discussion:
>
> $ find . -type f -print0 | xargs -0 sha1sum > 00_tar.sha1sum $ tar
> cvzf /dev/st0 .

Yep. Came to a similar conclusion myself, using --to-command.

> Also with the notion that gzip itself will signal integrity of the
> entire archive. And in cases where it doesn't, you can always fall
> back on this: http://www.gzip.org/recover.txt

{Herein, swap out the word 'zip' for your favourite compress program. Or 
encryption, for that matter.}

Saw that. Ain't pretty. 'fall back' is a very generous description. Step 
1, get and modify the source ...

And ... that article exactly explains why we should be tarring zips, not 
zipping tars. Getting a zip back is problematic. Once your tarball zip 
has issues all bets are off. However, once your non-zip tarball has 
issues, you skip ahead to the next file block indicator. (To extract 
your zipped file.) Etc.

At least if you have a botched zip, it isn't likely all zips in the tar.

Here's the rub ... conversely, zips don't have the filesystem smarts 
that it appears tar is kept current with. So if you tar a zip you've 
lost all the filesystem metadata (except for some basic ones).

Further, it's arguable with filesystems like btrfs that compress and 
de-dup that there is no point to zip'ping anything.

But there goes your integrity checks. (Not really - solution another day.)

> On a slightly different approach, I think that if you were to use the
> option
>
> -I, --use-compress-program PROG filter through PROG (must accept -d)
>
> and write your own version of PROG ...

D'OH! My eyes have passed over that I don't know how many times. I've 
been saying there oughta be a --in-command or --from-command to match 
the --to-command. Double D'OH!

Two or three REAL beauties of tar:
- understands most all metadata
- REALLY good file set facilities. Perhaps superior, even.
- --remove-files 'Pack it away' functionality. vs Duplicate file set 
facilities (e.g. to iterate gzip over all of them before handing them to 
tar), then manually remove them upon packing, tracking whether it's a 
file, a dir, a pipe, a link (hard or soft), a socket, a ...
= Conversely, unzip'ping them upon extraction from tar.

> that would augment a tar stream,
> you could inject your own checksums in there. You could even probably
> craft it such that you only compressed the actual files themselves
> and not the tar headers and meta info. So that lacking PROG, you can
> extract gzipped versions of the actual files. Although, tar itself
> might have some issues with the extra bytes in its stream if it keeps
> some sort of integrity checks on data length for files.

Shouldn't matter - it's just a stream of data to tar. Even if it does 
... perhaps one can create a sidecar file on the fly with the associated 
data. One would have to craft an accompanying --to-command prog though - 
but if the one, the other should be little work. In the end, remaining 
user transparent. (Albeit apparent?) Transparent as in automated / hands 
free.

(Moreover, the original and sidecar files could be tar'red together, and 
that tar itself be the one that goes into the main tar.)

> You would inject a hash on tar -c and verify on tar -x (by way of the
> -d being passed to PROG), although I'm not sure how you would signal
> a problem (maybe output to stderr from PROG??).
>
> My understanding of PROG is that it is just pipelined into the data
> stream just before the output handle of the tar command. I've never
> really played with this, so I'm just theorizing. It might be an
> interesting experiment.

Definitely.

At some point I'll post my conclusions on things. Been beating on things 
in various ways, but nothing is to the point of summarizing or posting 
on yet.

 >
 > (the other)Bob.
 >

THANK YOU! Good talk!


>> -----Original Message----- From: kwlug-disc
>> [mailto:kwlug-disc-bounces at kwlug.org] On Behalf Of B.S. Sent:
>> Friday, October 21, 2016 10:07 AM To: Kwlug-Disc Subject:
>> [kwlug-disc] So why not tar -cf tarball.tar a.xz b.xz c.xz, instead
>> of tar -cJf tarball.tar.xz a b c ?
>>
>> By itself, tar has no individual file integrity (checksumming)
>> ability - albeit the entire tarball itself is checksummed when
>> used, as traditionally, with compressors such as gzip.
>>
>> I've read that tarballs can be fragile. And when damaged, there's
>> no way to know which files remain undamaged. [Yet tar is the only
>> archiver kept current with filesystems enhancements, such as ACLs,
>> xattr, links, pipes, devices, etc.. Zip (also an archive) isn't,
>> nor gzip, et al.]
>>
>> Yes, md5's would also confirm integrity - while adding awkwardness
>> and sidecar files to also track and validate.
>>
>> So why is tar'ring gzip's instead of gzip'ping tarballs not more
>> popular? [Google-fu fail.] Yes, it won't compress as much, but
>> that seems a small price to pay for individual file integrity
>> assurance.





More information about the kwlug-disc mailing list