[kwlug-disc] So why not tar -cf tarball.tar a.xz b.xz c.xz, instead of tar -cJf tarball.tar.xz a b c ?

B.S. bs27975.2 at gmail.com
Wed Oct 26 15:49:45 EDT 2016



On 10/26/2016 11:24 AM, bob+kwlug at softscape.ca wrote:
 >> Yep. Came to a similar conclusion myself, using --to-command.
 >>
 > Cool. I hadn't noticed this option before. Will have to grok that for
 > a bit and see how it could be useful.

--to-command='md5sum -' ... run and saved at time of creation, run any 
other time, diff the two ... fini.

 > Yeah, this always bothered me about tar | gzip but not enough to do
 > anything about it. I guess I've never been burned by it yet to have
 > spent cycles thinking about it.

And should / when burned ... TOO LATE! <sigh>


>> D'OH! My eyes have passed over that I don't know how many times.
>> I've been saying there oughta be a --in-command or --from-command
>> to match the --to-command. Double D'OH!
>
> Did a quick experiment today as follows:
>
> tar -cvf /tmp/1.tar --use-compress-program /tmp/PROG.sh  /some/dir
>
> where PROG.sh was just 'tee /tmp/2.out'
>
> What was cool was that 1.tar and 2.out were IDENTICAL!

Check me on this, please:

Eventually I got around to reading the blurb on this (man tar) - note, 
pay particular attention to 'bugs' - I skipped over it too often. In 
particular, use https://www.gnu.org/software/tar/manual/ not man - 
although it's not much better. (Lines in man were curiously cut off 
strangely ... then I read the bugs bit.)

I have not yet had a chance to conduct my own similar experiments, but 
when I read the man tar blurb, I came away with the impression that this 
isn't going to do what we're talking about.

My expectation (your experience bears out?) is that this gets called 
once per invocation - to un/zip it all up. NOT on a per file within the 
archive basis.

Moreover, IIReadC, the facility is too limited (receive and output 
stdout), to do what I wrote, such as add sidecar files automatically. 
(Although one could tar the input and sidecar files into one stdout 
stream.) And, in the process, probably lose the metadata. Only testing 
will show - my guess is that tar must re-wrap the filename around 
things, which would rename this 'sub-'tar above to the original 
filename; and how would the untarrer know to treat specially when 
processing. Unclear is whether tar also re-wraps the metadata around it. 
e.g. If links are just pointer (strings) within the tar stream to 
elsewhere, ...

>
> I don't think it would take much to add some computation of the
> stream and data into the stream inside of PROG.sh to checksum the
> individual files. Just need to know enough about the data structures
> of a tar file to do this as it flies by.

Probably not (need to know internal structure) - although details are at 
that link, search TAR_FILENAME within it and you'll find the env. vars. 
that are probably of only use.


> Hear, hear! Tar, cpio and rsync. Essentials.

Some day I'll dig in to cpio, took me long enough to dig into tar, and I 
sense much more than I can grok at the moment - see backups, and dump.
(Which may incorporate the --from-command functionality?)

Perhaps instead of, or in addition to, cpio, afio. IIRC, mondoarchive 
uses it over cpio, and the reasoning for it I came across somewhere in 
the forum at some point. (Bruno, I trust. Way too much goodness and 
thinking obviously evident within the mondo scripts to not have 100% 
confidence in him.)

However, when I glanced over man afio and cpio couple days back, I 
didn't quickly see the --remove-files functionality of tar, which I 
think is a deal breaker.


> Although, I think I'd prefer to have the check summing in-line rather
> than in a sidecar file. I'll need to think about that for a bit.

So would I, but absolutely not possible / not going to happen.

As commented earlier in thread, tar way TOO far widespread for such 
radical changes. Any change proposal would have to suit everything from 
DOS to Win to HP to Sun to Unix to ...

Good luck with that.

They can't even keep stdout / stderr straight - and there's no way I 
could possibly be the first to bring that up, yet it is the way it is.

e.g. tar --checkpoint-action=dots --verbose -cf mytar.tar files* | tee 
mytar.filelist ... sends what are supposed to be stderr dots into 
mytar.filelist. (Append another --verbose in there for an ls -l form of 
filelist. FWIW.)


> Although, I wonder if any of these:
>
> -H, --format FORMAT create archive of the given formatFORMAT is one
> of the following:
...
> --format=v7 old V7 tar format

I looked at that, thought it was =sysv7 noted having checksum's, then 
when I tried something like tar -H=sysv7 -acf mytar.tar.xz files
it fell over saying couldn't compress such format, I think.

Perhaps I need to revisit, if compressing is off the table (with btrfs 
checksumming, compressing, and dedup'ing.)

Quick look through link shows no 'sysv7' or 'compress' (related) search 
result, so sysv7 wasn't it. 'compress' is very present as a search term, 
and the problematic nature discussed.

Note: Just saw 'Secondly, multi-volume archives cannot be compressed.'

Let alone: 'Compressed archives are easily corrupted, because compressed 
files have little redundancy. The adaptive nature of the compression 
scheme means that the compression tables are implicitly spread all over 
the archive. If you lose a few blocks, the dynamic construction of the 
compression tables becomes unsynchronized, and there is little chance 
that you could recover later in the archive.'

Read: Lose a block, lose the (tar) file - compression tables being 
throughout means horrible death upon the bad block / likely no getting 
past it. Now go back to your 'recover a damaged gzip' article.

> The '-I PROG.sh' approach might be a valuable plug-in unto itself
> such that it could make a tar archive with compression potentially
> mostly salvageable and bakes in integrity checks.

Only if called per file, not once per archive. (I expect.)





More information about the kwlug-disc mailing list