[kwlug-disc] BASH compare items in two files

Wed Nov 3 11:24:47 EDT 2010

A higher level language (python, perl, C, etc) program would be your best bet as they already have the XML parsing and data handling libraries that will make this task a cinch.

1. Load the list of IDs into a searchable list
2. Using a SAX parser compare the ID of every node against the list
3. Done!

Raul Suarez

Technology consultant
Software, Hardware and Practices
_________________
Twitter: rarsamx
http://rarsa.blogspot.com/ 
An eclectic collection of random thoughts

--- On Tue, 11/2/10, Richard Weait <richard at weait.com> wrote:

> From: Richard Weait <richard at weait.com>
> Subject: [kwlug-disc] BASH compare items in two files
> To: "KWLUG discussion" <kwlug-disc at kwlug.org>
> Received: Tuesday, November 2, 2010, 1:47 PM
> Dear all.
> 
> Every year I try again.  I write a shell script of one
> description or
> another.  Every year it ends the same way.  A
> terrible script and
> tears and recriminations for everybody.  This year I'm
> trying it a
> little differently.  This year, I'm asking for
> guidance in advance.
> 
> Here is the problem.
> 
> I am given two files, one is an xml file and the other is a
> list of IDs.
> 
> The ID file is a list of IDs, one per line.  Not all
> possible IDs are
> included in the file.
> 
> # part of the IDs file
> # includes comments
> # the real file includes leading whitespace to pad numbers
> to the same length.
> 1
> 12
> 1930272
> etc
> 
> The xml file has three types of xml items in it that I care
> about,
> foos, bars and bazes.  Each item has an ID associated
> with it.
> 
> Both files change over time, so I'll be running this
> script
> periodically.  Or rather, cron will be running this
> script
> periodically.  The files are of manageable size,
> though the xml file
> can be large-ish, up to several GB at times.  The IDs
> file shouldn't
> exceed 100,000 items / lines.
> 
> My desire is to create a summary so that for each of foo,
> bar and baz,
> I'll count items that match the ID list and items the
> don't.
> 
> Simplifying assumption: in the xml file the interesting
> items have an
> nice xml opening tag that includes the ID.
> 
> <foo [... stuff] id="IDnumber" ... >
> 
> I can imagine an unwieldy combination of greps that will do
> this, but
> that sounds, um, inelegant.  And bad for the disks,
> and
> brute-force-ish.  What would a real programmer
> do?  I would imagine
> the answer is something along the lines of
> 
> put the IDs file into a variable
> read a portion of the xml file
> for each line
>   is the line interesting? foo, bar, baz
>     is the ID in the ID file or not
>       update the counters in a smart way
> next line
> output totals
> 
> Helpful pointers?
> 
> Best regards,
> Richard
> 
> _______________________________________________
> kwlug-disc_kwlug.org mailing list
> kwlug-disc_kwlug.org at kwlug.org
> http://astoria.ccjclearline.com/mailman/listinfo/kwlug-disc_kwlug.org
>