[kwlug-disc] BASH compare items in two files
rarsa at yahoo.com
Wed Nov 3 11:24:47 EDT 2010
A higher level language (python, perl, C, etc) program would be your best bet as they already have the XML parsing and data handling libraries that will make this task a cinch.
1. Load the list of IDs into a searchable list
2. Using a SAX parser compare the ID of every node against the list
Software, Hardware and Practices
An eclectic collection of random thoughts
--- On Tue, 11/2/10, Richard Weait <richard at weait.com> wrote:
> From: Richard Weait <richard at weait.com>
> Subject: [kwlug-disc] BASH compare items in two files
> To: "KWLUG discussion" <kwlug-disc at kwlug.org>
> Received: Tuesday, November 2, 2010, 1:47 PM
> Dear all.
> Every year I try again. I write a shell script of one
> description or
> another. Every year it ends the same way. A
> terrible script and
> tears and recriminations for everybody. This year I'm
> trying it a
> little differently. This year, I'm asking for
> guidance in advance.
> Here is the problem.
> I am given two files, one is an xml file and the other is a
> list of IDs.
> The ID file is a list of IDs, one per line. Not all
> possible IDs are
> included in the file.
> # part of the IDs file
> # includes comments
> # the real file includes leading whitespace to pad numbers
> to the same length.
> The xml file has three types of xml items in it that I care
> foos, bars and bazes. Each item has an ID associated
> with it.
> Both files change over time, so I'll be running this
> periodically. Or rather, cron will be running this
> periodically. The files are of manageable size,
> though the xml file
> can be large-ish, up to several GB at times. The IDs
> file shouldn't
> exceed 100,000 items / lines.
> My desire is to create a summary so that for each of foo,
> bar and baz,
> I'll count items that match the ID list and items the
> Simplifying assumption: in the xml file the interesting
> items have an
> nice xml opening tag that includes the ID.
> <foo [... stuff] id="IDnumber" ... >
> I can imagine an unwieldy combination of greps that will do
> this, but
> that sounds, um, inelegant. And bad for the disks,
> brute-force-ish. What would a real programmer
> do? I would imagine
> the answer is something along the lines of
> put the IDs file into a variable
> read a portion of the xml file
> for each line
> is the line interesting? foo, bar, baz
> is the ID in the ID file or not
> update the counters in a smart way
> next line
> output totals
> Helpful pointers?
> Best regards,
> kwlug-disc_kwlug.org mailing list
> kwlug-disc_kwlug.org at kwlug.org
More information about the kwlug-disc