[kwlug-disc] BASH compare items in two files
William Park
opengeometry at yahoo.ca
Thu Nov 4 02:16:18 EDT 2010
On Tue, Nov 02, 2010 at 01:47:26PM -0400, Richard Weait wrote:
> Dear all.
>
> Every year I try again. I write a shell script of one description or
> another. Every year it ends the same way. A terrible script and
> tears and recriminations for everybody. This year I'm trying it a
> little differently. This year, I'm asking for guidance in advance.
>
> Here is the problem.
>
> I am given two files, one is an xml file and the other is a list of IDs.
>
> The ID file is a list of IDs, one per line. Not all possible IDs are
> included in the file.
>
> # part of the IDs file
> # includes comments
> # the real file includes leading whitespace to pad numbers to the same length.
> 1
> 12
> 1930272
> etc
>
> The xml file has three types of xml items in it that I care about,
> foos, bars and bazes. Each item has an ID associated with it.
>
> Both files change over time, so I'll be running this script
> periodically. Or rather, cron will be running this script
> periodically. The files are of manageable size, though the xml file
> can be large-ish, up to several GB at times. The IDs file shouldn't
> exceed 100,000 items / lines.
>
> My desire is to create a summary so that for each of foo, bar and baz,
> I'll count items that match the ID list and items the don't.
>
> Simplifying assumption: in the xml file the interesting items have an
> nice xml opening tag that includes the ID.
>
> <foo [... stuff] id="IDnumber" ... >
>
> I can imagine an unwieldy combination of greps that will do this, but
> that sounds, um, inelegant. And bad for the disks, and
> brute-force-ish. What would a real programmer do? I would imagine
> the answer is something along the lines of
>
> put the IDs file into a variable
> read a portion of the xml file
> for each line
> is the line interesting? foo, bar, baz
> is the ID in the ID file or not
> update the counters in a smart way
> next line
> output totals
>
> Helpful pointers?
>
> Best regards,
> Richard
>From top of my head,
1. Clean up ID file, by removing all non "IDnumber".
sed -e '^#d' -e 's/^/id="/' -e 's/$/"/'
So, ID file will contain
id="1"
id="12"
id="1930272"
2. Clean up XML file, by reformatting so that
<foo ... id="IDnumber" ...>
is on the same line, and only one per line. Say, something like
tr -d '\n' | sed -e 's/<foo[^>]*>/\n&\n/g'
So, XML file will contain
<foo ... id="1" ... >
<foo ... id="12" ... >
<foo ... id="1930272" ... >
3. Do grep.
fgrep -f ID_file XML_file
will give you all the "foo" node with IDs listed in ID file.
fgrep -v -f ID_file XML_file
will give you the inverse, or substract it from total line count.
4. Rinse, and repeat for "bar" and "baz" node.
--
William
More information about the kwlug-disc
mailing list