[kwlug-disc] BASH compare items in two files

Thu Nov 4 02:16:18 EDT 2010

On Tue, Nov 02, 2010 at 01:47:26PM -0400, Richard Weait wrote:
> Dear all.
> 
> Every year I try again.  I write a shell script of one description or
> another.  Every year it ends the same way.  A terrible script and
> tears and recriminations for everybody.  This year I'm trying it a
> little differently.  This year, I'm asking for guidance in advance.
> 
> Here is the problem.
> 
> I am given two files, one is an xml file and the other is a list of IDs.
> 
> The ID file is a list of IDs, one per line.  Not all possible IDs are
> included in the file.
> 
> # part of the IDs file
> # includes comments
> # the real file includes leading whitespace to pad numbers to the same length.
> 1
> 12
> 1930272
> etc
> 
> The xml file has three types of xml items in it that I care about,
> foos, bars and bazes.  Each item has an ID associated with it.
> 
> Both files change over time, so I'll be running this script
> periodically.  Or rather, cron will be running this script
> periodically.  The files are of manageable size, though the xml file
> can be large-ish, up to several GB at times.  The IDs file shouldn't
> exceed 100,000 items / lines.
> 
> My desire is to create a summary so that for each of foo, bar and baz,
> I'll count items that match the ID list and items the don't.
> 
> Simplifying assumption: in the xml file the interesting items have an
> nice xml opening tag that includes the ID.
> 
> <foo [... stuff] id="IDnumber" ... >
> 
> I can imagine an unwieldy combination of greps that will do this, but
> that sounds, um, inelegant.  And bad for the disks, and
> brute-force-ish.  What would a real programmer do?  I would imagine
> the answer is something along the lines of
> 
> put the IDs file into a variable
> read a portion of the xml file
> for each line
>   is the line interesting? foo, bar, baz
>     is the ID in the ID file or not
>       update the counters in a smart way
> next line
> output totals
> 
> Helpful pointers?
> 
> Best regards,
> Richard

>From top of my head,

1. Clean up ID file, by removing all non "IDnumber".
	sed -e '^#d' -e 's/^/id="/' -e 's/$/"/'
    So, ID file will contain
	id="1"
	id="12"
	id="1930272"

2. Clean up XML file, by reformatting so that
	<foo ... id="IDnumber" ...>
    is on the same line, and only one per line.  Say, something like
	tr -d '\n' | sed -e 's/<foo[^>]*>/\n&\n/g'
    So, XML file will contain
	<foo ... id="1" ... >
	<foo ... id="12" ... >
	<foo ... id="1930272" ... >

3. Do grep.
	fgrep -f ID_file XML_file
    will give you all the "foo" node with IDs listed in ID file.
	fgrep -v -f ID_file XML_file
    will give you the inverse, or substract it from total line count.

4. Rinse, and repeat for "bar" and "baz" node.

-- 
William