[kwlug-disc] BASH compare items in two files

Tue Nov 2 13:47:26 EDT 2010

Dear all.

Every year I try again.  I write a shell script of one description or
another.  Every year it ends the same way.  A terrible script and
tears and recriminations for everybody.  This year I'm trying it a
little differently.  This year, I'm asking for guidance in advance.

Here is the problem.

I am given two files, one is an xml file and the other is a list of IDs.

The ID file is a list of IDs, one per line.  Not all possible IDs are
included in the file.

# part of the IDs file
# includes comments
# the real file includes leading whitespace to pad numbers to the same length.
1
12
1930272
etc

The xml file has three types of xml items in it that I care about,
foos, bars and bazes.  Each item has an ID associated with it.

Both files change over time, so I'll be running this script
periodically.  Or rather, cron will be running this script
periodically.  The files are of manageable size, though the xml file
can be large-ish, up to several GB at times.  The IDs file shouldn't
exceed 100,000 items / lines.

My desire is to create a summary so that for each of foo, bar and baz,
I'll count items that match the ID list and items the don't.

Simplifying assumption: in the xml file the interesting items have an
nice xml opening tag that includes the ID.

<foo [... stuff] id="IDnumber" ... >

I can imagine an unwieldy combination of greps that will do this, but
that sounds, um, inelegant.  And bad for the disks, and
brute-force-ish.  What would a real programmer do?  I would imagine
the answer is something along the lines of

put the IDs file into a variable
read a portion of the xml file
for each line
  is the line interesting? foo, bar, baz
    is the ID in the ID file or not
      update the counters in a smart way
next line
output totals

Helpful pointers?

Best regards,
Richard