[kwlug-disc] Image Comparison

Chris Irwin chris at chrisirwin.ca
Thu Oct 15 03:48:10 EDT 2009


On Tue, 2009-08-11 at 20:32 -0400, Chris Irwin wrote:
> Does anybody know of any way of mass comparing jpeg files by a image
> content rather than a file sum or name? I've got two sets of several
> thousand images I need to sort through. Basically I want to find unique
> files in directories A & B, deleting duplicates from B.

I figured I'd ping back to the list regarding my status on this. I've
decided to merge a few todo list items together (why stress yourself out
trying to only learn *one* thing) and have started a project to tackle
my jpg problem.

Basically, I created a small python utility that scans over provided
directories, pulls the image data out of the file, and calculates a sha1
sum. At the end, it outputs a short (if you're lucky) list of files that
have exact duplicate content, regardless of differences in Comment tags
or EXIF. It also outputs the file's actual sha1 sum as well, and with -e
it will only list exact file matches.

It is still rather early as it doesn't yet remove duplicates, have any
diff or merge functionality, and does not cache the calculations, so it
is really only worthwhile on a small set of images.

The planned workflow would be to delete exact file duplicates (same file
sum), and have some sort of diff/merge functionality for the content
duplicates.

If anybody more familiar with Python than I had a few minutes and could
take a look at it for the purpose of feedback, I would really
appreciate it. This is the first time I've actually *really* used
python -- I've only written patches before. I may have done things
wrong. I've only tested it in Python 2.6, so I know there will be
issues with Python3 still to tackle.

The code is on gitorious, and is under GPLv3:
        http://gitorious.org/imagecompare

Included are four small images to test with. The first two are identical
copies, the third is identical in content to the first two but has a
JPEG Comment field that changes the file sum. The fourth is a unique
photo.

-- 
Chris Irwin
e:  chris at chrisirwin.ca
w: http://chrisirwin.ca
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://astoria.ccjclearline.com/pipermail/kwlug-disc_kwlug.org/attachments/20091015/7d8757ae/attachment.bin>


More information about the kwlug-disc_kwlug.org mailing list