If you are looking for a little bit more information about a way to do this there has been some discussion in the forensics community about this.<div><br></div><div>Here is a paper discussing it: <a href="http://dx.doi.org/10.1016/j.diin.2006.06.015">http://dx.doi.org/10.1016/j.diin.2006.06.015</a></div>

<div><br></div><div>If you search for fuzzy hashing you will also find some information.</div><div><br></div><div>ssdeep is a program that has been written for the forensics community that you may want to checkout for file comparison.</div>

<div><br></div><div><br></div><div>Hope this helps!</div><div>Brad<br><br><div class="gmail_quote">On Thu, Oct 15, 2009 at 3:48 AM, Chris Irwin <span dir="ltr"><<a href="mailto:chris@chrisirwin.ca">chris@chrisirwin.ca</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">On Tue, 2009-08-11 at 20:32 -0400, Chris Irwin wrote:<br>

> Does anybody know of any way of mass comparing jpeg files by a image<br>

> content rather than a file sum or name? I've got two sets of several<br>

> thousand images I need to sort through. Basically I want to find unique<br>

> files in directories A & B, deleting duplicates from B.<br>

<br>

I figured I'd ping back to the list regarding my status on this. I've<br>

decided to merge a few todo list items together (why stress yourself out<br>

trying to only learn *one* thing) and have started a project to tackle<br>

my jpg problem.<br>

<br>

Basically, I created a small python utility that scans over provided<br>

directories, pulls the image data out of the file, and calculates a sha1<br>

sum. At the end, it outputs a short (if you're lucky) list of files that<br>

have exact duplicate content, regardless of differences in Comment tags<br>

or EXIF. It also outputs the file's actual sha1 sum as well, and with -e<br>

it will only list exact file matches.<br>

<br>

It is still rather early as it doesn't yet remove duplicates, have any<br>

diff or merge functionality, and does not cache the calculations, so it<br>

is really only worthwhile on a small set of images.<br>

<br>

The planned workflow would be to delete exact file duplicates (same file<br>

sum), and have some sort of diff/merge functionality for the content<br>

duplicates.<br>

<br>

If anybody more familiar with Python than I had a few minutes and could<br>

take a look at it for the purpose of feedback, I would really<br>

appreciate it. This is the first time I've actually *really* used<br>

python -- I've only written patches before. I may have done things<br>

wrong. I've only tested it in Python 2.6, so I know there will be<br>

issues with Python3 still to tackle.<br>

<br>

The code is on gitorious, and is under GPLv3:<br>

        <a href="http://gitorious.org/imagecompare" target="_blank">http://gitorious.org/imagecompare</a><br>

<br>

Included are four small images to test with. The first two are identical<br>

copies, the third is identical in content to the first two but has a<br>

JPEG Comment field that changes the file sum. The fourth is a unique<br>

photo.<br>

<font color="#888888"><br>

--<br>

Chris Irwin<br>

e:  <a href="mailto:chris@chrisirwin.ca">chris@chrisirwin.ca</a><br>

w: <a href="http://chrisirwin.ca" target="_blank">http://chrisirwin.ca</a><br>

</font><br>_______________________________________________<br>

<a href="http://kwlug-disc_kwlug.org" target="_blank">kwlug-disc_kwlug.org</a> mailing list<br>

<a href="http://kwlug-disc_kwlug.org" target="_blank">kwlug-disc_kwlug.org</a>@<a href="http://kwlug.org" target="_blank">kwlug.org</a><br>

<a href="http://astoria.ccjclearline.com/mailman/listinfo/kwlug-disc_kwlug.org" target="_blank">http://astoria.ccjclearline.com/mailman/listinfo/kwlug-disc_kwlug.org</a><br>

<br></blockquote></div><br><br clear="all"><br>-- <br><a href="http://www.google.com/profiles/bbierman42">http://www.google.com/profiles/bbierman42</a><br>

</div>