If you are looking for a little bit more information about a way to do this there has been some discussion in the forensics community about this.<div><br></div><div>Here is a paper discussing it: <a href="http://dx.doi.org/10.1016/j.diin.2006.06.015">http://dx.doi.org/10.1016/j.diin.2006.06.015</a></div>
<div><br></div><div>If you search for fuzzy hashing you will also find some information.</div><div><br></div><div>ssdeep is a program that has been written for the forensics community that you may want to checkout for file comparison.</div>
<div><br></div><div><br></div><div>Hope this helps!</div><div>Brad<br><br><div class="gmail_quote">On Thu, Oct 15, 2009 at 3:48 AM, Chris Irwin <span dir="ltr"><<a href="mailto:chris@chrisirwin.ca">chris@chrisirwin.ca</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">On Tue, 2009-08-11 at 20:32 -0400, Chris Irwin wrote:<br>
> Does anybody know of any way of mass comparing jpeg files by a image<br>
> content rather than a file sum or name? I've got two sets of several<br>
> thousand images I need to sort through. Basically I want to find unique<br>
> files in directories A & B, deleting duplicates from B.<br>
<br>
I figured I'd ping back to the list regarding my status on this. I've<br>
decided to merge a few todo list items together (why stress yourself out<br>
trying to only learn *one* thing) and have started a project to tackle<br>
my jpg problem.<br>
<br>
Basically, I created a small python utility that scans over provided<br>
directories, pulls the image data out of the file, and calculates a sha1<br>
sum. At the end, it outputs a short (if you're lucky) list of files that<br>
have exact duplicate content, regardless of differences in Comment tags<br>
or EXIF. It also outputs the file's actual sha1 sum as well, and with -e<br>
it will only list exact file matches.<br>
<br>
It is still rather early as it doesn't yet remove duplicates, have any<br>
diff or merge functionality, and does not cache the calculations, so it<br>
is really only worthwhile on a small set of images.<br>
<br>
The planned workflow would be to delete exact file duplicates (same file<br>
sum), and have some sort of diff/merge functionality for the content<br>
duplicates.<br>
<br>
If anybody more familiar with Python than I had a few minutes and could<br>
take a look at it for the purpose of feedback, I would really<br>
appreciate it. This is the first time I've actually *really* used<br>
python -- I've only written patches before. I may have done things<br>
wrong. I've only tested it in Python 2.6, so I know there will be<br>
issues with Python3 still to tackle.<br>
<br>
The code is on gitorious, and is under GPLv3:<br>
<a href="http://gitorious.org/imagecompare" target="_blank">http://gitorious.org/imagecompare</a><br>
<br>
Included are four small images to test with. The first two are identical<br>
copies, the third is identical in content to the first two but has a<br>
JPEG Comment field that changes the file sum. The fourth is a unique<br>
photo.<br>
<font color="#888888"><br>
--<br>
Chris Irwin<br>
e: <a href="mailto:chris@chrisirwin.ca">chris@chrisirwin.ca</a><br>
w: <a href="http://chrisirwin.ca" target="_blank">http://chrisirwin.ca</a><br>
</font><br>_______________________________________________<br>
<a href="http://kwlug-disc_kwlug.org" target="_blank">kwlug-disc_kwlug.org</a> mailing list<br>
<a href="http://kwlug-disc_kwlug.org" target="_blank">kwlug-disc_kwlug.org</a>@<a href="http://kwlug.org" target="_blank">kwlug.org</a><br>
<a href="http://astoria.ccjclearline.com/mailman/listinfo/kwlug-disc_kwlug.org" target="_blank">http://astoria.ccjclearline.com/mailman/listinfo/kwlug-disc_kwlug.org</a><br>
<br></blockquote></div><br><br clear="all"><br>-- <br><a href="http://www.google.com/profiles/bbierman42">http://www.google.com/profiles/bbierman42</a><br>
</div>