[kwlug-disc] Image Comparison
bbierman42 at gmail.com
Thu Oct 15 20:04:33 EDT 2009
If you are looking for a little bit more information about a way to do this
there has been some discussion in the forensics community about this.
Here is a paper discussing it: http://dx.doi.org/10.1016/j.diin.2006.06.015
If you search for fuzzy hashing you will also find some information.
ssdeep is a program that has been written for the forensics community that
you may want to checkout for file comparison.
Hope this helps!
On Thu, Oct 15, 2009 at 3:48 AM, Chris Irwin <chris at chrisirwin.ca> wrote:
> On Tue, 2009-08-11 at 20:32 -0400, Chris Irwin wrote:
> > Does anybody know of any way of mass comparing jpeg files by a image
> > content rather than a file sum or name? I've got two sets of several
> > thousand images I need to sort through. Basically I want to find unique
> > files in directories A & B, deleting duplicates from B.
> I figured I'd ping back to the list regarding my status on this. I've
> decided to merge a few todo list items together (why stress yourself out
> trying to only learn *one* thing) and have started a project to tackle
> my jpg problem.
> Basically, I created a small python utility that scans over provided
> directories, pulls the image data out of the file, and calculates a sha1
> sum. At the end, it outputs a short (if you're lucky) list of files that
> have exact duplicate content, regardless of differences in Comment tags
> or EXIF. It also outputs the file's actual sha1 sum as well, and with -e
> it will only list exact file matches.
> It is still rather early as it doesn't yet remove duplicates, have any
> diff or merge functionality, and does not cache the calculations, so it
> is really only worthwhile on a small set of images.
> The planned workflow would be to delete exact file duplicates (same file
> sum), and have some sort of diff/merge functionality for the content
> If anybody more familiar with Python than I had a few minutes and could
> take a look at it for the purpose of feedback, I would really
> appreciate it. This is the first time I've actually *really* used
> python -- I've only written patches before. I may have done things
> wrong. I've only tested it in Python 2.6, so I know there will be
> issues with Python3 still to tackle.
> The code is on gitorious, and is under GPLv3:
> Included are four small images to test with. The first two are identical
> copies, the third is identical in content to the first two but has a
> JPEG Comment field that changes the file sum. The fourth is a unique
> Chris Irwin
> e: chris at chrisirwin.ca
> w: http://chrisirwin.ca
> kwlug-disc_kwlug.org mailing list
> kwlug-disc_kwlug.org at kwlug.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the kwlug-disc