views:

238

answers:

4

I've got a bunch of PNG images, and I'm looking for a way to identify duplicates. By duplicates I mean, specifically, two PNG files whose uncompressed image data are identical, not necessarily whose files are identical. This means I can't do something simple like compare CRC hash values.

I figure this can actually be done reliably since PNGs use lossless compression, but I'm worried about speed. I know I can winnow things down a little by testing for equal dimensions first, but when it comes time to actually compare the images against each other, is there any way to do it reasonably efficiently? (ie. faster than the "double-for-loop checking pixel values against each other" brute-force method?)

+13  A: 
  1. filter by identical image size (width & height)
  2. open file
  3. hash uncompressed contents (md5 would do probably)
  4. store hash

  5. compare hashes to find identical ones

ChristopheD
Why sort by identical size?
zneak
ChristopheD
I think this is a solid answer. After the filter, some quick pixel sampling / compares on a few random points might also weed out some images.
SB
Oh, okay. I thought it was "size" as in "file size".
zneak
A: 

I suppose that you might be able to adjust the size of the data being read, even though the storage format is completely different. So, if your image is 24-bit, you can possibly use a 32-bit or 64-bit (if 64-bit compiled) datatype and keep packing the data in two variables of these types from both images and compare the two for equality. This might speed stuff up a bit :)

Chris Dennett
+6  A: 

Instead of looping through all pixels in order to check equality, it might be worth while starting from the middle and working your way outwards. Most pictures have the subject in the middle meaning more feature data is located here. Essentially it will be lots quicker to find out if two pictures are different this way.

Chris
+3  A: 

Unless you're expecting a lot of duplicates, on average you're not going to compare many pixels before determining that 2 files are different. Especially if each pixel you test is located far from pixels already tested. This will help with e.g. line art files that have the same background color.

Also, how accurate do you have to be? For example, if 10 pixels tested in this way are the same, can you safely conclude that the images are identical? 10 RGB pixels = 240 bits, so the false match rate with random images should be 1 in 2^240 = 1 in 10^72!

They're not random images, and I'm expecting a pretty high proportion of duplicates. But the idea of testing random values as a filtering technique is a good one.
Mason Wheeler