views:

666

answers:

4

I'm putting together a script to find remove duplicates in a large library of images. At the moment I'm doing a two pass filter of first finding files of the same size and then doing a sha256 on a 10240 byte piece of the file to get a fingerprint of the files with the same size (code here).

It works well, but I'm guessing there are probably checksums built in to the jpeg format that I could use instead of doing the sha256.

Does anyone know if there are checksums or other components that could act as checksums / fingerprints? If so, is there an efficient way to access them?

+1  A: 

Its been awhile since I've dug into the IJG library, but I don't think there's an easy class member or function call you can use there to check for some type of fingerprint. You could use the built in EXIF tags if you can control the encoding of the images...

jdt141
+3  A: 

I don't think the JPEG specification includes any kind of checksum in the way you're describing.

A JPEG can contain a thumbnail as part of its EXIF metadata, though. It's not a perfect indicator, since it's possible for two different images to have the same thumbnail. There's at least one documented case of a thumbnail not being replaced after the image had undergone substantial modifications, said thumbnail revealing much more than the publisher had intended.

Mark Ransom
A: 

Hi, In JPEG standard(ITU-T.81) i believe there isn't any field/syntax element which has a checksum or such, for the whole compressed jpeg image file. Unless a customised application puts such filed in the Application segment, or as meta data for which segments are provided in the standard. So to serve your purpose, what you are doing is one soln. Other could be some kind a application wrapper which will call some binary file compare utlitiy (like beyond compare, or even a windows command fc /b) and check the result of that compare utility and take the decision u want to.

-AD

goldenmean
A: 

One way you could perform is reduce all images to a fixed size and store that as a thumbnail. Then the image comparison would compare similar sized images and give you a chance of being a duplicate - useful if you have cropped (unless cropped heavily) or resized images and want to find those 'duplicates'.

graham.reeds