ansaurus

Question

Is it possible to detect duplicate image files?

Answer 1

+3 A:

Assuming you are talking about same images in terms of same image data.

Compute the hash of the "no image" image and compare it to the hashes of the other images. If the hashes are the same, it is the same file.

Felix Kling 2010-08-01 21:59:55

This would also be a good way to detect duplicates elsewhere. Start computing hashes of the images, and then for each image, make sure it doesn't already exist. If it does, you have a duplicate. If not, add it to the database and move on.

Chris Thompson 2010-08-01 22:02:19

@Felix: Actually, if Blankman is looking for duplicates of a particular file (as opposed to finding all sets of duplicates in the collection), hashes are counter-productive — see my answer.

Gilles 2010-08-01 22:15:16

@Gilles: Interesting. Yeah, I know that you would have to read all the files completely, but I never said that this is the best or a fast approach ;) Gave you +1.

Felix Kling 2010-08-01 22:17:53

So how do I do this hash on an image?

Blankman 2010-08-02 00:08:03

@Blankman: Have a look at the hashlib module: http://docs.python.org/library/hashlib.html

Felix Kling 2010-08-02 00:42:01

Answer 2

A:

Hash them. Collisions are duplicates (at least, it's a mathematical impossibility that they aren't the same file).

amphetamachine 2010-08-01 22:08:31

I assume you meant _"improbability"_, not "impossibility".

David Zaslavsky 2010-08-01 22:32:10

You should _always_ consider the possibility of hash collisions. Multiply the **cost** of a collision with the **probability of a collision** to get the **expected cost**. Usually the expected cost is small because even if the cost is a million dollars, the probability of a collision is so small. But baby photos etc. are irreplacable, so maybe some extra effort is required sometimes ;)

gnibbler 2010-08-02 01:18:53

@gnibbler This is why we keep backups.

amphetamachine 2010-08-02 02:41:35

Answer 3

+2 A:

If you're looking for exact duplicates of a particular image: load this image into memory, then loop over your image collection; skip any file that doesn't have the same size; compare the contents of the files that have the same size, stopping at the first difference.

Computing a hash in this situation is actually counter-productive because you'd have to read each file completely into memory (instead of being able to stop at the first difference) and perform a CPU-intensive task on it.

If there are several sets of duplicates, on the other hand, computing a hash of each file is better.

If you're also looking for visual near-duplicates, findimagedupes can help you.

Gilles 2010-08-01 22:13:17

He can compute a hash and also save the image's size and skip the images with different sizes. It would be smart to test what takes more time. Computing hash or comparing two images byte by byte.

Jaka 2010-08-01 22:22:55

It may seem like a waste of effort to compute all those hashes, but comparing N files to each other is O(N*N). With sufficient number of files, the O(N) algorithm calculating hashes and comparing in a `set()` or `dict()` will be more efficient. Note that you don't need to hash the whole file - the first kb or so is likely to be just as useful as a first check

gnibbler 2010-08-02 01:12:46

Answer 4

A:

As a sidenote, for images, I find raster data hashes to be far more effective than file hashes.

ImageMagick provides reliable way to compute such hashes, and there are different bindings for python available. It helps to detect same images with different lossless compressions and different metadata.

Daniel Kluev 2010-08-01 23:16:15

Answer 5

+1 A:

I wrote a script for this a while back. First it scans all files, noting their sizes in a dictionary. You endup with:

images[some_size] = ['x/a.jpg', 'b/f.jpg', 'n/q.jpg']
images[some_other_size] = ['q/b.jpg']

Then, for each key (image size) where there's more than 1 element in the dictionary, I'd read some fixed amount of the file and do a hash. Something like:

possible_dupes = [size for size in images if len(images[size]) > 1]
for size in possible_dupes:
    hashes = defaultdict(list)
    for fname in images[size]:
        m = md5.new()
        hashes[ m.update( file(fname,'rb').read(10000) ).digest() ] = fname
    for k in hashes:
       if len(hashes[k]) <= 1: continue
       for fname in hashes[k][1:]:
           os.remove(fname)

This is all off the top of my head, haven't tested the code, but you get the idea.

Parand 2010-08-02 00:21:01

All Microsoft Bitmap files without RLE compression which have the same pixel dimensions will be the same size. As will XPMs with the same-length internal name, as will PNGs with no compression, as will Netpbm images... The list goes on and on. But I agree; checking the size will help to avoid meaningless collisions

amphetamachine 2010-08-02 23:03:03

ansaurus

tags:

views:

answers:

Is it possible to detect duplicate image files?

related questions