tags:

views:

123

answers:

5

I have over 10K files for products, the problem is is that many of the images are duplicates.

If there is no image, there is a standard image that says 'no image'.

How can I detect if the image is this standard 'no image' image file?

Update The image is a different name, but it is exactly the same image otherwise.

People are saying Hash, so would I do this?

im = cStringIO.StringIO(file.read())
img = im.open(im)
md5.md5(img)
+3  A: 

Assuming you are talking about same images in terms of same image data.

Compute the hash of the "no image" image and compare it to the hashes of the other images. If the hashes are the same, it is the same file.

Felix Kling
This would also be a good way to detect duplicates elsewhere. Start computing hashes of the images, and then for each image, make sure it doesn't already exist. If it does, you have a duplicate. If not, add it to the database and move on.
Chris Thompson
@Felix: Actually, if Blankman is looking for duplicates of a particular file (as opposed to finding all sets of duplicates in the collection), hashes are counter-productive — see my answer.
Gilles
@Gilles: Interesting. Yeah, I know that you would have to read all the files completely, but I never said that this is the best or a fast approach ;) Gave you +1.
Felix Kling
So how do I do this hash on an image?
Blankman
@Blankman: Have a look at the hashlib module: http://docs.python.org/library/hashlib.html
Felix Kling
A: 

Hash them. Collisions are duplicates (at least, it's a mathematical impossibility that they aren't the same file).

amphetamachine
I assume you meant _"improbability"_, not "impossibility".
David Zaslavsky
You should _always_ consider the possibility of hash collisions. Multiply the **cost** of a collision with the **probability of a collision** to get the **expected cost**. Usually the expected cost is small because even if the cost is a million dollars, the probability of a collision is so small. But baby photos etc. are irreplacable, so maybe some extra effort is required sometimes ;)
gnibbler
@gnibbler This is why we keep backups.
amphetamachine
+2  A: 

If you're looking for exact duplicates of a particular image: load this image into memory, then loop over your image collection; skip any file that doesn't have the same size; compare the contents of the files that have the same size, stopping at the first difference.

Computing a hash in this situation is actually counter-productive because you'd have to read each file completely into memory (instead of being able to stop at the first difference) and perform a CPU-intensive task on it.

If there are several sets of duplicates, on the other hand, computing a hash of each file is better.

If you're also looking for visual near-duplicates, findimagedupes can help you.

Gilles
He can compute a hash and also save the image's size and skip the images with different sizes. It would be smart to test what takes more time. Computing hash or comparing two images byte by byte.
Jaka
It may seem like a waste of effort to compute all those hashes, but comparing N files to each other is O(N*N). With sufficient number of files, the O(N) algorithm calculating hashes and comparing in a `set()` or `dict()` will be more efficient. Note that you don't need to hash the whole file - the first kb or so is likely to be just as useful as a first check
gnibbler
A: 

As a sidenote, for images, I find raster data hashes to be far more effective than file hashes.

ImageMagick provides reliable way to compute such hashes, and there are different bindings for python available. It helps to detect same images with different lossless compressions and different metadata.

Daniel Kluev
+1  A: 

I wrote a script for this a while back. First it scans all files, noting their sizes in a dictionary. You endup with:

images[some_size] = ['x/a.jpg', 'b/f.jpg', 'n/q.jpg']
images[some_other_size] = ['q/b.jpg']

Then, for each key (image size) where there's more than 1 element in the dictionary, I'd read some fixed amount of the file and do a hash. Something like:

possible_dupes = [size for size in images if len(images[size]) > 1]
for size in possible_dupes:
    hashes = defaultdict(list)
    for fname in images[size]:
        m = md5.new()
        hashes[ m.update( file(fname,'rb').read(10000) ).digest() ] = fname
    for k in hashes:
       if len(hashes[k]) <= 1: continue
       for fname in hashes[k][1:]:
           os.remove(fname)

This is all off the top of my head, haven't tested the code, but you get the idea.

Parand
All Microsoft Bitmap files without RLE compression which have the same pixel dimensions will be the same size. As will XPMs with the same-length internal name, as will PNGs with no compression, as will Netpbm images... The list goes on and on. But I agree; checking the size will help to avoid meaningless collisions
amphetamachine