



I'm looking to create a base table of images and then compare any new images against that to determine if the new image is an exact (or close) duplicate of the of the base. For example: if you want to reduce storage of the same image 100's of times, you could store one copy of it and provide reference links to it. When a new image is entered you want to compare to an existing image to make sure it's not a dup...ideas?

(one of mine was to reduce to a small thumbnail and then randomly pick 100 pixel locations and compare...


Just compare the md5 sums of the files?

That's not addressing the point of similar-but-not-equal images...
Well in that case you can't be really sure about similarity anyway, best way would be to use hashes.
If you have a large number of images, look into a Bloom filter, which uses multiple hashes for a probablistic but efficient result. If the number of images is not huge, then a cryptographic hash like md5 should be sufficient.

So (trying to understand the Bloom filter) - does that mean you select random pixel points on the base image, randomly get either a red/green/blue value of the pixel - then compare to the new image? and then use a probability level (90% match) to determine how similar the two images are?
This isn't a similarity check, it's an equivalence check. If you need similarity, then hashing is not the right approach.The idea behind Bloom is to use multiple hash algorithms to increase the likelihood of unique identification. Selecting random points isn't the best approach for a hashing algorithm because it will yield different results each time.
Picking 100 random points could mean that similar (or occasionally even dissimilar) images would be marked as the same, which I assume is not what you want. MD5 hashes wouldn't work if the images were different formats (png, jpeg, etc), had different sizes, or had different metadata. Reducing all images to a smaller size is a good bet, doing a pixel-for- pixel comparison shouldn't take too long as long as you're using a good image library / fast language, and the size is small enough.

You could try making them tiny, then if they are the same perform another comparison on a larger size - could be a good combination of speed and accuracy...

This is beautiful problem to look for the openCV library...

I think to accomplish detection of similar but no equal images, you should try with a combination of image analysis algorithms (histograms and the like). Anyway, I would take a look at this thread at Gamedev.

As cartman pointed out, you can use any kind of hash value for finding exact duplicates.

One starting point for finding close images could be here. This is a tool used by CG companies to check if revamped images are still showing essentially the same scene.

Below are three approaches to solving this problem (and there are many others).

  • The first is a standard approach in computer vision, keypoint matching. This may require some background knowledge to implement, and can be slow.

  • The second method uses only elementary image processing, and is potentially faster than the first approach, and is straightforward to implement. However, what it gains in understandability, it lacks in robustness -- matching fails on scaled, rotated, or discolored images.

  • The third method is both fast and robust, but is potentially the hardest to implement.

Keypoint Matching

Better than picking 100 random points is picking 100 important points. Certain parts of an image have more information than others (particularly at edges and corners), and these are the ones you'll want to use for smart image matching. Google "keypoint extraction" and "keypoint matching" and you'll find quite a few academic papers on the subject. These days, SIFT keypoints are arguably the most popular, since they can match images under different scales, rotations, and lighting. Some SIFT implementations can be found here.

One downside to keypoint matching is the running time of a naive implementation: O(n^2m), where n is the number of keypoints in each image, and m is the number of images in the database. Some clever algorithms might find the closest match faster, like quadtrees or binary space partitioning.

Alternative solution: Histogram method

Another less robust but potentially faster solution is to build feature histograms for each image, and choose the image with the histogram closest to the input image's histogram. I implemented this as an undergrad, and we used 3 color histograms (red, green, and blue), and two texture histograms, direction and scale. I'll give the details below, but I should note that this only worked well for matching images VERY similar to the database images. Re-scaled, rotated, or discolored images can fail with this method, but small changes like cropping won't break the algorithm

Computing the color histograms is straightforward -- just pick the range for your histogram buckets, and for each range, tally the number of pixels with a color in that range. For example, consider the "green" histogram, and suppose we choose 4 buckets for our histogram: 0-63, 64-127, 128-191, and 192-255. Then for each pixel, we look at the green value, and add a tally to the appropriate bucket. When we're done tallying, we divide each bucket total by the number of pixels in the entire image to get a normalized histogram for the green channel.

For the texture direction histogram, we started by performing edge detection on the image. Each edge point has a normal vector pointing in the direction perpendicular to the edge. We quantized the normal vector's angle into one of 6 buckets between 0 and PI (since edges have 180-degree symmetry, we converted angles between -PI and 0 to be between 0 and PI). After tallying up the number of edge points in each direction, we have an un-normalized histogram representing texture direction, which we normalized by dividing each bucket by the total number of edge points in the image.

To compute the texture scale histogram, for each edge point, we measured the distance to the next-closest edge point with the same direction. For example, if edge point A has a direction of 45 degrees, the algorithm walks in that direction until it finds another edge point with a direction of 45 degrees (or within a reasonable deviation). After computing this distance for each edge point, we dump those values into a histogram and normalize it by dividing by the total number of edge points.

Now you have 5 histograms for each image. To compare two images, you take the absolute value of the difference between each histogram bucket, and then sum these values. For example, to compare images A and B, we would compute

|A.green_histogram.bucket_1 - B.green_histogram.bucket_1| 

for each bucket in the green histogram, and repeat for the other histograms, and then sum up all the results. The smaller the result, the better the match. Repeat for all images in the database, and the match with the smallest result wins. You'd probably want to have a threshold, above which the algorithm concludes that no match was found.

Third Choice - Keypoints + Decision Trees

A third approach that is probably much faster than the other two is using semantic texton forests (PDF). This involves extracting simple keypoints and using a collection decision trees to classify the image. This is faster than simple SIFT keypoint matching, because it avoids the costly matching process, and keypoints are much simpler than SIFT, so keypoint extraction is much faster. However, it preserves the SIFT method's invariance to rotation, scale, and lighting, an important feature that the histogram method lacked.


My mistake -- the Semantic Texton Forests paper isn't specifically about image matching, but rather region labeling. The original paper that does matching is this one: Keypoint Recognition using Randomized Trees. Also, the papers below continue to develop the ideas and represent the state of the art (c. 2010):

The Histogram approach seems to make the most sense. I'm assuming you can rotate the image to perform this on all sides just in case the image being compared to was turned (treating the same image as 4) - thanks
@meade That's right. Something else to consider: depending on your problem, you might not need to use all 5 histograms in your algorithm. Discarding the texture direction histogram will allow you to match rotated versions of the picture. Discarding the texture scale histogram will allow you to match re-scaled versions of the image. You'll lose some ability to compare similarity, but this might not be a problem, depending on your situation. Also, since computing texture information is the most costly part of the algorithm, this will make your algorithm speedy, too.

I have an idea, which can work and it most likely to be very fast. You can sub-sample an image to say 80x60 resolution or comparable, and convert it to grey scale (after subsampling it will be faster). Process both images you want to compare. Then run normalised sum of squared differences between two images (the query image and each from the db), or even better Normalised Cross Correlation, which gives response closer to 1, if both images are similar. Then if images are similar you can proceed to more sophisticated techniques to verify that it is the same images. Obviously this algorithm is linear in terms of number of images in your database so even though it is going to be very fast up to 10000 images per second on the modern hardware. If you need invariance to rotation, then a dominant gradient can be computed for this small image, and then the whole coordinate system can be rotated to canonical orientation, this though, will be slower. And no, there is no invariance to scale here.

If you want something more general or using big databases (million of images), then you need to look into image retrieval theory (loads of papers appeared in the last 5 years). There are some pointers in other answers. But It might be overkill, and the suggest histogram approach will do the job. Though I would think combination of many different fast approaches will be even better.

