ansaurus

Question

Efficient way to fingerprint an image (jpg, png, etc)?

Answer 1

A:

You can generate an MD5 checksum. If you're using .NET, you could try this:

    public string GenerateFileChecksum(string filePath)
    {
        try
        {
            byte[] hash = null;

            MD5CryptoServiceProvider md5Provider = new MD5CryptoServiceProvider();

            using (FileStream fileStream = new FileStream(filePath, FileMode.Open, FileAccess.Read))
            {
                hash = md5Provider.ComputeHash(fileStream);
                fileStream.Close();
            }

            return BitConverter.ToString(hash).Replace("-", string.Empty);
        }
        catch
        {
            return string.Empty;
        }
    }

This works with a file stream. If you have images stored as byte arrays, you could load the byte stream directly.

Sergey 2009-08-11 17:05:38

Meta-data such as a date-change in headers or any other fuzz would invalidate this unless you checksum the payload of the format. Also formatting differences occur ie: MD5(PNG)!=MD5(JPG)

Aiden Bell 2009-08-11 17:07:09

Answer 2

A:

Pretty interesting question. Fastest and easiest would be to calculate crc32 of content byte array but that would work only on 100% identical images. For more intelligent compare you would probably need some kind of fuzy logic analyzis...

Ray 2009-08-11 17:05:47

Answer 3

+15 A:

Stab in the dark, if you are looking to circumvent meta-data and size related things:

Edge Detection and scale-independent comparison
Sampling and statistical analysis of grayscale/RGB values (average lum, averaged color map)
FFT and other transforms (Good article Classification of Fingerprints using FFT)

And numerous others.

Basically:

Convert JPG/PNG/GIF whatever into an RGB byte array which is independent of encoding
Use a fuzzy pattern classification method to generate a 'hash of the pattern' in the image ... not a hash of the RGB array as some suggest
Then you want a distributed method of fast hash comparison based on matching threshold on the encapsulated hash or encoding of the pattern. Erlang would be good for this :)

Advantages are:

Will, if you use any AI/Training, spot duplicates regardless of encoding, size, aspect, hue and lum modification, dynamic range/subsampling differences and in some cases perspective

Disadvantages:

Can be hard to code .. something like OpenCV might help
Probabilistic ... false positives are likely but can be reduced with neural networks and other AI
Slow unless you can encapsulate pattern qualities and distribute the search (MapReduce style)

Checkout image analysis books such as:

Pattern Classification 2ed
Image Processing Fundamentals
Image Processing - Principles and Applications

And others

If you are scaling the image, then things are simpler. If not, then you have to contend with the fact that scaling is lossy in more ways than sample reduction.

Aiden Bell 2009-08-11 17:06:06

Answer 4

+3 A:

Using the byte size of the image for comparison would be suitable for many applications. Another way would be to:

Strip out the metadata.
Calculate the MD5 (or other suitable hashing algorithm) for the image.
Compare that to the MD5 (or whatever) of the potential dupe image (provided you've stripped out the metadata for that one too)

karim79 2009-08-11 17:06:10

encodings/re-encoding/scaling/hue modification or even a single pixel modification would invalidate this.

Aiden Bell 2009-08-11 17:08:08

@Aiden Bell - I though we were comparing essentially the exact same images minus the metadata.

karim79 2009-08-11 17:09:45

If you scale, modify the hue, or change a single pixel, it's no longer the same image...

Thomas Owens 2009-08-11 17:11:15

@Thomas Owens .... ;)

Aiden Bell 2009-08-11 17:12:29

@Aiden Bell @Thomas Owens - I understand that :) This will obviously not work if any aspect of the image has changed other than the metadata - but that's how I understood the question.

karim79 2009-08-11 17:14:51

karim79: I agree with you totally. Given what I know of the problem, this is how I would do it, although I would also consider other hash algorithms as well (such as the SHA-2 family).

Thomas Owens 2009-08-11 17:19:11

Answer 5

A:

I've implemented at least a trivial version of this. I transform and resize all images to a very small (fixed size) black and white thumbnail. I then compare those. It detects exact, resized, and duplicates transformed to black and white. It gets a lot of duplicates without a lot of cost.

Jay 2009-08-11 17:08:17

A better implementation would be block averaging of color/dynamic range.

Aiden Bell 2009-08-11 17:13:58

Answer 6

+1 A:

You want to perform an image hash. Since you didn't specify a particular language I'm guessing you don't have a preference. At the very least there's a Matlab toolbox (beta) that can do it: http://users.ece.utexas.edu/~bevans/projects/hashing/toolbox/index.html. Most of the google results on this are research results rather than actual libraries or tools.

The problem with MD5ing it is that MD5 is very sensitive to small changes in the input, and it sounds like you want to do something a bit "smarter."

Matt Ball 2009-08-11 17:08:47

Answer 7

A:

The easiest thing to do is to do a hash (like MD5) of the image data, ignoring all other metadata. You can find many open source libraries that can decode common image formats so it's quite easy to strip metadata.

But that doesn't work when image itself is manipulated in anyway, including scaling, rotating.

To do exactly what you want, you have to use Image Watermarking but it's patented and can be expensive.

ZZ Coder 2009-08-11 17:09:44

Answer 8

+3 A:

Check out this paper on Robust Image Hashing.

Adamski 2009-08-11 17:21:53

Answer 9

A:

This is just an idea: Possibly low frequency components present in the DCT of the jpeg could be used as a size invariant identifier.

Indeera 2009-08-11 17:24:58

ansaurus

tags:

views:

answers:

Efficient way to fingerprint an image (jpg, png, etc)?

related questions