views:

182

answers:

3

Users are uploading fotos to our php build system. Some of them we are marking as forbidden because of not relevant content. I´m searching for optimalisation of an 'AUTO-COMPARE' algorithm which is skipping these marked as forbidden fotos. Every upload need to be compared to many vorbinden.

Possible solutions:

1/ Store forbidden files and compare whole content - works well but is slow.

2/ Store image file checksum and compare the checksums - this is the idea to improve the speed.

3/ Any inteligent algorithm which is fast enough and can compare similarity between photos. But I dont have any ideas abut these in PHP.

What is the best solution?

+2  A: 

Image similarity comparison isn't exactly a trivial problem, so unless you really want to devote a lot of effort to image comparison algorithms, your idea of creating some sort of hash of the image data and comparing that will at least allow you to quickly detect exact duplicates. I'd go with your current plan, but make sure it's a decent (but fast) hash so that the likelihood of collisions is low.

Amber
+2  A: 

Don't calculate checksums, calculate hashes!

I've once created a simple application that had to look for duplicate images on my harddisk. It would only search for .JPG files but for every file I would calculate a hash value over the first 1024 bytes, then append the width, height and size of the image to it to get a string like: "875234:640:480:13286", which I would use as key for the image. As it turns out, I haven't seen any false duplicates with this algorithm, although there still is a chance of false duplicates. However, this scheme will allow duplicates when someone just adds one byte to it, or makes very small adjustments to the image.

Another trick could be by reducing the size and number of colors of every image. If resize every image to 128x128 pixels and reduce the number of colors to 16 (4 bits) then you end up with reasonable unique patterns of 8192 bytes each. Calculate a hash value over this pattern ans use the hash as primary key. Once you get a hit, you might still have a false positive thus you would need to compare the pattern of the new image with the pattern stored in your system. This pattern compare could be used if the first hash solution indicates that the new image is unique. It's something that I still need to work out for my own tool, though. But it's basically a kind of taking fingerprints of images and then comparing them.

My first solution will find exact matches. My second solution would find similar images. (Btw, I wrote my hash method in Delphi but technically, any hash method would be good enough.)

Workshop Alex
+1  A: 

The problem with hashes, as suggested, is that if someone changes 1 pixel the hash turns out completely different.

There are excellent frameworks out there that are able to compare the contents of a file, and return (in a percentage) how much they look alike. There is one in specific, a command line app, I once came across which was build within a scientific environment and it was open source but I can't remember its name.

This kind of framework could definitely help you out, since they can be extremely fast, even with a large number of files.

MiRAGe