I don't think this problem can be solved by hashing. Here's the difficulty: suppose you have a red pixel, and you want 3 and 5 to hash to the same value. Well, then you also want 5 and 7 to hash to the same value, and 7 and 9, and so on... you can construct a chain that says you want all pixels to hash to the same value.
Here's what I would try instead:
- Build a huge B-tree, with 32-way fanout at each node, containing all of the images.
- All images in the tree are the same size, or they're not duplicates.
- Give each colored pixel a unique number starting at zero. Upper left might be numbered 0, 1, 2 for the R, G, B components, or you might be better off with a random permutation, because you're going to compare images in order of that numbering.
- An internal node at depth n discriminates 32 ways on the value of the pixel n divided by 8 (this gets out some of the noise in nearby pixels.
- A leaf node contains some small number of images, let's say 10 to 100. Or maybe the number of images is an increasing function of depth, so that if you have 500 duplicates of one image, after a certain depth you stop trying to distinguish them.
One all two million nodes are inserted in the tree, two images are duplicate only if they're at the same node. Right? Wrong! If the pixel value in two images are 127 and 128, one goes into outedge 15 and the other goes into outedge 16. So actually when you discriminate on a pixel, you may insert that image into one or two children:
- For brightness
B
, insert at B/8
, (B-3)/8
, and (B+3)/8
. Sometimes all 3 will be equal, and always 2 of 3 will be equal. But with probability 3/8, you double the number of outedges on which the image appears. Depending on how deep things go you could have lots of extra nodes.
Someone else will have to do the math and see if you have to divide by something larger than 8 to keep images from being duplicated too much. The good news is that even if the true fanout is only around 4 instead of 32, you only need a tree of depth 10. Four duplications in 10 takes you up to 32 million images at the leaves. I hope you have plenty of RAM at your disposal! If not, you can put the tree in the filesystem.
Let me know how this goes!