I'm trying to write some code that will do "fuzzy hashing". That is: I want several inputs to hash to the same output so that I can do searches etc quickly and easily. If A hashes to 1 and C hashes to 1, it will be trivial for me to find out that A is equivalent to C.
Designing such a hash function seems hard, so I was wondering if anyone had experience with CMPH or GPERF and could walk me through creating a function that would result in this hash function.
Thanks in advance! Stefan
@Ben
In this case, matrixes of booleans, but I can easily pack them into 64 bit integers. Rotations, translations, etc in the input are irrelevant and need to be weeded out. Thus:
000
111
000
Is equivalent to
111
000
000
and
001
001
001
(simplification)
@Kinopiko
My best bet thus far would be to determine some sort of "canonical" representation and design code that terminates when the transformations reach such a representation (say...packing all the bits at the bottom). Yet this is slow and I'm looking for a better way. My data set is large.
@Jason
These two would not hash to the same value.
000
010
000
000
011
000