views:

52

answers:

3

Hello everybody!

I'm searching for a "bad" hash function: I'd like to hash strings and put similar strings in one bucket.

Can you give me a hint where to start my research? Some methods or algorithm names...

Thnaks in advance!

Sebastian

A: 

It depends on what you mean by "similar string" ?

But if you look for such a bad one, you have to build it yourself.

Example :

  • you can create 10 buckets (0 to 9) and group the strings by theirs length mod 10

  • Use a strcmp() like function and group them by the differences with a defined String

Kami
A: 

Hmm, I'm searching for a function that treats texts with many identical parts as "similar", like the search if a student copied his homework from a webpage and just edited some words ;-)

So I want to treat

ABCD become 1234 ABCE hashed to 1235

Somethings like that.

The problem is, that I don't want to compare only two strings but thousands.

Sebastian
A: 

Your problem is not an easy one. Two ideas:

This solution might be overly complicated but you could try a Fourier transform. Treat your input text as a series of samples of a function and then run a Fourier transform to convert your input to the frequency domain. The low frequency part is the general jist of the text and the high frequency part is the tiny changes.

This is somewhat similar to what jpeg compression does: Throw away the details and just leave the important stuff. If you have two almost-identical images and you jpeg compress them greatly, you usually get the same output.

pHash uses a method similar to this.

Again, this is going to be a pretty complicated way to do it.

Second idea: minHash

The idea for minHash is that you pick some markers that are likely to be the same when the inputs are the same. Then you compute a vector for the outputs of all the markers. If two inputs have similar vectors then the inputs are similar.

For example, count how many times the word "the" appears in the text. If it's even, 0, if it's odd, 1. Now count how many times the word "math" shows up in the text. Again, 0 for even, 1 for odd. Do that for a lot of words.

Now you process all the texts and each one gives you an output like "011100010101" or whatever. If two texts are similar then they will have similar outputs strings, differing by just 1 or two bits. You can use a multi-variate partition trie (MVP) to search the outputs efficiently.

This, too, might be overkill for your problem.

Eyal