tags:

views:

337

answers:

5

I have sets of strings in a database. Each set will have less than 500 members, there will be tens of thousands of sets, and the strings are natural language. I would like to detect duplicate strings within each set. New strings will be compared with an existing set, and added to the database if they are unique.

Are there hashing algorithms that would be effective at finding (very) similar strings? For example, the strings probably would have the same number of words, but the encoding may be slightly different (UTF-8 vs Latin-1).

+1  A: 

If there are only 500 strings in the database, perhaps you can directly compare to each one. First convert to a standard representation (say UTF-16). The Levenshtein distance can be a nice way of comparing the similarity of two strings.

Mr Fooz
Because there will be many sets, using the similarity distance as provided by Difflib and the like is not viable.
wehriam
+2  A: 

For starters, you should probably do some sort of normalization. You should probably convert all of your text to a single encoding (eg: UTF-8). You may also want to do case-folding, other Unicode normalizations and perhaps also sorting each set (depending on how you're storing them).

It's unclear (to me) from your question whether you want to find exact matches or just string sets that are "similar". If you only care about exact matches once the normalization is taken into account, then you're pretty much done. Just have an index on the normalized forms of your string sets and you can look up new sets quickly by normalizing them as well.

If you want to find near matches then you'll probably want to do some sort of similarity hashing. The Wikipedia article on Locality Sensitive Hashing describes a number of techniques.

The basic idea behind a number of these techniques is to compute a handful of very lossy hashes on each string, h[0] through h[n]. To look up a new string set you'd compute its hashes and look each of these up. Anything that gets at least one match is "similar", and the more matches the more similar it is (and you can choose what threshhold to cut things off at).

Laurence Gonsalves
+1  A: 

The short answer is just guess at what a good hash parameter would be that matches your ideas of "similar".

Probably just something like sum of all the letters (A) and the sum of the differences between adjacent letters (B), might work. For each new string, use its A and B values to quickly lookup a now much smaller set of similar strings, and then do a more careful comparison between these.

This probably isn't the purest solution, but practically, lots of problems are solved this way. Beyond this, I think there is currently quite a bit of work solving similar problems in genetics (i.e. finding similar gene sequences within huge databases), but I don't think there's an accepted generic solution to this problem.

tom10
A: 

This might be overkill, but you might want to try NLTK (Natural Language Toolkit), which is Python based.

One feature that might be useful is to analyze sentence structure. Of course that might lead to some strings being marked as duplicate because they have the same grammar structure, but different words and meaning.

You might also be able to use probability and classification features.

Van Gale
A: 

you could get crazy and try latent semantic analysis/mapping and the singular value decomposition: latent semantic mapping

together with SVDLIBC, which is pretty easy to get going with.

blackkettle