ansaurus

Question

Tagging similar sentences with lower time complexity than n^2

Answer 1

+4 A:

Use an inverted index: for each word, store a list of pairs (docId, numOccurences). Then, to find all strings which might be similar to a given string, go through its words and look up strings containing that word in the inverted index. This way you'll get a table "(docId, wordMatchScore)" that automatically contains only entries where wordMatchScore is non-zero.

There are a huge number of possible optimizations; also, your code is extremely non-optimal, but if we're talking about decreasing the number of string pairs for comparison, then that's it.

jkff 2010-09-01 17:14:15

Thanks for the reply...can you also tell me what should i use to create inverted index (use lucene(pylucene) or dictionaries in python). My data size could increase to a maximum of 500k posts.BTW awesome advice, thanks. I will be rewriting the word_match and will get rid of substring match.

Rafi 2010-09-01 19:41:41

No, you don't need lucene for this (until your data actually increases to 500k posts, at which point my advice won't help at all). Just use simple dictionaries.

jkff 2010-09-01 20:37:15

Answer 2

+3 A:

Speeding up word_match is easy with sets:

def word_match(str1,str2):
    # .split() splits on all whitespace, you dont needs .strip() after
    words1 = set(str1.split())
    words2 = set(str2.split())
    common_words = words1 & words2
    return 2.0*len(common_words)/(len(words1)+len(words2))

It also shows that 'A A A' and 'A' have 100% in common by this measure ...

THC4k 2010-09-01 17:26:27

ansaurus

tags:

views:

answers:

Tagging similar sentences with lower time complexity than n^2

related questions