views:

70

answers:

2

Hey All

This is my first post, have been a lurker for a long time, so will try my best to explain myself here.

I have been using lowest common substring method along with basic word match and substring match(regexp) for clustering similar stories on the net. But the problem is its time complexity is n^2 (I compare each title to all the others). I've done very basic optimizations like storing and skipping all the matched titles.

What I want is some kind of preprocessing of the chunk of text so that for each iteration i reduce number of posts to match to. Any further optimizations are also welcome.

Here are the functions i use for the same. the main function which calls them first calls word_match, if more than 70% of the word matches i further go down and call 'substring_match' and LCSubstr_len. The code is in Python, I can use C as well

import re

def substring_match(a,b):
    try:
        c = re.match(a,b) 
        return c if c else True if re.match(b,a) else False
    except:
        return False

def LCSubstr_len(S, T):
    m = len(S); n = len(T)
    L = [[0] * (n+1) for i in xrange(m+1)]
    lcs = 0
    for i in xrange(m):
     for j in xrange(n):
         if S[i] == T[j]:
             L[i+1][j+1] = L[i][j] + 1
             lcs = max(lcs, L[i+1][j+1])
         else:
             L[i+1][j+1] = max(L[i+1][j], L[i][j+1])
    return lcs/((float(m+n)/2))

def word_match(str1,str2):
    matched = 0
    try:
        str1,str2 = str(str1),str(str2)
        assert isinstance(str1,str)
    except:
        return 0.0
    words1 = str1.split(None)
    words2 = str2.split(None)
    for i in words1:
        for j in words2:
            if i.strip() ==j.strip():
                matched +=1
    len1 = len(words1)
    len2 = len(words2)
    perc_match = float(matched)/float((len1+len2)/2)
    return perc_match
+4  A: 

Use an inverted index: for each word, store a list of pairs (docId, numOccurences). Then, to find all strings which might be similar to a given string, go through its words and look up strings containing that word in the inverted index. This way you'll get a table "(docId, wordMatchScore)" that automatically contains only entries where wordMatchScore is non-zero.

There are a huge number of possible optimizations; also, your code is extremely non-optimal, but if we're talking about decreasing the number of string pairs for comparison, then that's it.

jkff
Thanks for the reply...can you also tell me what should i use to create inverted index (use lucene(pylucene) or dictionaries in python). My data size could increase to a maximum of 500k posts.BTW awesome advice, thanks. I will be rewriting the word_match and will get rid of substring match.
Rafi
No, you don't need lucene for this (until your data actually increases to 500k posts, at which point my advice won't help at all). Just use simple dictionaries.
jkff
+3  A: 

Speeding up word_match is easy with sets:

def word_match(str1,str2):
    # .split() splits on all whitespace, you dont needs .strip() after
    words1 = set(str1.split())
    words2 = set(str2.split())
    common_words = words1 & words2
    return 2.0*len(common_words)/(len(words1)+len(words2))

It also shows that 'A A A' and 'A' have 100% in common by this measure ...

THC4k