views:

121

answers:

5

Stackoverflow implemented its "Related Questions" feature by taking the title of the current question being asked and removing from it the 10,000 most common English words according to Google. The remaining words are then submitted as a fulltext search to find related questions.

I want to do something similar in my Django site. What is the best way to filter a string (the question title in this case) against a long list of words in Python? Any libraries that would enable me to do that efficiently?

+2  A: 

I don't know which method is used by SO, but:

I suppose a fast (and very simplistic) way of doing this is by going back to C, and checking them one by one, maybe with a KMP algorithm.

Another (not so simple) way of doing this, is by keeping a trie with those 10.000 words and searching the text using that. This would be super-fast, but fairly hard to implement. If you are interested, I have a dummy implementation in C++.

EDIT

Looking back to it, I see I've only used fstream, so this could be modified easily for C, so you'll be able to integrate with python easily . This is the source:

#include <fstream>
using namespace std;

ifstream in("trie.in");
ofstream out("trie.out");

struct Trie
{
    short nr, pref;
    Trie *children[26], *father;
    Trie()
    {
        int i;
        nr = pref = 0;
        for(i=0; i<26; i++)
            children[i] = NULL;
        father = NULL;
    }
};

Trie t, *it, *it2;
int n, op, val, i, l, len;
char s[22],*p;

int main()
{
    while(in>>op>>s)
    {
        p = s;
        it = &t;
        l = 0;len=0;
        while(p[0] != '\0')
        {
            if(it->children[p[0] - 'a'] == NULL && op == 2)
                {op=9; out<<"0\n"; break;}
            if(it->children[p[0] - 'a'] == NULL && op == 3)
                break;
            if(it->children[p[0] - 'a'] == NULL)
                it->children[p[0] - 'a'] = new Trie(), it->children[p[0] - 'a']->father = it,
                        it = it->children[p[0] - 'a'];
            else
                it = it->children[p[0] - 'a'];
            if(op == 0)
                ++ it->pref;
            else if(op == 1 && it->pref > 0)
                -- it->pref;
            else if(op == 3 && it->pref > 0)
                l = p-s+1;

            p++;
        }
        if(op == 0)
            it->nr ++;
        else if(op == 1 && it->nr > 0)
        {
            it->nr --;
            l = strlen(s)-1;
            while(it->pref == 0 && it != &t && l>=0)
            {
                it2 = it->father;
                it2->children[s[l--] - 'a'] = NULL;

                delete it;
                it = it2;
            }

        }
        else if(op == 2)
            out<<it->nr<<'\n';
        else if(op == 3)
            out<<l<<'\n';

    }
    return 0;
}

This takes in trie.in text formatted like this:

0 lat
0 mare
0 lac
2 la
0 mare
1 lat
0 ma
0 lung
3 latitudine
0 mari
2 mare
0 lat
0 mic
3 latime
2 lac
3 mire

And produces text like this

0
2
2
3
1
2

0 w - add the word w in the list (could be multiple times)

1 w - delete one record of the word w from the list (could be multiple times)

2 w - print how many w words are there in the list

3 w - print the length of the longest common prefix of w with any other word in the list

Oh, and sorry for the poor formatting, this was done for training.

Gabi Purcaru
Please share your trie implementation, I'm definitely interested.How do I use your C++ implementation from a Python program?
Continuation
+2  A: 

I think a much simpler solution and still reasonably fast is to use sqlite and regular expressions.

Put the long list of words in an sqlite table and build a b-tree index. This gives you log(n) time exists queries. Split the smaller string with a regular expression and loop over the words running an exists query for each of them.

You can stem the words first with the porter stemmer from nltk.

Kevin
why not `WHERE word='word1' OR word='word2' ...` and do it in one query?
aaronasterling
a disjunctive query will tell you if any of the words are in the table but not which ones, which I think is what the question is about.
Kevin
oh yeah, obvious now.
aaronasterling
+6  A: 

You could do this very simply using the set and string functionality in Python and see how it performs (premature optimisation being the root of all evil!):

common_words = frozenset(("if", "but", "and", "the", "when", "use", "to", "for"))
title = "When to use Python for web applications"
title_words = set(title.lower().split())
keywords = title_words.difference(common_words)
print(keywords)
Gareth Williams
+2  A: 

While I hate to discourage the use of something cool like a trie, have you thought about doing it in straight python? I wrote a simple benchmark using the corncob worlist and performance wasn't that bad.

import time

with open('corncob_lowercase.txt') as f:
    filetime = 0
    starttime = time.time()
    words = f.read().split('\n')
    endtime = time.time()
    filetime = endtime - starttime
    print "file opened in {0} seconds".format(filetime)    
    nonwords = ['234' + word for word in words]
    totaltime = 0
    for word in nonwords:
        starttime = time.time()
        word in words
        endtime = time.time()
        totaltime += endtime - starttime
    wordcount = len(words)
    avgtime = totaltime / wordcount
    print "average time for word: {0}".format(avgtime)
    print "with {0} words".format(wordcount)
    runningtimes = (filetime + i * avgtime for i in xrange(10))
    print "running times: {0}".format(list(runningtimes))

note that I'm testing the worst case where the word isn't in the file. I'm also including the time to load the file and process the file. If you were to memcache it, that would disappear. One further thing to note is that my machine is basically crap. C is fast but most of the code involved in searching a list is written in C anyways. Finally, this test is for pretty much every word in the english language. If you just want 10,000, I think that's cake.

file opened in 0.0135519504547 seconds
average time for word: 0.00249605141253
with 58113 words
running times: [0.013551950454711914, 0.016048001867237236, 0.018544053279762558,
0.021040104692287877, 0.023536156104813199, 0.026032207517338521, 0.028528258929863839,
0.031024310342389162, 0.033520361754914484, 0.036016413167439809]
aaronasterling
+1  A: 

If some false positives/negatives are ok, search for bloom filter on wikipedia.

If not look at CDB, (yum install tinycdb, in Fedora -- no python API atm).

James Antill