tags:

views:

73

answers:

4

Hi there

Suppose I have a database containing 500,000 records, each representing, say, an animal. What would be the best approach for parsing 140 character tweets to identify matching records by animal name? For instance, in this string...

"I went down to the woods to day and couldn't believe my eyes: I saw a giant polar bear having a picnic with a red squirrel."

... I would like to flag up the phrases "giant polar bear" and "red squirrel", as they appear in my database.

This strikes me as a problem that has probably been solved many times, but from where I'm sitting it looks prohibitively intensive - iterating over every db record checking for a match in the string is surely a crazy way to do it.

Can anyone with a comp sci degree put me out of my misery? I'm working in C# if that makes any difference. Cheers!

+2  A: 

Assuming the database is fairly static, use a bloom filter. It is a degenerate form of hash table that only stores bits indicating the presence of a value, without storing the value itself. It is probabilistic, since hashes may collide, so each hit would require a full lookup. But a 1 MB bloom filter with 500,000 entries can have as low as a 0.03% rate of false positives.

Some math: To get this low rate requires up to 23 hash codes, each with 23 bits of entropy, for a total of 529 bits. Bob Jenkins's 64-bit hash function generates 192 bits of entropy in a single pass (if you use the unreported variables in hash(), which Bob cites as probably being OK as a "mediocre" hash), thus requiring at most three passes. Because of the way bloom filters work, you don't need all the entropy on every query, since most lookups will report a miss well before getting to the 23rd hash code.

EDIT: You will obviously have to parse words from the text. Finding every instance of /\b\w+\b/ will probably do for a first version.

To match phrases, you will have to test every n-word subsequence (aka n-gram) where n is every number from 2 to the largest phrase in your dictionary. You can make it much cheaper by adding any word that appears in a phrase to a separate bloom filter, and only testing n-grams for which every word passes this second filter.

Marcelo Cantos
Awesome answer, this seems promising. I'll see what else I can find out about bloom filters now I have a search term to use! Thanks.
centralscru
In this scenario, what string would I be testing against the bloom filter? Wouldn't I have to create a list of candidate words from the tweet in order to have something to test against? I've edited my original question to show that it could be phrases in the database, not single words.
centralscru
A: 

Why reinvent the wheel. Use a free text indexing tool to handle the heavy lifting. Lucene.Net comes to mind.

Tom Cabanski
Lucene is designed for _ad hoc_ queries against a large document corpus. This is a simple dictionary lookup of words in a single small document (a tweet).
Marcelo Cantos
Also the tweets will be coming in regularly and will need parsing pretty fast, so they wouldn't exist in any pre-built index.
centralscru
A: 

What's wrong with Regex? =) That will do for small text searches.

string input = @"I went down to the woods to day and couldn't believe my eyes: I saw a bear having a picnic with a squirrel. I am a human though!";
Regex animalFilter = new Regex(@"\b(bear|squirrel|tiger|human)\b");
foreach (Match s in animalFilter.Matches(input))
{
    textBox1.Text += s.Value + Environment.NewLine;
}

It gives output:

bear
squirrel
human

Some more:

string input = @"I went down to the woods to day and couldn't believe my eyes: I saw a bear having a picnic with a squirrel. I am a human though!";
Regex animalFilter = new Regex(@"\b(bear|squirrel|tiger|human)\b");

Dictionary<string, int> animals = new Dictionary<string, int>();

foreach (Match s in animalFilter.Matches(input))
{
    int ctr = 1;
    if (animals.ContainsKey(s.Value))
    {
        ctr = animals[s.Value] + 1;
    }
    animals[s.Value] = ctr;
}
foreach (KeyValuePair<string,int> k in animals)
{
    textBox1.Text += k.Key + " ocurred " + k.Value + " times" + Environment.NewLine;
}

Results:

bear ocurred 1 times
squirrel ocurred 1 times
human ocurred 1 times

Nayan
Thanks. This is the kind of thing I initially had in mind, but as I have 500k terms to search for I have a feeling it's not going to scale very well. Do you think otherwise?
centralscru
Elaborating on this, I found this interesting link:http://www.kavoir.com/2009/07/instantly-boost-sql-query-efficiency-of-regexp-or-rlike-by-2000.html
Pin
Theoretically, the regex compiler could create a state machine which is in effect a trie of all the alternatives. It won't be as good at discriminating, but would allow 'bark' to stop at the second character. I don't know what optimisations .net's regex engine makes.
Pete Kirkham
Nice link, Pin!
Nayan
@user136416: I think Regex can handle it well. Try pre-compiled option, for faster result.
Nayan
+2  A: 

Have you tried to build a trie for your dictionary? If you split the tweet into pieces and match each piece to the trie, you get linear complexity.

monn