ansaurus

Question

How can I parse free text (Twitter tweets) against a large database of values?

Answer 1

+2 A:

Assuming the database is fairly static, use a bloom filter. It is a degenerate form of hash table that only stores bits indicating the presence of a value, without storing the value itself. It is probabilistic, since hashes may collide, so each hit would require a full lookup. But a 1 MB bloom filter with 500,000 entries can have as low as a 0.03% rate of false positives.

Some math: To get this low rate requires up to 23 hash codes, each with 23 bits of entropy, for a total of 529 bits. Bob Jenkins's 64-bit hash function generates 192 bits of entropy in a single pass (if you use the unreported variables in hash(), which Bob cites as probably being OK as a "mediocre" hash), thus requiring at most three passes. Because of the way bloom filters work, you don't need all the entropy on every query, since most lookups will report a miss well before getting to the 23rd hash code.

EDIT: You will obviously have to parse words from the text. Finding every instance of /\b\w+\b/ will probably do for a first version.

To match phrases, you will have to test every n-word subsequence (aka n-gram) where n is every number from 2 to the largest phrase in your dictionary. You can make it much cheaper by adding any word that appears in a phrase to a separate bloom filter, and only testing n-grams for which every word passes this second filter.

Marcelo Cantos 2010-05-16 12:36:14

Awesome answer, this seems promising. I'll see what else I can find out about bloom filters now I have a search term to use! Thanks.

centralscru 2010-05-16 15:08:00

In this scenario, what string would I be testing against the bloom filter? Wouldn't I have to create a list of candidate words from the tweet in order to have something to test against? I've edited my original question to show that it could be phrases in the database, not single words.

centralscru 2010-05-16 15:26:29

Answer 2

A:

Why reinvent the wheel. Use a free text indexing tool to handle the heavy lifting. Lucene.Net comes to mind.

Tom Cabanski 2010-05-16 12:37:11

Lucene is designed for _ad hoc_ queries against a large document corpus. This is a simple dictionary lookup of words in a single small document (a tweet).

Marcelo Cantos 2010-05-16 12:53:18

Also the tweets will be coming in regularly and will need parsing pretty fast, so they wouldn't exist in any pre-built index.

centralscru 2010-05-16 15:05:15

Answer 3

A:

What's wrong with Regex? =) That will do for small text searches.

string input = @"I went down to the woods to day and couldn't believe my eyes: I saw a bear having a picnic with a squirrel. I am a human though!";
Regex animalFilter = new Regex(@"\b(bear|squirrel|tiger|human)\b");
foreach (Match s in animalFilter.Matches(input))
{
    textBox1.Text += s.Value + Environment.NewLine;
}

It gives output:

bear
squirrel
human

Some more:

string input = @"I went down to the woods to day and couldn't believe my eyes: I saw a bear having a picnic with a squirrel. I am a human though!";
Regex animalFilter = new Regex(@"\b(bear|squirrel|tiger|human)\b");

Dictionary<string, int> animals = new Dictionary<string, int>();

foreach (Match s in animalFilter.Matches(input))
{
    int ctr = 1;
    if (animals.ContainsKey(s.Value))
    {
        ctr = animals[s.Value] + 1;
    }
    animals[s.Value] = ctr;
}
foreach (KeyValuePair<string,int> k in animals)
{
    textBox1.Text += k.Key + " ocurred " + k.Value + " times" + Environment.NewLine;
}

Results:

bear ocurred 1 times
squirrel ocurred 1 times
human ocurred 1 times

Nayan 2010-05-16 14:16:39

Thanks. This is the kind of thing I initially had in mind, but as I have 500k terms to search for I have a feeling it's not going to scale very well. Do you think otherwise?

centralscru 2010-05-16 15:03:03

Elaborating on this, I found this interesting link:http://www.kavoir.com/2009/07/instantly-boost-sql-query-efficiency-of-regexp-or-rlike-by-2000.html

Pin 2010-05-16 19:23:19

Theoretically, the regex compiler could create a state machine which is in effect a trie of all the alternatives. It won't be as good at discriminating, but would allow 'bark' to stop at the second character. I don't know what optimisations .net's regex engine makes.

Pete Kirkham 2010-05-17 10:18:54

Nice link, Pin!

Nayan 2010-05-17 20:10:09

@user136416: I think Regex can handle it well. Try pre-compiled option, for faster result.

Nayan 2010-05-17 20:11:06

Answer 4

+2 A:

Have you tried to build a trie for your dictionary? If you split the tweet into pieces and match each piece to the trie, you get linear complexity.

monn 2010-05-17 03:38:47

ansaurus

tags:

views:

answers:

How can I parse free text (Twitter tweets) against a large database of values?

related questions