ansaurus

Question

Algorithm for multiple word matching in text

Answer 1

+12 A:

input the 10,000 words into a hashtable then check each of the words in the block of text if its hash has an entry.

Faster though I don't know, just another method (would depend on how many words you are searching for).

simple perl examp:

my $word_block = "the guy went afk after being popped by a brownrabbit";
my %hash = ();
my @words = split /\s/, $word_block;
while(<DATA>) { chomp; $hash{$_} = 1; }
foreach $word (@words)
{
    print "found word: $word\n" if exists $hash{$word};
}

__DATA__
afk
lol
brownrabbit
popped
garbage
trash
sitdown

2009-07-08 19:11:21

I was going to recommend KMP, but this is EXACTLY the solution that is required. +1 and should get the answer check.

samoz 2009-07-08 19:13:20

Yeah this is about as good as you are going to get... O(N) time (assuming a good hash function of course)

Polaris878 2009-07-08 19:21:12

@Polaris: Also assuming that the hash of words fits into memory (which, if it's 10k words, should be no problem. Just being pedantic)

llimllib 2009-07-08 19:30:31

If the "given block of text" is small, and the 10,000 words are sorted, then it might (might) be faster not to bother with the hashtable, just binary chop (even in a newline-separated list). O(M log N) instead of O(M+N). But for general purposes, it has to be either this hashtable, or something fancy like a trie.

Steve Jessop 2009-07-08 19:35:19

Why a hash and not just a linked list, for example?

Dervin Thunk 2009-07-08 19:51:14

@onbyeone I actually did the math, and O(M log n) is faster for around M < 1000, though this doesn't take into account any constant factors.

samoz 2009-07-08 19:55:05

@Dervin: Because looking things up in a linked list is incredibly slow.

Steve Jessop 2009-07-08 20:31:02

@Dervin, linked lists still have to be searched through. With a hash you just hash the key and voila you have access to the data with that key.

2009-07-08 20:31:56

You are right, but I haven't been precise (I should probably open another question): my 10,000 "words" should have been "strings", because they are not always single words, but two, three, or even four-words strings...

Enrico Detoma 2009-07-08 20:58:47

Answer 2

+4 A:

The answer heavily depends on the actual requirements.

How large is the word list?
How large is the text block?
How many text blocks must be processed?
How often must each text block be processed?
Do the text blocks or the word list change? If, how frequent?

Assuming relativly small text blocks compared to the word list and processing each text block only once, I suggest to put the words from the word list into a hash table. Then you can perform a hash lookup for each word in the text block and find out if the word list contains the word.

If you have to process the text blocks multiple times, I suggest to invert the text blocks. Inverting a text block means creating a list for each word that containing all the text blocks containing the specific word.

In still other situations it might be helpful to generate a bit vector for each text block with one bit per word indicating if the word is contained in the text block.

Daniel Brückner 2009-07-08 19:16:02

Answer 3

+7 A:

Try out the Aho-Corasick algorithm: http://en.wikipedia.org/wiki/Aho-Corasick_algorithm

Cuga 2009-07-08 19:38:44

+1 because that should actually solve my real problem (I said "words" but I should have said strings, because they can also be two, three or four-words strings

Enrico Detoma 2009-07-08 21:01:37

Would the Aho-Corasick be efficient enough for such a large set of strings? I found an implementation in Java http://hkn.eecs.berkeley.edu/~dyoo/java/index.html which I could probably use.

Enrico Detoma 2009-07-08 21:06:19

I believe it is, based on all the articles I was able to find related to it online.

Cuga 2009-07-09 18:57:52

If this solves the "real" problem, then accept Cuga's answer. Aho-Corasick is a classy beast. It's especially useful in your case because of the spaces in the strings in the search dictionary (the set of strings to search for). For example, with User105033's method (hashing), you can't just check each word; rather, you have to check each word /and/ each consecutive 2, 3, 4, ... words, etc. In contrast, with the state machine approach, this is done implicitly.

2010-06-24 04:58:43

Answer 4

+2 A:

Build up a trie of your words, and then use that to find which words are in the text.

FryGuy 2009-07-08 19:40:39

Answer 5

+1 A:

PierrOz 2009-07-08 19:45:41

The (usual) state machine approach is called Aho-Corasick, as per Cuga's answer.

2010-06-28 19:52:07

Answer 6

A:

The Boyer-Moore string algorithm should work. depending on the size/# or words in the block of text, you might want to use it as the key to search the word list (are there more words in the list then in the block). Also - you probably want to remove any dups from both lists.

meade 2009-07-09 18:46:08

ansaurus

tags:

views:

answers:

Algorithm for multiple word matching in text

related questions