views:

801

answers:

6

I have a large set of words (about 10,000) and I need to find if any of those words appear in a given block of text.

Is there a faster algorithm than doing a simple text search for each of the words in the block of text?

+12  A: 

input the 10,000 words into a hashtable then check each of the words in the block of text if its hash has an entry.

Faster though I don't know, just another method (would depend on how many words you are searching for).

simple perl examp:

my $word_block = "the guy went afk after being popped by a brownrabbit";
my %hash = ();
my @words = split /\s/, $word_block;
while(<DATA>) { chomp; $hash{$_} = 1; }
foreach $word (@words)
{
    print "found word: $word\n" if exists $hash{$word};
}

__DATA__
afk
lol
brownrabbit
popped
garbage
trash
sitdown
I was going to recommend KMP, but this is EXACTLY the solution that is required. +1 and should get the answer check.
samoz
Yeah this is about as good as you are going to get... O(N) time (assuming a good hash function of course)
Polaris878
@Polaris: Also assuming that the hash of words fits into memory (which, if it's 10k words, should be no problem. Just being pedantic)
llimllib
If the "given block of text" is small, and the 10,000 words are sorted, then it might (might) be faster not to bother with the hashtable, just binary chop (even in a newline-separated list). O(M log N) instead of O(M+N). But for general purposes, it has to be either this hashtable, or something fancy like a trie.
Steve Jessop
Why a hash and not just a linked list, for example?
Dervin Thunk
@onbyeone I actually did the math, and O(M log n) is faster for around M < 1000, though this doesn't take into account any constant factors.
samoz
@Dervin: Because looking things up in a linked list is incredibly slow.
Steve Jessop
@Dervin, linked lists still have to be searched through. With a hash you just hash the key and voila you have access to the data with that key.
You are right, but I haven't been precise (I should probably open another question): my 10,000 "words" should have been "strings", because they are not always single words, but two, three, or even four-words strings...
Enrico Detoma
+4  A: 

The answer heavily depends on the actual requirements.

  1. How large is the word list?
  2. How large is the text block?
  3. How many text blocks must be processed?
  4. How often must each text block be processed?
  5. Do the text blocks or the word list change? If, how frequent?

Assuming relativly small text blocks compared to the word list and processing each text block only once, I suggest to put the words from the word list into a hash table. Then you can perform a hash lookup for each word in the text block and find out if the word list contains the word.

If you have to process the text blocks multiple times, I suggest to invert the text blocks. Inverting a text block means creating a list for each word that containing all the text blocks containing the specific word.

In still other situations it might be helpful to generate a bit vector for each text block with one bit per word indicating if the word is contained in the text block.

Daniel Brückner
+7  A: 

Try out the Aho-Corasick algorithm: http://en.wikipedia.org/wiki/Aho-Corasick_algorithm

Cuga
+1 because that should actually solve my real problem (I said "words" but I should have said strings, because they can also be two, three or four-words strings
Enrico Detoma
Would the Aho-Corasick be efficient enough for such a large set of strings? I found an implementation in Java http://hkn.eecs.berkeley.edu/~dyoo/java/index.html which I could probably use.
Enrico Detoma
I believe it is, based on all the articles I was able to find related to it online.
Cuga
If this solves the "real" problem, then accept Cuga's answer. Aho-Corasick is a classy beast. It's especially useful in your case because of the spaces in the strings in the search dictionary (the set of strings to search for). For example, with User105033's method (hashing), you can't just check each word; rather, you have to check each word /and/ each consecutive 2, 3, 4, ... words, etc. In contrast, with the state machine approach, this is done implicitly.
+2  A: 

Build up a trie of your words, and then use that to find which words are in the text.

FryGuy
+1  A: 
PierrOz
The (usual) state machine approach is called Aho-Corasick, as per Cuga's answer.
A: 

The Boyer-Moore string algorithm should work. depending on the size/# or words in the block of text, you might want to use it as the key to search the word list (are there more words in the list then in the block). Also - you probably want to remove any dups from both lists.

meade