views:

136

answers:

5

I'm looking for suggestions for an efficient algorithm for finding all matches in a large body of text. Terms to search for will be contained in a list and can have 1000+ possibilities. The search terms may be 1 or more words.

Obviously I could make multiple passes through the text comparing against each search term. Not too efficient.

I've thought about ordering the search terms and combining common sub-segments. That way I could eliminate large numbers of terms quickly. Language is C++ and I can use boost.

An example of search terms could be a list of Fortune 500 company names.

Ideas?

A: 

Are the search terms words that you are looking for or can it be full sentances too ?

If it's only words, then i would suggest building a Red-Black Tree from all the words, and then searching for each word in the tree.

If it could be sentances, then it could get a lot more complex... (?)

gillyb
No need to build your own red-black tree. A `std::multimap` would be sufficient.
Brendan Long
+1  A: 

So you have lots of search terms and want to see if any of them are in the document?

Purely algorithmically, you could sort all your possibilities in alphabetical order, join them with pipes, and use them as a regular expression, if the regex engine will look at /ant|ape/ and properly short-circuit the a in "ape" if it didn't find it in "ant". If not, you could do a "precompile" of a regex and "squish" the results down to their minimum overlap. I.e. in the above case /a(nt|pe)/ and so on, recursively for each letter.

However, doing the above is pretty much like putting all your search strings in a 26-ary tree (26 characters, more if also numbers). Push your strings onto the tree, using one level of depth per character of length.

You can do this with your search terms to make a hyper-fast "does this word match anything in my list of search terms" if your search terms number large.

You could theoretically do the reverse as well -- pack your document into the tree and then use the search terms on it -- if your document is static and the search terms change a lot.

Depends on how much optimization you need...

eruciform
+2  A: 

Assuming that the large body of text is static english text and you need to match whole words you can try the following (you should really clarify what exactly is a 'match', what kind of text you are looking at etc in your question).

First preprocess the whole document into a Trie or a DAWG.

Trie/Dawg has the following property:

Given a trie/dawg and a search term of length K, you can in O(K) time lookup the data associated with the word (or tell if there is no match).

Using a DAWG could save you more space as compared to a trie. Tries exploit the fact that many words will have a common prefix and DAWGs exploit the common prefix as well as the common suffix property.

In the trie, also maintain exactly the list of positions of the word. For example if the text is

That is that and so it is.

The node for the last t in that will have the list {1,3} and the node for s in is will have the list {2,7} associated.

Now when you get a single word search term, you can walk the trie and get the list of matches for that word easily.

If you get a multiple word search term, you can do the following.

Walk the trie with the first word in the search term. Get the list of matches and insert into a hashTable H1.

Now walk the trie with the second word in the search term. Get the list of matches. For each match position x, check if x-1 exists in the HashTable H1. If so, add x to new hashtable H2.

Walk the trie with the third word, get list of matches. For each match position y, check if y-1 exists in H3, if so add to new hashtable H3.

Continue so forth.

At the end you get a list of matches for the search phrase, which give the positions of the last word of the phrase.

You could potentially optimize the phrase matching step by maintaining a sorted list of positions in the list and doing a binary search: i.e for eg. for each key k in H2, you binary search for k+1 in the sorted list for search term 3 and add k+1 to H3 if you find it etc.

Moron
+7  A: 

Don´t reinvent the wheel

This problem has been intensively researched. Curiously, the best algorithms for searching ONE pattern/string do not extrapolate easily to multi-string matching.

The "grep" family implement the multi-string search in a very efficient way. If you can use them as external programs, do it.

In case you really need to implement the algorithm, I think the fastest way is to reproduce what agrep does (agrep excels in multi-string matching!). Here are the source and executable files.

And here you will find a paper describing the algorithms used, the theoretical background, and a lot of information and pointers about string matching.

A note of caution: multiple-string matching have been heavily researched by people like Knuth, Boyer, Moore, Baeza-Yates, and others. If you need a really fast algorithm don't hesitate on standing on their broad shoulders. Don't reinvent the wheel.

belisarius
Thanks for the response. I asked the question to avoid reinventing something I felt was probably well researched. I'll take a look at agrep src.
Dwight Kelly
@Dwight Kelly Read this first! http://www.tgries.de/agrep/doc/agrep2ps.zip
belisarius
@Dwight: If the 'large body of text' is static, I would suggest creating a trie/dawg and I would think this would beat agrep's algorithms, which are more suitable for cases where the search terms are fixed but the text to search vary. In any case, since you accepted this answer, I suppose your large body of text is not really static. I suggest you modify your question to add that information.
Moron
@Moron The OP says that "Terms to search for will be contained in a list and can have 1000+ possibilities." So I assumed that the text is not static, because if it were, and you have 1500 terms to search, you could just build a tree in advance and perform binary searches.
belisarius
@Belisarius: Just trying to make it clear for future readers :-) From Dwight's comment in response to Georg on this question, it seems like the text("document") is static, so there is scope for confusion.
Moron
@Moron Maybe I misunderstood the conditions. I think the better I can do now is to agree that this answer makes sense only for a "dynamic" document.
belisarius
@beli: Not your fault: The question isn't very clear. Since OP accepted this answer, I presume you interpreted the question correctly. I only wanted OP to edit the question to clarify it. In case someone with a static document comes upon this question, our comments here and the edit to question (if made) will hopefully help...
Moron
the text to be searched will not be static. The match list could be preprocessed if necessary.
Dwight Kelly
A: 

An optimal solution for this problem is to use a suffix tree (or a suffix array). It's essentially a trie of all suffixes of a string. For a text of length O(N), this can be built in O(N).

Then all k occurrences of a string of length m can be answered optimally in O(m + k).

Suffix trees can also be used to efficiently find e.g. the longest palindrome, the longest common substring, the longest repeated substring, etc.

This is the typical data structure to use when analyzing DNA strings which can be millions/billions of bases long.

See also

  • Wikipedia/Suffix tree
  • Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology (Dan Gusfield).
polygenelubricants
Do you mean a text of length `N`?
JAB
@JAB: yes, I could've just said length `N` instead of `O(N)`, but `N` is `O(N)`, so it's still correct =)
polygenelubricants