views:

4376

answers:

10

We have 5mb of typical text (just plain words). We have 1000 words/phrases to use as terms to search for in this text.

What's the most efficient way to do this in .NET (ideally C#)?

Our ideas include regex's (a single one, lots of them) plus even the String.Contains stuff.

The input is a 2mb to 5mb text string - all text. Multiple hits are good, as in each term (of the 1000) that matches then we do want to know about it. Performance in terms of entire time to execute, don't care about footprint. Current algorithm gives about 60 seconds+ using naive string.contains. We don't want 'cat' to provide a match with 'category' or even 'cats' (i.e. entire term word must hit, no stemming).

We expect a <5% hit ratio in the text. The results would ideally just be the terms that matched (dont need position or frequency just yet). We get a new 2-5mb string every 10 seconds, so can't assume we can index the input. The 1000 terms are dynamic, although have a change rate of about 1 change an hour.

+3  A: 

Knuth-Morris-Pratt algorithm.

plinth
Thanks for the link, I'll read up. On a quick read though won't this mean we'll have to run the KMP 1000 times, i.e. once for each word? Ideally we'd like (term1, term2, termn -> text to search) = matches. Like a regex but it feels odd building a single regex expression with 1000 terms in it?
A: 

A modified Suffix tree would be very fast, though it would take up a lot of memory and I don't know how fast it would be to build it. After that however every search would take O(1).

Vilx-
Thanks for that. Unfortunately (in the question comments now, sorry) our 'input text' changes frequently, so building an index will take a while, i.e. we dont need to repeat a new search on the text body ever again (with those terms).
The O(1) is wrong. Search actually takes O(m) if m is the length of a given pattern. Only direct indexing (i.e. q-gram index) can give a constant retrieval time.
Konrad Rudolph
Yes, O(m), if m is pattern length. But O(1) if m is the number of strings that have to be checked for a match.
Vilx-
A: 

Are you trying to achieve a list of matched words or are you trying to highlight them in the text getting the start and length of the match position? If all you're trying to do is find out if the words exist, then you could use subset theory to fairly efficiently perform this.

However, I expect you're trying to each match's start position in the text... in which case this approach wouldn't work.

The most efficient approach I can think is to dynamically build a match pattern using a list and then use regex. It's far easier to maintain a list of 1000 items than it is to maintain a regex pattern based on those same 1000 items.

It is my understanding that Regex uses the same KMP algorithm suggested to efficiently process large amounts of data - so unless you really need to dig through and understand the minutiae of how it works (which might be beneficial for personal growth), then perhaps regex would be ok.

There's quite an interesting paper on search algorithms for many patterns in large files here: http://webglimpse.net/pubs/TR94-17.pdf

BenAlabaster
Thanks for that. Comments to the question to clarify a bit (sorry, new here).Actually we don't need the position, just the subset - can you expand on 'then you could use subset theory to fairly efficiently perform this' please? Do you mean just use a single regex statement once or 1000 regexs?
@David: Do you need to find out *which* items from your search terms exist, or do you need to know if they all exist as a set? i.e. "Yes, all of my search terms exist in my text" or just "this text contains items from my search list"?
BenAlabaster
Yes, we need "this text contains items from my search list, and here they all are".
A: 

Here's another idea: Make a class something like this:

class Word
{
    string Word;
    List<int> Positions;
}

For every unique word in your text you create an instance of this class. Positions array will store positions (counted in words, not characters) from the start of the text where this word was found.

Then make another two lists which will serve as indexes. One will store all these classes sorted by their texts, the other - by their positions in the text. In essence, the text index would probably be a SortedDictionary, while the position index would be a simple List<Word>.

Then to search for a phrase, you split that phrase into words. Look up the first word in the Dictionary (that's O(log(n))). From there you know what are the possible words that follow it in the text (you have them from the Positions array). Look at those words (use the position index to find them in O(1)) and go on, until you've found one or more full matches.

Vilx-
'For every unique word in your text you create an instance of this class' - I'm not sure that this will reduce the overall performance of the entire search time, as in, won't it take a while to populate/make?
I don't know. You could make the whole thing more efficient by pre-allocating a large amount of these instances and then reusing them so that later there isn't much allocation/deallocation of memory.
Vilx-
Also, if this is so performance critical, why don't you try to do this with C/C++, or at least unsafe C#. Pointer magic could help to speed things up here.
Vilx-
+1  A: 

Hi

have you considered the following:

  1. do you care about substring? lets say I am looking for the word "cat", nothing more or nothing less. now consider the Knuth-Morris-Pratt algorithm, or string.contains for "concatinate". both of these will return true (or an index). is this ok?

  2. Also you will have to look into the idea of the stemmed or "Finite" state of the word. lets look for "diary" again, the test sentance is "there are many kinds of diaries". well to you and me we have the word "diaries" does this count? if so we will need to preprocess the sentance converting the words to a finite state (diaries -> diary) the sentance will become "there are many kind of diary". now we can say that Diary is in the sentance (please look at the porter Stemmer Algroithm)

  3. Also when it comes to processing text (aka Natrual Langauge Processing) you can remove some words as noise, take for example "a, have, you, I, me, some, to" <- these could be considered as useless words, and can then be removed before any processing takes place? for example

"I have written some C# today", if i have 10,000 key works to look for I would have to scan the entire sentance 10,000 x the number of words in the sentance. removing noise before hand will shorting the processing time

"written C# today" <- removed noise, now there are lots less to look throught.

A great article on NLP can be found here. Sentance comparing

HTH

Bones

dbones
Yes, we want to avoid substrings, i.e. 'category' shouldn't match with 'a cat sat', so we want to deal with word terms.We do our own stemming, so this is outside this search alg (but good point).Taking out the noise is harder as it move the perf time to a 'filter' and we only do this once.
A: 

Is this a bottleneck? How long does it take? 5 MiB isn't actually a lot of data to search in. Regular expressions might do just fine, especially if you encode all the search strings into one pattern using alternations. This basically amortizes the overall cost of the search to O(n + m) where n is the length of your text and m is the length of all patterns, combined. Notice that this is a very good performance.

An alternative that's well suited for many patterns is the Wu Manber algorithm. I've already posted a very simplistic C++ implementation of the algorithm.

Konrad Rudolph
Thanks. We're trying with all the terms m as a single pattern, although wondering if after a certain pattern size then it will degrade (or regex will reject it, at least on .NET).
The latter could actually be true, I don't know the internals good enough to answer that question, sorry. I doubt it though. The engine seems to be very good. Performance degradation in this simple case shouldn't happen since the expression constructed is really straightforward. Again, caveat emptor
Konrad Rudolph
Also, make sure if you have a fixed pattern (like the heavily alternating regex with 1000 terms) that you declare the Regex variable outside of any loops (and preferably do a compile on it as well).
torial
+1  A: 

A naive string.Contains with 762 words (the final page) of War and Peace (3.13MB) runs in about 10s for me. Switching to 1000 GUIDs runs in about 5.5 secs.

Regex.IsMatch found the 762 words (much of which were probably in earlier pages as well) in about .5 seconds, and ruled out the GUIDs in 2.5 seconds.

I'd suggest your problem lies elsewhere...Or you just need some decent hardware.

Mark Brackett
Yep, our original implementation was borked, so the times are way off. You correct the IsMatch works pretty fast, but then if you use it on multiple terms then you need to know which ones match, i.e. we used regex.Matches.Count and it was slow, and found it faster to run regex.IsMatch() a 1000 times
More details here:http://stackoverflow.com/questions/382263/c-codealgorithm-to-search-text-for-terms#382451
Also, your regex was probably a bit quick in that you need to try to look for word boundaries, i.e. \bterm\b. (To avoid finding 'cat' in 'category' as the first mismatch). If you do that you'll find regex is up to about 6 seconds (like the GUIDs I guess).
A: 

Ok, current rework shows this as fastest (psuedocode):

foreach (var term in allTerms)
{
    string pattern = term.ToWord(); // Use /b word boundary regex
    Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
    if (regex.IsMatch(bigTextToSearchForTerms))
    {                    
        result.Add(term);                                        
    }
}

What was surprising (to me at least!) is that running the regex 1000 times was faster that a single regex with 1000 alternatives, i.e. "/b term1 /b | /b term2 /b | /b termN /b" and then trying to use regex.Matches.Count

A: 

How does this perform in comparison? It uses LINQ, so it may be a little slower, not sure...

List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Where(item => Regex.IsMatch(bigTextToSearchForTerms, item, RegexOptions.IgnoreCase));

This uses classic predicates to implement the FIND method, so it should be quicker than LINQ:

static bool Match(string checkItem)
{
  return Regex.IsMatch(bigTextToSearchForTerms, checkItem, RegexOptions.IgnoreCase);
}

static void Main(string[] args)
{
  List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
  List<String> matches = allTerms.Find(Match);
}

Or this uses the lambda syntax to implement the classic predicate, which again should be faster than the LINQ, but is more readable than the previous syntax:

List<String> allTerms = new List<String>(new String(){"string1", "string2", "string3", "string4"});
List<String> matches = allTerms.Find(checkItem => Regex.IsMatch(bigTextToSearchForTerms, checkItem, RegexOptions.IgnoreCase));

I haven't tested any of them for performance, but they all implement your idea of iteration through the search list using the regex. It's just different methods of implementing it.

BenAlabaster
Yes, thanks.I think this:List<String> matches = allTerms.Where(item => Regex.IsMatch(bigTextToSearchForTerms, item, RegexOptions.IgnoreCase));..would need an IEnumerable<string> though?I get the general idea though - thanks.
Also, the third example of 'Find' needs to be a FindAll'. I did try these as an informal test and the loops still edges it, i.e. 642 words from 3mb war and peace = 241 ms for loop, 283 ms for allTerms.FindAll - pretty close...
+1  A: 

Why reinvent the wheel? Why not just leverage something like Lucene.NET?

HTH, Kent

Kent Boogaart
Yep. that's a good suggestion. We use Lucene elsewhere, so we could do that. I guess it's a balance between the size of the input source and the number of terms to find. The multiple regex works well up to a certain size and then something like lucene becomes worth it.