tags:

views:

1519

answers:

6

Hi,

I'm using C# to continuously search for multiple string "keywords" within large strings, which are >= 4kb. This code is constantly looping, and sleeps aren't cutting down CPU usage enough while maintaining a reasonable speed. The bog-down is the keyword matching method.

I've found a few possibilities, and all of them give similar efficiency.

1) http://tomasp.net/articles/ahocorasick.aspx -I do not have enough keywords for this to be the most efficient algorithm.

2) Regex. Using an instance level, compiled regex. -Provides more functionality than I require, and not quite enough efficiency.

3) String.IndexOf. -I would need to do a "smart" version of this for it provide enough efficiency. Looping through each keyword and calling IndexOf doesn't cut it.

Does anyone know of any algorithms or methods that I can use to attain my goal?

+3  A: 

I haven't tried it, but have you looked at Rabin-Karp? Apparently it has a bad worst-case complexity, but is usually quite good.

What do your keywords look like? In particular, are they always delimited by spaces (or something similar)? If so, you could basically look through the string once looking for "words" and then either create a map from a word to the list of indexes of that word, or perhaps only do so for keywords you're interested in.

If you could give more details of the exact situation (such as the keywords, delimiters and what you need the result of your search to be) that would help.

Jon Skeet
I've been trying to utilize Rabin-Karp. The problem is, that all implementations use a static pattern length to speed up their algorithms. I cannot do this, and when I implement it without a constant pattern length, computation times grow exponentially.
Jon
Oh: The text I am searching is always of length 12286. My patterns are of much shorter length- anywhere from 10 to ~50 characters, and are simply words converted into a hex-string.(ex. BitConverter.ToString(ENCODING.GetBytes("no recoil")))All that I need is to know if any of my patterns occur in the text.
Jon
And are there always spaces before and after the words? If so, can you just iterate over the words in the text, and use a normal HashSet<string> to detect whether each word is or isn't a keyword?
Jon Skeet
No.The text is in the form XX-XX-XX-XX-XX where each XX is the hexadecimal representation of a byte from a buffer of memory.In fact, I could solve this problem without dealing with strings at all, but instead search byte arrays for bytes. The only reason that I am converting my data from byte[]'s to strings (which take up more memory and have other performance costs) is because I believed that there were more string searching algorithms than byte searching algorithms... I also hoped that Regex would meet my performance requirements...
Jon
+2  A: 

Are you always looking for the same keywords? Try Boyer-Moore. It requires some pre-processing for the keywords, but gains speed afterwards.

ammoQ
The problem is that I can't figure out how to make a Boyer-Moore implementation that works with multiple patterns..
Jon
Simple answer: You can't. But for each individual keyword, searching is much faster. It depends on the number of keyword vs. the average length of the keywords.
ammoQ
+2  A: 

I developed an efficient use of IndexOf for this question:

A better way to replace many strings - obfuscation in C#

It uses a list of keywords and their next position in the string. That way you only need to call IndexOf once for each keyword and then once for each match you find. It's especially efficient when replacing keywords in a large string, as you can process the string from start to end instead of processing the entire string once for each keyword. I don't know why you are looking for the keywords in the strings and what you do with the strings, but perhaps it may be useful in your situation.

Guffa
A: 

Run your code that is searching for the keywords through a foreach loop it will speed things right up. :)

Tanner
+2  A: 

Actually I had to solve this before, it was kinda fun. I had 20k html pages, each with a title, and wanted all other occurrence of the title on other pages to link to the page with that title. Sounds very similar to what your trying to accomplish.

The approach:

  1. Process the text of a file by turning it into a linked list of { Word, Whitespace } where the Word was identified as a contiguous alpha-numeric sequence with a few special characters, and the whitespace was everything that led up to the next word.
  2. Repeated the process in step 1 for each 'title' of the pages I wanted to link to.
  3. Each word from the node in the linked list in step 1 was then added to a binary-sorted list.
  4. Now you just need to walk the first word from each title linked-list from step 2 and seek into the binary sorted list from step 3. You may find several hits or even soft-hits when a word is plural so you might have several starting nodes from the binary list you need to test.
  5. Once you process the document into the form described in step 1 it's actually very easy to modify by inserting new nodes and/or modifying the Whitespace value. Once complete you simply walk the entire list and dump it all to a stream.

It sounds more complicated than it is, it took about two days to get it working well.

However you solve it, have fun with it :)

csharptest.net
A: 

I just posted this on a similar thread but it is probably more relevant here.

I'm doing a similar search, basically looking for keywords of around 10-50 bytes in length within text of approximately 45k bytes. I search about 1900 keywords over nine million texts so getting this as fast as possible is also a similar priority.

So, the fastest method I've found using .NET 4 is parallel Regex IsMatch.

Here's an example of getting total matches -

needles.AsParallel ( ).Sum ( l => Regex.IsMatch ( haystack , Regex.Escape ( l ) ) ? 1 : 0 );

This works for my scenario (above), it is 55% faster than ordinal indexOf parallel comparisons in my tests at least for the sort of data size I am using. I also imagine the speed improvement only occurs if you are using multi-core machines.

Would be interested if anyone can find a faster method?

gary
Read the article, the OP posted (http://tomasp.net/articles/ahocorasick.aspx): Regexes have the worst performance for this purpose. Parallelizing can improve the performance on multicore pcs, but does not care about the actual problem. The Aho-Corasick can be parallelized too, and would be even faster.
Philip Daubmeier
Thanks for pointing out the link. I had a go at making the FindAll function of this excellent library parallel but it did not work, I think the tree structure needs to be executed sequentially. I realize there's other options for doing a parallel search over this data (multiples source searches at once for example).Having said that even without using AsParallel for my scenario of a non-changing keyword set (needles) it is much faster. 1900 keyword searches over 45k data 100 times -REGEX: 5.137 secREGEX PLINQ: 1.73 secAHO-CORASICK: 0.826 sec
gary