views:

97

answers:

3

Hi there,

I am currently iterating over somewhere between 7000 and 10000 text definitions varying in size between 0 and 5000 characters and I want to check whether a particular string exists in any of them. I want to do this for somewhere in the region of 5000 different string definitions.

In most cases I just want to to know an exact case-insensitive match however sometimes a regex is required to be more specific. I was wondering though whether it would be quicker to use another "search" technique when the regex isn't required.

A slimmed version of the code looks something like this.

foreach (string find in stringsiWantToFind)
{
    Regex rx = new Regex(find, RegexOptions.IgnoreCase);
    foreach (String s in listOfText)
        if (rx.IsMatch(s))
            find.FoundIn(s);
}

I've read around a bit to see whether I'm missing anything obvious. There are a number of suggestions for using Compliled regexs however I can't see that is helpful given the "dynamic" nature of the regex.

I also read an interesting article on CodeProject so I'm just about to look at using the "FastIndexOf" to see how it compares in performance.

I just wondered if anybody had any advice for this kind of problem and how performance can potentially be optimized?

Thanks

A: 

I would look into a file indexing service like MS Indexing Service or Google Desktop Search. Those APIs will allow you to search the indexes of your files rather than the files themselves and are extremely fast.

Steve Danner
What defines the list of Texts is defined from a number of different sources. So for example in some cases it is obtained from file and in others from a field in a database.
MrEdmundo
Could you use the indexing service in conjunction with a full-text SQL query then?
Steve Danner
I guess I could do yeah. The code currently runs in about 15 minutes so in principle I wasn't looking to change the structure of it. I was mainly looking for a small chage to the existing structure that could yield a number of minutes performance improvement.
MrEdmundo
A: 

One trick that came to my mind was:

Concatenate the strings into 1 big one, have the regex work on global level. That would yield you results of a ´string found xx times´ using 1 regex instead of looping over your list.

Hope this helps,

Marvin Smit
I possibly should have been more specific in my example(I will change it). I need to know which "source" the text was found in.
MrEdmundo
+1  A: 

Something like this? Make one regular expression which contains all the strings you want to match then loop over the files with that regex. The new Regex parameter is prob wrong, my knowledge of .net regex patterns is not the best. Also i've left out a few using to make it more readable here. You could make the Regex compiled if this improves things.

Regex rx = new Regex("string1|string2|string3|string5|string-etc", RegexOptions.IgnoreCase);

foreach (string fileName in fileNames)
{
  var fs = new FileStream(fileName.ToString(), FileMode.Open,  FileAccess.ReadWrite, FileShare.ReadWrite);    
  var sr = new StreamReader(fs);    
  var sw = new StreamWriter(fs);

  string readFile = sr.ReadToEnd();
  MatchCollection matches = rx.Matches(readFile );

  foreach (Match match in matches)
  {
    //do stuff
  }
}
runrunraygun
Thank you. Was considering whether that was a good idea so will give it a go.
MrEdmundo
Just done a quick test and it would seem there is no performance improvement with the above method.
MrEdmundo
the example appears to have already loaded the files into memory before we get to the example code, my version is loading them in the loop, have you taken this into account?
runrunraygun
I can't see how it makes any difference when the texts are loaded, as it is the regex that is doing all the work and taking all the time.
MrEdmundo
Sorry i found the original question a little confusing, I thought you were searching 7000 to 10000 files for your text? My mistake.
runrunraygun