ansaurus

Question

Fastest way to check if a string exists in a large number of files

Answer 1

A:

I would look into a file indexing service like MS Indexing Service or Google Desktop Search. Those APIs will allow you to search the indexes of your files rather than the files themselves and are extremely fast.

Steve Danner 2010-02-15 18:38:37

What defines the list of Texts is defined from a number of different sources. So for example in some cases it is obtained from file and in others from a field in a database.

MrEdmundo 2010-02-15 18:41:46

Could you use the indexing service in conjunction with a full-text SQL query then?

Steve Danner 2010-02-15 18:45:26

I guess I could do yeah. The code currently runs in about 15 minutes so in principle I wasn't looking to change the structure of it. I was mainly looking for a small chage to the existing structure that could yield a number of minutes performance improvement.

MrEdmundo 2010-02-15 18:48:17

Answer 2

A:

One trick that came to my mind was:

Concatenate the strings into 1 big one, have the regex work on global level. That would yield you results of a ´string found xx times´ using 1 regex instead of looping over your list.

Hope this helps,

Marvin Smit 2010-02-15 18:39:59

I possibly should have been more specific in my example(I will change it). I need to know which "source" the text was found in.

MrEdmundo 2010-02-15 18:42:47

Answer 3

+1 A:

Something like this? Make one regular expression which contains all the strings you want to match then loop over the files with that regex. The new Regex parameter is prob wrong, my knowledge of .net regex patterns is not the best. Also i've left out a few using to make it more readable here. You could make the Regex compiled if this improves things.

Regex rx = new Regex("string1|string2|string3|string5|string-etc", RegexOptions.IgnoreCase);

foreach (string fileName in fileNames)
{
  var fs = new FileStream(fileName.ToString(), FileMode.Open,  FileAccess.ReadWrite, FileShare.ReadWrite);    
  var sr = new StreamReader(fs);    
  var sw = new StreamWriter(fs);

  string readFile = sr.ReadToEnd();
  MatchCollection matches = rx.Matches(readFile );

  foreach (Match match in matches)
  {
    //do stuff
  }
}

runrunraygun 2010-02-15 19:16:19

Thank you. Was considering whether that was a good idea so will give it a go.

MrEdmundo 2010-02-15 21:51:05

Just done a quick test and it would seem there is no performance improvement with the above method.

MrEdmundo 2010-02-15 22:32:18

the example appears to have already loaded the files into memory before we get to the example code, my version is loading them in the loop, have you taken this into account?

runrunraygun 2010-02-15 23:07:04

I can't see how it makes any difference when the texts are loaded, as it is the regex that is doing all the work and taking all the time.

MrEdmundo 2010-02-16 07:54:11

Sorry i found the original question a little confusing, I thought you were searching 7000 to 10000 files for your text? My mistake.

runrunraygun 2010-02-16 09:05:42

ansaurus

tags:

views:

answers:

Fastest way to check if a string exists in a large number of files

related questions