tags:

views:

2265

answers:

3

I am writing a program that needs to search a LARGE text document for a large collection of words. The words are all file names, with underscores in them (eg, this_file_name). I know how to open and iterate through a text document, but I'm curious whether I should use Regex to search for these names, and if so, what kind of reg. ex. sequence should I use? I've tried

Regex r = new Regex("?this\_file\_name");

but I get an invalid argument error every time.

+2  A: 

It would be helpful to see a sample of the source text. but maybe this helps

var doc = @"asdfsdafjkj;lkjsadf asddf jsadf asdfj;lksdajf
sdafjkl;sjdfaas  sadfj;lksadf sadf jsdaf jf sda sdaf asdf sad
jasfd sdf sadf sadf sdajlk;asdf
this_file_name asdfsadf asdf asdf asdf 
asdf sadf asdfj asdf sdaf sadfsadf
sadf asdf this_file_name asdf asdf ";

var reg = new Regex("this_file_name", RegexOptions.IgnoreCase | RegexOptions.Multiline);
var matches = reg.Matches(doc);
bendewey
The Multiline modifier is not needed.
Alan Moore
@Alan M, why not?
bendewey
As Alan pointed out, the `RegexOptions.Multiline` is not needed. Read its documentation. It only makes a difference if you’re using `^` and/or `$`.
Timwi
A: 

If I understand your problem correctly, I think a regular expression is the wrong tool for the job. I'll assume your file names are separated with some kind of delimiter (like commas or new lines).

If this is the case, use String.Split to put all file names into an array, sort the array alphabetically, then perform a binary search against the sorted array for each item in the "collection" you mentioned. I'm pretty sure that this is the most computationally efficient way to perform the task.

When you say "LARGE" text files, think about their size relative to the machines this program will be running on. A 1 MB text file may seem large, but it will easily fit into the memory of a machine with 2 GB RAM. If the file is considerably larger compared to the memory of your client machines, read the file in chunks at a time. This is called buffering.

James Jones
+1  A: 

Perhaps break your document into tokens by splitting on space or non word characters first?

After, I think a regex that might work for you would look something like this:

Regex r = new Regex(@"([\w_]+)");

Scott Hoffman