tags:

views:

50

answers:

2

What's the best way to look for a pattern in a (potentially) very large text.

I could use Regex but it accepts a string as an argument. Is there a way to use it with a TextReader or some kind of stream instead?

+2  A: 

No, a regular expression may need to do backtracking. As a stream only is read forward it would mean that it had to keep the entire stream in memory anyway. Even if you have a regular expression that wouldn't backtrack, the engine isn't built for this.

Besides, regular expressions isn't very fast anyway. You should look for a pattern matching method that is designed for reading streams.

Guffa
Thanks for your answer. What tool should I use then?
"regular expressions isn't very fast anyway"? Of course that depends on specific task, but comparing with IndexOf() and Substring() and MUCH faster? Or am I wrong (tested with classic asp, long time ago)?
Rubens Farias
IndexOf and Substring should be much faster than any Regex invocation. Backtracking isn't necessarily a problem; that just means that "seen" data needs to be cached until a match is found - but since matches are presumably small, this is just fine. (The stream is large not because an individual match is, but because it's an unbounded sequence of matches).
Eamon Nerbonne
+1  A: 

Since your patterns are relatively simple (as indicated in your edit), you should be able to use regular expressions and just read the stream line-by-line. Here is an example that finds words. (Maybe, depending on how you are defining "words." :-) )

var pattern = new Regex(@"\b\w+\b");

using (var reader = new StreamReader(@"..\..\TextFile1.txt"))
{
    while (reader.Peek() >= 0)
    {
        Match match = pattern.Match(reader.ReadLine());
        while (match.Success)
        {
            Console.WriteLine(match.Value);
            match = match.NextMatch();
        }
    }
}

If you are looking for something that involves newlines, then you will have to be a little creative. Add them to the base string being searched. Or, if multiple newlines are significant, build the search string in memory with multiple ReadLine() calls until a non-newline is found. Then process that and move on in the stream.

Dave
Thanks Dave for the idea. This would work well if the lines are small. However, it may be that the file as only one gigantic line. But the chunk by chunk approach may be what I am looking for.