tags:

views:

57

answers:

1

If I have input string in C#, how can I do series of Regex / linq operations on it to do Regex match on one piece of string and then another Regex on all pieces of string which are not matched by 1st Regex.

In other words, for input string:

<!-- Lorem ipsum dolor sit amet, consectetur adipiscing elit -->  
<!-- The quick brown fox jumps over the lazy dog -->
Lorem ipsum dolor sit amet, consectetur adipiscing elit 
The quick brown fox jumps over the lazy  dog
<!-- Lorem ipsum dolor sit amet, consectetur adipiscing elit -->  
<!-- The quick brown fox jumps over the lazy dog -->
Lorem ipsum dolor sit amet, consectetur adipiscing elit
The quick brown fox jumps over the lazy dog

I want to use Regex1 to match lines with <!-- --> and do certain operation on them without parsing them further. And to have Regex2 to match things in pieces of string not matched with Regex1, for example to find all words "fox" and "dog" in those lines and do certain operations on those words.

What is the best way to combine Regex/linq operations in situation like this?

+1  A: 

You're in luck since .NET supports variable-length lookbehind.

Therefore, you can use two regexes in sequence.

First, use

^<!--(.*)-->\s*$

to find all comment lines. Backreference $1 will contain whatever is between the delimiters. For example:

Regex paragraphs = new Regex(@"^<!--(.*)-->\s*$", RegexOptions.Multiline);
Match matchResults = paragraphs.Match(subjectString);
while (matchResults.Success) {
    // matched text: matchResults.Value
    // match start: matchResults.Index
    // match length: matchResults.Length
    matchResults = matchResults.NextMatch();

Second, to find and manipulate "dog" and "fox" in the other lines, you can use

(?<!^<!--.*)(dog|fox)

What this regex means is "Match dog or fox unless the line starts with <!--". So if you want to replace them, say, by "cat", use

resultString = Regex.Replace(subjectString, "(?<!^<!--.*)(dog|fox)", "cat", RegexOptions.Multiline);
Tim Pietzcker
Correct me if I'm wrong but here you are doing it in sequence twice on entire input string. What I would like to do is in the first while loop somehow do regex between current and previous match as that should be substring between two matches (first un-match). Looks like I need to store match indexes and get substring based on them and execute second regex on that substring? Is this more efficient than doing it like you do in sequence?Or is it better to have two regex groups in first regex (matching and non matching) and use non-matching group as input for second regex???
Perica Zivkovic
The first code snippet collects an array of matches, each one containing one line. The second one does a replace operation on the entire input string at once. Of course, in your example you wouldn't need a regex for the first task, anyway - just find a line that starts and ends with your delimiter. Whether this matters performance-wise kind of depends on what you want to do with the results exactly.
Tim Pietzcker