tags:

views:

82

answers:

3

Okay, so I'm working with html and I want to match everything between two comments generated by a CMS - including linebreaks.

Example:

<!-- Start Magic -->
<h2>My title</h2>
<p>Here's some content</p>
<p>And hey look, a linebreak!
And here's another for good measure!
</p>
<!-- End Magic -->

And here's the Regex I'm using to extract the guts:

Regex.Match(magic, @"<!-- Start Magic -->(?<guts>[\s\S]*?)<!-- End Magic -->");

Now I should note that this actually works fine. I just wondered whether using [\s\S]*? is the best way of matching everything (including line breaks) in a non-greedy fashion.

A: 

If you want to match everything in a non-greedy fashion,

@"<!-- Start Magic -->(.*?)<!-- End Magic -->"

should work. Din't test it though.

Daniel S
The problem with that is that it will stop at a newline, right?
Franz
Yes Franz, you are right. `.` matches everything but line breaks. Sorry Daniel :(
Iain Fraser
True! I forgot that. In that case, \s\S is correct.
Daniel S
It's okay, it's great to be able to learn new stuff everyday here.
Daniel S
Yes I'm getting quite addicted to this place lately. I'm just starting out with C# and I have to say, being on here has really accellerated my learning! And also - the accepted answer uses your `.*?` suggestions, just adding the modifier that makes the engine ignore line breaks, so there you go :)
Iain Fraser
Same here. And I even learn through not accepted answers ;)
Franz
+1  A: 

I believe \s\S is the equivalent to . if you use the ignore-whitespace modifier if that is possible in C#.

Franz
Just to be complete: That modifier is `m`.
Franz
Hmmm... I'll look into that. I've always just used `\s\S` in the past and today (for whatever reason) I thought to myself, "Am I writing WTF code, is there a better way?". So I figured I'd try to find out :)
Iain Fraser
I quite like your solution, to be honest. It's a pretty cool way to get around forgetting the modifier all the time ;)
Franz
Hey Franz, thanks for your answer and your comments, it was an interesting discussion :). I gave the accept to David though because he provided the way of introducing the `m` modifier in C#. Although if you're in C#, the `[\s\S]*?` results in shorter code unless you're using more than 4 of them.
Iain Fraser
Lol, that's no problem. I was quite interested in seeing the outcome in C#, too.
Franz
Um, the ignore-whitespace modifier is `x`, but it has no effect on the dot. It's the `s` modifier that lets the dot match newlines.
Alan Moore
+4  A: 

There is another method using the RegexOptions shown below:

Regex.match(magic, @"<!-- Start Magic -->.*?<!-- End Magic -->", RegexOptions.SingleLine);

With RegexOptions.SingleLine you are informing the C# regex engine to change the meaning of dot so that it matches every character (instead of the default which is every character excluding \n)

This doesn't address "the best way" of doing this since that is rather subjective, including considerations like performance and readability.

David Hall
Ok, so this is the C# equivalent of the m modifier?
Franz
Yep - gosh, I miss perl. That was where I first met the concept of regular expressions years ago.
David Hall
Hey, would you look at that :). Nice work David
Iain Fraser
Oops, there actually is a tiny error in this one. The `.*` pattern should probably be made non-greedy by appending a question mark, just in case (although it probably does not matter in this application - theoretically, it would be better practice, and possibly faster, too).
Franz
I agree - it would probable be faster, and almost certainly would avoid a bug if you have two instances of the end magic - I'll ammend the code. Thanks.
David Hall
Actually, it's the C# equivalent of the `s` modifier. That is, `s` is what Perl uses, and most other flavors copy Perl. But some of them call it "DOTALL" mode, a much better name than "singleline" IMO. And just to keep us on our toes, Ruby uses the `m` modifier for that mode and calls it "multiline".
Alan Moore