tags:

views:

58

answers:

3

I have the following regex and would like it to match the following two lines. It appears to match the first end tag it finds rather than the last. How can it be modified to find the last one not the first.

 Regex: &lt;div(?<Attr>.*?)&gt;(?<Content>.*?)&lt;/div&gt;

    Currently matches: &lt;div class="test"&gt;Test Div&lt;/div&gt;

    Needs to match: &lt;div class="test"&gt;Test Div&lt;div&gt;Another Test&lt;/div&gt;&lt;/div&gt;
A: 

You’re using the non-greedy quantifier *? that will be expanded to as few as possible repetitions. If you want to match as much as possible, use the greedy version without the ?.

But in general, regular expressions are not suitable for non-regular languages like HTML. You should better use a HTML parser.

Gumbo
A: 

Regex typically is greedy, meaning it will try to find the last match, for what you need to do you can tel it to match /div> twice, or just including the unique ;</div> before that.

CodeJoust
+2  A: 

Not really an answer, but an observation based on experience. In general, regex-based approaches to pattern-matching HTML will give you endless grief and ultimately cannot work properly since HTML is not a regular language. Instead, I would recommend looking at DOM-based mechanisms. I've used, with considerably improved success, both jQuery and phpQuery to deal with hunting for stuff in HTML documents.

Scott Evernden
+1 - absolutely. For .NET try http://www.codeplex.com/htmlagilitypack
TrueWill