tags:

views:

47

answers:

3

I have the following regular expression:

(?:<(?<tag>\w*)>(?<text>.*)</\k<tag>>)

I want it t grab the text within the first HTML element.

eg.

<p>This should capture</p>This shouldn't

Works, but ...

<p>This should capture</p><p>This shouldn't</p>

Doesn't work. As you'd expect, it returns:

This should capture</p><p>This shouldn't

I'm racking my brains here. How can I just have it select the FIRST inner text?

(I'm trying to be tag-agnostic, so <strong>This should match</strong> is equally appropriate, etc.)

+3  A: 

You should use the HTML Agility Pack.

For example:

doc.DocumentNode.Descendants("p").First().InnerText
SLaks
I actually am looking at using the HtmlAgilityPack for another section of the project so this is "on the radar". I might just use it in the longer term.
Program.X
+1  A: 

In order to have a non-greedy * selection, you should add an ? after the *.

(?:<(?<tag>\w*)>(?<text>.*?)</\k<tag>>)
HoLyVieR
Thanks. I'm going to go for that only because it is very simple work I am doing and I am coping with failure elegantly. Then again @BlueRaja has just blown a hole in my theory. Sorry.
Program.X
+2  A: 

Stop. Just stop. If you are parsing HTML, use an HTML parser (or XML if you're dealing with valid XHTML). See this answer for more info.

Hank Gay