ansaurus

Question

Regular Expression (C# flavor) to fetch first after heading tag

Answer 1

+2 A:

Match m = Regex.Match(strHTMLSource, "^.*?</h[123]>.*?<p>(.*?)</p>",
    RegexOptions.Compiled | RegexOptions.IgnoreCase);

string para = m.Success ? m.Groups[1].Value.Trim() : string.Empty;

LukeH 2009-05-06 22:38:09

Thanks to both of you!You're both basically right; marking this one "best" solely because it's a little more complete. I still need to make some minor tweaks to the code; it's often pulling up the empty string for random linkbacks. But thanks!

2009-05-06 23:03:40

Answer 2

A:

This regex will find all first paragraphs after a h1, h2, or h3. If you want only the very first paragraph on the page, just keep the first match.

(?<=</h[1-3]>\s*?<p>)([\s\S]*?)(?=</p>)

You will probably need to adjust the matches for the  tags to account for attributes.

Alan McBee 2009-05-06 22:47:54

Answer 3

+1 A:

Personally I would use XPath queries to do what you're trying to achieve, much easier imo than fiddling with regexes.

Blake Pettersson 2009-05-06 23:04:46

Answer 4

A:

There are a lot of use cases that a regular expression won't work properly for. For instance:

<p>foo<p>bar</p>baz</p>

<p>This paragraph is valid <!-- <p>This one isn't</p> --> </p>

A regular expression that captures the text between the  and  will capture (respectively):

foo<p>bar

This paragraph is valid <!-- <p>This one isn't

If I had to process HTML found in the wild, I'd use MSHTML to parse the HTML, and then search through the DOM to find the objects.

Using MSHTML is not anywhere near as lightweight as using a regular expression, to be sure. But MSHTML is designed to make sense out of the sloppiest of web pages. I'd much rather use all of the knowledge of messy real-world use cases that it's designed to handle than discover them allfor myself.

See the answer to this question for a bit of sample code.

Robert Rossney 2009-05-07 00:32:41

ansaurus

tags:

views:

answers:

Regular Expression (C# flavor) to fetch first <p></p> after heading tag

related questions