tags:

views:

1156

answers:

4

I'm nearly done with a trackback system for my website, but have one last niggling regular expression I just can't get right.

What I'm after is an excerpt of the referring page, where I'm defining the most relevant excerpt as:

The first paragraph (marked by <p></p> tags) that follows either an <h1></h1>, <h2></h2> or <h3></h3> in the HTML Source of the page.

For instance, I can successfully fetch the <title></title> tag for the HTML as follows:

Regex reTITLE = new Regex( @"(?<=<title.*>)([\s\S]*)(?=</title>)",
RegexOptions.IgnoreCase );

Match match = reTITLE.Match( strHTMLSource );
if (match.Success)
    {
        strReferringPageTitle = match.Value.Trim( );
    }

My question -- what Regular Expression can I use to fetch the string described in the first part of my post?

PS: I love StackOverflow and this community -- great job, Joel & Co.!

+2  A: 
Match m = Regex.Match(strHTMLSource, "^.*?</h[123]>.*?<p>(.*?)</p>",
    RegexOptions.Compiled | RegexOptions.IgnoreCase);

string para = m.Success ? m.Groups[1].Value.Trim() : string.Empty;
LukeH
Thanks to both of you!You're both basically right; marking this one "best" solely because it's a little more complete. I still need to make some minor tweaks to the code; it's often pulling up the empty string for random linkbacks. But thanks!
A: 

This regex will find all first paragraphs after a h1, h2, or h3. If you want only the very first paragraph on the page, just keep the first match.

(?<=</h[1-3]>\s*?<p>)([\s\S]*?)(?=</p>)

You will probably need to adjust the matches for the <p> tags to account for attributes.

Alan McBee
+1  A: 

Personally I would use XPath queries to do what you're trying to achieve, much easier imo than fiddling with regexes.

Blake Pettersson
A: 

There are a lot of use cases that a regular expression won't work properly for. For instance:

<p>foo<p>bar</p>baz</p>

<p>This paragraph is valid <!-- <p>This one isn't</p> --> </p>

A regular expression that captures the text between the <p> and </p> will capture (respectively):

foo<p>bar

This paragraph is valid <!-- <p>This one isn't

If I had to process HTML found in the wild, I'd use MSHTML to parse the HTML, and then search through the DOM to find the objects.

Using MSHTML is not anywhere near as lightweight as using a regular expression, to be sure. But MSHTML is designed to make sense out of the sloppiest of web pages. I'd much rather use all of the knowledge of messy real-world use cases that it's designed to handle than discover them allfor myself.

See the answer to this question for a bit of sample code.

Robert Rossney