ansaurus

Question

Answer 1

+7 A:

Don't use a regular expression for this - use something like the Html Agility Pack.

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

Then you can extract the body with an XPATH.

Andrew Hare 2009-06-11 17:33:56

I agree. I've used this and must say it's fast, neat and clean.

Saif Khan 2009-06-11 17:46:45

Answer 2

A:

This should get you pretty close:

(?is)<body(?:\s[^>]*)>(.*?)(?:</\s*body\s*>|</\s*html\s*>|$)

Jeremy Stein 2009-06-11 19:55:26

Answer 3

A:

How about something like this?

It captures everything between <body></body> tags (case insensitive due to RegexOptions.IgnoreCase) into a group named theBody.

RegexOptions.Singleline allows us to handle multiline HTML as a single string.

If the HTML does not contain <body></body> tags, the Success property of the match will be false.

        string html;

        // Populate the html string here

        RegexOptions options = RegexOptions.IgnoreCase | RegexOptions.Singleline;
        Regex regx = new Regex( "<body>(?<theBody>.*)</body>", options );

        Match match = regx.Match( html );

        if ( match.Success ) {
            string theBody = match.Groups["theBody"].Value;
        }

Darryl 2009-06-17 15:04:04

ansaurus

tags:

views:

answers:

Regex Extract html Body

related questions