tags:

views:

75

answers:

5

Hi

I need to get all the content inside the body tag of a html file using c#, any good and effective ways of doing this?

+3  A: 

Check out the HTML Agility Pack to do all sorts of HTML manipulation

It gives you an interface somewhat similar to the XmlDocument XML handling interface:

 HtmlDocument doc = new HtmlDocument();
 doc.Load("file.htm");

 HtmlNode bodyNode = doc.DocumentNode.SelectSingleNode("/html/body");

 if(bodyNode != null)
 {
    // do something
 }
marc_s
+2  A: 

You may take a look at SgmlReader and HTML Agility Pack.

Darin Dimitrov
That URL to SgmlReader leads to a very old version that hasn't been touched in years. The guys maintaining SgmlReader these days are MindTouch. I would recommend SgmlReader over HtmlAgilityPack due to its lower level approach and active maintenance. http://developer.mindtouch.com/en/docs/SgmlReader
asbjornu
If your HTML isn't wellformed XHTML I think you'll find that SgmlReader (and yeah use the mindtouch version as in the comment above) is your best bet.
nrkn
@asbjomu - Looking through the conversion examples on the mindtouch site, I can't find a single one where SgmlReader produces a DOM that matches what browsers do. I don't know whether HTML Agility Pack is any better, but I wasn't impressed.
Alohci
@Alohci I agree that SgmlReader isn't up to par with browser parsers, but there aren't many alternatives native to C# that does it better. HtmlAgilityPack surely doesn't.
asbjornu
A: 

Its easy enough to pull the page code into a string, and simply search for the occurrence of the string "<body" and the string "</body", and just do a little math to get your value...

Dutchie432
A: 

If it happens to be XHTML, then you could use XPath.

Bryan
A: 

Use XML methods, XPATH (if you want ONLY specified node). For more advanced manipulation with html use HTML Agility pack.

Tomas Voracek