views:

262

answers:

4

What's the best way to parse fragments of HTML in C#?

For context, I've inherited an application that uses a great deal of composite controls, which is fine, but a good deal of the controls are rendered using a long sequence of literal controls, which is fairly terrifying. I'm trying to get the application into unit tests, and I want to get these controls under tests that will find out if they're generating well formed HTML, and in a dream solution, validate that HTML.

+2  A: 

Have a look at the HTMLAgility pack. It's very compatible with the .NET XmlDocument class, but it much more forgiving about HTML that's not clean/valid XHTML.

James Curran
That library seems a little too good - I'm testing the code, so it's a good thing if tags left open blow the parser up.
Dan Monego
James Curran
+1  A: 

I've used an SGMLReader to produce a valid Xml document from HTML and then parse what is required using XPath or to another format using XSLT. .

Mark Lindell
+1  A: 

If the HTML is XHTML compliant, you can use the built in System.Xml namespace.

Kieron
A: 

You can also look into HTML Tidy for HTML parsing/cleanup. I don't think they have specific .NET libraries, but you might be able to run the binary via command-line, or IKVM the java libraries.

Chris Marasti-Georg