views:

544

answers:

1

I'm looking for a .NET library that can generate a clean Xml tree, ideally System.Xml.XmlDocument, from invalid HTML code. I.E. it should make the kind of best effort guesses, repairs, and substitutions browsers do when confronted with this situation, and generate a pretend XmlDocument. The library should also be well-maintained. :)

I realize this is a lot (too much?) to ask, and I would appreciate any useful leads. There seem to be a fair number of implementations of this for Java, but I would rather not generate my own bindings. So far for .NET I have found http://www.majestic12.co.uk/projects/html_parser.php and http://users.rcn.com/creitzel/tidy.html#dotnet, and http://sourceforge.net/projects/tidyfornet .

I have not yet built or tested any of these, but from the (sparse) docs and rare updates they do not seem like they have what I'm looking for. So what recommendations do you have, either among these choices, or from your past experience.

+6  A: 

The HTML Agility Pack is highly rated. It will certainly do the parsing / best guess etc.

The model is intentially similar to XmlDocument, including SelectNodes etc for querying.

If you need xhtml output, there is a OptionOutputAsXml flag; I assume that setting this to true and calling Save results in xhtml.

Marc Gravell
Thank you! So far it looks very solid, though I had to make a couple tweaks to compile it and there's no real docs.
Matthew Flaschen
I've completed the parsing code, and I still think it's an excellent library. Thank you for the tip. One slightly odd thing is that it doesn't seem to have an option to automatically expand entities (e.g.  ). You have to manually call DeEntitize. Luckily, I only needed this for 1 node.
Matthew Flaschen