views:

2116

answers:

3

For the moment the best way that I have found to be able to manipulate DOM from a string that contain HTML is:

WebBrowser webControl = new WebBrowser();
webControl.DocumentText = html;
HtmlDocument doc = webControl.Document;

There are two problems:

  1. Requires the WebBrowser object!
  2. This can't be used with multiple threads; I need something that would work on different thread (other than the main thread).

Any ideas?

+1  A: 

Depending on what you are trying to do (maybe you can give us more details?) and depending on whether or not the HTML is well-formed, you could convert this to an XmlDocument:

System.Xml.XmlDocument x = new System.Xml.XmlDocument();
x.LoadXml(html); // as long as html is well-formed, i.e. XHTML

Then you could manipulate it easily, without the WebBrowser instance. As for threads, I don't know enough about the implementation of XmlDocument to know the answer to that part.


If the document isn't in proper form, you could use NTidy (.NET wrapper for HTML Tidy) to get it in shape first; I had to do this very thing for a project once and it really wasn't too bad.

Jason Bunting
The document might not be well formatted this is why the XmlDocument might not work but I appreciate the alternative.
Daok
+8  A: 

I did a search to GooglePlex for HTML and I found Html Agility Pack I do not know if it's for that or not, I am downloading it right now to give a try.

Daok
Html Agility Pack is awesome
Mark Cidade
Ditto - I was actually about to recommend using HTML Tidy to get the document into good shape and then turn it into an XmlDocument, but perhaps you can skip that with the HTML Agility Pack. Good stuff.
Jason Bunting
Agility pack work fine with HTML and thread! I got my answer! Thx all!!!
Daok
Yeah +1 for the HtmlAgilityPack. Stand on the shoulders of giants!
Stewart Johnson
+1  A: 

JasonBunting already posted this, but it really works to use a .net wrapper around HTML tidy and load it up in an XmlDocument.

I have used this .net wrapper before :

http://www.codeproject.com/KB/cs/ZetaHtmlTidy.aspx

And implemented it somewhat like this:

string input = "<p>crappy html<br <img src=foo></div>";
HtmlTidy tidy = new HtmlTidy()
string output = tidy.CleanHtml(input, HtmlTidyOptions.ConvertToXhtml);
XmlDocument doc = new XmlDocument();
doc.LoadXml(output);

Sorry if considered a repost :)

Martin Kool