tags:

views:

349

answers:

5

Is there a .Net class for reading and manipulating html other than System.Windows.Forms.HtmlDocument.

If not, are there any open source libraries for this.

+1  A: 

I would do something like this if it XHTML compliant:

System.Xml.XmlDocument xDoc = new System.Xml.XmlDocument();
xDoc.LoadXml(html);

And edit it that way. If it needs some cleaning up(XHtml Conversion) you can use HtmlTidy or Ntidy. Additionally, you can use this HTMLTidy wrapper example below:

string input = "<p>broken html<br <img src=test></div>";
HtmlTidy tidy = new HtmlTidy()
string output = tidy.CleanHtml(input, HtmlTidyOptions.ConvertToXhtml);
XmlDocument doc = new XmlDocument();
doc.LoadXml(output);

StackOverFlow Reference

EDIT above will be converted to XHtml

cgreeno
Surely that only works with XHTML: not with HTML.
ChrisW
Y is this down voted? Is it not a valid option????
cgreeno
I'd imagine it was down voted because the question had nothing to do with XML.
hmcclungiii
YES but the question asks for other OPTIONS on how to manipulate HTML! XHTML is just a reformulation of HTML in XML.
cgreeno
I don't think it deserves a down vote. so I voted it up.
Cyril Gupta
Then he'll fall into the trap of XML validation among many other things, that I'd guess by his wording would be way more than he is bargaining for. Instead of manipulating straight HTML, you would suggest he "reformulate" it? Sorry, I just don't agree, and I think your CAPS are a bit rude.
hmcclungiii
Reformulating it? XHtML is valid HTML as well.... so by turning HTML to XHTML you would not only be manipulating the required data but outputting something better.... You may not agree, but it is a valid option.
cgreeno
Oh, I didn't down vote it. Without knowing exactly what his purposes are, I would say that XHTML is overkill, to put it more simply.
hmcclungiii
A: 

Why does you like not System.Windows.Forms.HtmlDocument and Microsoft.mshtml ?

abatishchev
Because it requires a reference to System.Windows.Forms which isn't so appropriate for a class library or for asp.net.
mdresser
+1  A: 

You could use the MSHTML library. However, it is COM/ActiveX, but if you are using Visual Studio, it will create a managed wrapper for you automatically.

hmcclungiii
Is the (unmanaged) MSHTML library the same thing as the (managed) System.Windows.Forms.HtmlDocument?
ChrisW
I assumed that HtmlDocument is a managed wrapper around the unmanaged MSHTML ... you're saying this isn't so?
ChrisW
+1  A: 

you can always use the LiteralControl:

PlaceHolder.Controls.Add(new LiteralControl("<div>some html</div>"));
naspinski
+3  A: 

It seems that the best option for parsing Html in .Net apps is to use the Html Agility Pack library found on codeplex. This provides full DOM access to the HTML and is very straightforward to use.

mdresser