views:

23574

answers:

14

I'm looking for a library/method to parse an html file with more html specific features than generic xml parsing libraries.

A: 

You could use a HTML DTD, and the generic XML parsing libraries.

Corin
Can you clarify this?
Luke
Very few real-world HTML pages will survive an XML parsing library.
Frank Krueger
A: 

The trouble with parsing HTML is that it isn't an exact science. If it was XHTML that you were parsing, then things would be a lot easier (as you mention you could use a general XML parser). Because HTML isn't necessarily well-formed XML you will come into lots of problems trying to parse it. It almost needs to be done on a site-by-site basis.

Mark Ingram
Isn't parsing well forming HTML as specified by the W3C as an exact science as XHTML?
J. Pablo Fernández
It should be, but people don't do it.
DMan
+36  A: 

I used the HTMLAgilityPack on a project for a previous employer and it was pretty effective. It wasn't foolproof, but it did handle most of the malformed tags, etc. that you find on the web these days.

Sarcastic
Very handy library, thanks... And much easier for me to figure out than the mshtml.
Alex Baranosky
Is this still the best option, almost two years on from when you answered the question? I'll check it out all the same though.
Drew Noakes
I'm no longer using this project on a day-to-day basis, but it looks well maintained, with new features such as LINQ to Objects in beta, and under active development. Definitely still worth evaluating.
Sarcastic
+15  A: 

You could use TidyNet.Tidy to convert the HTML to XHTML, and then use an XML parser.

Another alternative would be to use the builtin engine mshtml:

using mshtml;
...
object[] oPageText = { html };
HTMLDocument doc = new HTMLDocumentClass();
IHTMLDocument2 doc2 = (IHTMLDocument2)doc;
doc2.write(oPageText);

This allows you to use javascript-like functions like getElementById()

Erlend
This is a really good solution.
Frank Krueger
Call me crazy, but I am having trouble figuring out how to use mshtml. Do you have any good links?
Alex Baranosky
@Alex you need to include Microsoft.mshtml can find a bit more info here: http://msdn.microsoft.com/en-us/library/aa290341(VS.71).aspx
Wilfred Knievel
+4  A: 

I'm not sure about "best" but I'd start here:

Html Agility Pack

This will probably give you what you need.

Murph
+3  A: 

I think @Erlend's use of HTMLDocument is the best way to go. However, I have also had good luck using this simple library:

SgmlReader

Frank Krueger
+4  A: 

You can do a lot without going nuts on 3rd-party products and mshtml (i.e. interop). use the System.Windows.Forms.WebBrowser. From there, you can do such things as "GetElementById" on an HtmlDocument or "GetElementsByTagName" on HtmlElements. If you want to actually inteface with the browser (simulate button clicks for example), you can use a little reflection (imo a lesser evil than Interop) to do it:

var wb = new WebBrowser() ... tell the browser to navigate (tangential to this question). Then on the Document_Completed event you can simulate clicks like this.

var doc = wb.Browser.Document var elem = doc.GetElementById(elementId); object obj = elem.DomElement; System.Reflection.MethodInfo mi = obj.GetType().GetMethod("click"); mi.Invoke(obj, new object[0]);

you can do similar reflection stuff to submit forms, etc.

Enjoy.

Alan
+1  A: 

I've linked code here that will give you "LINQ to HTML" functionality

http://stackoverflow.com/questions/100358/looking-for-c-html-parser/624410#624410

Frank Schwieterman
A: 

Use WatiN if you need to see the impact of JS on the page [and you're prepared to start a browser]

Ruben Bartelink
A: 

Depending on your needs you might go for the more feature-rich libraries. I tried most/all of the solutions suggested, but what stood out head & shoulders was Html Agility Pack. It is a very forgiving and flexible parser.

Mikos
A: 

http://www.codeplex.com/htmlagilitypack

Njy
A: 

Data Extracting SDK

sashaeve
A: 

Try this script.

http://www.biterscripting.com/SS_URLs.html

When I use it with this url,

script SS_URLs.txt URL("http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c")

It shows me all the links on the page for this thread.

http://sstatic.net/so/all.css
http://sstatic.net/so/favicon.ico
http://sstatic.net/so/apple-touch-icon.png
.
.
.

You can modify that script to check for images, variables, whatever.

P M
A: 

.outerHTML strips away the quotes which is very bad

jan