I'm looking for a library/method to parse an html file with more html specific features than generic xml parsing libraries.
The trouble with parsing HTML is that it isn't an exact science. If it was XHTML that you were parsing, then things would be a lot easier (as you mention you could use a general XML parser). Because HTML isn't necessarily well-formed XML you will come into lots of problems trying to parse it. It almost needs to be done on a site-by-site basis.
I used the HTMLAgilityPack on a project for a previous employer and it was pretty effective. It wasn't foolproof, but it did handle most of the malformed tags, etc. that you find on the web these days.
You could use TidyNet.Tidy to convert the HTML to XHTML, and then use an XML parser.
Another alternative would be to use the builtin engine mshtml:
using mshtml;
...
object[] oPageText = { html };
HTMLDocument doc = new HTMLDocumentClass();
IHTMLDocument2 doc2 = (IHTMLDocument2)doc;
doc2.write(oPageText);
This allows you to use javascript-like functions like getElementById()
I'm not sure about "best" but I'd start here:
This will probably give you what you need.
I think @Erlend's use of HTMLDocument
is the best way to go. However, I have also had good luck using this simple library:
You can do a lot without going nuts on 3rd-party products and mshtml (i.e. interop). use the System.Windows.Forms.WebBrowser. From there, you can do such things as "GetElementById" on an HtmlDocument or "GetElementsByTagName" on HtmlElements. If you want to actually inteface with the browser (simulate button clicks for example), you can use a little reflection (imo a lesser evil than Interop) to do it:
var wb = new WebBrowser() ... tell the browser to navigate (tangential to this question). Then on the Document_Completed event you can simulate clicks like this.
var doc = wb.Browser.Document var elem = doc.GetElementById(elementId); object obj = elem.DomElement; System.Reflection.MethodInfo mi = obj.GetType().GetMethod("click"); mi.Invoke(obj, new object[0]);
you can do similar reflection stuff to submit forms, etc.
Enjoy.
I've linked code here that will give you "LINQ to HTML" functionality
http://stackoverflow.com/questions/100358/looking-for-c-html-parser/624410#624410
Use WatiN if you need to see the impact of JS on the page [and you're prepared to start a browser]
Depending on your needs you might go for the more feature-rich libraries. I tried most/all of the solutions suggested, but what stood out head & shoulders was Html Agility Pack. It is a very forgiving and flexible parser.
Try this script.
http://www.biterscripting.com/SS_URLs.html
When I use it with this url,
script SS_URLs.txt URL("http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c")
It shows me all the links on the page for this thread.
http://sstatic.net/so/all.css
http://sstatic.net/so/favicon.ico
http://sstatic.net/so/apple-touch-icon.png
.
.
.
You can modify that script to check for images, variables, whatever.