ansaurus

Question

What is the best way to parse html in C#?

Answer 1

A:

You could use a HTML DTD, and the generic XML parsing libraries.

Corin 2008-09-11 09:39:18

Can you clarify this?

Luke 2008-09-11 09:44:45

Very few real-world HTML pages will survive an XML parsing library.

Frank Krueger 2008-09-11 11:07:42

Answer 2

A:

The trouble with parsing HTML is that it isn't an exact science. If it was XHTML that you were parsing, then things would be a lot easier (as you mention you could use a general XML parser). Because HTML isn't necessarily well-formed XML you will come into lots of problems trying to parse it. It almost needs to be done on a site-by-site basis.

Mark Ingram 2008-09-11 09:47:26

Isn't parsing well forming HTML as specified by the W3C as an exact science as XHTML?

J. Pablo Fernández 2009-12-08 12:56:54

It should be, but people don't do it.

DMan 2010-02-16 03:54:18

Answer 3

+36 A:

I used the HTMLAgilityPack on a project for a previous employer and it was pretty effective. It wasn't foolproof, but it did handle most of the malformed tags, etc. that you find on the web these days.

Sarcastic 2008-09-11 10:17:08

Very handy library, thanks... And much easier for me to figure out than the mshtml.

Alex Baranosky 2009-01-09 09:22:33

Is this still the best option, almost two years on from when you answered the question? I'll check it out all the same though.

Drew Noakes 2010-08-07 12:54:39

I'm no longer using this project on a day-to-day basis, but it looks well maintained, with new features such as LINQ to Objects in beta, and under active development. Definitely still worth evaluating.

Sarcastic 2010-08-10 13:46:06

Answer 4

+15 A:

You could use TidyNet.Tidy to convert the HTML to XHTML, and then use an XML parser.

Another alternative would be to use the builtin engine mshtml:

using mshtml;
...
object[] oPageText = { html };
HTMLDocument doc = new HTMLDocumentClass();
IHTMLDocument2 doc2 = (IHTMLDocument2)doc;
doc2.write(oPageText);

This allows you to use javascript-like functions like getElementById()

Erlend 2008-09-11 10:35:12

This is a really good solution.

Frank Krueger 2008-09-11 11:06:37

Call me crazy, but I am having trouble figuring out how to use mshtml. Do you have any good links?

Alex Baranosky 2009-01-09 05:52:04

@Alex you need to include Microsoft.mshtml can find a bit more info here: http://msdn.microsoft.com/en-us/library/aa290341(VS.71).aspx

Wilfred Knievel 2010-01-12 23:17:11

Answer 5

+4 A:

I'm not sure about "best" but I'd start here:

Html Agility Pack

This will probably give you what you need.

Murph 2008-09-11 10:53:06

Answer 6

+3 A:

I think @Erlend's use of HTMLDocument is the best way to go. However, I have also had good luck using this simple library:

SgmlReader

Frank Krueger 2008-09-11 11:12:13

Answer 7

+4 A:

You can do a lot without going nuts on 3rd-party products and mshtml (i.e. interop). use the System.Windows.Forms.WebBrowser. From there, you can do such things as "GetElementById" on an HtmlDocument or "GetElementsByTagName" on HtmlElements. If you want to actually inteface with the browser (simulate button clicks for example), you can use a little reflection (imo a lesser evil than Interop) to do it:

var wb = new WebBrowser() ... tell the browser to navigate (tangential to this question). Then on the Document_Completed event you can simulate clicks like this.

var doc = wb.Browser.Document var elem = doc.GetElementById(elementId); object obj = elem.DomElement; System.Reflection.MethodInfo mi = obj.GetType().GetMethod("click"); mi.Invoke(obj, new object[0]);

you can do similar reflection stuff to submit forms, etc.

Enjoy.

Alan 2008-09-11 14:08:20

Answer 8

+1 A:

I've linked code here that will give you "LINQ to HTML" functionality

http://stackoverflow.com/questions/100358/looking-for-c-html-parser/624410#624410

Frank Schwieterman 2009-03-08 22:14:29

Answer 9

A:

Use WatiN if you need to see the impact of JS on the page [and you're prepared to start a browser]

Ruben Bartelink 2009-11-12 14:53:50

Answer 10

A:

Depending on your needs you might go for the more feature-rich libraries. I tried most/all of the solutions suggested, but what stood out head & shoulders was Html Agility Pack. It is a very forgiving and flexible parser.

Mikos 2010-01-03 09:04:29

Answer 11

A:

http://www.codeplex.com/htmlagilitypack

Njy 2010-01-14 17:58:26

Answer 12

A:

Data Extracting SDK

sashaeve 2010-03-09 14:45:22

Answer 13

A:

Try this script.

http://www.biterscripting.com/SS_URLs.html

When I use it with this url,

script SS_URLs.txt URL("http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c")

It shows me all the links on the page for this thread.

http://sstatic.net/so/all.css
http://sstatic.net/so/favicon.ico
http://sstatic.net/so/apple-touch-icon.png
.
.
.

You can modify that script to check for images, variables, whatever.

P M 2010-03-22 20:29:03

Answer 14

A:

.outerHTML strips away the quotes which is very bad

jan 2010-07-29 02:17:02

ansaurus

tags:

views:

answers:

What is the best way to parse html in C#?

related questions