Suggestion Needed: Best way of parsing HTML in C#

tags:

c#
html

views:

323

answers:

Suggestion Needed: Best way of parsing HTML in C#

Hello everybody, This is my question. Which is the best way to extract certain information from an HTML page. What I currently do is the following:

Download the page using WebClient
Convert the received data to string using UTF8Encoding
Convert the string to XML
Using Xml related classes from the .NET Framework extract the desired data

This is what I currently do in summarized form. Anyone aware of another method? Something that can be faster or easier?

Best Regards, Kiril

PS: I have heard about a testing framework called Watin

that allows you to do something similar, but haven't researched it much

Unless you are working with perfectly formed XHTML Regular expressions will be more suitable for parsing the html?

Watin allows you to script button clicks, script calls etc on a web page through IE (can it use other browsers not sure?). I dont think this will accomplish what you are looking for.

alexmac 2009-05-27 14:40:07

Regular expressions do not work well against malformed HTML either.

Rex M 2009-05-27 14:42:15

No but I suspect many third party libs utilize them in conjunction with standard string manipulation to process the html and either way regex is superior to the flexibility offered by XML queries.

alexmac 2009-05-27 14:46:21

Yeah they do - I use them all the time. They only work badly on human written HTML that does not come from a templating system - a vanishingly small proportion of structured data. Should not have been voted down +1 from me.

Simon Gibbs 2009-05-27 14:47:09

@Simon unfortunately, plenty of templating systems use templates written with malformed HTML. Regular expressions are fine for extracting fairly small sets of data, but the time required to write highly complex, large data extractions is far greater than XML-izing the markup and using XPATH.

Rex M 2009-05-27 14:49:17

So how is "XMLizing" the markup going to assist with malformed HTML? However I agree HTML agility pack is probably what he/she is looking for - which probably uses regular expressions at some point ;)

alexmac 2009-05-27 14:52:56

@alexmac malformed HTML can still be parsed into a DOM, and a DOM can always be represented by well-formed XML. The library handles that part, essentially making malformed HTML well-formed XHTML.

Rex M 2009-05-27 14:59:59

+6 A:

It sounds like you've figured out how to fetch the page data (that's the simplest part).

For the rest, the best managed library I've used for this type of task is the HTML Agility Pack. It's open source and very mature, written entirely in .NET. It handles malformed HTML and can do what you need in two different ways:

Natively supports XPATH and XML-like querying against the HTML DOM. It is designed to mimic .NET's XML library, so anything you can do against XML with .NET, you can do against HTML with this.
Supports producing valid XML from the HTML, so you can use any XML tools.

Rex M 2009-05-27 14:40:36

+2 A:

For your parsing needs I recommend the HTML Agility Pack.

For actually retrieving the HTML, use the WebRequest class

Kirschstein 2009-05-27 14:45:55

This could be simplified slightly, by using the WebClient.DownloadString method I believe.

See other answers for details on the parsing, as I haven't tried the HTML Agility Pack.

samjudson 2009-05-27 14:49:30

That wouldn't solve the parsing problem, though.

John Saunders 2009-05-27 14:53:41

No, it wont, but I considered the other answers on the HTML Agility Pack to cover that aspect well enough.

samjudson 2009-05-27 15:44:58

ansaurus

tags:

views:

answers:

Suggestion Needed: Best way of parsing HTML in C#

related questions