tags:

views:

323

answers:

4

Hello everybody, This is my question. Which is the best way to extract certain information from an HTML page. What I currently do is the following:

  1. Download the page using WebClient

  2. Convert the received data to string using UTF8Encoding

  3. Convert the string to XML

  4. Using Xml related classes from the .NET Framework extract the desired data

This is what I currently do in summarized form. Anyone aware of another method? Something that can be faster or easier?

Best Regards, Kiril

PS: I have heard about a testing framework called Watin

that allows you to do something similar, but haven't researched it much

A: 

Unless you are working with perfectly formed XHTML Regular expressions will be more suitable for parsing the html?

Watin allows you to script button clicks, script calls etc on a web page through IE (can it use other browsers not sure?). I dont think this will accomplish what you are looking for.

alexmac
Regular expressions do not work well against malformed HTML either.
Rex M
No but I suspect many third party libs utilize them in conjunction with standard string manipulation to process the html and either way regex is superior to the flexibility offered by XML queries.
alexmac
Yeah they do - I use them all the time. They only work badly on human written HTML that does not come from a templating system - a vanishingly small proportion of structured data. Should not have been voted down +1 from me.
Simon Gibbs
@Simon unfortunately, plenty of templating systems use templates written with malformed HTML. Regular expressions are fine for extracting fairly small sets of data, but the time required to write highly complex, large data extractions is far greater than XML-izing the markup and using XPATH.
Rex M
So how is "XMLizing" the markup going to assist with malformed HTML? However I agree HTML agility pack is probably what he/she is looking for - which probably uses regular expressions at some point ;)
alexmac
@alexmac malformed HTML can still be parsed into a DOM, and a DOM can always be represented by well-formed XML. The library handles that part, essentially making malformed HTML well-formed XHTML.
Rex M
+6  A: 

It sounds like you've figured out how to fetch the page data (that's the simplest part).

For the rest, the best managed library I've used for this type of task is the HTML Agility Pack. It's open source and very mature, written entirely in .NET. It handles malformed HTML and can do what you need in two different ways:

  • Natively supports XPATH and XML-like querying against the HTML DOM. It is designed to mimic .NET's XML library, so anything you can do against XML with .NET, you can do against HTML with this.

  • Supports producing valid XML from the HTML, so you can use any XML tools.

Rex M
+2  A: 

For your parsing needs I recommend the HTML Agility Pack.

For actually retrieving the HTML, use the WebRequest class

Kirschstein
A: 

This could be simplified slightly, by using the WebClient.DownloadString method I believe.

See other answers for details on the parsing, as I haven't tried the HTML Agility Pack.

samjudson
That wouldn't solve the parsing problem, though.
John Saunders
No, it wont, but I considered the other answers on the HTML Agility Pack to cover that aspect well enough.
samjudson