Advice needed: Screen scraping a web page using .NET

views:

358

answers:

+2 Q:

Advice needed: Screen scraping a web page using .NET

Hello everybody,

I need an advice for a project I am about to begin.

In few words, my application has to go to a certain soccer website, download the HTML and extract the necessary data.

This is what I have done so far:

:: 1) Go to a certain soccer website (ex. http://www.livescore.com/default.dll?page=england) and download the HTML using WebClient.

:: 2) Using SgmlReader convert the HTML to XML

:: 3) Using XmlDocument retrieve the data I am looking for. Usually this involves:

::::::: 3.1) Retrieving nodes using GetElementsByTagName() (ex. GetElementsByTagName("tr"))

::::::: 3.2) Looping through the list of nodes returned by the GetElementsByTagName() method

Is there a better way to do what I trying to do?

I was thinking of LINQ to XML. Do you think this will improve performance?

Any suggestions or comments would be greatly appreciated!

+5 A:

Just use HTML Agility Pack! http://www.codeplex.com/htmlagilitypack

In that way you can query the document using XPath to get the nodes you need. You can even use Firefox's plugin Firebug to help you build your XPath querys

AlbertEin 2009-06-16 18:27:44

I think I will use the HtmlAgilityPack, but all I found was 3 very basic examples along with a poor API refernce. Are you aware of a richer documentation?

Kiril 2009-06-18 20:53:19

You'll need to read the XPath documentation, what HtmlAgilityPack does is add support to XPath querys to Html

AlbertEin 2009-06-19 16:38:04

Use a service such as these guys who have most everything done for you. You can also use a free service such as Dapper. I believe you can export data in different formats, although I don't know if you can grab the data in real time, you may have a delay.

If you don't want to program everything in-house using a 3rd party solution can save you time and money.

Kekoa 2009-06-16 18:32:08

Once you've converted the data to XML, you can use XSLT to transform it to a simpler set of XML and one that is more suited to your purposes. From there you can use LINQ to XML to get the data you need out of the XML. The benefit to this approach is that it decouples the website from the data gathering so that when the website changes their format you can simply change the XSLT to match and nothing else has to be touched.

Lee 2009-06-16 18:33:41

Thanks for that link to mozenda.com. that is great!

2009-08-17 21:08:19

Try ScrapePro it supports several actions, filters, converters and data sources to scrape any web site easily. scrapepro.com It has an API, too.

csharpp 2010-08-12 19:12:03

ansaurus

tags:

views:

answers:

Advice needed: Screen scraping a web page using .NET

related questions