views:

336

answers:

6

I want to build a small application which will go to some websites and read specific data from these website, i want the application to be generic as i can so it work on any website i want later. So there is a good way to parse website using .net?

A: 

Assuming that the data you're trying to get isn't available through RSS or other similar means, it sounds like you are going to be looking for a page scraper. Googling for ".NET HTML scraper" returns a wealth of information on the topic.

rmz
+1  A: 

While not really intended for your scenario, have a look at Watin. I'm assuming what you want to do is screen scrape. Watin is pretty easy to get going and will save you from writing much of the parsing code on your own.

JasonS
A: 

Reading website made for humans to read is always a problem. I do this for some years now and it's never been easy. It's always a good idea to look out for the data you need in some other from like XML or RSS.

Always keep in mind the website layout may change and you have to adapt your application. But the main problem is finding the information you need on the page. We always need a lot of string methods like contains, left,mid, indexOf.

HTML is some kind of XML so maybe an XMLParser would work. But browsers are always a bit more tolerant and will accept and display many pages who are not well formatted.

Holli
A: 

If the site doesn't provide an API, you are going to end up "scraping" the website. While this is certainly possible, it's extremely brittle. If the website changes, the logic you use to "find" the info you want will break. Having a generic app that will find specific information is a great concept, but hard to implement reliably.

Todd R
+1  A: 

I have been told that a good toolkit for doing screen scraping in .NET is HTML Agility Pack.

You should be aware that it's always very hard to make a screen scraping solution stable, or general-purpose as you describe.

Also see other answers to this question: http://stackoverflow.com/questions/646484/screen-scraping-regular-expressions-or-xquery-expressions

Bill Karwin
+1  A: 

using wget for windows and regular expressions should do what you want. You use wget to get the source code, then use regular expressions to filter out all the code or get exactly what you are looking for if it has a pattern.

for reference this is wget url http://www.gnu.org/software/wget/
Amr ElGarhy