I want to build a small application which will go to some websites and read specific data from these website, i want the application to be generic as i can so it work on any website i want later. So there is a good way to parse website using .net?
views:
336answers:
6Whats the best way to get data from other websites programmatically? "amazon offers for example"
Assuming that the data you're trying to get isn't available through RSS or other similar means, it sounds like you are going to be looking for a page scraper. Googling for ".NET HTML scraper" returns a wealth of information on the topic.
Reading website made for humans to read is always a problem. I do this for some years now and it's never been easy. It's always a good idea to look out for the data you need in some other from like XML or RSS.
Always keep in mind the website layout may change and you have to adapt your application. But the main problem is finding the information you need on the page. We always need a lot of string methods like contains, left,mid, indexOf.
HTML is some kind of XML so maybe an XMLParser would work. But browsers are always a bit more tolerant and will accept and display many pages who are not well formatted.
If the site doesn't provide an API, you are going to end up "scraping" the website. While this is certainly possible, it's extremely brittle. If the website changes, the logic you use to "find" the info you want will break. Having a generic app that will find specific information is a great concept, but hard to implement reliably.
I have been told that a good toolkit for doing screen scraping in .NET is HTML Agility Pack.
You should be aware that it's always very hard to make a screen scraping solution stable, or general-purpose as you describe.
Also see other answers to this question: http://stackoverflow.com/questions/646484/screen-scraping-regular-expressions-or-xquery-expressions
using wget for windows and regular expressions should do what you want. You use wget to get the source code, then use regular expressions to filter out all the code or get exactly what you are looking for if it has a pattern.