views:

105

answers:

2

I am looking for recommendations for a screenscraper I need to extract "Contact Us" information from certain web sites.

Any ideas where I can get a good (pref free) screenscarper?

+1  A: 

Write your own -- it isn't hard. if you aren't familiar with programming or have a choice for programming languages: use Python the library support for doing scrapping great.

As for how to attack the problem their are two popular techniques: use regular expressions, they work best for ad-hoc screen scrapping. If your target web-sites are well structured -- read: not ad-hoc -- then use a framework that allows you to work with the DOM.

Navigation and Extraction

These are the two phases of writing a spider. Your spider needs to navigate a website to visit different pages, and it needs to extract information of interest. Both these phases can be driven by either the DOM or RE's

p.s., Since your name indicates .NET -- I should mention that I have written scrappers in C-Sharp -- it's a doddle.

Hassan Syed
Hassan, I am new to C# and that is my preferred language. Any sites you can recommend that can show me how to write my own screenscarper?
LearningCSharp
If the target web site owner decides to modify their webpage, that my mess up my screenscarper, correct?
LearningCSharp
Yes, It might mess up your scraper. The two methods I mentioned each have their strengths and weaknesses in this regard. The DOM method will work if the website changes in a way where the implementer cares about semantic compatibility of his website. If he doesn't than often a regular expression will be better -- if I was scrapping amazon I would use the DOM method -- but for smaller websites especially those who try to protect their content by deliberately obfuscating the structure -- I would use Regular expressions. All of the scrappers I have written use RE's.
Hassan Syed
I suggest you read up the two Wikipedia articles to see how and why what I have said is true/untrue. After that just google for "C# screen scrapping DOM" and "C# screen scrapping regular expressions"
Hassan Syed
A: 

Screen scraping is nicely done by Automation Anywhere. I think it can also extract data from web in set pattern - automatically. I found a demo depicting how to screen scrape. Check out!

Enjoy!

Bob