I would like to know what is the best eopen-source library for crawling and analyzing websites. One example would be a crawler property agencies, where I would like to grab information from a number of sites and aggregate them into my own site. For this I need to crawl the sites and extract the property ads.
views:
539answers:
4I do a lot of scraping, using excellent python packages urllib2, mechanize and BeautifulSoup.
I also suggest to look at lxml and Scrapy, though I don't use them currently (still planning to try out scrapy).
Perl language also has great facilities for scraping.
PHP/cURL is a very powerful combination, especially if you want to use the results directly in a web page...
In common with Mr Morozov I do quite a bit of scraping too, principally of job sites. I've never had to resort to mechanize, if that helps any. Beautifulsoup in combination with urllib2 have always been sufficient.
I have used lxml, which is great. However, I believe it may not have been available with Google apps a few months ago when I tried it, if you need that.
My thanks are due to Mr Morozov for mentioning Scrapy. Hadn't heard of it.