views:

539

answers:

4

I would like to know what is the best eopen-source library for crawling and analyzing websites. One example would be a crawler property agencies, where I would like to grab information from a number of sites and aggregate them into my own site. For this I need to crawl the sites and extract the property ads.

+7  A: 

I do a lot of scraping, using excellent python packages urllib2, mechanize and BeautifulSoup.

I also suggest to look at lxml and Scrapy, though I don't use them currently (still planning to try out scrapy).

Perl language also has great facilities for scraping.

Eugene Morozov
+1  A: 

PHP/cURL is a very powerful combination, especially if you want to use the results directly in a web page...

Tupak Goliam
+1  A: 

In common with Mr Morozov I do quite a bit of scraping too, principally of job sites. I've never had to resort to mechanize, if that helps any. Beautifulsoup in combination with urllib2 have always been sufficient.

I have used lxml, which is great. However, I believe it may not have been available with Google apps a few months ago when I tried it, if you need that.

My thanks are due to Mr Morozov for mentioning Scrapy. Hadn't heard of it.

Bill Bell
A: 

Besides Scrapy, you should also look at Parselets

Joseph Turian