Best open source library or application to crawl and data mine web sites

views:

539

answers:

+2 Q:

Best open source library or application to crawl and data mine web sites

I would like to know what is the best eopen-source library for crawling and analyzing websites. One example would be a crawler property agencies, where I would like to grab information from a number of sites and aggregate them into my own site. For this I need to crawl the sites and extract the property ads.

+7 A:

I do a lot of scraping, using excellent python packages urllib2, mechanize and BeautifulSoup.

I also suggest to look at lxml and Scrapy, though I don't use them currently (still planning to try out scrapy).

Perl language also has great facilities for scraping.

Eugene Morozov 2009-04-17 07:43:57

+1 A:

PHP/cURL is a very powerful combination, especially if you want to use the results directly in a web page...

Tupak Goliam 2009-06-02 14:13:16

+1 A:

In common with Mr Morozov I do quite a bit of scraping too, principally of job sites. I've never had to resort to mechanize, if that helps any. Beautifulsoup in combination with urllib2 have always been sufficient.

I have used lxml, which is great. However, I believe it may not have been available with Google apps a few months ago when I tried it, if you need that.

My thanks are due to Mr Morozov for mentioning Scrapy. Hadn't heard of it.

Bill Bell 2009-07-01 14:57:00

Besides Scrapy, you should also look at Parselets

Joseph Turian 2009-10-15 22:16:37

ansaurus

tags:

views:

answers:

Best open source library or application to crawl and data mine web sites

related questions