views:

306

answers:

1

I have a relatively simple case. I basically want to store data about links between various websites, and don't want to limit the domains. I know I could write my own crawler using some http client library, but I feel that I would be doing some unnecessary work -- making sure pages are not checked more than once, working out how to read and use a robots.txt file, maybe even trying to make it concurrent and distributed, and I'm sure a lot of other things that I haven't yet thought of.

So I wanted a framework for web crawling that takes care of these kind of things, while allowing me to dictate what to do with the responses (in my case, just extracting the links and storing them). Most crawlers seem to assume you're indexing web pages for search, and that's no good, I need something customizable.

I want to store the link information in a MongoDB database, so I need to be able to dictate how the links are stored in the framework. And although I've tagged the question as language-agnostic, this also means that I have to limit the choice to a framework in one of MongoDB's supported languages (Python, Ruby, Perl, PHP, Java and C++), which is a very wide net. I prefer dynamic languages, but I'm open to any suggestions.

I have been able to find Scrapy (which looks neat), and JSpider (which seems good, but perhaps a bit too "heavy duty", based on the 121 page user manual), but I wanted to see if there were other good options out there I'm missing.

+3  A: 

I suppose you have already searched Stack Overflow yourself as there are quite a few pretty similar questions within those tagged web-crawler? Having used none of the following extensively I refrain from elaborating and just list a few I feel worth reviewing for the task at hand:

  • Python
  • Ruby (never used these at all)
  • Perl
  • Java
    • Nutch: pretty mature project, well documented, dedicated extensibility, based on Apache Lucene, which is very mature and has a strong community; still there appear to be issues regarding advanced integration scenarios, see this question.
    • Heritrix: very mature project, well documented, dedicated extensibility, backbone of the Internet Archive; seems to address advanced integration scenarios better for some, again, see this question.

Well, good luck for the review ;)

Steffen Opel