I have a relatively simple case. I basically want to store data about links between various websites, and don't want to limit the domains. I know I could write my own crawler using some http client library, but I feel that I would be doing some unnecessary work -- making sure pages are not checked more than once, working out how to read and use a robots.txt file, maybe even trying to make it concurrent and distributed, and I'm sure a lot of other things that I haven't yet thought of.
So I wanted a framework for web crawling that takes care of these kind of things, while allowing me to dictate what to do with the responses (in my case, just extracting the links and storing them). Most crawlers seem to assume you're indexing web pages for search, and that's no good, I need something customizable.
I want to store the link information in a MongoDB database, so I need to be able to dictate how the links are stored in the framework. And although I've tagged the question as language-agnostic, this also means that I have to limit the choice to a framework in one of MongoDB's supported languages (Python, Ruby, Perl, PHP, Java and C++), which is a very wide net. I prefer dynamic languages, but I'm open to any suggestions.
I have been able to find Scrapy (which looks neat), and JSpider (which seems good, but perhaps a bit too "heavy duty", based on the 121 page user manual), but I wanted to see if there were other good options out there I'm missing.