I search for a web crawler solution which can is mature enough and can be simply extended. I am interested in the following features... or possibility to extend the crawler to meet them:
- partly just to read the feeds of several sites
- to scrap the content of these sites
- if the site has an archive I would like to crawl and index it as well
- the crawler should be capable to explore part of the Web for me and it should be able to decide which sites matches the given criteria
- should be able to notify me, if things possibly matching my interest were found
- the crawler should not kill the servers by attacking it by too many requests, it should be smart doing crawling
- the crawler should be robust against freak sites and servers
Those things above can be done one by one without any big effort, but I am interested in any solution which provide a customisable, extendible crawler. I heard of Apache Nutch, but very unsure about the project so far. Do you have experiences with it? Can you recommend alternatives?