I am interested in crawling a lot of websites. The most important consideration is that the spider is able to reach as much as the site as possible. One key feature that is lacking from most spiders is the ability to execute JavaScript. This is required in order to crawl ajax powered sites. I really like Open Source and I will need to modify the code for my project.
Currently I think that Solr, which is apart of Lucine is a very good solution. http://lucene.apache.org/solr/features.html
Has anyone used Solr or Lucine? My biggest problem with Solr can not execute javascript, however its has a rich feature set and is scalability both of which makes Solr attractive.