views:

417

answers:

4

I am interested in crawling a lot of websites. The most important consideration is that the spider is able to reach as much as the site as possible. One key feature that is lacking from most spiders is the ability to execute JavaScript. This is required in order to crawl ajax powered sites. I really like Open Source and I will need to modify the code for my project.

Currently I think that Solr, which is apart of Lucine is a very good solution. http://lucene.apache.org/solr/features.html

Has anyone used Solr or Lucine? My biggest problem with Solr can not execute javascript, however its has a rich feature set and is scalability both of which makes Solr attractive.

+3  A: 

Solr is not a crawler, but a search engine (searches over an index to return results).

That said, I really like heritrix for its flexibility. Most crawlers won't execute Javascript (but some, as Heritrix, will try to extract links from it) as that doesn't make much sense, even today. The thing is that Heritrix will allow you to plug in your own classes to do whatever you wish with the crawled data.

Vinko Vrsalovic
Heritrix is awesome and it has the features I'm looking for: ExtractorJS,ExtractorSWF,ExtractorCSS,ExtractorPDF and more! You couldn't me more wrong about Javascript, because it is vital component of a modern spider. Google and other major search engines evaluate javascript.
Rook
You really mean they execute all javascript in the page? One thing is extract the links in PDF, JS and so on, but I wouldn't call that evaluating JS, PDF and so on.
Vinko Vrsalovic
A: 

Try HTMLUnit. http://htmlunit.sourceforge.net/

Ondra Žižka
+1  A: 

Solr is a search engine built on the top of Lucene. It is not doing anything with crawling. Take a look at Apache Nutch. Cracking javascript might be a problem, as they are often inteded to lead the crawler to the dead-end.

fifigyuri
My bad, Lucine has a lot of sub projects.
Rook
A: 

watir might be useful for you.

troelskn
Watir kicks ass, it doesn't quite fit my needs, but I'll have to keep it in mind.
Rook