views:

706

answers:

4

Hi,

I am about to develop a crawler in Java but don't feel like reinventing the wheel. A quick Google search gives a whole bunch of Java libraries to build a web crawler. Besides that Nutch is of course a very robust package but seems a bit too advanced for my needs. I only need to crawl a handful websites a week containing a couple of 1000 pages each.

Which open source Java library would you recommend considering:

  • speed
  • multithreading (or even distributed)
  • extending it with new functionality
  • active maintained
  • and documentation?
A: 

Try solr. It has all the spice you just need.

Bragboy
Thanks Bragaadeesh, I have had a look at Solr as it's part of the Lucene / Nutch family and it seems to be a bit of an overkill for my requirements.
DrDee
+1  A: 

After giving the link you provided a quick read it looks Java Web Crawler looks the best for what you want.

David
I am thinking of giving Heritrix (http://crawler.archive.org/) a shot, but thank you for your answer.
DrDee
+1  A: 

Have a look at Niocchi.

flm
A: 

crawler4j is a simple java crawler that can be configured in a few minutes and can easily handle a few million web pages.

Yasser