What is a good Java web crawler library?

views:

706

answers:

+1 Q:

What is a good Java web crawler library?

Hi,

I am about to develop a crawler in Java but don't feel like reinventing the wheel. A quick Google search gives a whole bunch of Java libraries to build a web crawler. Besides that Nutch is of course a very robust package but seems a bit too advanced for my needs. I only need to crawl a handful websites a week containing a couple of 1000 pages each.

Which open source Java library would you recommend considering:

speed
multithreading (or even distributed)
extending it with new functionality
active maintained
and documentation?

Try solr. It has all the spice you just need.

Bragboy 2010-03-22 20:00:33

Thanks Bragaadeesh, I have had a look at Solr as it's part of the Lucene / Nutch family and it seems to be a bit of an overkill for my requirements.

DrDee 2010-03-23 02:22:30

+1 A:

After giving the link you provided a quick read it looks Java Web Crawler looks the best for what you want.

David 2010-03-22 20:21:37

I am thinking of giving Heritrix (http://crawler.archive.org/) a shot, but thank you for your answer.

DrDee 2010-03-23 02:25:05

+1 A:

Have a look at Niocchi.

flm 2010-05-16 17:45:04

crawler4j is a simple java crawler that can be configured in a few minutes and can easily handle a few million web pages.

Yasser 2010-07-29 08:12:51

ansaurus

tags:

views:

answers:

What is a good Java web crawler library?

related questions