views:

589

answers:

2

Hi- We are in the starting phase of a project, and we are currently wondering whether which crawler is the best choice for us.

Our project:

Basically, we're going to set up Hadoop and crawl the web for images. We will then run our own indexing software on the images stored in HDFS based on the Map/Reduce facility in Hadoop. We will not use other indexing than our own.

Some particular questions:

  • Which crawler will handle crawling for images best?
  • Which crawler will best adapt to a distributed crawling system, in which we use many servers conducting crawling together?

Right now these look like the 3 best options-

  • Nutch: Known to scale. Doesn't look like the best option because it seems that is it tied closely to their text searching software.
  • Heritrix: Also scales. This one currently looks like the best option.
  • Scrapy: Has not been used on a large scale (not sure though). I dont know if it has the basic stuff like URL canonicalization. I would like to use this one because it is a python framework (I like python more than java), but I don't know if they have implemented the advanced features of a web crawler.

Summary:

We need to get as many images as possible from the web. Which existing crawling framework is both scalable and efficient , but also the one which will be the easiest to modify to get only images?

Thanks!

A: 

http://lucene.apache.org/nutch/

I would think going with something with the broadest use and support (community support) would be the better approach.

Andrew Siemer
A: 

Nutch may be a good option because you want to end up on HDFS. It may be useful to look into the HBase integration that are currently in the works (NUTCH-650).

You may be able to get the data you need by skipping the index step at the end and instead look at the segments themselves.

However for flexibility another option may be Droids: http://incubator.apache.org/droids/. It's still in the incubator phase at apache, but worth looking at.

You may get some ideas by looking at the SimpleRuntime example in the org.apache.droids.examples. Perhaps by replacing the Sysout handler with one that stores the images onto HDFS that may give you what you want.

Ben Daniels