Hi- We are in the starting phase of a project, and we are currently wondering whether which crawler is the best choice for us.
Our project:
Basically, we're going to set up Hadoop and crawl the web for images. We will then run our own indexing software on the images stored in HDFS based on the Map/Reduce facility in Hadoop. We will not use other indexing than our own.
Some particular questions:
- Which crawler will handle crawling for images best?
- Which crawler will best adapt to a distributed crawling system, in which we use many servers conducting crawling together?
Right now these look like the 3 best options-
- Nutch: Known to scale. Doesn't look like the best option because it seems that is it tied closely to their text searching software.
- Heritrix: Also scales. This one currently looks like the best option.
- Scrapy: Has not been used on a large scale (not sure though). I dont know if it has the basic stuff like URL canonicalization. I would like to use this one because it is a python framework (I like python more than java), but I don't know if they have implemented the advanced features of a web crawler.
Summary:
We need to get as many images as possible from the web. Which existing crawling framework is both scalable and efficient , but also the one which will be the easiest to modify to get only images?
Thanks!