views:

67

answers:

1

Hi, I want to select one of the above for building a crawling framework for specific web sites. This is not an internet-wide crawl. I am not building a search index, and rather interested in scraping specific pages from the web site.

Could somebody please detail about the pros and cons of above? Thanks Nayn

A: 

Your main task is scrape specific pages from the web site.

Nutch: Open-source web-search software, built on Lucene Java

Heritrix: is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project

So I think Heritrix is much better than Nutch for your project.

Learning a framework/library is a valuable exercise. But it takes some time. Since you task is not very complex one, sometimes it would be less painful to write a simple crawler from the scratch in Java

Upul
As I mentioned, I am not interested to create index of pages. I want to perform a directed crawl (i.e. which links (regex) to follow at each depth) and cache the pages from the last level. Then I would employ scraping on cached pages to fetch data of my interest. I don't need(want) to do complete web-site crawl.
Nayn
I am accepting this answer since no one else came up with anything. But I am going away from both Nutch and Heritrix and using Bixo for my use case. Thanks
Nayn