tags:

views:

64

answers:

1

Hi, I have seen Nutch and Heritrix way of crawling. They both have the concept of generate/fetch/update cycles which start with some seed urls and iterate over the result urls after fetching step.

The scoping/filtering logic works on regular expression applied to the URLs extracted.

I want to do something very specific. I don't want to extract all urls from the page but I'd rather fetch urls based on some xpath. The reasons being: - Not all urls could be classified with precise regular expression - I might miss some urls which fall outside given reg ex - I might want to follow 'Next Page' sequence as well - A specific crawl cycle might have different xpath based filters in each depth.

Has anybody done such thing with Nutch of Heritrix?

Thanks Nayn

A: 

I tried to create a POC with both of these. I needed the outlinks to start the next phase of the crawl with diff set of rules. With heritrix, there is no way to retain outlinks on the last hop since all the outlinks are discarded. With Nutch, there is no way to incorporate my own scraper which does not return outlink etc which are required by its internal data structures like ParseData etc. Moreover it is tightly coupled with lucene and related indexing system. Thanks Nayn

Nayn