Hi, I have seen Nutch and Heritrix way of crawling. They both have the concept of generate/fetch/update cycles which start with some seed urls and iterate over the result urls after fetching step.
The scoping/filtering logic works on regular expression applied to the URLs extracted.
I want to do something very specific. I don't want to extract all urls from the page but I'd rather fetch urls based on some xpath. The reasons being: - Not all urls could be classified with precise regular expression - I might miss some urls which fall outside given reg ex - I might want to follow 'Next Page' sequence as well - A specific crawl cycle might have different xpath based filters in each depth.
Has anybody done such thing with Nutch of Heritrix?
Thanks Nayn