Directed crawl using Nutch or Heritrix | ansaurus

tags:

java
nutch

views:

64

answers:

1

Q:

Directed crawl using Nutch or Heritrix

Hi, I have seen Nutch and Heritrix way of crawling. They both have the concept of generate/fetch/update cycles which start with some seed urls and iterate over the result urls after fetching step.

The scoping/filtering logic works on regular expression applied to the URLs extracted.

I want to do something very specific. I don't want to extract all urls from the page but I'd rather fetch urls based on some xpath. The reasons being: - Not all urls could be classified with precise regular expression - I might miss some urls which fall outside given reg ex - I might want to follow 'Next Page' sequence as well - A specific crawl cycle might have different xpath based filters in each depth.

Has anybody done such thing with Nutch of Heritrix?

Thanks Nayn

A:

I tried to create a POC with both of these. I needed the outlinks to start the next phase of the crawl with diff set of rules. With heritrix, there is no way to retain outlinks on the last hop since all the outlinks are discarded. With Nutch, there is no way to incorporate my own scraper which does not return outlink etc which are required by its internal data structures like ParseData etc. Moreover it is tightly coupled with lucene and related indexing system. Thanks Nayn

Nayn 2010-07-28 11:41:15

related questions

Java Time Zone is messed up

Eclipse on win64

Automate builds for Java RCP for deployment with JNLP

Why are professors or schools picking Java over C++ to teach to students?

Is there a real benefit of using J#?

Public/Popular Websites using JavaServer Faces

Why can't I use a try block around my super() call?

Accessing post variables using Java Servlets

Personal Linux web server

Is this really widening vs autoboxing?

How can I Java webstart multiple, dependent, native libraries?

Why can't I call toString() on a Java primitive?

How do I use Java to read from a file that is actively being written?

What code analysis tools do you use for your Java projects?

IllegalArgumentException or NullPointerException for a null parameter?

How do I configure and communicate with a serial port?

What is the best way to parse strings in Java

Getting started with a custom JXTA PeerGroup

Creating a custom button in Java

How to get started "writing" a code coverage tool?

Which Build-/Configuration Management Tool?

What is the difference between an int and an Integer in Java/C#?

What is the meaning of the type safety warning in certain Java generics casts?

How would you access Object properties from within an object method?

Converting CSV File to XML in Java