ansaurus

Question

Nutch : get current crawl depth in the plugin

Answer 1

A:

With Nutch, "depth" represents the number of generate/fetch/update cycles run successively. Per example, if you are at depth 4, it means your are in the fourth cycle. When you say that you want to go no further than depth 10, it means that you want to stop after 10 cycles.

Within each cycle, the number or previous cycles run before it (the "depth") is unknown. That information is not kept.

If you have your own version of Crawl.java, you could keep track of the current "depth" and pass that information to your HTML parser plugin.

Pascal Dimassimo 2010-08-20 13:12:06

Precisely that's what I am doing, writing my own version of Crawl.java. But the depth information has to propagate through ParseSegment.parse which executed the job on hadoop cluster with input as only Content directory. I don't want to change ParseSegment since it is internal to Nutch. Any other way out of this?

Nayn 2010-08-20 13:58:58

I was thinking of writing this level information in hdfs file and read it back in each job at my plugin level, but this would be an extra overhead in terms of IO.

Nayn 2010-08-20 14:02:10

Ok, now I understand. My html parser is not a real nutch plugin. It is just a java class that is called in my Crawler.java, so I can pass it all the info I need. Maybe you could do something similar?

Pascal Dimassimo 2010-08-20 15:05:58

That would mean breaking map-reduce programming and making the whole system slow and not scalable.

Nayn 2010-08-20 15:14:47

absolutely... :) but my application doesn't need it...

Pascal Dimassimo 2010-08-20 16:13:19

Answer 2

+1 A:

Crawl.java has NutchConfiguration object. This object is passed while initializing all the components. I set the property for crawl-depth before creating new Fetcher.

conf.setInt("crawl.depth", i+1);
new Fetcher(conf).fetch(segs[0], threads,
          org.apache.nutch.fetcher.Fetcher.isParsing(conf));  // fetch it

The HtmlParser plugin can access it as below:

LOG.info("Current depth: " + getConf().getInt("crawl.depth", -1));

This doesn't force me to break map-reduce. Thanks Nayn

Nayn 2010-08-23 11:18:58

+1 That makes sense.

Pascal Dimassimo 2010-08-23 13:09:38

ansaurus

tags:

views:

answers:

Nutch : get current crawl depth in the plugin

related questions