tags:

views:

36

answers:

2

Hi, I want to write my own HTML parser plugin for nutch. I am doing focused crawling by generating outlinks falling only in specific xpath. In my use case, I want to fetch different data from the html pages depending on the current depth of the crawl. So I need to know the current depth in HtmlParser plugin for each content that I am parsing.

Is it possible with Nutch? I see CrawlDatum does not have crawl_depth information. I was thinking of having map of information in another data structure. Does anybody have better idea?

Thanks

A: 

With Nutch, "depth" represents the number of generate/fetch/update cycles run successively. Per example, if you are at depth 4, it means your are in the fourth cycle. When you say that you want to go no further than depth 10, it means that you want to stop after 10 cycles.

Within each cycle, the number or previous cycles run before it (the "depth") is unknown. That information is not kept.

If you have your own version of Crawl.java, you could keep track of the current "depth" and pass that information to your HTML parser plugin.

Pascal Dimassimo
Precisely that's what I am doing, writing my own version of Crawl.java. But the depth information has to propagate through ParseSegment.parse which executed the job on hadoop cluster with input as only Content directory. I don't want to change ParseSegment since it is internal to Nutch. Any other way out of this?
Nayn
I was thinking of writing this level information in hdfs file and read it back in each job at my plugin level, but this would be an extra overhead in terms of IO.
Nayn
Ok, now I understand. My html parser is not a real nutch plugin. It is just a java class that is called in my Crawler.java, so I can pass it all the info I need. Maybe you could do something similar?
Pascal Dimassimo
That would mean breaking map-reduce programming and making the whole system slow and not scalable.
Nayn
absolutely... :) but my application doesn't need it...
Pascal Dimassimo
+1  A: 

Crawl.java has NutchConfiguration object. This object is passed while initializing all the components. I set the property for crawl-depth before creating new Fetcher.

conf.setInt("crawl.depth", i+1);
new Fetcher(conf).fetch(segs[0], threads,
          org.apache.nutch.fetcher.Fetcher.isParsing(conf));  // fetch it

The HtmlParser plugin can access it as below:

LOG.info("Current depth: " + getConf().getInt("crawl.depth", -1));

This doesn't force me to break map-reduce. Thanks Nayn

Nayn
+1 That makes sense.
Pascal Dimassimo