views:

828

answers:

0

I am currently trying to write a custom plugin for nutch 1.0. This plugin is supposed to parse html data and filter out relevant information from documents. I have a basic plugin working, it extends the HtmlParserResult object and is executed each time I do a parse.

My problems are two faced at the moment:

  1. I do not understand the workflow/pipline of the nutch parsing good enough. I do not find the information about this on the nutch site.

  2. I do not understand how the DOM parsing is done, I see that Nutch have set of DOM objects and that the HtmlParser plugin does some DOM parsing, still I have not figured out how this is best done.