Parsing html data with nutch 1.0 and a custom plugin | ansaurus

tags:

views:

828

answers:

0

+3 Q:

Parsing html data with nutch 1.0 and a custom plugin

I am currently trying to write a custom plugin for nutch 1.0. This plugin is supposed to parse html data and filter out relevant information from documents. I have a basic plugin working, it extends the HtmlParserResult object and is executed each time I do a parse.

My problems are two faced at the moment:

I do not understand the workflow/pipline of the nutch parsing good enough. I do not find the information about this on the nutch site.
I do not understand how the DOM parsing is done, I see that Nutch have set of DOM objects and that the HtmlParser plugin does some DOM parsing, still I have not figured out how this is best done.

related questions

Converting web page into UITableView

PHP regular expression to remove tags in HTML document

Regex to Match HTML Style Properties.

.NET Html Parser

Non-destructive parsing and modifying of HTML elements in C++

Script to build HTML page from from extracted DIVs from other HTML pages

lxml retrieving odd items with cssselector

php regex for html

Advantages of XSLT or Linq to XML

java parse html + css and convert the output to different lang

What is the best practice for parsing remote content with jQuery?

What regular expression would match this data?

How to parse html and css to understand the layout of the page (java)

How can I clean HTML tags out of a ColdFusion string?

Html Agility Pack - Parsing <li>

Parsing HTML in Python

Library to generate .NET XmlDocument from HTML tag soup

HTML Agility pack - parsing tables

What language/tool should I use for HTML parsing?

Extracting meaning full content from web pages

Library Recommendation: C++ HTML Parser

Convert > to HTML entity equivalent within HTML string

Problem with HTML Parser in IE

Options for HTML scraping?