views:

21

answers:

0

I would like to start working with parsing large numbers of raw HTML pages into semantic data structures.

Just interested in the community opinion on various available tools for such a task, particularly various useful libraries in any language.

So far, planning on using Hadoop to manage a lot of the processing, but curious about alternatives.