views:

52

answers:

1

I'm looking for information extraction libraries where I can have semi structured information that may have either hidden or incomplete data. I want to train some classifiers to pull out content based on the structure.

I'm working on building a tool where I can select text in the browser, and it will generate (via some web service call) a classifier that can be used on other documents to pull out text.

I'm primarily looking at how the structure of the document can be used to indicate what the content is.

+1  A: 

Sounds like you're looking for some kind of html parser generator. There was a web service (whose name I can't recall) that would let you select areas on a page, and would generate xpath parsing rules, but I'm not sure how well it worked, or even if it still exists.

Generally, if you can write code, it's easiest to just write a parser yourself. I recommend BeautifulSoup or lxml.

Jacob
well, writing one parser is fairly straight forward; writing a 1000 parsers and maintaining them is another.
MathGladiator
Yes, 1000 parsers would suck. So then I'd recommend having the browser tool generate xpath extraction expressions for each website, and then a generic parser engine that uses the xpath expression to extract the content. But you'll still have a maintenance problem, as website will update their structure without informing you.
Jacob