I'm looking for information extraction libraries where I can have semi structured information that may have either hidden or incomplete data. I want to train some classifiers to pull out content based on the structure.
I'm working on building a tool where I can select text in the browser, and it will generate (via some web service call) a classifier that can be used on other documents to pull out text.
I'm primarily looking at how the structure of the document can be used to indicate what the content is.