views:

24

answers:

2

I've got a set of documents which have a semi-regular format. Rows are typically separated by new line characters, and the main components of each row are separated by spaces. Some examples are a set of furniture assembly instructions, a set of table of contents, a set of recipes and a set of bank statements.

The problem is that each specimen in each set is different from its peer members in ways which make RegEx parsing infeasible: the quantity of an item may come before or after the item name, the same items may have different names between specimens, expository text or notes may exist between rows, etc.

I've used classifiers (Neural Nets, Bayesian, GA and GP) to deal with whole documents or data sets, but not to extract items from documents and classify them within a context. Can this be done? Is there a more feasible approach?

+2  A: 

If your data has structure, arguably you can use a grammar to describe some of that structure. (Classically you use grammars to recognize what they can, often too much, and extra-grammatical checks to prune away what the grammars cannot eliminate).

If you use a grammar that can run parallel potential parses, which eliminate parses as they become infeasible, you can parse different ordering straightforwardly. (A GLR parser can do this nicely).

Imaging you have NUMBERS describing amounts, NOUNS describing various objects, and VERBS for actions. Then a grammar that can accept varying orders of items might be:

 G = SENTENCE '.' ;
 SENTENCE = VERB NOUN NUMBER ; 
 SENTENCE = NOUN VERB NUMBER;
 VERB = 'ORDER' | 'SAW' ;
 NUMBER = '1' | '2' | '10' ;
 NOUN = 'JOE' | 'TABLE' | 'SAW' ;

This sample is extremely simple, but it will handle:

 JOE ORDERED 10.
 JOE SAW 1.
 ORDER 2 SAW.

It will also accept:

 SAW SAW 10.

You can eliminate this by adding an external constraint that actors must be people.

Ira Baxter