views:

175

answers:

2

I need to parse & process a big set of semi-structured text (basically, legal documents - law texts, addendums to them, treaties, judge's decisions, ...). The most fundamental thing I'm trying to do is extract information on how subparts are structured - chapters, articles, subheadings, ... plus some metadata. My question is if anyone can point me to starting points for this type of text processing, because I'm sure there has been a lot of research into this but what I find is mostly on either parsing something with a strict grammar (like code) or completely free-form text (like google tries to do on webpages). I think if I get hold of the right keywords, I would have more success in google and my journal databases. Thanks.

A: 

Never done this before, but if I was going to I'd definitely look into ANTLR. Its a pretty popular project and could very well have a port in your language of choice.

Will
+1  A: 

The natural language toolkit may be an interesting start and has plenty of resources on all areas of natural language processing. It is probably more linguistically focused than you need.

The other option is to go for a some parser generator library (normally used for code) which is not so strict (i.e allows you to ignore big chucks of text if needed). In python I would recommend pyparsing. In another answer I showed a simple example of what it can do when you want to ignore arbitrary chucks of text.

David Raznick