I am parsing unstructured documents into a structured representation (XML) using a template to describe the intended result. A simple typical problem might be a list of strings:
"Chapter 1"
"Section background"
"this is something"
"this is another"
"Section methods"
"take some xxx"
"do yyy"
"and some..."
"Chapter apparatus"
"we created..."
which I wish to transform to:
<div role="CHAPTER" title="1">
<div role="SECTION" title="background">
<p>this is a paragraph...</p>
<p>this is another...</p>
</div>
<div role="SECTION" title="methods">
<p>take some xxx</p>
<p>do yyy</p>
<p>and some...</p>
</div>
</div>
<div role="CHAPTER" title="apparatus">
<div role="SECTION" title="???">
<p>we created...</p>
</div>
</div>
The labels CHAPTER and SECTION are not present in the strings but are generated from heuristic regexes (e.g. "[Cc]hap(ter)?(\s\d+\.)?.*
") and are applied to all strings.
The intended result is described by a "template" which currently looks something like:
<template count="0," role="CHAPTER">
<regex>[Cc]hap(ter)?(\s+.*)</regex>
<template count="0," role="SECTION">
<regex>[Ss]ec(tion)?(\s+.*)</regex>
<template count="0," role="p">
<regex>.*</regex>
</template>
</template>
</template>
(In some cases counts can be ranges, e.g. 2,4).
I know this is a very hard problem (SGML attempted to tackle parts of it) and that real documents do not conform tidily to such templates, so I am prepared for partial parses and to lose some precision and recall.
For some years I have used my own working code which works for documents up to a few megabytes over a range of types. Performance is not an issue. I have different templates for different document types (theses, logfiles, fortran output, etc.). Some documents have a nested structure (e.g. as above) while others are flatter but have many more types of markup.
I am now refactoring this and wonder:
- is there an Open source toolkit that addresses this problem? (preferably Java)
- if not, can I use XSLT2 grouping strategy combined with regular expressions
- or should I use an automaton? If so, should I use a toolkit or write my own?
EDIT: @naspinski and generally. It will always be possible to write specific scripting code to solve particular problems. I want a general solution as I may be parsing many (even millions) of documents with consisderable (but not infinite) variability in structure. I want the structure of the parsed documents to be expressed in XML, not script. I believe that it will be easier to add new solutions through templates (declarative) rather than scripts.
EDIT I am almost certain that my best approach now is to use ANTLR. It is a powerful tool which from my initial explorations can parse lines and groups of lines.