tags:

views:

198

answers:

3

I am parsing unstructured documents into a structured representation (XML) using a template to describe the intended result. A simple typical problem might be a list of strings:

"Chapter 1"
"Section background"
"this is something"
"this is another"
"Section methods"
"take some xxx"
"do yyy"
"and some..."
"Chapter apparatus"
"we created..."

which I wish to transform to:

<div role="CHAPTER" title="1">
  <div role="SECTION" title="background">
    <p>this is a paragraph...</p>
    <p>this is another...</p>
  </div>
  <div role="SECTION" title="methods">
    <p>take some xxx</p>
    <p>do yyy</p>
    <p>and some...</p>
  </div>
</div>
<div role="CHAPTER" title="apparatus">
  <div role="SECTION" title="???">
    <p>we created...</p>
  </div>
</div>

The labels CHAPTER and SECTION are not present in the strings but are generated from heuristic regexes (e.g. "[Cc]hap(ter)?(\s\d+\.)?.*") and are applied to all strings.

The intended result is described by a "template" which currently looks something like:

<template count="0," role="CHAPTER">
  <regex>[Cc]hap(ter)?(\s+.*)</regex>
  <template count="0," role="SECTION">
   <regex>[Ss]ec(tion)?(\s+.*)</regex>
    <template count="0," role="p">
     <regex>.*</regex>
    </template>
  </template>
</template>

(In some cases counts can be ranges, e.g. 2,4).

I know this is a very hard problem (SGML attempted to tackle parts of it) and that real documents do not conform tidily to such templates, so I am prepared for partial parses and to lose some precision and recall.

For some years I have used my own working code which works for documents up to a few megabytes over a range of types. Performance is not an issue. I have different templates for different document types (theses, logfiles, fortran output, etc.). Some documents have a nested structure (e.g. as above) while others are flatter but have many more types of markup.

I am now refactoring this and wonder:

  • is there an Open source toolkit that addresses this problem? (preferably Java)
  • if not, can I use XSLT2 grouping strategy combined with regular expressions
  • or should I use an automaton? If so, should I use a toolkit or write my own?

EDIT: @naspinski and generally. It will always be possible to write specific scripting code to solve particular problems. I want a general solution as I may be parsing many (even millions) of documents with consisderable (but not infinite) variability in structure. I want the structure of the parsed documents to be expressed in XML, not script. I believe that it will be easier to add new solutions through templates (declarative) rather than scripts.

EDIT I am almost certain that my best approach now is to use ANTLR. It is a powerful tool which from my initial explorations can parse lines and groups of lines.

A: 

Going from unstructured to structured is going to require writing some type of parser on your part, which is trivial enough. Scan for the first regular expression, extract the data, and emit an XML element for it. Then scan for the second regular expressions, extract it's data, and emit it within the first XML element you made. Then scan for all remaining input to see if it matches the FIRST regular expression, if not, add it to the second element you made, otherwise restart with a new upper-level element. Proceed to EOF, and save the resulting XML.

Walt Stoneburner
+1  A: 

This is the sort of job Perl was made for.

#! /opt/perl/bin/perl
use strict;
use warnings;
use 5.10.1;

{
  package My::Full;
  use Moose;
  use MooseX::Method::Signatures;

  has 'chapters' => (
    'is' => 'rw',
    'isa' => 'ArrayRef[My::Chapter]',
    'default' => sub{[]}
  );

  method add_chapter( Str $name ){
    my $chapter = My::Chapter->new( name => "$name" );
    push @{$self->chapters}, $chapter;
    return $chapter;
  }

  method latest(){
    return $self->add_chapter('') unless @{$self->chapters};
    return $self->chapters->[-1];
  }

  method add_section( Str $name ){
    my $latest_chapter = $self->latest;
    $latest_chapter->add_section("$name");
  }

  method add_line( Str $line ){
    $self->latest->add_line( "$line" );
  }

  method xml(){
    my $out = '';
    for my $chapter ( @{ $self->chapters } ){
      $out .= $chapter->xml;
    }
    return $out;
  }
}
{
  package My::Chapter;
  use Moose;
  use MooseX::Method::Signatures;

  has 'name' => (
    'is' => 'rw',
    'isa' => 'Str',
    'required' => 1
  );

  has 'sections' => (
    'is' => 'rw',
    'isa' => 'ArrayRef[My::Section]',
    'default' => sub{[]}
  );

  method latest(){
    return $self->add_section('') unless @{$self->sections};
    return $self->sections->[-1];
  }

  method add_section( Str $name ){
    my $section = My::Section->new(name => "$name");
    push @{$self->sections}, $section;
    return $section;
  }

  method add_line( Str $line ){
    $self->latest->add_line( "$line" );
  }

  method xml(){
    my $name = $self->name;
    $name = '???' unless length $name;

    my $out = qq'<div role="CHAPTER" title="$name">\n';
    for my $section ( @{ $self->sections } ){
      $out .= $section->xml;
    }
    return $out."</div>\n";
  }
}
{
  package My::Section;
  use Moose;
  use MooseX::MultiMethods;

  has 'name' => (
    'is' => 'rw',
    'isa' => 'Str',
    'required' => 1
  );

  has 'lines' => (
    'is' => 'rw',
    'isa' => 'ArrayRef[Str]',
    'default' => sub{[]}
  );

  method add_line( Str $line ){
    push @{$self->lines}, "$line"
  }

  method xml(){
    my $name = $self->name;
    $name = '???' unless length $name;

    my $out = qq'  <div role="SECTION" title="$name">\n';
    for my $line ( @{ $self->lines } ){
      $out .= "    <p>$line</p>\n";
    }
    return $out."  </div>\n";
  }
}

The main loop:

my $full = My::Full->new;

while( my $line = <> ){
  chomp $line;

  given( $line ){
    when( /^chap(?:ter)?\s++(.+)/i ){
      $full->add_chapter($1);
    }
    when( /^sec(?:tion)?\s++(.+)/i ){
      $full->add_section($1);
    }
    default{
      $full->add_line($line);
    }
  }
}

say $full->xml

 

<div role="CHAPTER" title="check">
  <div role="SECTION" title="check">
    <p>this is something</p>
    <p>this is another</p>
  </div>
  <div role="SECTION" title="check">
    <p>take some xxx</p>
    <p>do yyy</p>
    <p>and some...</p>
  </div>
</div>
<div role="CHAPTER" title="check">
  <div role="SECTION" title="???">
    <p>we created...</p>
  </div>
</div>
Brad Gilbert
Thanks. This is a script specific to one particular type of document and contains document-specific functions (add_chapter). I am looking for a solution where the code does not need recompiling and the document is described in an external declarative manner
peter.murray.rust
A: 

I am fairly sure that the answer I am looking for is in ANTLR (http://www.antlr.org/). This allows me to write expressions of the form:

document : (chapter)+;
chapter : 'Chapter ' DIGIT NEWLINE line+;

and so on. It also allows embedding of code into these expressions.

peter.murray.rust