ansaurus

Question

Answer 1

A:

Going from unstructured to structured is going to require writing some type of parser on your part, which is trivial enough. Scan for the first regular expression, extract the data, and emit an XML element for it. Then scan for the second regular expressions, extract it's data, and emit it within the first XML element you made. Then scan for all remaining input to see if it matches the FIRST regular expression, if not, add it to the second element you made, otherwise restart with a new upper-level element. Proceed to EOF, and save the resulting XML.

Walt Stoneburner 2009-08-31 15:38:01

Answer 2

+1 A:

This is the sort of job Perl was made for.

#! /opt/perl/bin/perl
use strict;
use warnings;
use 5.10.1;

{
  package My::Full;
  use Moose;
  use MooseX::Method::Signatures;

  has 'chapters' => (
    'is' => 'rw',
    'isa' => 'ArrayRef[My::Chapter]',
    'default' => sub{[]}
  );

  method add_chapter( Str $name ){
    my $chapter = My::Chapter->new( name => "$name" );
    push @{$self->chapters}, $chapter;
    return $chapter;
  }

  method latest(){
    return $self->add_chapter('') unless @{$self->chapters};
    return $self->chapters->[-1];
  }

  method add_section( Str $name ){
    my $latest_chapter = $self->latest;
    $latest_chapter->add_section("$name");
  }

  method add_line( Str $line ){
    $self->latest->add_line( "$line" );
  }

  method xml(){
    my $out = '';
    for my $chapter ( @{ $self->chapters } ){
      $out .= $chapter->xml;
    }
    return $out;
  }
}
{
  package My::Chapter;
  use Moose;
  use MooseX::Method::Signatures;

  has 'name' => (
    'is' => 'rw',
    'isa' => 'Str',
    'required' => 1
  );

  has 'sections' => (
    'is' => 'rw',
    'isa' => 'ArrayRef[My::Section]',
    'default' => sub{[]}
  );

  method latest(){
    return $self->add_section('') unless @{$self->sections};
    return $self->sections->[-1];
  }

  method add_section( Str $name ){
    my $section = My::Section->new(name => "$name");
    push @{$self->sections}, $section;
    return $section;
  }

  method add_line( Str $line ){
    $self->latest->add_line( "$line" );
  }

  method xml(){
    my $name = $self->name;
    $name = '???' unless length $name;

    my $out = qq'<div role="CHAPTER" title="$name">\n';
    for my $section ( @{ $self->sections } ){
      $out .= $section->xml;
    }
    return $out."</div>\n";
  }
}
{
  package My::Section;
  use Moose;
  use MooseX::MultiMethods;

  has 'name' => (
    'is' => 'rw',
    'isa' => 'Str',
    'required' => 1
  );

  has 'lines' => (
    'is' => 'rw',
    'isa' => 'ArrayRef[Str]',
    'default' => sub{[]}
  );

  method add_line( Str $line ){
    push @{$self->lines}, "$line"
  }

  method xml(){
    my $name = $self->name;
    $name = '???' unless length $name;

    my $out = qq'  <div role="SECTION" title="$name">\n';
    for my $line ( @{ $self->lines } ){
      $out .= "    <p>$line</p>\n";
    }
    return $out."  </div>\n";
  }
}

The main loop:

my $full = My::Full->new;

while( my $line = <> ){
  chomp $line;

  given( $line ){
    when( /^chap(?:ter)?\s++(.+)/i ){
      $full->add_chapter($1);
    }
    when( /^sec(?:tion)?\s++(.+)/i ){
      $full->add_section($1);
    }
    default{
      $full->add_line($line);
    }
  }
}

say $full->xml

<div role="CHAPTER" title="check">
  <div role="SECTION" title="check">
    <p>this is something</p>
    <p>this is another</p>
  </div>
  <div role="SECTION" title="check">
    <p>take some xxx</p>
    <p>do yyy</p>
    <p>and some...</p>
  </div>
</div>
<div role="CHAPTER" title="check">
  <div role="SECTION" title="???">
    <p>we created...</p>
  </div>
</div>

Brad Gilbert 2009-08-31 19:18:18

Thanks. This is a script specific to one particular type of document and contains document-specific functions (add_chapter). I am looking for a solution where the code does not need recompiling and the document is described in an external declarative manner

peter.murray.rust 2009-08-31 20:17:02

Answer 3

A:

I am fairly sure that the answer I am looking for is in ANTLR (http://www.antlr.org/). This allows me to write expressions of the form:

document : (chapter)+;
chapter : 'Chapter ' DIGIT NEWLINE line+;

and so on. It also allows embedding of code into these expressions.

peter.murray.rust 2009-10-31 10:53:47

ansaurus

tags:

views:

answers:

Parsing unstructured documents into XML

The main loop:

related questions