views:

97

answers:

6

This question is not Perl-specific, (although the unpack function will most probably figure into my implementation).

I have to deal with files where multiple formats exist to hierarchically break down the data into meaningful sections. What I'd like to be able to do is parse the file data into a suitable data structure.

Here's an example (commentary on RHS):

                                       # | Format | Level | Comment
                                       # +--------+-------+---------
**DEVICE 109523.69142                  #        1       1   file-specific
  .981    561A                         #        2       1
10/MAY/2010    24.15.30,13.45.03       #        3       2   group of records
05:03:01   AB23X  15.67   101325.72    #        4       3   part of single record
*           14  31.30474 13        0   #        5       3   part of single record
05:03:15   CR22X  16.72   101325.42    #        4       3   new record
*           14  29.16264 11        0   #        5       3
06:23:51   AW41X  15.67    101323.9    #        4       3
*           14  31.26493219        0   #        5       3
11/MAY/2010    24.07.13,13.44.63       #        3       2   group of new records
15:57:14   AB23X  15.67   101327.23    #        4       3   part of single record
*           14  31.30474 13        0   #        5       3   part of single record
15:59:59   CR22X  16.72   101331.88    #        4       3   new record
*           14  29.16264 11        0   #        5

The logic I have at the moment is fragile:

  • I know, for instance, that a Format 2 always comes after a Format 1, and that they only span 2 lines.
  • I also know that Formats 4 and 5 always come in pairs as they correspond to a single record. The number of records may be variable
  • I'm using regular expressions to infer the format of each line. However, this is risky and does not lend to flexibility in the future (when someone decides to change the format of the output).

The big question here is about what strategies I can employ to determine which format needs to be used for which line. I'd be interested to know if others have faced similar situations and what they've done to address it.

A: 

I would keep an additional state in one or more variables and update it per row. Then you e. g. know if the last line was level 1, or if the last row was format 4 (and you can expect format 5), thus giving more security to your processing.

Frank
I'm doing that at the moment actually. What I'm more interested in is alternative ways to do this.
Zaid
+1  A: 

Depending what you want to do with this, it might be a good place to actually write a formal grammar, using Parse::RecDescent, for instance. This will allow you to feed the entire file to the parser, and get a datastructure out of it.

zigdon
+1 : This is *really* interesting! Well worth looking into. I'd love it if you could post an example (doesn't have to be for the example in the question).
Zaid
@Zaid: A line-separated grammar for Parse::RecDescent is not the easiest thing to write. I had to dive down deep into the code to get some `(s)` and `(s?)` type rules even close to working for a COBOL scanner I was writing.
Axeman
If you can require Perl 5.10, you should also look at [Regexp::Grammars](http://search.cpan.org/perldoc?Regexp::Grammars), which is kind of the successor to Parse::RecDescent.
cjm
+1  A: 

This sounds like the sort of thing a state machine is good at. One way to do a state machine in Perl is as an object, where each state is a method. The object gives you a place to store the structure you're building, and any intermediate state you need (like the filehandle you're reading from).

my $state = 'expect_fmt1';
while (defined $state) {
  $state = $object->$state();
}
...
sub expect_fmt1 {
  my $self = shift;
  # read format 1, parse it, store it in object
  return 'expect_fmt2';
}

Some thoughts on handling the cases where you have to look at the line before deciding what to do with it:

If the file is small enough, you could slurp it into an arrayref in the object. That makes it easy for a state to examine a line without removing it.

If the file is too big for easy slurping, you can have a method for reading the next line along with a cache in your object that allows you to put it back:

my get_line {
  my $self = shift;
  my $cache = $self->{line_cache};
  return shift @$cache if @$cache;
  return $self->{filehandle}->getline;
}
my unget_line { my $self = shift; unshift @{ $self->{line_cache} }, @_ }

Or, you could split the states that involve this decision into two states. The first state reads the line, stores it in $self->{current_line}, decides what format it is, and returns the state that parses & stores that format (which gets the line to parse from $self->{current_line}).

cjm
A: 

What I used to do in this case--if possible--is have a unique regex for each line. If format #2 follows 1 line of format #1, then you can apply regex #2 right after 1. But for the line following the first #2, you want to try either #2 or #3.

You could also have an alternation which combines #2 and #3:

my ( $cap2_1, $cap2_2, $cap3_1, $cap3_2 ) = $line =~ /$regex2|regex3/;

If #4 immediate follows 3, you'll want to apply regex #4 after #3, and regex #5. After that, because it can be either #3 or #4, you might want to repeat either the multiple match or the alternation with #3/#4.

while ( <> ) {
    given ( $state ) { 
         when ( 1 ) { my ( $device_num )  = m/$regex1/; $state++; }
         when ( 2 ) { my ( $cap1, $cap2 ) = m/$regex2/; $state++; }
         when ( 3 ) { 
             my ( $cap1, $cap2, $date, $nums ) = m/$regex2|$regex3/;
             $state += $cap1 ? 1 : 2;
         }
    }
}

That kind of gives you the gist of what you might want to do. Or see FSA::Rules for a state managing module.

Axeman
+4  A: 

Toying with an answer to your question, I arrived at an interesting solution with a concise main loop:

while (<>) {
  given($_) {
    when (@{[ map $pattern{$_}, @expect]}) {}
    default {
      die "$0: line $.: expected " . join("|" => @expect) . "; got\n$_";
    }
  }
}

As you'll see below, %pattern is a hash of named patterns for the different formats, and given/when against an array of Regex objects performs a short-circuiting search to find the first match.

From this, you can infer that @expect is a list of names of formats we expect to find on the current line.

For a while, I was stuck on the case of multiple possible expected formats and how to know format just matched, but then I remembered (?{ code }) in regular expressions:

This zero-width assertion evaluates any embedded Perl code. It always succeeds, and its code is not interpolated.

This allows something like a poor man's yacc grammar. For example, the pattern to match and process format 1 is

fmt1 => qr/^ \*\* DEVICE \s+ (\S+) \s*$
             (?{ $device->{attr1} = $1;
                 @expect = qw< fmt2 >;
               })
          /x,

After processing the input from your question, $device contains

{
  'attr1' => '109523.69142',
  'attr2' => '.981',
  'attr3' => '561A',
  'groups' => [
    {
      'date' => '10/MAY/2010',
      'nnn' => [ '24.15.30', '13.45.03' ],
      'records' => [
        [ '05:03:01', 'AB23X', '15.67', '101325.72', '14', '31.30474',  '13', '0' ],
        [ '05:03:15', 'CR22X', '16.72', '101325.42', '14', '29.16264',  '11', '0' ],
        [ '06:23:51', 'AW41X', '15.67', '101323.9',  '14', '31.264932', '19', '0' ],
      ],
    },
    {
      'date' => '11/MAY/2010',
      'nnn' => [ '24.07.13', '13.44.63' ],
      'records' => [
        [ '15:57:14', 'AB23X', '15.67', '101327.23', '14', '31.30474', '13', '0' ],
        [ '15:59:59', 'CR22X', '16.72', '101331.88', '14', '29.16264', '11', '0' ],
      ],
    }
  ],
}

I'm amused with the result, but for some reason Larry's advice in perlstyle comes to mind:

Just because you CAN do something a particular way doesn't mean that you SHOULD do it that way.


For completeness, a working program demonstrating the result is below.

#! /usr/bin/perl

use warnings;
use strict;
use feature ':5.10';
use re 'eval';

*ARGV = *DATA;

my $device;
my $record;
my @expect = qw/ fmt1 /;
my %pattern;
%pattern = (
  fmt1 => qr/^ \*\* DEVICE \s+ (\S+) \s*$
               (?{ $device->{attr1} = $1;
                   @expect = qw< fmt2 >;
                 })
            /x,

  fmt2 => qr/^ \s* (\S+) \s+ (\S+) \s*$
               (?{ @{$device}{qw< attr2 attr3 >} = ($1,$2);
                   @expect = qw< fmt3 >;
                 })
            /x,

  # e.g., 10/MAY/2010    24.15.30,13.45.03
  fmt3 => qr/^ (\d\d\/[A-Z]{3}\/\d{4}) \s+ (\S+) \s*$
               (?{ my($date,$nnns) = ($1,$2);
                   push @{ $device->{groups} } =>
                     { nnn  => [ split m|,| => $nnns ],
                       date => $date };
                   @expect = qw< fmt4 >;
                 })
            /x,

  # e.g., 05:03:01   AB23X  15.67   101325.72
  fmt4 => qr/^ (\d\d:\d\d:\d\d) \s+
               (\S+) \s+ (\S+) \s+ (\S+)
               \s*$
               (?{ push @{ $device->{groups}[-1]{records} } =>
                        [ $1, $2, $3, $4 ];
                   @expect = qw< fmt4 fmt5 >;
                 })
            /x,

  # e.g., *           14  31.30474 13        0
  fmt5 => qr/^\* \s+ (\d+) \s+
              # tricky: possibly no whitespace after 9-char float
              ((?=\d{1,7}\.\d+)[\d.]{1,9}) \s*
              (\d+) \s+ (\d+)
              \s*$
              (?{ push @{ $device->{groups}[-1]{records}[-1] } =>
                        $1, $2, $3, $4;
                  @expect = qw< fmt4 fmt3 fmt2 >;
                })
            /x,
);

while (<>) {
  given($_) {
    when (@{[ map $pattern{$_}, @expect]}) {}
    default {
      die "$0: line $.: expected " . join("|" => @expect) . "; got\n$_";
    }
  }
}

use Data::Dumper;
$Data::Dumper::Terse = $Data::Dumper::Indent = 1;
print Dumper $device;

__DATA__
**DEVICE 109523.69142
  .981    561A
10/MAY/2010    24.15.30,13.45.03
05:03:01   AB23X  15.67   101325.72
*           14  31.30474 13        0
05:03:15   CR22X  16.72   101325.42
*           14  29.16264 11        0
06:23:51   AW41X  15.67    101323.9
*           14  31.26493219        0
11/MAY/2010    24.07.13,13.44.63
15:57:14   AB23X  15.67   101327.23
*           14  31.30474 13        0
15:59:59   CR22X  16.72   101331.88
*           14  29.16264 11        0
Greg Bacon
+1 for a thorough example. Is it necessary to array-reference the dereferenced map transformation (`@{[ map $pattern{$_}, @expect]}`)? Isn't the `map { $pattern{$_} } @expect` itself sufficient?
Zaid
@Zaid Using an unadorned `map` causes the program to die complaining that `Argument "**DEVICE 109523.69142\n" isn't numeric in smart match …`
Greg Bacon
+2  A: 

This is a good question. Two suggestions occur to me.

(1) The first is simply to reiterate the idea from cjm: an object-based state machine. This is a flexible way to perform complex parsing. I've used its several times and have been happy with the results in most cases.

(2) The second idea falls under the category of a divide-and-conquer Unix-pipeline to pre-process the data.

First an observation about your data: if a set of formats always occurs as a pair, it effectively represent a single data format and can be combined without any loss of information. This means that you have only 3 formats: 1+2, 3, and 4+5.

And that thought leads to the strategy. Write a very simple script or two to pre-process your data -- effectively, a reformatting step to get the data into shape before the real parsing work begins. Here I show the scripts as separate tools. They could be combined, but the general philosophy might suggest that they remain distinct, narrowly defined tools.

In unbreak_records.pl.

Omitting the she-bang and use strict/warnings.

while (<>){
    chomp;
    print /^\*?\s/ ? ' ' : "\n", $_;
}
print "\n";

In add_record_types.pl

while (<>){
    next unless /\S/;
    my $rt = /^\*/ ?   1 :
             /^..\// ? 2 : 3;
    print $rt, ' ', $_;
}

At the command line.

./unbreak_records.pl orig.dat | ./add_record_types.pl > reformatted.dat

Output:

1 **DEVICE 109523.69142   .981    561A
2 10/MAY/2010    24.15.30,13.45.03
3 05:03:01   AB23X  15.67   101325.72 *           14  31.30474 13        0
3 05:03:15   CR22X  16.72   101325.42 *           14  29.16264 11        0
3 06:23:51   AW41X  15.67    101323.9 *           14  31.26493219        0
2 11/MAY/2010    24.07.13,13.44.63
3 15:57:14   AB23X  15.67   101327.23 *           14  31.30474 13        0
3 15:59:59   CR22X  16.72   101331.88 *           14  29.16264 11        0

The rest of the parsing is straightforward. If your data providers modify the format slightly, you simply need to write some different reformatting scripts.

FM