ansaurus

Question

Strategies to handle a file with multiple fixed formats

Answer 1

A:

I would keep an additional state in one or more variables and update it per row. Then you e. g. know if the last line was level 1, or if the last row was format 4 (and you can expect format 5), thus giving more security to your processing.

Frank 2010-09-03 19:37:55

I'm doing that at the moment actually. What I'm more interested in is alternative ways to do this.

Zaid 2010-09-03 19:43:20

Answer 2

+1 A:

Depending what you want to do with this, it might be a good place to actually write a formal grammar, using Parse::RecDescent, for instance. This will allow you to feed the entire file to the parser, and get a datastructure out of it.

zigdon 2010-09-03 19:43:36

+1 : This is *really* interesting! Well worth looking into. I'd love it if you could post an example (doesn't have to be for the example in the question).

Zaid 2010-09-03 19:49:14

@Zaid: A line-separated grammar for Parse::RecDescent is not the easiest thing to write. I had to dive down deep into the code to get some `(s)` and `(s?)` type rules even close to working for a COBOL scanner I was writing.

Axeman 2010-09-03 20:00:23

If you can require Perl 5.10, you should also look at [Regexp::Grammars](http://search.cpan.org/perldoc?Regexp::Grammars), which is kind of the successor to Parse::RecDescent.

cjm 2010-09-03 20:00:25

Answer 3

+1 A:

This sounds like the sort of thing a state machine is good at. One way to do a state machine in Perl is as an object, where each state is a method. The object gives you a place to store the structure you're building, and any intermediate state you need (like the filehandle you're reading from).

my $state = 'expect_fmt1';
while (defined $state) {
  $state = $object->$state();
}
...
sub expect_fmt1 {
  my $self = shift;
  # read format 1, parse it, store it in object
  return 'expect_fmt2';
}

Some thoughts on handling the cases where you have to look at the line before deciding what to do with it:

If the file is small enough, you could slurp it into an arrayref in the object. That makes it easy for a state to examine a line without removing it.

If the file is too big for easy slurping, you can have a method for reading the next line along with a cache in your object that allows you to put it back:

my get_line {
  my $self = shift;
  my $cache = $self->{line_cache};
  return shift @$cache if @$cache;
  return $self->{filehandle}->getline;
}
my unget_line { my $self = shift; unshift @{ $self->{line_cache} }, @_ }

Or, you could split the states that involve this decision into two states. The first state reads the line, stores it in $self->{current_line}, decides what format it is, and returns the state that parses & stores that format (which gets the line to parse from $self->{current_line}).

cjm 2010-09-03 19:44:31

Answer 4

A:

What I used to do in this case--if possible--is have a unique regex for each line. If format #2 follows 1 line of format #1, then you can apply regex #2 right after 1. But for the line following the first #2, you want to try either #2 or #3.

You could also have an alternation which combines #2 and #3:

my ( $cap2_1, $cap2_2, $cap3_1, $cap3_2 ) = $line =~ /$regex2|regex3/;

If #4 immediate follows 3, you'll want to apply regex #4 after #3, and regex #5. After that, because it can be either #3 or #4, you might want to repeat either the multiple match or the alternation with #3/#4.

while ( <> ) {
    given ( $state ) { 
         when ( 1 ) { my ( $device_num )  = m/$regex1/; $state++; }
         when ( 2 ) { my ( $cap1, $cap2 ) = m/$regex2/; $state++; }
         when ( 3 ) { 
             my ( $cap1, $cap2, $date, $nums ) = m/$regex2|$regex3/;
             $state += $cap1 ? 1 : 2;
         }
    }
}

That kind of gives you the gist of what you might want to do. Or see FSA::Rules for a state managing module.

Axeman 2010-09-03 19:46:37

Answer 5

+4 A:

Toying with an answer to your question, I arrived at an interesting solution with a concise main loop:

while (<>) {
  given($_) {
    when (@{[ map $pattern{$_}, @expect]}) {}
    default {
      die "$0: line $.: expected " . join("|" => @expect) . "; got\n$_";
    }
  }
}

As you'll see below, %pattern is a hash of named patterns for the different formats, and given/when against an array of Regex objects performs a short-circuiting search to find the first match.

From this, you can infer that @expect is a list of names of formats we expect to find on the current line.

For a while, I was stuck on the case of multiple possible expected formats and how to know format just matched, but then I remembered (?{ code }) in regular expressions:

This zero-width assertion evaluates any embedded Perl code. It always succeeds, and its code is not interpolated.

This allows something like a poor man's yacc grammar. For example, the pattern to match and process format 1 is

fmt1 => qr/^ \*\* DEVICE \s+ (\S+) \s*$
             (?{ $device->{attr1} = $1;
                 @expect = qw< fmt2 >;
               })
          /x,

After processing the input from your question, $device contains

{
  'attr1' => '109523.69142',
  'attr2' => '.981',
  'attr3' => '561A',
  'groups' => [
    {
      'date' => '10/MAY/2010',
      'nnn' => [ '24.15.30', '13.45.03' ],
      'records' => [
        [ '05:03:01', 'AB23X', '15.67', '101325.72', '14', '31.30474',  '13', '0' ],
        [ '05:03:15', 'CR22X', '16.72', '101325.42', '14', '29.16264',  '11', '0' ],
        [ '06:23:51', 'AW41X', '15.67', '101323.9',  '14', '31.264932', '19', '0' ],
      ],
    },
    {
      'date' => '11/MAY/2010',
      'nnn' => [ '24.07.13', '13.44.63' ],
      'records' => [
        [ '15:57:14', 'AB23X', '15.67', '101327.23', '14', '31.30474', '13', '0' ],
        [ '15:59:59', 'CR22X', '16.72', '101331.88', '14', '29.16264', '11', '0' ],
      ],
    }
  ],
}

I'm amused with the result, but for some reason Larry's advice in perlstyle comes to mind:

Just because you CAN do something a particular way doesn't mean that you SHOULD do it that way.

For completeness, a working program demonstrating the result is below.

#! /usr/bin/perl

use warnings;
use strict;
use feature ':5.10';
use re 'eval';

*ARGV = *DATA;

my $device;
my $record;
my @expect = qw/ fmt1 /;
my %pattern;
%pattern = (
  fmt1 => qr/^ \*\* DEVICE \s+ (\S+) \s*$
               (?{ $device->{attr1} = $1;
                   @expect = qw< fmt2 >;
                 })
            /x,

  fmt2 => qr/^ \s* (\S+) \s+ (\S+) \s*$
               (?{ @{$device}{qw< attr2 attr3 >} = ($1,$2);
                   @expect = qw< fmt3 >;
                 })
            /x,

  # e.g., 10/MAY/2010    24.15.30,13.45.03
  fmt3 => qr/^ (\d\d\/[A-Z]{3}\/\d{4}) \s+ (\S+) \s*$
               (?{ my($date,$nnns) = ($1,$2);
                   push @{ $device->{groups} } =>
                     { nnn  => [ split m|,| => $nnns ],
                       date => $date };
                   @expect = qw< fmt4 >;
                 })
            /x,

  # e.g., 05:03:01   AB23X  15.67   101325.72
  fmt4 => qr/^ (\d\d:\d\d:\d\d) \s+
               (\S+) \s+ (\S+) \s+ (\S+)
               \s*$
               (?{ push @{ $device->{groups}[-1]{records} } =>
                        [ $1, $2, $3, $4 ];
                   @expect = qw< fmt4 fmt5 >;
                 })
            /x,

  # e.g., *           14  31.30474 13        0
  fmt5 => qr/^\* \s+ (\d+) \s+
              # tricky: possibly no whitespace after 9-char float
              ((?=\d{1,7}\.\d+)[\d.]{1,9}) \s*
              (\d+) \s+ (\d+)
              \s*$
              (?{ push @{ $device->{groups}[-1]{records}[-1] } =>
                        $1, $2, $3, $4;
                  @expect = qw< fmt4 fmt3 fmt2 >;
                })
            /x,
);

while (<>) {
  given($_) {
    when (@{[ map $pattern{$_}, @expect]}) {}
    default {
      die "$0: line $.: expected " . join("|" => @expect) . "; got\n$_";
    }
  }
}

use Data::Dumper;
$Data::Dumper::Terse = $Data::Dumper::Indent = 1;
print Dumper $device;

__DATA__
**DEVICE 109523.69142
  .981    561A
10/MAY/2010    24.15.30,13.45.03
05:03:01   AB23X  15.67   101325.72
*           14  31.30474 13        0
05:03:15   CR22X  16.72   101325.42
*           14  29.16264 11        0
06:23:51   AW41X  15.67    101323.9
*           14  31.26493219        0
11/MAY/2010    24.07.13,13.44.63
15:57:14   AB23X  15.67   101327.23
*           14  31.30474 13        0
15:59:59   CR22X  16.72   101331.88
*           14  29.16264 11        0

Greg Bacon 2010-09-03 21:27:17

+1 for a thorough example. Is it necessary to array-reference the dereferenced map transformation (`@{[ map $pattern{$_}, @expect]}`)? Isn't the `map { $pattern{$_} } @expect` itself sufficient?

Zaid 2010-09-04 12:57:10

@Zaid Using an unadorned `map` causes the program to die complaining that `Argument "**DEVICE 109523.69142\n" isn't numeric in smart match …`

Greg Bacon 2010-09-06 12:44:53

Answer 6

+2 A:

This is a good question. Two suggestions occur to me.

(1) The first is simply to reiterate the idea from cjm: an object-based state machine. This is a flexible way to perform complex parsing. I've used its several times and have been happy with the results in most cases.

(2) The second idea falls under the category of a divide-and-conquer Unix-pipeline to pre-process the data.

First an observation about your data: if a set of formats always occurs as a pair, it effectively represent a single data format and can be combined without any loss of information. This means that you have only 3 formats: 1+2, 3, and 4+5.

And that thought leads to the strategy. Write a very simple script or two to pre-process your data -- effectively, a reformatting step to get the data into shape before the real parsing work begins. Here I show the scripts as separate tools. They could be combined, but the general philosophy might suggest that they remain distinct, narrowly defined tools.

In unbreak_records.pl.

Omitting the she-bang and use strict/warnings.

while (<>){
    chomp;
    print /^\*?\s/ ? ' ' : "\n", $_;
}
print "\n";

In add_record_types.pl

while (<>){
    next unless /\S/;
    my $rt = /^\*/ ?   1 :
             /^..\// ? 2 : 3;
    print $rt, ' ', $_;
}

At the command line.

./unbreak_records.pl orig.dat | ./add_record_types.pl > reformatted.dat

Output:

1 **DEVICE 109523.69142   .981    561A
2 10/MAY/2010    24.15.30,13.45.03
3 05:03:01   AB23X  15.67   101325.72 *           14  31.30474 13        0
3 05:03:15   CR22X  16.72   101325.42 *           14  29.16264 11        0
3 06:23:51   AW41X  15.67    101323.9 *           14  31.26493219        0
2 11/MAY/2010    24.07.13,13.44.63
3 15:57:14   AB23X  15.67   101327.23 *           14  31.30474 13        0
3 15:59:59   CR22X  16.72   101331.88 *           14  29.16264 11        0

The rest of the parsing is straightforward. If your data providers modify the format slightly, you simply need to write some different reformatting scripts.

FM 2010-09-04 14:14:35

ansaurus

tags:

views:

answers:

Strategies to handle a file with multiple fixed formats

In unbreak_records.pl.

In add_record_types.pl

At the command line.

Output:

related questions