ansaurus

Question

How can I extract columns from a fixed-width format in Perl?

Answer 1

A:

You can split the line. It appears that your delimiter is just whitespace? You can do something on the order of:

@line = split(" ", $line);

This will match all whitespace. You can then do bounds checking and access each field via $line[0], $line[1], etc.

Split can also take a regular expression rather than a string as a delimiter as well.

@line = split(/\s+/, $line);

This might do the same thing.

FModa3 2009-09-29 20:08:40

I think he is talking about fixed width encoding.

Byron Whitlock 2009-09-29 20:11:12

Used this method - works great, output:Time: 1253592000Livereporter: Span: Bcreporter: Time: 1253678400Livereporter: 86400Span: 6183.000000Bcreporter: Time: 1253764800Livereporter: 86400Span: 4486.000000Bcreporter: Time: 1253851200Livereporter: 36.000000Span: 86400Bcreporter: 10669.000000Time: 1253937600Livereporter: 0.000000Span: 86400Bcreporter: 9126.000000Time: 1254024000Livereporter: 0.000000Span: 86400Bcreporter: 2930.000000Time: 1254110400Livereporter: 0.000000Span: 86400Bcreporter: 2895.000000Time: 1254196800Livereporter: 0.000000Span: 8828.000000

Greg 2009-09-29 20:36:33

You can't split on whitespace because some fields are empty. You lose the column order when you do this.

brian d foy 2009-09-29 23:29:42

I'm going to test out the unpack solution in a few minutes. - Thanks!

Greg 2009-09-30 12:37:04

Answer 2

A:

Fixed width delimiting can be done like this:

my @cols;
my %header;
$header{field1} = 0; // char position of first char in field
$header{field2} = 12;
$header{field3} = 15;

while(<IN>) {

   print chomp(substr $_, $header{field2}, $header{field3}); // value of field2 


}

My Perl is very rusty so I am sure there are syntax errors there. but that is the gist of it.

Byron Whitlock 2009-09-29 20:10:33

Why are you chomping like that? And what do you think that prints? See the documentation for chomp for a clue. Not to be too mean about it, but if you're guessing and showing something you've never tried or even run, you should wait for a more experienced person to answer.

brian d foy 2009-09-29 23:28:54

Answer 3

A:

If all fields have the same fixed width and are formatted with spaces, you can use the following split:

@array = split / {1,N}/, $line;

where N is the with of the field. This will yield a space for each empty field.

Pavel Shved 2009-09-29 20:18:29

I don't think that does what you think it does. There are two major errors in that one line: one in logic and one in syntax.

brian d foy 2009-09-29 23:27:13

@brian d foy: thank you, fixed. Sorry for a low-quality answer. Anyway, `unpack` solution is way better.

Pavel Shved 2009-09-30 03:50:24

Answer 4

+11 A:

This example illustrates how to parse the line either with whitespace as the delimiter (split) or with a fixed-column layout (unpack). With unpack if you use upper-case (A10 etc), whitespace will be removed for you. Note: as brian d foy points out, the split approach does not work well for a situation with missing fields (for example, the second line of data), because the field position information will be lost; unpack is the way to go here, unless we are misunderstanding your data.

use strict;
use warnings;

while (my $line = <DATA>){
    chomp $line;
    my @fields_whitespace = split m'\s+', $line;
    my @fields_fixed = unpack('a10 a10 a12 a28', $line);
}

__DATA__
1253592000                                                  
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000
1253851200 36.000000       86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000

FM 2009-09-29 20:31:07

+1 for unpack, given the layout of the sample data

Hobo 2009-09-29 21:09:43

Everyone forgets that Perl has pack, but it's really handy and I should use it more myself. I was just editing that chapter for the next edition of Effective Perl Programming. :)

brian d foy 2009-09-29 23:26:07

`split m'\s+'` would highlight better.

Brad Gilbert 2009-09-29 23:56:03

According to the perldoc - "The string is broken into chunks described by the TEMPLATE." These chunks are inserted into the @fields_fixed array, correct?

Greg 2009-09-30 13:53:37

while ($line = <TEMPFILE>){ if ($x < 2) { $x++; } else { chomp $line; @fields_whitespace = split m'\s+', $line; @fields_fixed = unpack('a10 a10 a12 a28', $line); print @fields_fixed, "\n"; $x++; }}That's what I have - I cannot access @fields_fixed outside the block - am I missing some basic ideal of programming that I should remember? I know and understand scope - but am confused in this case.

Greg 2009-09-30 14:19:23

Answer 5

A:

I'm unsure of the column names and formatting but you should be able to adjust this recipe to your liking using Text::FixedWidth

use strict;
use warnings;
use Text::FixedWidth;

my $fw = Text::FixedWidth->new;
$fw->set_attributes(
    qw(
        timestamp undef  %10s
        field2    undef  %10s
        period    undef  %12s
        field4    undef  %28s
        )
);

while (<DATA>) {
    $fw->parse( string => $_ );
    print $fw->get_timestamp . "\n";
}

__DATA__
1253592000
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000
1253851200 36.000000       86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000

Mike Wade 2009-09-30 13:35:08

Answer 6

+1 A:

Use my module DataExtract::FixedWidth. It is the most full featured, and well tested, for working with Fixed Width columns in perl. If this isn't fast enough you can pass in an unpack_string and eliminate the need for heuristic detection of boundaries.

#!/usr/bin/env perl
use strict;
use warnings;
use DataExtract::FixedWidth;
use feature ':5.10';

my @rows = <DATA>;
my $de = DataExtract::FixedWidth->new({
  heuristic => \@rows
  , header_row => undef
});

say join ('|',  @{$de->parse($_)}) for @rows;

    --alternatively if you want header info--

my @rows = <DATA>;
my $de = DataExtract::FixedWidth->new({
  heuristic => \@rows
  , header_row => undef
  , cols => [qw/timestamp field2 period field4/]
});

use Data::Dumper;
warn Dumper $de->parse_hash($_) for @rows;

__DATA__
1253592000
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000
1253851200  36.000000      86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000

Evan Carroll 2010-07-15 03:40:21

I've used this module in the past and the column detection is slick.

Demosthenex 2010-07-15 08:04:34

ansaurus

tags:

views:

answers:

How can I extract columns from a fixed-width format in Perl?

related questions