views:

571

answers:

6

I'm writing a Perl script to run through and grab various data elements such as:

1253592000
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000 
1253851200  36.000000      86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000

I can grab each line of this text file no problem.

I have working regex to grab each of those fields. Once I have the line in a variable, i.e. $line - how can I grab each of those fields and place them into their own variables even though they have different delimiters?

A: 

You can split the line. It appears that your delimiter is just whitespace? You can do something on the order of:

@line = split(" ", $line);

This will match all whitespace. You can then do bounds checking and access each field via $line[0], $line[1], etc.

Split can also take a regular expression rather than a string as a delimiter as well.

@line = split(/\s+/, $line);

This might do the same thing.

FModa3
I think he is talking about fixed width encoding.
Byron Whitlock
Used this method - works great, output:Time: 1253592000Livereporter: Span: Bcreporter: Time: 1253678400Livereporter: 86400Span: 6183.000000Bcreporter: Time: 1253764800Livereporter: 86400Span: 4486.000000Bcreporter: Time: 1253851200Livereporter: 36.000000Span: 86400Bcreporter: 10669.000000Time: 1253937600Livereporter: 0.000000Span: 86400Bcreporter: 9126.000000Time: 1254024000Livereporter: 0.000000Span: 86400Bcreporter: 2930.000000Time: 1254110400Livereporter: 0.000000Span: 86400Bcreporter: 2895.000000Time: 1254196800Livereporter: 0.000000Span: 8828.000000
Greg
You can't split on whitespace because some fields are empty. You lose the column order when you do this.
brian d foy
I'm going to test out the unpack solution in a few minutes. - Thanks!
Greg
A: 

Fixed width delimiting can be done like this:

my @cols;
my %header;
$header{field1} = 0; // char position of first char in field
$header{field2} = 12;
$header{field3} = 15;

while(<IN>) {

   print chomp(substr $_, $header{field2}, $header{field3}); // value of field2 


}

My Perl is very rusty so I am sure there are syntax errors there. but that is the gist of it.

Byron Whitlock
Why are you chomping like that? And what do you think that prints? See the documentation for chomp for a clue. Not to be too mean about it, but if you're guessing and showing something you've never tried or even run, you should wait for a more experienced person to answer.
brian d foy
A: 

If all fields have the same fixed width and are formatted with spaces, you can use the following split:

@array = split / {1,N}/, $line;

where N is the with of the field. This will yield a space for each empty field.

Pavel Shved
I don't think that does what you think it does. There are two major errors in that one line: one in logic and one in syntax.
brian d foy
@brian d foy: thank you, fixed. Sorry for a low-quality answer. Anyway, `unpack` solution is way better.
Pavel Shved
+11  A: 

This example illustrates how to parse the line either with whitespace as the delimiter (split) or with a fixed-column layout (unpack). With unpack if you use upper-case (A10 etc), whitespace will be removed for you. Note: as brian d foy points out, the split approach does not work well for a situation with missing fields (for example, the second line of data), because the field position information will be lost; unpack is the way to go here, unless we are misunderstanding your data.

use strict;
use warnings;

while (my $line = <DATA>){
    chomp $line;
    my @fields_whitespace = split m'\s+', $line;
    my @fields_fixed = unpack('a10 a10 a12 a28', $line);
}

__DATA__
1253592000                                                  
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000
1253851200 36.000000       86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000
FM
+1 for unpack, given the layout of the sample data
Hobo
Everyone forgets that Perl has pack, but it's really handy and I should use it more myself. I was just editing that chapter for the next edition of Effective Perl Programming. :)
brian d foy
`split m'\s+'` would highlight better.
Brad Gilbert
According to the perldoc - "The string is broken into chunks described by the TEMPLATE." These chunks are inserted into the @fields_fixed array, correct?
Greg
while ($line = <TEMPFILE>){ if ($x < 2) { $x++; } else { chomp $line; @fields_whitespace = split m'\s+', $line; @fields_fixed = unpack('a10 a10 a12 a28', $line); print @fields_fixed, "\n"; $x++; }}That's what I have - I cannot access @fields_fixed outside the block - am I missing some basic ideal of programming that I should remember? I know and understand scope - but am confused in this case.
Greg
A: 

I'm unsure of the column names and formatting but you should be able to adjust this recipe to your liking using Text::FixedWidth

use strict;
use warnings;
use Text::FixedWidth;

my $fw = Text::FixedWidth->new;
$fw->set_attributes(
    qw(
        timestamp undef  %10s
        field2    undef  %10s
        period    undef  %12s
        field4    undef  %28s
        )
);

while (<DATA>) {
    $fw->parse( string => $_ );
    print $fw->get_timestamp . "\n";
}

__DATA__
1253592000
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000
1253851200 36.000000       86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000
Mike Wade
+1  A: 

Use my module DataExtract::FixedWidth. It is the most full featured, and well tested, for working with Fixed Width columns in perl. If this isn't fast enough you can pass in an unpack_string and eliminate the need for heuristic detection of boundaries.

#!/usr/bin/env perl
use strict;
use warnings;
use DataExtract::FixedWidth;
use feature ':5.10';

my @rows = <DATA>;
my $de = DataExtract::FixedWidth->new({
  heuristic => \@rows
  , header_row => undef
});

say join ('|',  @{$de->parse($_)}) for @rows;

    --alternatively if you want header info--

my @rows = <DATA>;
my $de = DataExtract::FixedWidth->new({
  heuristic => \@rows
  , header_row => undef
  , cols => [qw/timestamp field2 period field4/]
});

use Data::Dumper;
warn Dumper $de->parse_hash($_) for @rows;

__DATA__
1253592000
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000
1253851200  36.000000      86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000
Evan Carroll
I've used this module in the past and the column detection is slick.
Demosthenex