ansaurus

Question

How can I break apart fixed-width columns in Perl?

Answer 1

+4 A:

use strict;
use warnings;

# this puts each line in the array @lines
my @lines = <DATA>; # <DATA> is a special filehandle that treats
                    # everything after __END__ as if it was a file
                    # It's handy for testing things

# Iterate over the array of lines and for each iteration
# put that line into the variable $line
foreach my $line (@lines) {
   # Use split to 'split' each $line with the regular expression /s+/
   # /s+/ means match one or more white spaces.
   # the 4 means that all whitespaces after the 4:th will be ignored
   # as a separator and be included in $col4
   my ($col1, $col2, $col3, $col4) = split(/\s+/, $line, 4);

   # here you can do whatever you need to with the data
   # in the columns. I just print them out
   print "$col1, $col2, $col3, $col4 \n";
}


__END__
darren.local           1987    A      Sentece1
darren.local           1996    C      Sentece2
darren.local           1991    E      Sentece3
darren.local           1954    G      Sentece4
darren.local           1998    H      Sentece5

Nifle 2009-12-12 16:23:26

If you are going to use split in this case, use the third argument to limit the number of elements it returns. If that last column has significant whitespace, you'll lose part of the data.

brian d foy 2009-12-12 17:38:30

Good point, edited

Nifle 2009-12-12 18:34:31

Answer 2

A:

For each line of text something like this:

my ($domain, $year, $grade, @text) = split /\s+/ $line;

I use an array for the sentence since it's not clear if the sentence at the end will have spaces or not. you can then join the @text array into a new string if necessary. If the sentences at the end are not going to have spaces then you can turn @text into $text.

Jeremy Wall 2009-12-12 16:23:51

If you are going to use split in this case, use the third argument to limit the number of elements it returns. If that last column has significant whitespace, you'll lose part of the data.

brian d foy 2009-12-12 17:39:03

Answer 3

+3 A:

Assuming that the text is put into a single variable $info, then you can split it into separate lines using the intrinsic perl split function:

my @lines = split("\n", $info);

where @lines is an array of your lines. The "\n" is the regex for a newline. You can loop through each line as follows:

foreach (@lines) {
   $line = $_;
   # do something with $line....  
}

You can then split each line on whitespace (regex \s+, where the \s is one whitespace character, and the + means 1 or more times):

@fields = split("\s+", $line);

and you can then access each field directly via its array index: $field[0], $field[1] etc.

or, you can do:

($var1, $var2, $var3, $var4) = split("\s+", $line);

which will put the fields in each line into seperate named variables.

Now - if you want to sort your lines by the character in the third column, you could do this:

my @lines = split("\n", $info); 
my @arr = ();    # declare new array

foreach (@lines) {
   my @fields = split("\s+", $_);
   push(@arr, \@fields)    # add @fields REFERENCE to @arr 
}

Now you have an "array of arrays". This can easily be sorted as follows:

@sorted = sort { $a->[2] <=> $b->[2] } @arr;

which will sort @arr by the 3rd element (index 2) of @fields.

Edit 2 To put lines with the same third column into their own variables, do this:

my %hash = ();             # declare new hash

foreach $line (@arr) {     # loop through lines
  my @fields = @$line;     # deference the field array

  my $el = $fields[2];     # get our key - the character in the third column

  my $val = "";
  if (exists $hash { $el }) {         # check if key already in hash
     my $val = $hash{ $el };        # get the current value for key
     $val = $val . "\n" . $line;    # append new line to hash value         
  } else {
     $val = $line;
  }
  $hash{ $el } = $val;         # put the new value (back) into the hash
}

Now you have a hash keyed with the third column characters, with the value for each key being the lines that contain that key. You can then loop through the hash and print out or otherwise use the hash values.

Richard 2009-12-12 16:24:05

If you are going to use split in this case, use the third argument to limit the number of elements it returns. If that last column has significant whitespace, you'll lose part of the data.

brian d foy 2009-12-12 17:39:42

Thanks Richard -- each line needs to be grouped by the capitalized letters. Depending on the output of that query I could have as many as 20 lines or as little as 2 lines. Lines with "C" need to go into a variable, lines with "B" need to go into their own variable, etc. Will that work?

scraft3613 2009-12-12 17:40:21

using the sort function in my answer above, your array will be sorted alphanumerically. so "A"s will appear first, "B"s next and so on. If you want to put all "A" lines into a single variable, there is (like any programming problem) a number of possibilities. You could use a keyed hash/map, with the characters "A" etc as your key, with the value being either a) an array of lines or b) a single single to which you append subsequent lines as you find them. See <a href="http://www.cs.mcgill.ca/~abatko/computers/programming/perl/howto/hash/">here</a> for a tutorial on using hashes.

Richard 2009-12-12 17:57:11

my link got scrambled - it's http://www.cs.mcgill.ca/~abatko/computers/programming/perl/howto/hash/

Richard 2009-12-12 17:58:25

edit: I meant b) a single String to which you can append... (haven't yet worked out how to edit comments)

Richard 2009-12-12 18:00:25

@scraft: I have added an example on how to use a hash.

Richard 2009-12-12 18:20:25

@Richard: you don't edit comments (that sucks), but you can delete them and re-comment like I do. :)

brian d foy 2009-12-12 18:28:35

thank you richard, that was very helpful

scraft3613 2009-12-12 18:56:25

@scraft - my pleasure.

Richard 2009-12-12 19:15:10

Answer 4

+10 A:

I like using unpack for this sort of thing. It's fast, flexible, and reversible.

You just need to know the positions for each column, and unpack can automatically trim the extra whitespace from each column.

If you change something in one of the columns, it's easy to go pack to the original format by repacking with the same format:

my $format = 'A23 A8 A7 A*';

while( <DATA> ) {
    chomp( my $line = $_ );

    my( $machine, $year, $letter, $sentence ) =
        unpack( $format, $_ );

    # save the original line too, which might be useful later
    push @grades, [ $machine, $year, $letter, $sentence, $_ ];
    }

my @sorted = sort { $a->[2] cmp $b->[2] } @grades;

foreach my $tuple ( @sorted ) {
    print $tuple->[-1];
    }

# go the other way, especially if you changed things
foreach my $tuple ( @sorted ) {
    print pack( $format, @$tuple[0..3] ), "\n";
    }

__END__
darren.local           1987    A      Sentence1
darren.local           1996    C      Sentence2
darren.local           1991    E      Sentence3
darren.local           1954    G      Sentence4
darren.local           1998    H      Sentence5

Now, there's an additional consideration. It sounds like you might have this big chunk of multi-line text in a single variable. Handle this as you would a file by opening a filehandle on a reference to the scalar. The filehandle stuff takes care of the rest:

 my $lines = '...multiline string...';

 open my($fh), '<', \ $lines;

 while( <$fh> ) {
      ... same as before ...
      }

brian d foy 2009-12-12 17:22:24

A format of `'A23 A8 A7 A*'` would also work.

Brad Gilbert 2009-12-12 17:39:13

A nice example of readable Perl ... (even to a once-every-two-years-user)

ldigas 2009-12-12 17:42:13

I'm not sure which format you saw because I made a mistake in the first one I posted, but we ended up at the same format.

brian d foy 2009-12-12 17:42:56

-1: Yes, it's fast, but it's also a lot more to code than when using split, therefore more effort and more error-prone. Unless there really is a lot of data to extract, this looks like premature optimization to me.

Adrian Grigore 2009-12-12 18:44:16

Adrian, I'm sorry that you're upset that I didn't like your answer, but you'd be hard pressed to explain how a single call to unpack is a lot more code than a single call to split. unpack is much more flexible. I'm calling sour grapes here.

brian d foy 2009-12-12 18:46:48

Please, let's not turn this into something personal. Have a look at Nifle's answer, it's obvious that his approach takes much less code than yours, even after the edit he made to fix the bug you spotted.

Adrian Grigore 2009-12-12 18:52:26

Adrian: get a clue. The difference between Nifle's answer and mine has nothing to do with split or unpack. His answer is shorter because his answer doesn't do anything. It breaks apart the lines and just prints them again. I break apart the lines, sort them, print them, and also thrown in an example of repacking them. Stop being an ass because you suggested a solution in which everyone wrote buggy code and I posted the only correct answer.

brian d foy 2009-12-12 18:58:15

+1 Good illustration of `unpack`, a frequently overlooked tool. Very minor detail: if perfect reversibility is needed, you want to use `a*` rather than `A*`. The latter will remove trailing whitespace, which might be undesirable (for example, if the sentences differ in length but the users of the data do not want jagged records on the reverse trip).

FM 2009-12-12 19:28:38

@FM: a* is probably better. It's the reason the format is in a variable: you change it one place when you want to do something different.

brian d foy 2009-12-12 19:40:47

Could I have it explained in stupid person terms what this line is doing?: my $format = 'A23 A8 A7 A*';

scraft3613 2009-12-12 21:06:46

@scraft3613: That's the pack format string. There's no magic in that one line. See the pack documenation, which I linked to in the answer.

brian d foy 2009-12-12 21:13:48

ansaurus

tags:

views:

answers:

How can I break apart fixed-width columns in Perl?

related questions