ansaurus

Question

How can I extract the columns of data with Perl?

Answer 1

+6 A:

It's going to depend on whether those are fixed length fields, or if they are tab separated. The easiest (using split) is if they are tab separated.

my ($name1, $name2, $deptName, $position) = split("\t", $string);

If they're fixed length, and assuming they are all, say, 10 characters long, you can parse it like

my ($name1, $name2, $deptName, $position) = unpack("A10 A10 A10 A10", $string);

Paul Tomblin 2010-08-23 17:42:27

They are not of fixed length.

Sunny 2010-08-23 17:50:05

@Sunny, then how are you going to determine where one field ends and the next begins, seeing as how some of the fields have spaces in them? Either you need to delimit them with a specific character like tab, or you need to put them in specific places. In the first case, you use split, in the second you use unpack.

Paul Tomblin 2010-08-23 17:59:18

Thanks Paul.when I want to vote it says Vote Up requires 15 reputation.

Sunny 2010-08-23 18:57:22

@Sunny, well how about accepting an answer to your first question?

Paul Tomblin 2010-08-23 19:18:19

Answer 2

A:

To split on whitespace:

@string_parts = split /\s{2,}/, $string;

This will split $string into a list of substrings. The separator will be the regex \s+, which means one or more whitespace characters. This includes spaces, tabs, and (unless I'm mistaken) newlines.

Edit: I see that one of the requirements is not to split on only one space, but to split on two or more. I modified the regex accordingly.

Nathan Fellman 2010-08-23 17:54:19

This solution will split string like "JONH" , "MILLER" but its a single name so it should be JONH MILLER, that means solution is not correct.

Nikhil Jain 2010-08-23 18:17:02

@Nikhil: Good point. But you could do something like `@string_parts = split /\s\s+|\t\s*/, $string` to split on multiple spaces, or one tab and possibly other space characters.

Platinum Azure 2010-08-23 18:25:07

@Platinum: That true, exactly i am doing the same thing in my answer.

Nikhil Jain 2010-08-23 18:33:35

Answer 3

+2 A:

If your input data comes in as an array of strings (@strings), this

for my $s (@strings) {
   my $output = join ' ',
                map /^\s*(.+)\s*$/ ? $1 : (),
                unpack('A19 A15 x19 A*', $s);
   print "$output\n"
}

would extract and trim the information needed.

NAME1 | NAME2 | POSITION

and

JONH MILLER | ROBERT JIM | ASST GENERAL MANAGER

(The '|' were included by me for better expalnation of the result)

Regards

rbo

rubber boots 2010-08-23 18:22:50

Unpack is a great tool for this, and we cover almost this same example in _Effective Perl Programming_. I'd like to have an entire pack chapter in the next book :)

brian d foy 2010-08-23 21:40:46

@brian, "The Book" looks promising, I'd love to have a chapter on advanced regular expressions (sth. like a contemporary version of japhys Regex Arcana: http://japhy.perlmonk.org/articles/tpj/2004-summer.html). Furthermore, in the first edition of the old "Advanced Perl Programming" (by Srinivasan), there have been some very interesting advanced topics (Perl guts, embedding, XS-hands on, and eval) which were left out from the second edt. (by Simon Cozens). Such (more technical) advanced topics aren't part of any actual books I know of. (BTW: I ordered the 2'nd edt. of E.P.P yesterday).

rubber boots 2010-08-24 20:11:17

For Perl guts, get _Extending and Embedding Perl_. Some of the interesting parts of _Advanced Perl Programming, 1st Edition_ were the basis for _Mastering Perl_. For fancy regex stuff, _Mastering Regular Expressions_. _Mastering Perl_ has some fancy regex stuff too, as does _Effective Perl Programming_. Maybe you just need to read more books. Remember, though, that all this stuff is also in the docs, so you don't need to buy a book.

brian d foy 2010-08-24 21:09:01

Answer 4

A:

Consider using autosplit in a Perl one-liner from your command line:

$ perl -F/\s{2,}/ -ane 'print qq/@F[0,1,3]\n/' file

The one-liner will split on two or more consecutive spaces and print the first, second and fourth fields, corresponding to NAME1, NAME2 and POSITION fields.

Of course, this will break if you have only a single space separating NAME1 and NAME2 entries, but more information is needed about your file in order to ascertain what the best course of action might be.

Zaid 2010-08-23 18:29:41

Any reason for the downvote?

Zaid 2010-08-24 06:27:04

Answer 5

+1 A:

Assuming that space between the fields are not fixed so split string on the basis of two or more spaces so that it will not break the Name like JONH MILLER into two parts.

#!/usr/bin/perl
use strict;
use warning;
my $string = "NAME1              NAME2          DEPTNAME           POSITION
             JONH MILLER        ROBERT JIM     CS                 ASST GENERAL MANAGER ";
my @string_parts = split /\s\s+/, $string;
foreach my $test (@string_parts){  
      print"$test\n";
}

Nikhil Jain 2010-08-23 18:32:20

Answer 6

+1 A:

From the sample there, a single space belongs in the data, but 2 or more contiguous spaces do not. So you can easily split on 2 or more spaces. The only thing I add to this is the use of List::MoreUtils::mesh

use List::MoreUtils qw<mesh>;
my @names   = map { chomp; $_ } split /\s{2,}/, <$file>;
my @records = map { chomp; { mesh( @names, @{[ split /\s{2,}/ ]} ) } } <$file>;

Axeman 2010-08-23 19:31:30

ansaurus

tags:

views:

answers:

How can I extract the columns of data with Perl?

related questions