ansaurus

Question

How can I split a line when some fields contain spaces?

Answer 1

+1 A:

Looks to me like the first four columns and last 5 columns are always present and the 5th and 6th (prof2) columns are optional

So split the line as you were attempting, pull off the first four and last five elements from the resulting array, then whatever remains is your 5th column and 6th columns

If however either the prof1 or the prof2 entry can be missing, you're stuck - your file format is ambiguous

Paul 2010-10-21 10:35:52

soulSurfer2010 2010-10-21 10:38:43

From your examples it seems PROF1 is alwasys present, with two names, and PROF2 (if present) has two names. It seems unlikely that PROF2's surname is TANIA, to use the example in your first line :). For the course name. do you know what the letters will be? If it's always DACSB you can fix up the optional space with a bit of regexps or just some hueristic code when extracting the fields after the split... i.e. if you find just numbers whne you look for the COURSE, then assume a space was present and prefix that with the field before (which will probably be "DACSB")

Paul 2010-10-21 10:45:36

I managed to solve this with CAM::PDF module, which DOESNT omit the \t. exporting the PDF --> txt turns the \t to \s which made that issue. anyway, thanks

soulSurfer2010 2010-10-21 10:47:48

Cool. Preserving tabs is definitely going to help here!

Paul 2010-10-21 10:53:33

Answer 2

+2 A:

Believe it ambiguous :

if PROF1 can contain spaces, how do you know where it ends and where PROF2 begins? What if PROF2 also contains a space? Or 3 spaces ..

You probably can't even tell yourself, and if you can it's because you can tell the difference between a first-name and a surname.

If you're on Linux/Unix, try running text2pdf on the pdf.. might give you better results.

Øyvind Skaar 2010-10-21 10:49:18

Answer 3

+2 A:

The problem is complex, but it's not unsolvable. It seems to me that course will always contain a space between the alpha code and the numeric code and that the prof names will also always contain a space. But then you're pretty much screwed if somebody has a two-part last name like "VAN DYKE".

A regex would describe this record:

my $record_exp
    = qr{ ^ \s*
          (\d{4}/\d{2}) # yyyy/mm date
          \s+
          (\d+)         # any number of digits
          \s+
          (\S+ \s \S+) # non-space cluster, single space, non-space cluster
          \s+
          # sames as last, possibly not there, separating spaces are included
          # in the conditional, because we have to make sure it will start
          # right at the next rule.
          (?:(\S+ \s \S+)\s+)?  
          # a cluster of alpha, single space, cluster of digits
          (\p{Alpha}+ \s \d+)   
          \s+    # any number of spaces           
          (\S+)  # any number of non-space
          \s+    # ditto..  
          (\S+)  
          \s+    
          (\S+)  
        }x;

Which makes the loop a lot easier:

while ( <$input> ) { 
    my @fields = m{$record_exp};
    # ... list of semantic actions here...
}

But you could also store it into structures, knowing that the only variable part of the data is the profs:

use strict;
use warnings;
my @records;
<$input>; # bleed the first line
while ( <$input> ) { 
    my @fields         = split; # split on white-space
    my $record         = { date => shift @fields };
    $record->{session} = shift @fields;
    $record->{profs}   = [ join( ' ', splice( @fields, 0, 2 )) ];
    while ( @fields > 5 ) { 
        push @{ $record->{profs} }, join( ' ', splice( @fields, 0, 2 ));
    }
    $record->{course} = splice( @fields, 0, 2 );
    @$record{ qw<sec grade count> } = @fields;
    push @records, $record;
}

Axeman 2010-10-21 14:04:00

thats a great regex and it works, but I found out that sometimes PROF1 and PROF2 last and first names can also have spaces, so im yet to sucessfully modify that regex,

soulSurfer2010 2010-10-22 05:50:15

Answer 4

+1 A:

There is nothing that says you must use only a single regex. You can go prune off bits of your line in chunks if that makes it easier to handle the weird parts.

Andy Lester 2010-10-21 14:24:48

Answer 5

+1 A:

I would probably still use split(), but then access the data thusly:

my @values = split '\s+', $string;
my $date = $values[0];
my $sess = $values[1];
my $count = $values[-1];
my $grade = $values[-2];
my $sec = $values[-3];
my $course = $values[-4];
my @profs = @values[2..($#values-5)];

With this construct you don't have to worry about how many profs you have. Even if you have none, the other values will all work fine (and you'll get an empty array for your profs).

CanSpice 2010-10-21 17:13:13

You can also pop/push values. $date=shift @values; $sess=shift @values; $course=pop @values; etc. And then when you have those done, you are left with the stuff in the middle in @values.

Andy Lester 2010-10-22 19:32:36

ansaurus

tags:

views:

answers:

How can I split a line when some fields contain spaces?

related questions