tags:

views:

86

answers:

5

I have a text file that I extracted from a PDF file. It's arranged in a tabular format; this is part of it:

 DATE SESS PROF1 PROF2 COURSE SEC GRADE COUNT 

 2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 A 3 

 2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 A- 2 

 2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 B 4 

 2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 B+ 2 

 2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 B- 1 

 2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 WU 1 

 2007/09 1 NOOB ADRIENNE JOSH ROGER DBIOM 10000 125 C+ 1 

 2007/09 1 NOOB ADRIENNE JOSH ROGER DBIOM 10000 125 C+ 1 

 2007/09 1 FUENTES TANIA DACSB 06500 002 A 3 

 2007/09 1 FUENTES TANIA DACSB 06500 002 A- 8 

 2007/09 1 FUENTES ALEXA DACSB 06500 002 B 5 

 2007/09 1 FUENTES ALEXA DACSB 06500 002 B+ 3 

 2007/09 1 FUENTES ALEXA DACSB 06500 002 B- 1 

 2007/09 1 FUENTES ALEXA DACSB 06500 002 C 1 

 2007/09 1 FUENTES ALEXA DACSB 06500 002 C+ 1 

 2007/09 1 LIGGINS FREDER DACSB 06500 003 A 1

Where the first line is the columns names, and the rest of the lines are the data. there are 8 columns which I want to get, at first it seemed very easy by splitting with split(/\s+/, ...) for each line I read, but then,I noticed that in some lines there are additional spaces, for example:

2007/09 1 NOOB ADRIENNE JOSH ROGER DBIOM 10000 125 C+ 1

Sometimes the data for a certain column is optional as you can see it.

+1  A: 

Looks to me like the first four columns and last 5 columns are always present and the 5th and 6th (prof2) columns are optional

So split the line as you were attempting, pull off the first four and last five elements from the resulting array, then whatever remains is your 5th column and 6th columns

If however either the prof1 or the prof2 entry can be missing, you're stuck - your file format is ambiguous

Paul
soulSurfer2010
From your examples it seems PROF1 is alwasys present, with two names, and PROF2 (if present) has two names. It seems unlikely that PROF2's surname is TANIA, to use the example in your first line :). For the course name. do you know what the letters will be? If it's always DACSB you can fix up the optional space with a bit of regexps or just some hueristic code when extracting the fields after the split... i.e. if you find just numbers whne you look for the COURSE, then assume a space was present and prefix that with the field before (which will probably be "DACSB")
Paul
I managed to solve this with CAM::PDF module, which DOESNT omit the \t. exporting the PDF --> txt turns the \t to \s which made that issue. anyway, thanks
soulSurfer2010
Cool. Preserving tabs is definitely going to help here!
Paul
+2  A: 

Believe it ambiguous :

if PROF1 can contain spaces, how do you know where it ends and where PROF2 begins? What if PROF2 also contains a space? Or 3 spaces ..

You probably can't even tell yourself, and if you can it's because you can tell the difference between a first-name and a surname.

If you're on Linux/Unix, try running text2pdf on the pdf.. might give you better results.

Øyvind Skaar
+2  A: 

The problem is complex, but it's not unsolvable. It seems to me that course will always contain a space between the alpha code and the numeric code and that the prof names will also always contain a space. But then you're pretty much screwed if somebody has a two-part last name like "VAN DYKE".

A regex would describe this record:

my $record_exp
    = qr{ ^ \s*
          (\d{4}/\d{2}) # yyyy/mm date
          \s+
          (\d+)         # any number of digits
          \s+
          (\S+ \s \S+) # non-space cluster, single space, non-space cluster
          \s+
          # sames as last, possibly not there, separating spaces are included
          # in the conditional, because we have to make sure it will start
          # right at the next rule.
          (?:(\S+ \s \S+)\s+)?  
          # a cluster of alpha, single space, cluster of digits
          (\p{Alpha}+ \s \d+)   
          \s+    # any number of spaces           
          (\S+)  # any number of non-space
          \s+    # ditto..  
          (\S+)  
          \s+    
          (\S+)  
        }x;

Which makes the loop a lot easier:

while ( <$input> ) { 
    my @fields = m{$record_exp};
    # ... list of semantic actions here...
}

But you could also store it into structures, knowing that the only variable part of the data is the profs:

use strict;
use warnings;
my @records;
<$input>; # bleed the first line
while ( <$input> ) { 
    my @fields         = split; # split on white-space
    my $record         = { date => shift @fields };
    $record->{session} = shift @fields;
    $record->{profs}   = [ join( ' ', splice( @fields, 0, 2 )) ];
    while ( @fields > 5 ) { 
        push @{ $record->{profs} }, join( ' ', splice( @fields, 0, 2 ));
    }
    $record->{course} = splice( @fields, 0, 2 );
    @$record{ qw<sec grade count> } = @fields;
    push @records, $record;
}
Axeman
thats a great regex and it works, but I found out that sometimes PROF1 and PROF2 last and first names can also have spaces, so im yet to sucessfully modify that regex,
soulSurfer2010
+1  A: 

There is nothing that says you must use only a single regex. You can go prune off bits of your line in chunks if that makes it easier to handle the weird parts.

Andy Lester
+1  A: 

I would probably still use split(), but then access the data thusly:

my @values = split '\s+', $string;
my $date = $values[0];
my $sess = $values[1];
my $count = $values[-1];
my $grade = $values[-2];
my $sec = $values[-3];
my $course = $values[-4];
my @profs = @values[2..($#values-5)];

With this construct you don't have to worry about how many profs you have. Even if you have none, the other values will all work fine (and you'll get an empty array for your profs).

CanSpice
You can also pop/push values. $date=shift @values; $sess=shift @values; $course=pop @values; etc. And then when you have those done, you are left with the stuff in the middle in @values.
Andy Lester