tags:

views:

206

answers:

5

I have the following regex:

my $scores_compiled_regex  = qr{^0
                                  \s+
                                  (\p{Alpha}+\d*)
                                  \s+
                                  (\d+
                                  \s*
                                   \p{Alpha}*)
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}                              
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s{2,}
                                   (\d+)?
                                   \s+
                                   \d+ #$
                                   }xos

;

It should match lines like these (from a plain txt file):

0            AAS  211    1   1       5       2   6                                                                         15

While the column names are:

0 INST, NAME             A  A-  B+   B  B-  C+   C  C-  D+   D  D-   F  CR   P  PR   I  I*   W  WP  WF  AU  NR  FN  FS

and it means: Score A=1, Score A- = 1, No Score B+, Score B=5 , etc.. I'm trying to split it to an list, and not ignoring empty columns, it works, but very slow, also the matching is very slow, and by slow I mean, more than 5 seconds, sometimes even more!

The First few files in the file looks like:

0 PALMER, JAN            A  A-  B+   B  B-  C+   C  C-  D+   D  D-   F  CR   P  PR   I  I*   W  WP  WF  AU  NR  FN  FS   TOTAL
0            ECON 103   98      35 114   1  14  75           9      35               1          10       1                     

The Scores are anything that follows the A column to the right.

any idea? Thanks,

+4  A: 

If the format you must accept is really as loose as the format your regex currently does accept, you have a big problem: If one or more of the numeric fields is missing, and if there is more than one occurrence of 4 spaces in a row, then it's ambiguous which score corresponds to which column.

Perl's backtracking will resolve the ambiguity by choosing the "leftmost, longest" match, but (a) this isn't necessarily what you want and (b) the number of possibilities it needs to try is exponential in how many numeric fields you are missing in the line, hence the slowness.

To illustrate, let's use a simpler regex:

/\A(\d+)?\s{2,}
   (\d+)?\s{2,}
   (\d+)?\s{2,}
   (\d+)?\z/xs;

And suppose the input is:

123    456    789

(There are four spaces between each number.) Now, should 456 be the second or the third field returned? Both are valid matches. In this case Perl's backtracking will make it the second field, but I doubt you really want to rely on Perl's backtracking to decide this.

Suggestion: If at all possible, replace each \s{2,} with a fixed-size space-matching regex. If you only allow it to be variable-sized because the numbers are lined up in columns and the numbers may be 1 or 2 digits, then just use substr() to grab from known column offsets instead of a regex. (It's not possible to parse fixed-width data efficiently with a regex.)

j_random_hacker
Not ambiguous: the data just looks like it's tabulated, similarly to how you'd output it were you to use printf("%-10s%-5s..",$a,$b);
mfontani
OK that's good news! *Definitely* drop the regex and use plain old `substr()` or `unpack()`.
j_random_hacker
Why the -1? Just interested.
j_random_hacker
+5  A: 

See my program:

use strict;
use warnings;

# Column details and sample line, from the post
my $header  = q{0 AOZSVIN, TAMSSZ B      A  A-  B+   B  B-  C+   C  C-  D+   D  D-   F  CR   P  PR   I  I*   W  WP  WF  AU  NR  FN  FS};
my $sample  = q{0            AAS  150   23  25  16  35  45  14   8  10   2   1   1   4                           4                     };
#               -+--------+-----+-----+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---..
# chars         1212345678912345612345612341234123412341234123412341234123412341234123412341234123412341234123412341234123412341234...
# num. chars:   2 9        6     6     4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   *
my $unpack  = q{A2A9       A6    A6    A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A4  A*};
$unpack =~ s/\s//g;

# Get column names from the "$header" variable above
my @column_names = unpack($unpack, $header);
s/\s+$// for @column_names; # get rid of trailing spaces
s/^\s+// for @column_names; # get rid of leading spaces

# Some sample data in same format, to try the script out
my @samples = (
  q{0            AAS  150   23  25  16  35  45  14   8  10   2   1   1   4                           4                     },
  q{0            AAS  353    2   3   5   2   6       1                   2                                                     },
  q{0            T304 480M   3  10   8   8   2   3   2   1                                               1               1    },
  q{0            BIOS 206    3  14   5  11   9   8   4   8   3   1   1   6                           7                      },
);

my @big_sample = (@samples) ;#x 200_000;

my @unpacked_data_as_arrayrefs;
m    y @unpacked_data_as_hashrefs;
my $begin = time;
for my $line ( @big_sample ) {
    my @data = unpack($unpack,$line);
    s/\s+$// for @data; # get rid of trailing spaces
    s/^\s+// for @data; # get rid of leading spaces
    push @unpacked_data_as_arrayrefs, [@data]; # stop here if this is all you need
    ## below converts the data in a hash, based on the column names given
    #my %as_hash;
    #for ( 0..$#column_names ) {
    #    $as_hash{ $column_names[$_] } = $data[$_];
    #}
    #push @unpacked_data_as_hashrefs, { %as_hash };
}
my $tot = time - $begin;
print "Done in $tot seconds\n";

# verify all data is as we expected
# uncomment the ones that test hashref, if the above hashref-building code is also uncommented.
{
    use Test::More;
    # first sample
    is($unpacked_data_as_arrayrefs[0]->[2],'AAS'); # AAS in the third column
    is($unpacked_data_as_arrayrefs[0]->[7],'35');  # 35 in the 8th column
    # fourth sample
    is($unpacked_data_as_arrayrefs[3]->[2],'BIOS');
    is($unpacked_data_as_arrayrefs[3]->[15],'6');
    # sixth
    is($unpacked_data_as_arrayrefs[5]->[7],'114');
    is($unpacked_data_as_arrayrefs[5]->[10],'75');
    done_testing();
}

it uses unpack to split the text in a number of chunks based on the width (in characters) of the fields in your string. See also perlpacktut for more details on how to use unpack for this sort of string munging. Unpack is possibly the best for this sort of format as it does perform blazingly fast compared to a regex (parses 600_000 such strings in ~6 seconds on my machine).

Please let me know if you need to be walked through any parts of the program. I did not post it here as it's a bit on the longish side (better to have comments than not!). Please tell me if you'd rather I did.

mfontani
it works on your example, but when I try mine, I dont get the desired result, thanks,
soulSurfer2010
Hi,Please Take a look at the file sample I added to my original post. Thanks,
soulSurfer2010
Replace the header in my example with your header, and create the correct unpack string for your data basing yourself on the number of characters each column should have. The data you posted on your question wasn't properly aligned. If you can post a discrete sample of your data (with columns) I can redo the example so it will work on your data.
mfontani
Updated with the new data you put in your question
mfontani
Your solution works fine, it only collapses when I have numbers with 3 digits, I tried changing to A5, it made more problems.
soulSurfer2010
Do numbers in these sections overflow then? Or do they begin in a different column? If you provide sample data, things can be tested. Without data, we cannot ;)
mfontani
I've modified my example on my post, check your program with it. Btw, very nice work with unpack, been reading about it, in the docs, still looking for a good tutorial / guide.
soulSurfer2010
I have a couple examples in _Effective Perl Programming_ of using unpack for this sort of thing. I didn't turn up much else when I was writing that bit of the book.
brian d foy
This code aligns columns a bit differently than mine. I take the end of the grade field as the last non-whitespace character in its header. I don't know if that's why this one is falling over for the three digit problem.
brian d foy
Updated gist and inlined code with the new line example: as brian said, alignment was key: looks like the numbers' last digit should be on the header's last letter. The edited version seems to do the trick.
mfontani
+3  A: 

If columns can be empty, either (a) your data is ambiguous and you've got a bigger problem than a slow regular expression, or (b) your data is in a fixed-width format, like this:

NAME   A     A-
foo    123   456
bar          789
fubb   111     

If you do have fixed-width data, the appropriate parsing tool is substr (or unpack), not regular expressions.

FM
I guess its what you called, fixed width, I added the first few lines from the file to my original post.
soulSurfer2010
+3  A: 

Don't use regexes for this. It looks like a fixed-column format, so unpack will be much faster.

Here's a sample program showing the meat of the problem. You'll still have to figure out how to integrate it so you know when a new person record is starting and so on. I made it so the format for unpacking the values comes mostly from the headers so you don't have to spend so much time counting columns (but also so that it responds easily to changes in the column positions):

chomp( my $header = <DATA> );
my( $num, $name, $rest ) = unpack "a2 a20 a*", $header;
my @grades = split /(?=\s+)/, $rest;

my @grade_keys = map { /(\S+)/} @grades;

my $format = 'a13 a4 a5 ' . join ' ', map { 'a' . length } @grades;

while( <DATA> ) {
    my( $key, $label, $number, @grades ) = unpack $format, $_;

    $$_ =~ s/\s//g foreach ( \$key, \$label, \$number );

    @{ $hash{$key}{$label}{$number} }{@grade_keys} = 
         map { s/\s//g; $_ } @grades;
    }

use Data::Dumper;   
print Dumper( \%hash );

You say that you're having a problem because some columns have values with three digits. Unless that's misaligning the grid so the least significant digit doesn't align with the last non-whitespace character in its column, this code should work.

Here's the data structure I produced for "AOZSVIN, TAMSSZ B" (whose sample data is now hidden in your question edits), although you can arrange it anyway that you like:

$VAR1 = {
          '0' => {
                   'BIOS' => {
                               '206' => {
                                          'F' => '6',
                                          'AU' => '',
                                          'FS' => '',
                                          'B-' => '9',
                                          'D+' => '3',
                                          'CR' => '',
                                          'B+' => '5',
                                          'WP' => '7',
                                          'C+' => '8',
                                          'NR' => '',
                                          'C' => '4',
                                          'PR' => '',
                                          'A' => '3',
                                          'W' => '',
                                          'I*' => '',
                                          'A-' => '14',
                                          'P' => '',
                                          'WF' => '',
                                          'B' => '11',
                                          'FN' => '',
                                          'D' => '1',
                                          'D-' => '1',
                                          'I' => '',
                                          'C-' => '8'
                                        }
                             },
                   'AAS' => {
                              '353' => {
                                         'F' => '2',
                                         'AU' => '',
                                         'FS' => '',
                                         'B-' => '6',
                                         'D+' => '',
                                         'CR' => '',
                                         'B+' => '5',
                                         'WP' => '',
                                         'C+' => '',
                                         'NR' => '',
                                         'C' => '1',
                                         'PR' => '',
                                         'A' => '2',
                                         'W' => '',
                                         'I*' => '',
                                         'A-' => '3',
                                         'P' => '',
                                         'WF' => '',
                                         'B' => '2',
                                         'FN' => '',
                                         'D' => '',
                                         'D-' => '',
                                         'I' => '',
                                         'C-' => ''
                                       },
                              '150' => {
                                         'F' => '4',
                                         'AU' => '',
                                         'FS' => '',
                                         'B-' => '45',
                                         'D+' => '2',
                                         'CR' => '',
                                         'B+' => '16',
                                         'WP' => '4',
                                         'C+' => '14',
                                         'NR' => '',
                                         'C' => '8',
                                         'PR' => '',
                                         'A' => '23',
                                         'W' => '',
                                         'I*' => '',
                                         'A-' => '25',
                                         'P' => '',
                                         'WF' => '',
                                         'B' => '35',
                                         'FN' => '',
                                         'D' => '1',
                                         'D-' => '1',
                                         'I' => '',
                                         'C-' => '10'
                                       }
                            },
                   'T304' => {
                               '480M' => {
                                           'F' => '',
                                           'AU' => '',
                                           'FS' => '1',
                                           'B-' => '2',
                                           'D+' => '',
                                           'CR' => '',
                                           'B+' => '8',
                                           'WP' => '',
                                           'C+' => '3',
                                           'NR' => '',
                                           'C' => '2',
                                           'PR' => '',
                                           'A' => '3',
                                           'W' => '',
                                           'I*' => '',
                                           'A-' => '10',
                                           'P' => '',
                                           'WF' => '1',
                                           'B' => '8',
                                           'FN' => '',
                                           'D' => '',
                                           'D-' => '',
                                           'I' => '',
                                           'C-' => '1'
                                         }
                             }
                 }
        };

And for your new sample for "Palmer, Jan":

$VAR1 = {
          '0' => {
                   'ECON' => {
                               '103' => {
                                          'F' => '35',
                                          'AU' => '1',
                                          'FS' => '',
                                          'B-' => '1',
                                          'D+' => '',
                                          'CR' => '',
                                          'B+' => '35',
                                          'WP' => '10',
                                          'C+' => '14',
                                          'NR' => '',
                                          'C' => '75',
                                          'PR' => '',
                                          'A' => '98',
                                          'W' => '',
                                          'I*' => '',
                                          'A-' => '',
                                          'P' => '',
                                          'WF' => '',
                                          'B' => '114',
                                          'FN' => '',
                                          'TOTAL' => '',
                                          'D' => '9',
                                          'D-' => '',
                                          'I' => '1',
                                          'C-' => ''
                                        }
                             }
                 }
        };
brian d foy
Working on it, with the example given above, still no go, sometimes some columns have 3 digits numbers, that what causes it to fail
soulSurfer2010
If you showed your unpack code and more example cases, we might be able to help.
brian d foy
Unpack code by mfontani above, my sample case is on my post, just added it. Btw, love your book (yet to read it all though, need some time). thanks,
soulSurfer2010
Thanks! definitely educated from this reply!
soulSurfer2010
A: 

First break the line up into fixed width chunks spaces and all. Then clean the chunks up. Otherwise you're trying to do 2 things at the same time which can be error prone.

Jason