tags:

views:

67

answers:

3

I'm using Perl to perform some file cleansing, and am running into some performance issues. One of the major parts of my code involves standardizing name fields. I have several sections that look like this:

sub substitute_titles
{
    my ($inStr) = @_;
    ${$inStr} =~ s/ PHD./ PHD /;
    ${$inStr} =~ s/ P H D / PHD   /;
    ${$inStr} =~ s/ PROF./ PROF /;
    ${$inStr} =~ s/ P R O F / PROF    /;
    ${$inStr} =~ s/ DR./ DR /;
    ${$inStr} =~ s/ D.R./ DR  /;
    ${$inStr} =~ s/ HON./ HON /;
    ${$inStr} =~ s/ H O N / HON   /;
    ${$inStr} =~ s/ MR./ MR /;
    ${$inStr} =~ s/ MRS./ MRS /;
    ${$inStr} =~ s/ M R S / MRS   /;
    ${$inStr} =~ s/ MS./ MS /;
    ${$inStr} =~ s/ MISS./ MISS /;
}

I'm passing by reference to try and get at least a little speed, but I fear that running so many (literally hundreds) of specific string replaces on tens of thousands (likely hundreds of thousands eventually) of records is going to hurt the performance.

Is there a better way to implement this kind of logic than what I'm doing currently?

Thanks

Edit: Quick note, not all the replace functions are just removing periods and spaces. There are string deletions, soundex groups, etc.

+4  A: 

Here's a technique that should work pretty well if all of your search items are fixed strings:

my %title_replacements = (
  ' PHD.' => ' PHD ',
  ' P H D ' => ' PHD  ',
  # ...,
);

my $titles_to_replace = join '|',
  map quotemeta, 
  keys %title_replacements;

$titles_to_replace = qr/$titles_to_replace/;

sub substitute_titles {
  my ($in) = @_;
  $$in =~ s/($titles_to_replace)/$title_replacements{$1}/g;
}

If you're running on a perl older than 5.10.0 or 5.8.9, you should consider using Regexp::Trie or Regexp::Assemble to build the regex, but on current perls the regex compiler will automatically trie-optimize any large list of alternations like that, so I left out the unnecessary complication.

hobbs
+4  A: 

Rather than running each substitution separately, create a closure that can do the work for you in a more efficient way:

sub make_translator {
    my %table = @_;
    my $regex = join '|' => map {quotemeta} keys %table;
    $regex = qr/$regex/;

    return sub {s/($regex)/$table{$1}/g}
}

my $translator = make_translator
    ' PHD.'   => ' PHD ',
    ' P H D ' => ' PHD   ',
    ' PROF.'  => ' PROF ';   # ... the rest of the pairs

my @list_of_strings = qw/.../;

$translator->() for @list_of_strings;

It is fastest to not pass anything and use $_ aliased to the array value (which the for loop does for you).

Eric Strom
A: 

I would most likely make a sub that created my patterns for me. This way all I would have to do is pass in an array of the titles I want normalized. Example:

sub make_pattern {
    my $list_ref = shift;
    my %patterns;
    for my $title ( @{$list_ref} ) {
        my $result = uc $title;
        my $pattern = '/' . join( '\s*', (//, $title)) . '\.*/i';
        $patterns{$pattern} = $result;
    }
return \%patterns;
}

my @titles = qw (PHD MD DR PROF ) #... plus whatever other titles you have
my $conversion_hash = make_pattern(\@titles);

Then you the resulting hash in conjunction with a closure as listed in some of the other answers here. I have not had time to test my code yet, but it should work.

stocherilac