ansaurus

Question

How can I efficiently handle multiple Perl search/replace operations on the same string?

Answer 1

+3 A:

Hashes are not good because they are unordered. I find an array of arrays whose second array contains a compiled regex and a string to eval (actually it is a double eval) works best:

#!/usr/bin/perl

use strict;
use warnings;

my @replace = (
    [ qr/(bar)/ => '"<$1>"' ],
    [ qr/foo/   => '"bar"'  ],
);

my $s = "foo bar baz foo bar baz";

for my $replace (@replace) {
    $s =~ s/$replace->[0]/$replace->[1]/gee;
}

print "$s\n";

I think j_random_hacker's second solution is vastly superior to mine. Individual subroutines give you the most flexibility and are an order of magnitude faster than my /ee solution:

bar <bar> baz bar <bar> baz
bar <bar> baz bar <bar> baz
         Rate refs subs
refs  10288/s   -- -91%
subs 111348/s 982%   --

Here is the code that produces those numbers:

#!/usr/bin/perl

use strict;
use warnings;

use Benchmark;

my @subs = (
    sub { $_[0] =~ s/(bar)/<$1>/g },
    sub { $_[0] =~ s/foo/bar/g },
);

my @refs = (
    [ qr/(bar)/ => '"<$1>"' ],
    [ qr/foo/   => '"bar"'  ],
);

my %subs = (
    subs => sub {
     my $s = "foo bar baz foo bar baz";
     for my $sub (@subs) {
      $sub->($s);
     }
     return $s;
    },
    refs => sub {
     my $s = "foo bar baz foo bar baz";
     for my $ref (@refs) {
      $s =~ s/$ref->[0]/$ref->[1]/gee;
     }
     return $s;
    }
);

for my $sub (keys %subs) {
    print $subs{$sub}(), "\n";
}

Benchmark::cmpthese -1, \%subs;

Chas. Owens 2009-05-09 16:47:57

j_random_hacker 2009-05-09 16:59:29

Due to a bug in the flag evaluation portion of regexes people found that each extra e added another level of eval. This was found to be handy, so it got promoted to a feature. With /e the first replace becomes '<$1>', that is you see '<$1>' in $s. The second e then evals '<$1>' producing the desired '<bar>' replacement.

Chas. Owens 2009-05-09 17:06:58

You can use Tie::DxHash to maintain insertion order order: http://search.cpan.org/~kruscoe/Tie-DxHash-1.05/lib/Tie/DxHash.pm

Drew Stephens 2009-05-09 17:15:32

@dinomite Yes, but at the loss of performance with no real gain in readability. This isn't really a job for a hash (keys are not randomly accessed, there is no need for unique keys, the data is not unordered, etc). An array of coderefs seems to be the best solution.

Chas. Owens 2009-05-09 17:24:36

@Chas: Thanks, but I'm wondering why you could/would not just say qr/(bar)/ => '<$1>' and then use a single /e. (I'm aware of /ee, /eee etc... so far I haven't found cause to use them but I'm on the lookout :))

j_random_hacker 2009-05-09 17:43:50

@j_random_hacker because /e is evaluating $ref->[1] not the contents of $ref->[1]. The double quoted string nature of the replace is removed when you say /e.

Chas. Owens 2009-05-09 18:40:42

@Chas: I see, thanks. I guess I thought Perl would treat that $ref->[1] as an expression to be interpolated without needing any /e (i.e. in the same way that a plain mention of $foo would be interpolated without /e). Oh well, cryptic Perl parsing rules 1, j_random_hacker 0...

j_random_hacker 2009-05-09 20:53:51

@j_random_hacker $ref->[1] is interpolated when there is no /e, but when /e is in effect there is no interpolation step.

Chas. Owens 2009-05-09 21:01:44

@Chas: I think I've finally got it -- /e implies no interpolation (like single quotes). Thanks for your patience :)

j_random_hacker 2009-05-10 09:10:10

Answer 2

+6 A:

Problem #1

As there doesn't appear to be much structure shared by the individual regexes, there's not really a simpler or clearer way than just listing the commands as you have done. One common approach to decreasing repetition in code like this is to move $text into $_, so that instead of having to say:

$text =~ s/foo/bar/g;

You can just say:

s/foo/bar/g;

A common idiom for doing this is to use a degenerate for() loop as a topicalizer:

for($test)
{
  s/foo/bar/g;
  s/qux/meh/g;
  ...
}

The scope of this block will preserve any preexisting value of $_, so there's no need to explicitly localize $_.

At this point, you've eliminated almost every non-boilerplate character -- how much shorter can it get, even in theory?

Unless what you really want (as your problem #2 suggests) is improved modularity, e.g., the ability to iterate over, report on, count etc. all regexes.

Problem #2

You can use the qr// syntax to quote the "search" part of the substitution:

my $search = qr/(<[^>]+>)/;
$str =~ s/$search/foo,$1,bar/;

However I don't know of a way of quoting the "replacement" part adequately. I had hoped that qr// would work for this too, but it doesn't. There are two alternatives worth considering:

1. Use eval() in your foreach loop. This would enable you to keep your current %rxcheck2 hash. Downside: you should always be concerned about safety with string eval()s.

2. Use an array of anonymous subroutines:

my @replacements = (
    sub { $_[0] =~ s/<[^>]+>/ /g; },
    sub { $_[0] =~ s/\s+/ /g; },
    sub { $_[0] =~ s/[\(\{\[]\d+[\(\{\[]/ /g; },
    sub { $_[0] =~ s/\s+[<>]+\s+/\. /g },
    sub { $_[0] =~ s/\s+/ /g; },
    sub { $_[0] =~ s/\.*\s*[\*|\#]+\s*([A-Z\"])/\. $1/g; },
    sub { $_[0] =~ s/\.\s*\([^\)]*\) ([A-Z])/\. $1/g; }
);

# Assume your data is in $_
foreach my $repl (@replacements) {
    &{$repl}($_);
}

You could of course use a hash instead with some more useful key as the hash, and/or you could use multivalued elements (or hash values) including comments or other information.

j_random_hacker 2009-05-09 16:56:29

/e and /ee are safer than string eval

Chas. Owens 2009-05-09 17:02:08

@Chas: Definitely prettier in this case, but how are they safer?

j_random_hacker 2009-05-09 17:04:05

But I like the subroutine version.

Chas. Owens 2009-05-09 17:07:38

Hmm, I know /e is safer because it is more like eval {} than eval "", but /ee may not be safer, but I can't remember why.

Chas. Owens 2009-05-09 17:08:33

/e is just a string eval. /ee is the same thing, but you take the result of the first /e and do it again. There isn't a safety feature by adding or subtracting an /e.

brian d foy 2009-05-09 20:39:45

I really like John Siracusa's edit, suggesting using "for ($mystr) { ... }" as a way to "topicalise" -- neat!

j_random_hacker 2009-05-12 03:37:33

Answer 3

+3 A:

You say you are dealing with HTML. You are now realizing that this is pretty much a losing battle with fleeting and fragile solutions.

A proper HTML parser would be make your life easier. HTML::Parser can be hard to use but there are other very useful libraries on CPAN which I can recommend if you can specify what you are trying to do rather than how.

Sinan Ünür 2009-05-09 17:09:37

Good point, I was answering the general question of how do run multiple regexes against a string in a maintainable way, but the specific question is about running a regex on HTML, which is a no-no. See http://stackoverflow.com/questions/701166http://stackoverflow.com/questions/701166 for why and http://stackoverflow.com/questions/773340http://stackoverflow.com/questions/773340 for examples on how to use HTML parsers.

Chas. Owens 2009-05-09 17:31:57

That is weird, it double pasted the links, let me try again: http://stackoverflow.com/questions/701166 for why.

Chas. Owens 2009-05-09 18:42:31

and http://stackoverflow.com/questions/773340 for examples of parsers in action.

Chas. Owens 2009-05-09 18:43:01

HTML::Parser is often too much work for the nastiness of some data sources. If you can do a bunch of quick substitutions to regularize the input, you can make things easier down the road. This isn't a question about parsing HTML, but cleaning up dirty data.

brian d foy 2009-05-09 20:41:52

HTML::Parser is indeed too much work in most cases. However, there are many libraries that solve many a complicated problem. I have dealt with incredibly badly formed HTML in very large files thanks to such modules. If we knew what information Jeff is trying to get out of these files, a better alternative than a massive block of substitutions with no underlying theme might present itself.

Sinan Ünür 2009-05-10 02:10:11

ansaurus

tags:

views:

answers:

How can I efficiently handle multiple Perl search/replace operations on the same string?

Problem #1

Problem #2

related questions