views:

1084

answers:

3

So my Perl script basically takes a string and then tries to clean it up by doing multiple search and replaces on it, like so:

$text =~ s/<[^>]+>/ /g;
$text =~ s/\s+/ /g;
$text =~ s/[\(\{\[]\d+[\(\{\[]/ /g;
$text =~ s/\s+[<>]+\s+/\. /g;
$text =~ s/\s+/ /g;
$text =~ s/\.*\s*[\*|\#]+\s*([A-Z\"])/\. $1/g; # replace . **** Begin or . #### Begin or ) *The 
$text =~ s/\.\s*\([^\)]*\) ([A-Z])/\. $1/g; # . (blah blah) S... => . S...

As you can see, I'm dealing with nasty html and have to beat it into submission.

I'm hoping there is a simpler, aesthetically appealing way to do this. I have about 50 lines that look just like what is above.

I have solved one version of this problem by using a hash where the key is the comment, and the hash is the reg expression, like so:

%rxcheck = (
      'time of day'=>'\d+:\d+', 
 'starts with capital letters then a capital word'=>'^([A-Z]+\s)+[A-Z][a-z]',
 'ends with a single capital letter'=>'\b[A-Z]\.'
}

And this is how I use it:

 foreach my $key (keys %rxcheck) {
if($snippet =~ /$rxcheck{ $key }/g){ blah blah  }
 }

The problem comes up when I try my hand at a hash that where the key is the expression and it points to what I want to replace it with... and there is a $1 or $2 in it.

%rxcheck2 = (
      '(\w) \"'=>'$1\"'
}

The above is to do this:

$snippet =~ s/(\w) \"/$1\"/g;

But I can't seem to pass the "$1" part into the regex literally (I think that's the right word... it seems the $1 is being interpreted even though I used ' marks.) So this results in:

if($snippet =~ /$key/$rxcheck2{ $key }/g){  }

And that doesn't work.

So 2 questions:

Easy: How do I handle large numbers of regex's in an easily editable way so I can change and add them without just cut and pasting the line before?

Harder: How do I handle them using a hash (or array if I have, say, multiple pieces I want to include, like 1) part to search, 2) replacement 3) comment, 4) global/case insensitive modifiers), if that is in fact the easiest way to do this?

Thanks for your help -

+3  A: 

Hashes are not good because they are unordered. I find an array of arrays whose second array contains a compiled regex and a string to eval (actually it is a double eval) works best:

#!/usr/bin/perl

use strict;
use warnings;

my @replace = (
    [ qr/(bar)/ => '"<$1>"' ],
    [ qr/foo/   => '"bar"'  ],
);

my $s = "foo bar baz foo bar baz";

for my $replace (@replace) {
    $s =~ s/$replace->[0]/$replace->[1]/gee;
}

print "$s\n";

I think j_random_hacker's second solution is vastly superior to mine. Individual subroutines give you the most flexibility and are an order of magnitude faster than my /ee solution:

bar <bar> baz bar <bar> baz
bar <bar> baz bar <bar> baz
         Rate refs subs
refs  10288/s   -- -91%
subs 111348/s 982%   --

Here is the code that produces those numbers:

#!/usr/bin/perl

use strict;
use warnings;

use Benchmark;

my @subs = (
    sub { $_[0] =~ s/(bar)/<$1>/g },
    sub { $_[0] =~ s/foo/bar/g },
);

my @refs = (
    [ qr/(bar)/ => '"<$1>"' ],
    [ qr/foo/   => '"bar"'  ],
);

my %subs = (
    subs => sub {
     my $s = "foo bar baz foo bar baz";
     for my $sub (@subs) {
      $sub->($s);
     }
     return $s;
    },
    refs => sub {
     my $s = "foo bar baz foo bar baz";
     for my $ref (@refs) {
      $s =~ s/$ref->[0]/$ref->[1]/gee;
     }
     return $s;
    }
);

for my $sub (keys %subs) {
    print $subs{$sub}(), "\n";
}

Benchmark::cmpthese -1, \%subs;
Chas. Owens
j_random_hacker
Due to a bug in the flag evaluation portion of regexes people found that each extra e added another level of eval. This was found to be handy, so it got promoted to a feature. With /e the first replace becomes '<$1>', that is you see '<$1>' in $s. The second e then evals '<$1>' producing the desired '<bar>' replacement.
Chas. Owens
You can use Tie::DxHash to maintain insertion order order: http://search.cpan.org/~kruscoe/Tie-DxHash-1.05/lib/Tie/DxHash.pm
Drew Stephens
@dinomite Yes, but at the loss of performance with no real gain in readability. This isn't really a job for a hash (keys are not randomly accessed, there is no need for unique keys, the data is not unordered, etc). An array of coderefs seems to be the best solution.
Chas. Owens
@Chas: Thanks, but I'm wondering why you could/would not just say qr/(bar)/ => '<$1>' and then use a single /e. (I'm aware of /ee, /eee etc... so far I haven't found cause to use them but I'm on the lookout :))
j_random_hacker
@j_random_hacker because /e is evaluating $ref->[1] not the contents of $ref->[1]. The double quoted string nature of the replace is removed when you say /e.
Chas. Owens
@Chas: I see, thanks. I guess I thought Perl would treat that $ref->[1] as an expression to be interpolated without needing any /e (i.e. in the same way that a plain mention of $foo would be interpolated without /e). Oh well, cryptic Perl parsing rules 1, j_random_hacker 0...
j_random_hacker
@j_random_hacker $ref->[1] is interpolated when there is no /e, but when /e is in effect there is no interpolation step.
Chas. Owens
@Chas: I think I've finally got it -- /e implies no interpolation (like single quotes). Thanks for your patience :)
j_random_hacker
+6  A: 

Problem #1

As there doesn't appear to be much structure shared by the individual regexes, there's not really a simpler or clearer way than just listing the commands as you have done. One common approach to decreasing repetition in code like this is to move $text into $_, so that instead of having to say:

$text =~ s/foo/bar/g;

You can just say:

s/foo/bar/g;

A common idiom for doing this is to use a degenerate for() loop as a topicalizer:

for($test)
{
  s/foo/bar/g;
  s/qux/meh/g;
  ...
}

The scope of this block will preserve any preexisting value of $_, so there's no need to explicitly localize $_.

At this point, you've eliminated almost every non-boilerplate character -- how much shorter can it get, even in theory?

Unless what you really want (as your problem #2 suggests) is improved modularity, e.g., the ability to iterate over, report on, count etc. all regexes.

Problem #2

You can use the qr// syntax to quote the "search" part of the substitution:

my $search = qr/(<[^>]+>)/;
$str =~ s/$search/foo,$1,bar/;

However I don't know of a way of quoting the "replacement" part adequately. I had hoped that qr// would work for this too, but it doesn't. There are two alternatives worth considering:

1. Use eval() in your foreach loop. This would enable you to keep your current %rxcheck2 hash. Downside: you should always be concerned about safety with string eval()s.

2. Use an array of anonymous subroutines:

my @replacements = (
    sub { $_[0] =~ s/<[^>]+>/ /g; },
    sub { $_[0] =~ s/\s+/ /g; },
    sub { $_[0] =~ s/[\(\{\[]\d+[\(\{\[]/ /g; },
    sub { $_[0] =~ s/\s+[<>]+\s+/\. /g },
    sub { $_[0] =~ s/\s+/ /g; },
    sub { $_[0] =~ s/\.*\s*[\*|\#]+\s*([A-Z\"])/\. $1/g; },
    sub { $_[0] =~ s/\.\s*\([^\)]*\) ([A-Z])/\. $1/g; }
);

# Assume your data is in $_
foreach my $repl (@replacements) {
    &{$repl}($_);
}

You could of course use a hash instead with some more useful key as the hash, and/or you could use multivalued elements (or hash values) including comments or other information.

j_random_hacker
/e and /ee are safer than string eval
Chas. Owens
@Chas: Definitely prettier in this case, but how are they safer?
j_random_hacker
But I like the subroutine version.
Chas. Owens
Hmm, I know /e is safer because it is more like eval {} than eval "", but /ee may not be safer, but I can't remember why.
Chas. Owens
/e is just a string eval. /ee is the same thing, but you take the result of the first /e and do it again. There isn't a safety feature by adding or subtracting an /e.
brian d foy
I really like John Siracusa's edit, suggesting using "for ($mystr) { ... }" as a way to "topicalise" -- neat!
j_random_hacker
+3  A: 

You say you are dealing with HTML. You are now realizing that this is pretty much a losing battle with fleeting and fragile solutions.

A proper HTML parser would be make your life easier. HTML::Parser can be hard to use but there are other very useful libraries on CPAN which I can recommend if you can specify what you are trying to do rather than how.

Sinan Ünür
Good point, I was answering the general question of how do run multiple regexes against a string in a maintainable way, but the specific question is about running a regex on HTML, which is a no-no. See http://stackoverflow.com/questions/701166http://stackoverflow.com/questions/701166 for why and http://stackoverflow.com/questions/773340http://stackoverflow.com/questions/773340 for examples on how to use HTML parsers.
Chas. Owens
That is weird, it double pasted the links, let me try again: http://stackoverflow.com/questions/701166 for why.
Chas. Owens
and http://stackoverflow.com/questions/773340 for examples of parsers in action.
Chas. Owens
HTML::Parser is often too much work for the nastiness of some data sources. If you can do a bunch of quick substitutions to regularize the input, you can make things easier down the road. This isn't a question about parsing HTML, but cleaning up dirty data.
brian d foy
HTML::Parser is indeed too much work in most cases. However, there are many libraries that solve many a complicated problem. I have dealt with incredibly badly formed HTML in very large files thanks to such modules. If we knew what information Jeff is trying to get out of these files, a better alternative than a massive block of substitutions with no underlying theme might present itself.
Sinan Ünür