tags:

views:

99

answers:

4

I have a file and a list of string pairs which I get from another file. I need substitute the first string of the pair with the second one, and do this for each pair. Is there more efficient/simple way to do this (using Perl, grep, sed or other), then running a separate regexp substitution for each pair of values?

A: 

Build a hash of the pairs. Then split the target string into word tokens, and check each token against the keys in the hash. If it's present, replace it with the value of that key.

Amber
+6  A: 
#! /usr/bin/perl

use warnings;
use strict;

my %replace = (
  "foo" => "baz",
  "bar" => "quux",
);

my $to_replace = qr/@{["(" .
                       join("|" => map quotemeta($_), keys %replace) .
                       ")"]}/;

while (<DATA>) {
  s/$to_replace/$replace{$1}/g;
  print;
}

__DATA__
The food is under the bar in the barn.

The @{[...]} bit may look strange. It's a hack to interpolate generated content inside quote and quote-like operators. The result of the join goes inside the anonymous array-reference constructor [] and is immediately dereferenced thanks to @{}.

If all that seems too wonkish, it's the same as

my $search = join "|" => map quotemeta($_), keys %replace;
my $to_replace = qr/($search)/;

minus the temporary variable.

Note the use of quotemeta—thanks Ivan!—which escapes the first string of each pair so the regular-expression engine will treat them as literal strings.

Output:

The bazd is under the quux in the quuxn.

Metaprogramming—that is, writing a program that writes another program—is also nice. The beginning looks familiar:

#! /usr/bin/perl

use warnings;
use strict;

use File::Compare;

die "Usage: $0 path ..\n" unless @ARGV >= 1;

# stub
my @pairs = (
  ["foo"     => "baz"],
  ["bar"     => "quux"],
  ['foo$bar' => 'potrzebie\\'],
);

Now we generate the program that does all the s/// replacements—but is quotemeta on the replacement side a good idea?

my $code =
  "sub { while (<>) { " .
  join(" " => map "s/" . quotemeta($_->[0]) .
                  "/"  . quotemeta($_->[1]) .
                  "/g;",
              @pairs) .
  "print; } }";
#print $code, "\n";

and compile it with eval:

my $replace = eval $code
  or die "$0: eval: $@\n";

To do the replacements, we use Perl's ready-made in-place editing:

# set up in-place editing
$^I = ".bak";
my @save_argv = @ARGV;

$replace->();

Below is an extra nicety that restores backups that the File::Compare module judges to have been unnecessary:

# in-place editing is conservative: it creates backups
# regardless of whether it modifies the file
foreach my $new (@save_argv) {
  my $old = $new . $^I;
  if (compare($new, $old) == 0) {
    rename $old => $new
      or warn "$0: rename $old => $new: $!\n";
  }
}
Greg Bacon
You should also `quotemeta` keys before put them into regexp.
Ivan Nevostruev
As I am a complete green in perl, can you explain what is done here? Specificly this line: @{["(" . join("|" => keys %replace) . ")"]}
Artium
@Artium I got carried away. See updated answer.
Greg Bacon
@Greg : That fat comma really hurts readability. Just a personal opinion...
Zaid
+2  A: 

There are two ways, both of them require you to compile a regex alternation on the keys of the table:

my %table = qw<The A the a quick slow lazy dynamic brown pink . !>;
my $alt 
    = join( '|'
          , map  { quotemeta } keys %table 
            sort { ( length $b <=> length $a ) || $a cmp $b } 
          )
    ;
my $keyword_regex = qr/($alt)/;

Then you can use this regex in a substitution:

my $text 
    = <<'END_TEXT';
The quick brown fox jumped over the lazy dog.  The quick brown fox jumped over the lazy dog. 
The quick brown fox jumped over the lazy dog.  The quick brown fox jumped over the lazy dog.  
END_TEXT

$text =~ s/$keyword_regex/$table{ $1 }/ge; # <- 'e' means execute code

Or you can do it in a loop:

use English qw<@LAST_MATCH_START @LAST_MATCH_END>;
while ( $text =~ /$keyword_regex/g ) { 
    my $key = $1;
    my $rep = $table{ $key };
    # use the 4-arg form
    substr( $text, $LAST_MATCH_START[1]
          , $LAST_MATCH_END[1] - $LAST_MATCH_START[1], $rep 
          );
    # reset the position to start + new actual
    pos( $text ) = $LAST_MATCH_START[1] + length $rep;
}
Axeman
A: 

If eval is not a security concern:

eval $(awk 'BEGIN { printf "sed \047"} {printf "%s", "s/\\<" $1 "\\>/" $2 "/g;"} END{print "\047 substtemplate"}' substwords )

This constructs a long sed command consisting of multiple substitution commands. It's subject to potentially exceeding your maximum command line length. It expects the word pair file to consist of two words separated by whitespace on each line. Substitutions will be made for whole words only (no clbuttic substitutions).

It may choke if the word pair file contains characters that are significant to sed.

You can do it this way if your sed insists on -e:

eval $(awk 'BEGIN { printf "sed"} {printf "%s", " -e \047s/\\<" $1 "\\>/" $2 "/g\047"} END{print " substtemplate"}' substwords)
Dennis Williamson