tags:

views:

256

answers:

6

I need some Perl regular expression help. The following snippet of code:

use strict; 
use warnings; 
my $str = "In this example, A plus B equals C, D plus E plus F equals G and H plus I plus J plus K equals L"; 
my $word = "plus"; 
my @results = ();
1 while $str =~ s/(.{2}\b$word\b.{2})/push(@results,"$1\n")/e;
print @results;

Produces the following output:

A plus B
D plus E
2 plus F
H plus I
4 plus J
5 plus K

What I want to see is this, where a character already matched can appear in a new match in a different context:

A plus B
D plus E
E plus F
H plus I
I plus J
J plus K

How do I change the regular expression to get this result? Thanks --- Dan

A: 

Here's one way to do it:

use strict; 
use warnings; 
my $str = "In this example, A plus B equals C, D plus E plus F equals G and H plus I plus J plus K equals L"; 
my $word = "plus"; 
my @results = ();
my $i = 0;
while (substr($str, $i) =~ /(.{2}\b$word\b.{2})/) {
    push @results, "$1\n";
    $i += $-[0] + 1;
}
print @results;

It's not terribly Perl-ish, but it works and it doesn't use too many obscure regular expression tricks. However, you might have to look up the function of the special variable @- in perlvar.

Greg Hewgill
+6  A: 

General advice: Don't use s/// when you want m//. Be specific in what you match.

The answer is pos:

#!/usr/bin/perl -l

use strict;
use warnings;

my $str = 'In this example, ' . 'A plus B equals C, ' .
          'D plus E plus F equals G ' .
          'and H plus I plus J plus K equals L';

my $word = "plus";

my @results;

while ( $str =~ /([A-Z] $word [A-Z])/g ) {
    push @results, $1;
    pos($str) -= 1;
}

print "'$_'" for @results;

Output:

C:\Temp> b
'A plus B'
'D plus E'
'E plus F'
'H plus I'
'I plus J'
'J plus K'
Sinan Ünür
Ah, essentially the same answer but you cleaned up the regex too.
Michael Carman
`pos` just feels cleaner than `substr`.
Sinan Ünür
+1 for `pos`. Didn't know about that one.
Greg Hewgill
+3  A: 

You can use a m//g instead of s/// and assign to the pos function to rewind the match location before the second term:

use strict;
use warnings;

my $str  = 'In this example, A plus B equals C, D plus E plus F equals G and H plus I plus J plus K equals L';
my $word = 'plus';
my @results;

while ($str =~ /(.{2}\b$word\b(.{2}))/g) {
    push @results, "$1\n";
    pos $str -= length $2;
}
print @results;
Michael Carman
A: 

don't have to use regex. basically, just split up the string, use a loop to go over each items, check for "plus" , then get the word from before and after.

my $str = "In this example, A plus B equals C, D plus E plus F equals G and H plus I plus J plus K equals L"; 
@s = split /\s+/,$str;
for($i=0;$i<=scalar @s;$i++){
    if ( "$s[$i]"  eq "plus" ){
        print "$s[$i-1] plus $s[$i+1]\n";
    }
}
ghostdog74
+1  A: 

Given the "Full Disclosure" comment (but assuming .{0,35}, not .{35}), I'd do

use List::Util qw/max min/;
my $context = 35;
while ( $str =~ /\b$word\b/g ) {
    my $pre = substr( $str, max(0, $-[0] - $context), min( $-[0], $context ) );
    my $post = substr( $str, $+[0], $context );
    my $match = substr( $str, $-[0], $+[0] - $-[0] );
    $pre =~ s/.*\n//s;
    $post =~ s/\n.*//s;
    push @results, "$pre$match$post";
}
print for @results;

You'd skip the substitutions if you really meant (?s:.{0,35}).

ysth
+2  A: 

Another option is to use a lookahead:

use strict; 
use warnings; 
my $str = "In this example, A plus B equals C, D plus E "
        . "plus F equals G and H plus I plus J plus K equals L"; 
my $word = "plus"; 
my $chars = 2;
my @results = ();

push @results, $1 
  while $str =~ /(?=((.{0,$chars}?\b$word\b).{0,$chars}))\2/g;

print "'$_'\n" for @results;

Within the lookahead, capturing group 1 matches the word along with a variable number of leading and trailing context characters, up to whatever maximum you've set. When the lookahead finishes, the backreference \2 matches "for real" whatever was captured by group 2, which is the same as group 1 except that it stops at the end of the word. That sets pos where you want it, without requiring you to calculate how many characters you actually matched after the word.

Alan Moore
Thanks for posting, I learned more about regex looking at this. I wonder, which is faster, this solution or Sinan's which uses pos()?
dlw
They're not really equivalent. Sinan's code, which is based on your original question, matches exactly 2 extra characters at either end, and bumps `pos` back exactly one position. Mine allows for a variable number of context characters (with 2 being the max in this case), which seems more realistic after reading your "Full Disclosure" comment. My solution can more usefully be compared to ysth's, and I would expect his to be faster because it lets the regex engine find the match for `\b$word\b` without putting a reluctant quantifier in its way.
Alan Moore