ansaurus

Question

How can a Perl regex re-use part of the previous match for the next match?

Answer 1

A:

Here's one way to do it:

use strict; 
use warnings; 
my $str = "In this example, A plus B equals C, D plus E plus F equals G and H plus I plus J plus K equals L"; 
my $word = "plus"; 
my @results = ();
my $i = 0;
while (substr($str, $i) =~ /(.{2}\b$word\b.{2})/) {
    push @results, "$1\n";
    $i += $-[0] + 1;
}
print @results;

It's not terribly Perl-ish, but it works and it doesn't use too many obscure regular expression tricks. However, you might have to look up the function of the special variable @- in perlvar.

Greg Hewgill 2009-08-16 02:29:48

Answer 2

+6 A:

General advice: Don't use s/// when you want m//. Be specific in what you match.

The answer is pos:

#!/usr/bin/perl -l

use strict;
use warnings;

my $str = 'In this example, ' . 'A plus B equals C, ' .
          'D plus E plus F equals G ' .
          'and H plus I plus J plus K equals L';

my $word = "plus";

my @results;

while ( $str =~ /([A-Z] $word [A-Z])/g ) {
    push @results, $1;
    pos($str) -= 1;
}

print "'$_'" for @results;

Output:

C:\Temp> b
'A plus B'
'D plus E'
'E plus F'
'H plus I'
'I plus J'
'J plus K'

Sinan Ünür 2009-08-16 02:49:30

Ah, essentially the same answer but you cleaned up the regex too.

Michael Carman 2009-08-16 03:00:46

`pos` just feels cleaner than `substr`.

Sinan Ünür 2009-08-16 03:04:05

+1 for `pos`. Didn't know about that one.

Greg Hewgill 2009-08-16 04:11:57

Answer 3

+3 A:

You can use a m//g instead of s/// and assign to the pos function to rewind the match location before the second term:

use strict;
use warnings;

my $str  = 'In this example, A plus B equals C, D plus E plus F equals G and H plus I plus J plus K equals L';
my $word = 'plus';
my @results;

while ($str =~ /(.{2}\b$word\b(.{2}))/g) {
    push @results, "$1\n";
    pos $str -= length $2;
}
print @results;

Michael Carman 2009-08-16 02:56:37

Answer 4

A:

don't have to use regex. basically, just split up the string, use a loop to go over each items, check for "plus" , then get the word from before and after.

my $str = "In this example, A plus B equals C, D plus E plus F equals G and H plus I plus J plus K equals L"; 
@s = split /\s+/,$str;
for($i=0;$i<=scalar @s;$i++){
    if ( "$s[$i]"  eq "plus" ){
        print "$s[$i-1] plus $s[$i+1]\n";
    }
}

ghostdog74 2009-08-16 03:44:17

Answer 5

+1 A:

Given the "Full Disclosure" comment (but assuming .{0,35}, not .{35}), I'd do

use List::Util qw/max min/;
my $context = 35;
while ( $str =~ /\b$word\b/g ) {
    my $pre = substr( $str, max(0, $-[0] - $context), min( $-[0], $context ) );
    my $post = substr( $str, $+[0], $context );
    my $match = substr( $str, $-[0], $+[0] - $-[0] );
    $pre =~ s/.*\n//s;
    $post =~ s/\n.*//s;
    push @results, "$pre$match$post";
}
print for @results;

You'd skip the substitutions if you really meant (?s:.{0,35}).

ysth 2009-08-16 09:01:35

Answer 6

+2 A:

Another option is to use a lookahead:

use strict; 
use warnings; 
my $str = "In this example, A plus B equals C, D plus E "
        . "plus F equals G and H plus I plus J plus K equals L"; 
my $word = "plus"; 
my $chars = 2;
my @results = ();

push @results, $1 
  while $str =~ /(?=((.{0,$chars}?\b$word\b).{0,$chars}))\2/g;

print "'$_'\n" for @results;

Within the lookahead, capturing group 1 matches the word along with a variable number of leading and trailing context characters, up to whatever maximum you've set. When the lookahead finishes, the backreference \2 matches "for real" whatever was captured by group 2, which is the same as group 1 except that it stops at the end of the word. That sets pos where you want it, without requiring you to calculate how many characters you actually matched after the word.

Alan Moore 2009-08-16 18:26:25

Thanks for posting, I learned more about regex looking at this. I wonder, which is faster, this solution or Sinan's which uses pos()?

dlw 2009-08-18 02:56:19

They're not really equivalent. Sinan's code, which is based on your original question, matches exactly 2 extra characters at either end, and bumps `pos` back exactly one position. Mine allows for a variable number of context characters (with 2 being the max in this case), which seems more realistic after reading your "Full Disclosure" comment. My solution can more usefully be compared to ysth's, and I would expect his to be faster because it lets the regex engine find the match for `\b$word\b` without putting a reluctant quantifier in its way.

Alan Moore 2009-08-18 13:00:24

ansaurus

tags:

views:

answers:

How can a Perl regex re-use part of the previous match for the next match?

related questions