ansaurus

Question

How can I count overlapping substrings in Perl?

Answer 1

+7 A:

See ysth's answer ... I failed to realize that the pattern could consist solely of a zero width assertion and still work for this purpose.

You can use positive lookahead as suggested by others, and write the function as:

sub countnmstr {
    my ($haystack, $needle) = @_;
    my ($first, $rest) = $needle =~ /^(.)(.*)$/;
    return scalar (() = $haystack =~ /(\Q$first\E(?=\Q$rest\E))/g);
}

You can also use pos to adjust where the next search picks up from:

#!/usr/bin/perl

use strict; use warnings;

sub countnmstr {
    my ($haystack, $needle) = @_;
    my $adj = length($needle) - 1;
    die "Search string cannot be empty!" if $adj < 0;

    my $count = 0;
    while ( $haystack =~ /\Q$needle/g ) {
        pos $haystack -= $adj;
        $count += 1;
    }
    return $count;
}

print countnmstr("aaa","aa"), "\n";

Output:

C:\Temp> t
2

Sinan Ünür 2010-01-22 01:01:51

yes, a bit of dwimmery there; you'd naively expect something like /\b/g to match at the same position an infinite number of times, but there's a special rule that zero-width matches aren't allowed twice at the same starting position (even across separate scalar-context m//g's).

ysth 2010-01-22 21:36:47

you may be interested in http://perlmonks.org/?node_id=322751

ysth 2010-01-22 21:43:37

Answer 2

+2 A:

You could use a lookahead assertion in the regular expression:

sub countnmstr {
    my @matches = $_[0] =~ /(?=($_[1]))/g;

    return scalar @matches;
}

I suspect Sinan's suggestion will be quicker though.

martin clayton 2010-01-22 01:10:40

Answer 3

+2 A:

you can try this, no more regex than needed.

$haystack="aaaaabbbcc";
$needle = "aa";
while ( 1 ){
    $ind = index($haystack,$needle);
    if ( $ind == -1 ) {last};
    $haystack = substr($haystack,$ind+1);
    $count++;
}
print "Total count: $count\n";

output

$ ./perl.pl
Total count: 4

ghostdog74 2010-01-22 01:13:00

There's no need for `substr` here; `index` takes an optional 3rd parameter telling it where to start the search.

cjm 2010-01-22 09:40:35

Answer 4

+10 A:

Everyone is getting pretty complicated in their answers (d'oh! daotoad should have made his comment an answer!), perhaps because they are afraid of the goatse operator. I didn't name it, that's just what people call it. It uses the trick that the result of a list assignment is the number of elements in the righthand list.

The Perl idiom for counting matches is then:

 my $count = () = $_[0] =~ /($pattern)/g;

The goatse part is the = () =, which is an empty list in the middle of two assignments. The lefthand part of the goatse gets the count from the righthand side of the goatse. Note the you need a capture in the pattern because that's the list the match operator will return in list context.

Now, the next trick in your case is that you really want a positive lookbehind (or lookahead maybe). The lookarounds don't consume characters, so you don't need to keep track of the position:

 my $count = () = 'aaa' =~ /((?<=a)a)/g;

Your aaa is just an example. If you have a variable-width pattern, you have to use a lookahead. Lookbehinds in Perl have to be fixed width.

brian d foy 2010-01-22 01:21:48

If you don't know why it's called the "goatse operator" do *not* try to find out what it means. Some things you're better off not knowing.

Michael Carman 2010-01-22 02:40:45

"some things cannot be unseen"

Ether 2010-01-22 03:08:55

The fear comes from the name ;). I didn't post it as an answer, because I didn't think of the lookahead/lookbehind idea, which is the real answer. The idea of using a 'lookbehind' with the goatse operator makes a certain sort of horrible sense.

daotoad 2010-01-22 18:16:45

Answer 5

+5 A:

sub countnmstr
{
    my ($string, $substr) = @_;
    return scalar( () = $string =~ /(?=\Q$substr\E)/g );
}

$count = countnmstr("aaa","aa");

print "$count\n";

A few points:

//g in list context matches as many times as possible.

\Q...\E is used to auto-escape any meta characters, so that you are doing a substring count, not a subpattern count.

Using a lookahead (?= ... ) causes each match to not "consume" any of the string, allowing the following match to be attempted at the very next character.

This uses the same feature where a list assignment (in this case, to an empty list) in scalar context returns the count of elements on the right of the list assignment as the goatse/flying-lentil/spread-eagle/whatever operator, but uses scalar() instead of a scalar assignment to provide the scalar context.

$_[0] is not used directly, but instead copied to a lexical; a naive use of $_[0] in place of $string would cause the //g to start partway through the string instead of at the beginning if the passed string had a stored pos().

Update: s///g is faster, though not as fast as using index:

sub countnmstr
{
    my ($string, $substr) = @_;
    return scalar( $string =~ s/(?=\Q$substr\E)//g );
}

ysth 2010-01-22 09:48:41

Answer 6

+2 A:

If speed is an issue, the index approach suggested by ghostdog74 (with cjm's improvement) is likely to be considerably faster than the regex solutions.

use strict;
use warnings;

sub countnmstr_regex {
    my ($haystack, $needle) = @_;
    return scalar( () = $haystack =~ /(?=\Q$needle\E)/g );
}

sub countnmstr_index {
    my ($haystack, $needle) = @_;
    my $i = 0;
    my $tally = 0;
    while (1){
        $i = index($haystack, $needle, $i);
        last if $i == -1;
        $tally ++;
        $i ++;
    }
    return $tally;
}

use Benchmark qw(cmpthese);

my $size = 1;
my $h = 'aaa aaaaaa' x $size;
my $n = 'aa';

cmpthese( -2, {
    countnmstr_regex => sub { countnmstr_regex($h, $n) },
    countnmstr_index => sub { countnmstr_index($h, $n) },
} );

__END__

# Benchmarks run on Windows.
# Result using a small haystack ($size = 1).
                     Rate countnmstr_regex countnmstr_index
countnmstr_regex  93701/s               --             -66%
countnmstr_index 271893/s             190%               --

# Result using a large haystack ($size = 100).
                   Rate countnmstr_regex countnmstr_index
countnmstr_regex  929/s               --             -81%
countnmstr_index 4960/s             434%               --

FM 2010-01-22 14:01:28

This benchmark has a problem in that it only looks at one very specific example where index is sure to win. Now try it with a non-literal pattern: index loses because it can't even work. Be very careful of special case benchmarks.

brian d foy 2010-01-22 19:04:55

@brian Sure, `index` only works with literal text -- I think we knew that already. The OP isn't clear whether he or she needs to search for literal text or a pattern. Since most of the answers focused on regex, it seems reasonable to illustrate an alternative approach that also has a speed advantage.

FM 2010-01-22 20:08:09

@brian: The OP never said anything about matching a regex; you said that when you edited the title. He simply used a regex in his original implementation.

cjm 2010-01-23 11:05:35

Fair enough and I've broadened the title. However, "never said anything" is a bit strong since he used one himself. :)

brian d foy 2010-01-23 18:03:56

ansaurus

tags:

views:

answers:

How can I count overlapping substrings in Perl?

related questions