tags:

views:

385

answers:

6

i need to implement a program to count the occurrence of a substring in a string in perl. i have implemented it as follows

sub countnmstr
{
  $count =0;
  $count++ while $_[0] =~ /$_[1]/g;
  return $count;
}

$count = countnmstr("aaa","aa");

print "$count\n";

now this is what i would normally do. however, in the implementation above i want to count occurrence of 'aa' in 'aaa'. here i get answer as 1 which seems reasonable but i need to consider the overlapping cases as well. hence the above case should give an answer as 2 since there are two 'aa's if we consider overlap.

can anyone suggest how to implement such a function??

+7  A: 

See ysth's answer ... I failed to realize that the pattern could consist solely of a zero width assertion and still work for this purpose.

You can use positive lookahead as suggested by others, and write the function as:

sub countnmstr {
    my ($haystack, $needle) = @_;
    my ($first, $rest) = $needle =~ /^(.)(.*)$/;
    return scalar (() = $haystack =~ /(\Q$first\E(?=\Q$rest\E))/g);
}

You can also use pos to adjust where the next search picks up from:

#!/usr/bin/perl

use strict; use warnings;

sub countnmstr {
    my ($haystack, $needle) = @_;
    my $adj = length($needle) - 1;
    die "Search string cannot be empty!" if $adj < 0;

    my $count = 0;
    while ( $haystack =~ /\Q$needle/g ) {
        pos $haystack -= $adj;
        $count += 1;
    }
    return $count;
}

print countnmstr("aaa","aa"), "\n";

Output:

C:\Temp> t
2
Sinan Ünür
yes, a bit of dwimmery there; you'd naively expect something like /\b/g to match at the same position an infinite number of times, but there's a special rule that zero-width matches aren't allowed twice at the same starting position (even across separate scalar-context m//g's).
ysth
you may be interested in http://perlmonks.org/?node_id=322751
ysth
+2  A: 

You could use a lookahead assertion in the regular expression:

sub countnmstr {
    my @matches = $_[0] =~ /(?=($_[1]))/g;

    return scalar @matches;
}

I suspect Sinan's suggestion will be quicker though.

martin clayton
+2  A: 

you can try this, no more regex than needed.

$haystack="aaaaabbbcc";
$needle = "aa";
while ( 1 ){
    $ind = index($haystack,$needle);
    if ( $ind == -1 ) {last};
    $haystack = substr($haystack,$ind+1);
    $count++;
}
print "Total count: $count\n";

output

$ ./perl.pl
Total count: 4
ghostdog74
There's no need for `substr` here; `index` takes an optional 3rd parameter telling it where to start the search.
cjm
+10  A: 

Everyone is getting pretty complicated in their answers (d'oh! daotoad should have made his comment an answer!), perhaps because they are afraid of the goatse operator. I didn't name it, that's just what people call it. It uses the trick that the result of a list assignment is the number of elements in the righthand list.

The Perl idiom for counting matches is then:

 my $count = () = $_[0] =~ /($pattern)/g;

The goatse part is the = () =, which is an empty list in the middle of two assignments. The lefthand part of the goatse gets the count from the righthand side of the goatse. Note the you need a capture in the pattern because that's the list the match operator will return in list context.

Now, the next trick in your case is that you really want a positive lookbehind (or lookahead maybe). The lookarounds don't consume characters, so you don't need to keep track of the position:

 my $count = () = 'aaa' =~ /((?<=a)a)/g;

Your aaa is just an example. If you have a variable-width pattern, you have to use a lookahead. Lookbehinds in Perl have to be fixed width.

brian d foy
If you don't know why it's called the "goatse operator" do *not* try to find out what it means. Some things you're better off not knowing.
Michael Carman
"some things cannot be unseen"
Ether
The fear comes from the name ;). I didn't post it as an answer, because I didn't think of the lookahead/lookbehind idea, which is the real answer. The idea of using a 'lookbehind' with the goatse operator makes a certain sort of horrible sense.
daotoad
+5  A: 
sub countnmstr
{
    my ($string, $substr) = @_;
    return scalar( () = $string =~ /(?=\Q$substr\E)/g );
}

$count = countnmstr("aaa","aa");

print "$count\n";

A few points:

//g in list context matches as many times as possible.

\Q...\E is used to auto-escape any meta characters, so that you are doing a substring count, not a subpattern count.

Using a lookahead (?= ... ) causes each match to not "consume" any of the string, allowing the following match to be attempted at the very next character.

This uses the same feature where a list assignment (in this case, to an empty list) in scalar context returns the count of elements on the right of the list assignment as the goatse/flying-lentil/spread-eagle/whatever operator, but uses scalar() instead of a scalar assignment to provide the scalar context.

$_[0] is not used directly, but instead copied to a lexical; a naive use of $_[0] in place of $string would cause the //g to start partway through the string instead of at the beginning if the passed string had a stored pos().

Update: s///g is faster, though not as fast as using index:

sub countnmstr
{
    my ($string, $substr) = @_;
    return scalar( $string =~ s/(?=\Q$substr\E)//g );
}
ysth
+2  A: 

If speed is an issue, the index approach suggested by ghostdog74 (with cjm's improvement) is likely to be considerably faster than the regex solutions.

use strict;
use warnings;

sub countnmstr_regex {
    my ($haystack, $needle) = @_;
    return scalar( () = $haystack =~ /(?=\Q$needle\E)/g );
}

sub countnmstr_index {
    my ($haystack, $needle) = @_;
    my $i = 0;
    my $tally = 0;
    while (1){
        $i = index($haystack, $needle, $i);
        last if $i == -1;
        $tally ++;
        $i ++;
    }
    return $tally;
}

use Benchmark qw(cmpthese);

my $size = 1;
my $h = 'aaa aaaaaa' x $size;
my $n = 'aa';

cmpthese( -2, {
    countnmstr_regex => sub { countnmstr_regex($h, $n) },
    countnmstr_index => sub { countnmstr_index($h, $n) },
} );

__END__

# Benchmarks run on Windows.
# Result using a small haystack ($size = 1).
                     Rate countnmstr_regex countnmstr_index
countnmstr_regex  93701/s               --             -66%
countnmstr_index 271893/s             190%               --

# Result using a large haystack ($size = 100).
                   Rate countnmstr_regex countnmstr_index
countnmstr_regex  929/s               --             -81%
countnmstr_index 4960/s             434%               --
FM
This benchmark has a problem in that it only looks at one very specific example where index is sure to win. Now try it with a non-literal pattern: index loses because it can't even work. Be very careful of special case benchmarks.
brian d foy
@brian Sure, `index` only works with literal text -- I think we knew that already. The OP isn't clear whether he or she needs to search for literal text or a pattern. Since most of the answers focused on regex, it seems reasonable to illustrate an alternative approach that also has a speed advantage.
FM
@brian: The OP never said anything about matching a regex; you said that when you edited the title. He simply used a regex in his original implementation.
cjm
Fair enough and I've broadened the title. However, "never said anything" is a bit strong since he used one himself. :)
brian d foy