views:

258

answers:

3

The "goatse operator" or the =()= idiom in Perl causes an expression to be evaluated in list context.

An example is:

my $str = "5 and 4 and a 3 and 2 1 BLAST OFF!!!";
my $count =()= $str =~ /\d/g; # 5 matches...
print "There are $count numbers in your countdown...\n\n";

As I interprete the use, this is what happens:

  1. $str =~ /\d/g matches all the digits. The g switch and list context produces a list of those matches. Let this be the "List Producer" example, and in Perl this could be many things.
  2. the =()= causes an assignment to an empty list, so all the actual matches are copied to an empty list.
  3. The assignment in scalar context to $count of the list produced in 2. gives the count of the list or the result of 5.
  4. The reference count of the empty list =()= goes to zero after the scalar assignment. The copy of the list elements is then deleted by Perl.

The questions on efficiency are these:

  1. Am I wrong in how I am parsing this?
  2. If you have some List Producer and all you are interested in is the count, is there a more efficient way to do this?

It works great with this trivial list, but what if the list was hundreds of thousands of matches? With this method you are producing a full copy of every match then deleting it just to count them.

+11  A: 

Perl 5 is smart about copying lists. It only copies as many items as are on the left hand side. It works because list assignment in scalar context yields the number of items on the right hand side. So, n items will be created by the regex, but they won't be copied and discarded, just discarded. You can see the difference the extra copy makes in the naive case in the benchmark below.

As for efficiency, an iterative solution is often easier on memory and CPU usage, but this must be weighed against the succinctness of the goatse secret operator. Here are the results of benchmarking the various solutions:

naive: 10
iterative: 10
goatse: 10

for 0 items:
               Rate iterative    goatse     naive
iterative 4365983/s        --       -7%      -12%
goatse    4711803/s        8%        --       -5%
naive     4962920/s       14%        5%        --

for 1 items:
               Rate     naive    goatse iterative
naive      749594/s        --      -32%      -69%
goatse    1103081/s       47%        --      -55%
iterative 2457599/s      228%      123%        --

for 10 items:
              Rate     naive    goatse iterative
naive      85418/s        --      -33%      -82%
goatse    127999/s       50%        --      -74%
iterative 486652/s      470%      280%        --

for 100 items:
             Rate     naive    goatse iterative
naive      9309/s        --      -31%      -83%
goatse    13524/s       45%        --      -76%
iterative 55854/s      500%      313%        --

for 1000 items:
            Rate     naive    goatse iterative
naive     1018/s        --      -31%      -82%
goatse    1478/s       45%        --      -75%
iterative 5802/s      470%      293%        --

for 10000 items:
           Rate     naive    goatse iterative
naive     101/s        --      -31%      -82%
goatse    146/s       45%        --      -75%
iterative 575/s      470%      293%        --

Here is the code that generated it:

#!/usr/bin/perl

use strict;
use warnings;

use Benchmark;

my $s = "a" x 10;

my %subs = (
    naive => sub {
        my @matches = $s =~ /a/g;
        return scalar @matches;
    },
    goatse => sub {
        my $count =()= $s =~ /a/g;
        return $count;
    },
    iterative => sub {
        my $count = 0;
        $count++ while $s =~ /a/g;
        return $count;
    },
);

for my $sub (keys %subs) {
    print "$sub: @{[$subs{$sub}()]}\n";
}

for my $n (0, 1, 10, 100, 1_000, 10_000) {
    $s = "a" x $n;
    print "\nfor $n items:\n";
    Benchmark::cmpthese -1, \%subs;
}
Chas. Owens
+1: Thanks. I really appreciate how you approached the logic of this and you captured what I was imagining to be the case: The more you have, the better iteration is. But if Perl is "smart" about copying the number that are needed on the left hand side, with `=()=` wouldn't that be all of them?
drewk
No, there are no targets on the lefthand side, so no data is copied (but the regex still has to generate the ones on the righthand side).
Chas. Owens
Agreed that if you have something like `($i, $j, $k)=/a/g;` will copy 3 items even if there are 10 matches. But if you have `()=/a/g;` is Perl smart enough to see that there are zero assignments copy 0?
drewk
@drewk Yes, it is that smart.
Chas. Owens
Please excuse my blatantly plugging my own software, but for benchmarking, have a look at my `Dumbbench` tool or the `Benchmark::Dumb` compatibility wrapper that works almost the same as Benchmark.pm, just better. The docs attempt to explain why.
tsee
+11  A: 

In your particular example, a benchmark is useful:

my $str = "5 and 4 and a 3 and 2 1 BLAST OFF!!!";

use Benchmark 'cmpthese';

cmpthese -2 => {
    goatse => sub {
        my $count =()= $str =~ /\d/g;
        $count == 5 or die
    },
    while => sub {
        my $count; 
        $count++ while $str =~ /\d/g;
        $count == 5 or die
    },
};

which returns:

           Rate goatse  while
goatse 285288/s     --   -57%
while  661659/s   132%     --

The $str =~ /\d/g in list context is capturing the matched substring even though it is not needed. The while example has the regex in scalar (boolean) context, so the regex engine just has to return true or false, and not the actual matches.

And in general, if you have a list producing function and only care about the number of items, writing a short count function is faster:

sub make_list {map {$_**2} 0 .. 1000}

sub count {scalar @_}

use Benchmark 'cmpthese';

cmpthese -2 => {
    goatse => sub {my $count =()= make_list; $count == 1001 or die},
    count  => sub {my $count = count make_list; $count == 1001 or die},
};

which gives:

         Rate goatse  count
goatse 3889/s     --   -26%
count  5276/s    36%     --

My guess as to why the sub is faster is because subroutine calls are optimized to pass lists without copying them (passed as aliases).

Eric Strom
+1: Benchamarks are always better than idle supposition. Thanks!
drewk
+4  A: 

If you need to run something in list context you have to run it in list context. In some cases, like the one you present, you might be able to work around it with another technique, but in most cases you won't.

Before you benchmark, however, the most important question is "Does it even matter?". Profile before you benchmark, and only worry about these sorts of things when you've run out of real problems to solve. :)

If you're looking for the ultimate in efficiency though, Perl's a bit too high level. :)

brian d foy
"Does it even matter" is a fair question. It does matter to *me* for two reasons: 1) I am curious! If I use 1 idiom vs another, I like to think in the back of my head why I do that. 2) If I use a shortcut, I like to understand the nuts and bolts of it. I can just as easily be in the habit of typing the idiom of `$count++ while $s =~/a/g` as I can `$count =()= $s =~ /a/g;`. If one tends to be a lot faster than the other, I will tend to favor it without saying the other is "wrong."
drewk
@brian: are you up to creating a tag wiki for this "operator"? http://stackoverflow.com/tags/goatse/info
Ether
I am not up to it.
brian d foy