tags:

views:

1003

answers:

4

I want to repeatedly search for values in array, that does not change during the script.

So far, I have been doing it this way: I put the values in a hash (so I had an array and a hash with essentialy the same thing) and I searched through exists.

I found out, that there is a ~~ operator in perl 5.10. The question is - how efficient it is for searching scalar in an array? (as I said, the array does not change at all)

I don't like having two different variables (the array and the hash) that both store the same thing; however, the hash is much faster for searching.

+4  A: 

If you don't care about order, why have the array at all? It seems to be redundant, and you can always get the values out of the hash when you need them.

Here is an article indicating that smart match does a linear search: Smart match with scalar and array. In fact, from perlsyn for 5.10 a linear search is indicated: perlsyn indicates linear search

Smart matching in detail

$a      $b        Type of Match Implied    Matching Code
======  =====     =====================    =============
...
Array   Num       array contains number    grep $_ == $b, @$a
Array   Any       array contains string    grep $_ eq $b, @$a

Updated: From the latest perlsyn (5.13.5): latest perlsyn

Any     Array     match against an array element[Footnote 3]
                                           grep $a ~~ $_, @$b

Footnote 3 - If a circular reference is found, we fall back to referential equality.

Conclusion: Based on all of the information available, yes it goes through the entire array (or at least until it finds the item sought, if there is any sort of optimization).

Michael Goldshteyn
cjm
Great answer, thanks. (1) however, I do need the array for iterating (very occasionaly, but I do); (2) there is this line: "The "matching code" doesn't represent the real matching code, of course: it's just there to explain the intended meaning. Unlike grep, the smart match operator will short-circuit whenever it can." - I was not really sure what it means.
Karel Bílek
Don't quote that part of perlsyn, first because there's a part right after it (which you didn't quote) which says specifically that the part you quoted *doesn't* mean what you say it means, and second because it's from the obsolete 5.10.0.
hobbs
@Karel is there something wrong with iterating over `keys %hash`?
hobbs
@hobbs, my question exactly (re: iterating keys %hash)
Michael Goldshteyn
@Karel Bílek, it means smart-match will stop as soon as it finds the first match (because all it cares about is true or false), while `grep` would continue to look for additional matches in the rest of the array (because it returns the number of matching elements).
cjm
btw, my conclusion did mention short circuiting.
Michael Goldshteyn
@hobbs: I am talking about my own special case here. In my case, it is, because some scalars can be repeated (it doesn't matter in the search, of course, but it matters somewhere else).
Karel Bílek
@cjm oh, that's it. Thanks.
Karel Bílek
@Michael my point is that that column is only meant to express the "moral equivalent" of what the smartmatch does. If tomorrow the perl optimizer started turning smartmatches against fixed arrays into hash lookups, the docs wouldn't even have to change because they never actually say there's a linear scan (although in fact there is :)
hobbs
@Karel: If performance is still an issue, you could use a hash with integer keys that store the number of times the given element occurs (something like `$hash{$key}++ for my $key (@array);`). If you still need the ordering of values sometimes, just use the original array there.
jnylen
I really hate that matching code column. It's very misleading because it tells you how to accomplish the same thing without the smart match, not what the smart match actually has to do.
brian d foy
+8  A: 

Fast for small numbers of potential matches, but not faster than the hash. Hashes are really the right tool for testing set membership. Since hash access is O(log n) and smartmatch on an array is still O(n) linear scan (albeit short-circuiting, unlike grep), with larger numbers of values in the allowed matches, smartmatch gets relatively worse.

Benchmark code (matching against 3 values):

#!perl
use 5.12.0;
use Benchmark qw(cmpthese);

my @hits = qw(one two three);
my @candidates = qw(one two three four five six); # 50% hit rate
my %hash;
@hash{@hits} = ();

sub count_hits_hash {
  my $count = 0;
  for (@_) {
    $count++ if exists $hash{$_};
  }
  $count;
}

sub count_hits_smartmatch {
  my $count = 0;
  for (@_) {
    $count++ when @hits;
  }
  $count;
}

say count_hits_hash(@candidates);
say count_hits_smartmatch(@candidates);

cmpthese(-5, {
    hash => sub { count_hits_hash((@candidates) x 1000) },
    smartmatch => sub { count_hits_smartmatch((@candidates) x 1000) },
  }
);

Benchmark results:

             Rate smartmatch       hash
smartmatch  404/s         --       -65%
hash       1144/s       183%         --
hobbs
This is with a small candidates array. I bet that if the array was 25+ items, there would be a much more significant difference.
Michael Goldshteyn
The size of candidates makes no real difference to the relative performance. Did you mean the size of hits?
hobbs
thanks for benchmark.... it's a pity I can't select two right answers.
Karel Bílek
I re-jiggered the benchmark to try it with different candidate array sizes and use different positions of the hits in the candidate array. It makes a huge difference.
brian d foy
+7  A: 

The "smart" in "smart match" isn't about how the searching. It's about doing the right thing at the right time based on context.

The question of whether it's faster to loop through an array or index into a hash is something you'd have to benchmark, but in general, it'd have to be a pretty small array to be quicker to skim through than indexing into a hash.

Andy Lester
+25  A: 

If you want to search for a single scalar in an array, you can use List::Util's first subroutine. It stops as soon as it knows the answer. I don't expect this to be faster than a hash lookup if you already have the hash, but when you consider creating the hash and having it in memory, it might be more convenient for you to just search the array you already have.

As for the smarts of the smart-match operator, if you want to see how smart it is, test it. :)

There are at least three cases you want to examine. The worst case is that every element you want to find is at the end. The best case is that every element you want to find is at the beginning. The likely case is that the elements you want to find average out to being in the middle.

Now, before I start this benchmark, I expect that if the smart match can short circuit (and it can; its documented in perlsyn), that the best case times will stay the same despite the array size, while the other ones get increasingly worse. If it can't short circuit and has to scan the entire array every time, there should be no difference in the times because every case involves the same amount of work.

Here's a benchmark:

#!perl
use 5.12.2;
use strict;
use warnings;

use Benchmark qw(cmpthese);

my @hits = qw(A B C);
my @base = qw(one two three four five six) x ( $ARGV[0] || 1 );

my @at_end       = ( @base, @hits );
my @at_beginning = ( @hits, @base );

my @in_middle = @base;
splice @in_middle, int( @in_middle / 2 ), 0, @hits;

my @random = @base;
foreach my $item ( @hits ) {
    my $index = int rand @random;
    splice @random, $index, 0, $item;
    }

sub count {
    my( $hits, $candidates ) = @_;

    my $count;
    foreach ( @$hits ) { when( $candidates ) { $count++ } }
    $count;
    }

cmpthese(-5, {
    hits_beginning => sub { my $count = count( \@hits, \@at_beginning ) },
    hits_end       => sub { my $count = count( \@hits, \@at_end ) },
    hits_middle    => sub { my $count = count( \@hits, \@in_middle ) },
    hits_random    => sub { my $count = count( \@hits, \@random ) },
    control        => sub { my $count = count( [], [] ) },
  }
);

Here's how the various parts did. Note that this is a logarithmic plot on both axes, so the slopes of the plunging lines aren't as close as they look:

Smart match speed

So, it looks like the smart match operator is a bit smart, but that doesn't really help you because you still might have to scan the entire array. You probably don't know ahead of time where you'll find your elements. I expect a hash will perform the same as the best case smart match, even if you have to give up some memory for it.


Okay, so the smart match being smart times two is great, but the real question is "Should I use it?". The alternative is a hash lookup, and it's been bugging me that I haven't considered that case.

As with any benchmark, I start off thinking about what the results might be before I actually test them. I expect that if I already have the hash, looking up a value is going to be lightning fast. That case isn't a problem. I'm more interested in the case where I don't have the hash yet. How quickly can I make the hash and lookup a key? I expect that to perform not so well, but is it still better than the worst case smart match?

Before you see the benchmark, though, remember that there's almost never enough information about which technique you should use just by looking at the numbers. The context of the problem selects the best technique, not the fastest, contextless micro-benchmark. Consider a couple of cases that would select different techniques:

  • You have one array you will search repeatedly
  • You always get a new array that you only need to search once
  • You get very large arrays but have limited memory

Now, keeping those in mind, I add to my previous program:

my %old_hash = map {$_,1} @in_middle; 

cmpthese(-5, {
    ...,
    new_hash       => sub { 
        my %h = map {$_,1} @in_middle; 
        my $count = 0;
        foreach ( @hits ) { $count++ if exists $h{$_} }
        $count;
        },
    old_hash       => sub { 
        my $count = 0;
        foreach ( @hits ) { $count++ if exists $old_hash{$_} }
        $count;
        },
    control_hash   => sub { 
        my $count = 0;
        foreach ( @hits ) { $count++ }
        $count;
        },
    }
);

Here's the plot. The colors are a bit difficult to distinguish. The lowest line there is the case where you have to create the hash any time you want to search it. That's pretty poor. The highest two (green) lines are the control for the hash (no hash actually there) and the existing hash lookup. This is a log/log plot; those two cases are faster than even the smart match control (which just calls a subroutine).

Smart match v. hash

There are a few other things to note. The line for the "random" case are a bit different. That's understandable because each benchmark (so, once per array scale run) randomly places the hit elements in the candidate array. Some runs put them a bit earlier and some a bit later, but since I only make the @random array once per run of the entire program, they move around a bit. That means that the bumps in the line aren't significant. If I tried all positions and averaged, I expect that "random" line to be the same as the "middle" line.

Now, looking at these results, I'd say that a smart-match is much faster in its worst case than the hash lookup is in its worst case. That makes sense. To create a hash, I have to visit every element of the array and also make the hash, which is a lot of copying. There's no copying with the smart match.

Here's a further case I won't examine though. When does the hash become better than the smart match? That is, when does the overhead of creating the hash spread out enough over repeated searches that the hash is the better choice?

brian d foy