views:

215

answers:

3

Is it possible to use a Perl hash in a manner that has O(log(n)) lookup and insertion?

By default, I assume the lookup is O(n) since it's represented by an unsorted list.

I know I could create a data structure to satisfy this (ie, a tree, etc) however, it would be nicer if it was built in and could be used as a normal hash (ie, with %)

+7  A: 

A perl hash is a hash-table, so it already has O(1) insertion and lookup.

tzaman
Now ... what if you fill it up with MANY elements ... would not it be `O(logn)` then?
Hamish Grubijan
No, insertion and lookup will still be amortized `O(1)` - if the fill ratio starts getting too high the underlying array is resized.
tzaman
Perl automatically re-tunes the hash distribution as the structure grows. If the hash gets big enough to actually matter, you can switch to a disk-based bash table like BerkeleyDB.
friedo
Thanks for the answer, could you provide me a link on this though? I had always believed the underlying representation was an array.
Mike
@Mike perlguts has some information about how Perl hashes are implemented. http://perldoc.perl.org/perlguts.html#Working-with-HVs
Schwern
@Mike - hash-tables generally _are_ implemented on top of arrays; only thing is items are indexed by their hash value, instead of sequentially.
tzaman
+15  A: 

Associative arrays in Perl 5 are implemented with hash tables which have amortized O(1) (i.e. constant time) insert and lookup. That is why we tend to call them hashes not associative arrays.

It is difficult to find documentation that states that Perl 5 uses a hash table for implementation of associative arrays (besides the fact that we call associative arrays "hashes"), but there is at least this in perldoc perlfaq4

What happens if I add or remove keys from a hash while iterating over it?
   (contributed by brian d foy)

   The easy answer is "Don't do that!"

   If you iterate through the hash with each(), you can delete the key
   most recently returned without worrying about it.  If you delete or add
   other keys, the iterator may skip or double up on them since perl may
   rearrange the hash table.  See the entry for "each()" in perlfunc.

An even better quote from perldoc perldata:

   If you evaluate a hash in scalar context, it returns false if the hash
   is empty.  If there are any key/value pairs, it returns true; more
   precisely, the value returned is a string consisting of the number of
   used buckets and the number of allocated buckets, separated by a slash.
   This is pretty much useful only to find out whether Perl's internal
   hashing algorithm is performing poorly on your data set.  For example,
   you stick 10,000 things in a hash, but evaluating %HASH in scalar
   context reveals "1/16", which means only one out of sixteen buckets has
   been touched, and presumably contains all 10,000 of your items.  This
   isn't supposed to happen.  If a tied hash is evaluated in scalar
   context, a fatal error will result, since this bucket usage information
   is currently not available for tied hashes.

Of course, O(1) is only theoretical performance. In the real world we do not have perfect hashing functions, so hashes do slow down as they get larger, and there are some degenerate cases that turn a hash into O(n), but Perl does its best to prevent this from happening. Here is a benchmark of Perl hashes with 10, 100, 1,000, 10,000, 100,000 keys:

Perl version 5.012000
               Rate 10^5 keys 10^4 keys 10^3 keys 10^2 keys 10^1 keys
10^5 keys 5688029/s        --       -1%       -4%       -7%      -12%
10^4 keys 5748771/s        1%        --       -3%       -6%      -11%
10^3 keys 5899429/s        4%        3%        --       -4%       -9%
10^2 keys 6116692/s        8%        6%        4%        --       -6%
10^1 keys 6487133/s       14%       13%       10%        6%        --

Here is the benchmark code:

#!/usr/bin/perl

use strict;
use warnings;

use Benchmark;

print "Perl version $]\n";

my %subs;
for my $n (1 .. 5) {
    my $m = 10 ** $n;
    keys(my %h) = $m; #preallocated the hash so it doesn't have to keep growing
    my $k = "a";
    %h = ( map { $k++ => 1 } 1 .. $m );
    $subs{"10^$n keys"} = sub {
        return @h{"a", $k};
    }
};

Benchmark::cmpthese -1, \%subs;
Chas. Owens
thank you for your thorough response.
Mike
Since strings (hash keys) are sortable, perl may implement the hashes to have a O(log(n)) worst case performance. I haven't checked whether it does, though.
tsee
@tsee Every version of Perl 5 up to at least 5.12.1 uses a hash. Since Perl 5.8.1, `perl` will change the hashing function if it detects the degenerate case (see http://perldoc.perl.org/perlsec.html#Algorithmic-Complexity-Attacks).
Chas. Owens
@Chas, I think you got me wrong there. Of course, Perl uses hashing for hashes. But for collisions, it falls back to... something. That something could be an O(n) linked list with low overhead. That would be a good idea optimizing for the case of smallish hashes. Or it could be a tree-like structure that gives O(log(n)) in case of a collision. I knew Perl randomized the hash seeds since 5.8.1. I totally forgot that was tweaked to only apply sometimes in 5.8.2.
tsee
@tsee Ah, yes, the buckets themselves could be implemented as a tree, but it looks like the hash entry `struct` only has a next pointer. This means it is likely that Perl is always using a linked list for collisions.
Chas. Owens
+1  A: 

Anybody who thinks that hash insert or lookup time is O(1) on modern hardware is extraordinary naive. Measuring get of same value is plain wrong. Following results will give you much better picture what's going on.

Perl version 5.010001
            Rate 10^6 keys 10^5 keys 10^1 keys 10^4 keys 10^3 keys 10^2 keys
10^6 keys 1.10/s        --      -36%      -64%      -67%      -68%      -69%
10^5 keys 1.73/s       57%        --      -43%      -49%      -50%      -52%
10^1 keys 3.06/s      177%       76%        --      -10%      -12%      -15%
10^4 keys 3.40/s      207%       96%       11%        --       -3%       -5%
10^3 keys 3.49/s      216%      101%       14%        3%        --       -3%
10^2 keys 3.58/s      224%      107%       17%        6%        3%        --

Above result is measured on system with 5MB CPU cache. Note that performance drops significantly from 3.5M/s to 1M/s lookups. Anyway it is still very fast and for some cases you can beat even systems like RDBMS if you know what you are doing. You can measure your system using following code:

#!/usr/bin/perl

use strict;
use warnings;

use Benchmark;

print "Perl version $]\n";

my %subs;
for my $n ( 1 .. 6 ) {
    my $m = 10**$n;
    keys( my %h ) = $m;    #preallocated the hash so it doesn't have to keep growing
    my $k = "a";
    %h = ( map { $k++ => 1 } 1 .. $m );
    my $l = 10**( 6 - $n );
    my $a;
    $subs{"10^$n keys"} = sub {
        for ( 1 .. $l ) {
            $a = $h{$_} for keys %h;
        }
    };
}

Benchmark::cmpthese -3, \%subs;

You shouldn't also forget that hash lookup time depends on key length. Simply, there is not real technology with O(1) access time. Each known real technology has O(logN) access time in the best. There are only systems which have O(1) access time because are limiting their maximal N and are degrading its performance for low N. It is how things works in real world and it is reason why someone making algorithms like Judy Array and evolution becomes worse and worse.

Hynek -Pichi- Vychodil