tags:

views:

213

answers:

3

I am sorting a hash in Perl. I encountered an Out of memory error when running my Perl Script:

foreach $key (sort (keys(%hash))) {
   ....
}

How do I sort a hash that has tons of data?

A: 

Perl FAQ has some examples to sort a hash. Look at How do I sort a hash? and here is A Fresh Look at Efficient Perl Sorting.

Space
I'm wondering if Perl is smart enough not to re-sort the keys if it finds the sort function couched in the for loop condition
syker
I think perl is smart and not resort the keys :).
Space
The faq answer is the same thing that is giving him the out of memory problem.
brian d foy
+12  A: 

sort keys %hash is inefficient for a large %hash in that, memory wise, its roughly equivalent to:

my @keys = keys %hash;
@keys = sort @keys;

In that it has to keep three copies of the keys in memory while doing the sorting (one in the hash, one in the list of keys, one in the sorted list being created). foreach's memory optimizations for iterators do not apply.

Since the hash is so large, the best option is to get it entirely out of memory. Stick it in a BerkeleyDB file. And if you want to keep the keys in order a hash isn't the best option, a tree is. I'd suggest using a Berkeley BTree file. Trees will efficiently keep your data sorted like an array while providing fast lookup like a hash.

Here's an example using BerkeleyDB. DB_File is simpler and better documented but does not take advantage of modern features of BerkeleyDB. YMMV.

use BerkeleyDB;

my $db  = tie my %hash, 'BerkeleyDB::Btree',
              -Filename => "your.db",
              -Compare  => sub { $_[1] cmp $_[0] },
              -Flags    => DB_CREATE;

-Compare illustrates how to supply your own sorting function. The tied interface will be sluggish. Unless you need it to act like a hash, use the object interface.

Schwern
Schwern, for newer perls, I believe the number of sets of keys required in memory might actually be two (not three) *if* using in-place sort. But I may have misread the commits and am too lazy to dig them up again. Of course, this has no bearing on the validity of your answer at all.
tsee
@tsee What in-place sort are you referring to?
Schwern
`@ary = sort @ary`. Testing on a large array yields resident memory increase from 101 to 155 MB, whereas `my @ary2 = sort @ary` ends up at 293 MB. So yes, sorting does incur a memory overhead, but it's not the full size of the array. (Half in this artificial case.) Curiously, if one *wanted* to copy the array for sorting, this would actually use less memory than the obvious: `@ary2 = @ary; @ary2=sort@ary2`.
tsee
Schwern
"@ary = sort @ary" is special cased in core. It *does* do in-place sort, never mind the syntax. Cf. for example core commit fe1bc4cf71e7b04d33e679798964a090d9fa7b46 from 2004. The copying, then sorting wasn't a suggestion as much as a curiosity I had just discovered myself when I realized @ary2 = sort @ary was *that* much worse memory-wise. Either way, I think your deck-chair analogy applies. :)
tsee
@tsee I sit corrected! Please feel free to edit my rough memory sketch above.
Schwern
A: 

If your keys are integers, numbers or strings of a small maximum size, you can use Sort::Packed:

use Sort::Packed qw(sort_packed);

my $hash_size = keys %hash;
my $max_key_len = 4;  
my $packed_keys = '\0' x ($max_key_len * $hash_size);
my $ix = 0;
while (my ($key, $value) = each %hash) {
  my $key_len = length $k;
  $key_len <= $max_key_len or die "key $key is too big";
  substr($packed_keys, $ix, $key_len, $key);
  $ix += $max_key_len;
}

sort_packed("C$max_key_len", $packed_keys);

$ix = 0;
while ($ix < length $packed_keys) {
  my $key = substr($packed_keys, $ix, $max_key_len);
  $key =~ s/\0+$//;
  print "$key\n";
  $ix += $max_key_len;
}

Admittedly, this code is quite ugly, but it will keep memory usage to the minimum.

salva