I am sorting a hash in Perl. I encountered an Out of memory error when running my Perl Script:
foreach $key (sort (keys(%hash))) {
....
}
How do I sort a hash that has tons of data?
I am sorting a hash in Perl. I encountered an Out of memory error when running my Perl Script:
foreach $key (sort (keys(%hash))) {
....
}
How do I sort a hash that has tons of data?
Perl FAQ has some examples to sort a hash. Look at How do I sort a hash? and here is A Fresh Look at Efficient Perl Sorting.
sort keys %hash
is inefficient for a large %hash
in that, memory wise, its roughly equivalent to:
my @keys = keys %hash;
@keys = sort @keys;
In that it has to keep three copies of the keys in memory while doing the sorting (one in the hash, one in the list of keys, one in the sorted list being created). foreach
's memory optimizations for iterators do not apply.
Since the hash is so large, the best option is to get it entirely out of memory. Stick it in a BerkeleyDB file. And if you want to keep the keys in order a hash isn't the best option, a tree is. I'd suggest using a Berkeley BTree file. Trees will efficiently keep your data sorted like an array while providing fast lookup like a hash.
Here's an example using BerkeleyDB. DB_File is simpler and better documented but does not take advantage of modern features of BerkeleyDB. YMMV.
use BerkeleyDB;
my $db = tie my %hash, 'BerkeleyDB::Btree',
-Filename => "your.db",
-Compare => sub { $_[1] cmp $_[0] },
-Flags => DB_CREATE;
-Compare
illustrates how to supply your own sorting function. The tied interface will be sluggish. Unless you need it to act like a hash, use the object interface.
If your keys are integers, numbers or strings of a small maximum size, you can use Sort::Packed:
use Sort::Packed qw(sort_packed);
my $hash_size = keys %hash;
my $max_key_len = 4;
my $packed_keys = '\0' x ($max_key_len * $hash_size);
my $ix = 0;
while (my ($key, $value) = each %hash) {
my $key_len = length $k;
$key_len <= $max_key_len or die "key $key is too big";
substr($packed_keys, $ix, $key_len, $key);
$ix += $max_key_len;
}
sort_packed("C$max_key_len", $packed_keys);
$ix = 0;
while ($ix < length $packed_keys) {
my $key = substr($packed_keys, $ix, $max_key_len);
$key =~ s/\0+$//;
print "$key\n";
$ix += $max_key_len;
}
Admittedly, this code is quite ugly, but it will keep memory usage to the minimum.