ansaurus

Question

in perl. how does hash store data in memory

Answer 1

A:

Yes - there is a LOT of overhead. If possible, don't store the data as a full tree, especially since you're using a SAX parser which frees you from the necessities of doing so imposed by a DOM one.

If you MUST store the entire tree, one possible workaround is storing arrays of arrays - e.g. you store all student names in an array (with, say "mary123456" being stored in $students[11], and then store a hash value that would have been ...->{"mary123456"} as ->[11] instead.

It WILL increase processing time due to extra layers of indirection, but might decrease due to less memory usage and thus less swapping/thrashing.

Another option is using hashes tied to files, though it would be REALLY slow due to disk IO bottleneck, of course.

DVK 2010-08-04 01:30:49

hi, DVK thx again. 1. SAX will free the data when it go with the next element right? so far my SAX handler copy the data to my hash and I believe all the memory are due to my hash not data struct created by sax. correct me if I am wrong. 2.I already removed some elements from my hash once it is not used but it is not enough. I do not understand your array of array. seems you suggested create $students[11]='mary123456' and then $this...{'mary123456'}=11, then you do not reduce memory but introduce more memory right? I have tried tie file. 222 slow to make my boss happy.

2010-08-04 01:50:50

Answer 2

+3 A:

Perl hashes use a technique known as bucket-chaining. All keys that have the same hash (see the macro PERL_HASH_INTERNAL in hv.h) go in the same “bucket,” a linear list.

According to the perldata documentation

If you evaluate a hash in scalar context, it returns false if the hash is empty. If there are any key/value pairs, it returns true; more precisely, the value returned is a string consisting of the number of used buckets and the number of allocated buckets, separated by a slash. This is pretty much useful only to find out whether Perl's internal hashing algorithm is performing poorly on your data set. For example, you stick 10,000 things in a hash, but evaluating %HASH in scalar context reveals "1/16" , which means only one out of sixteen buckets has been touched, and presumably contains all 10,000 of your items. This isn't supposed to happen. If a tied hash is evaluated in scalar context, a fatal error will result, since this bucket usage information is currently not available for tied hashes.

To see whether your dataset has a pathological distribution, you could inspect the various levels in scalar context, e.g.,

print scalar(%$this), "\n",
      scalar(%{ $this->{date} }), "\n",
      scalar(%{ $this->{date}{"school 1"} }), "\n",
      ...

For a somewhat dated overview, see How Hashes Really Work at perl.com.

The modest reduction in the lengths of students' names, keys that are four levels down, won't make a significant difference. In general, the perl implementation has a strong bias toward throwing memory at problems. It ain't your father's FORTRAN.

Greg Bacon 2010-08-04 02:23:52

very helpful. Thanks you very much.

2010-08-04 21:13:17

Answer 3

A:

It may be useful to use the Devel::Size module that can report back how big various data structures are:

use Devel::Size qw(total_size);
print "Total Size is: ".total_size($hashref)."\n";

Mark Fowler 2010-08-05 10:39:15

Thanks. I heard Devel::Size itself consume lot of memory. but good to know and will try.

2010-08-10 05:22:21

ansaurus

tags:

views:

answers:

in perl. how does hash store data in memory

related questions