tags:

views:

74

answers:

3

I have a big xml file and parsing it consumes a lot of memory.
since I believe most of it is due to a lot of user name in the file.
I changed the length of each user name from ~28 Bytes to 10 bytes.
and run again. but it still takes almost the same amount of memory.
the xml file is so far parsed with SAX and during handling, the result is stored in a hash structure, like this:
$this->{'date'}->{'school 1'}->{$class}->{$student}...

why the memory is still so much after I reduce the length of student name? is it possible when the data is stored in hash memory. there are a lot of overhead no matter how lone the length of string is?

A: 

Yes - there is a LOT of overhead. If possible, don't store the data as a full tree, especially since you're using a SAX parser which frees you from the necessities of doing so imposed by a DOM one.

If you MUST store the entire tree, one possible workaround is storing arrays of arrays - e.g. you store all student names in an array (with, say "mary123456" being stored in $students[11], and then store a hash value that would have been ...->{"mary123456"} as ->[11] instead.

It WILL increase processing time due to extra layers of indirection, but might decrease due to less memory usage and thus less swapping/thrashing.

Another option is using hashes tied to files, though it would be REALLY slow due to disk IO bottleneck, of course.

DVK
hi, DVK thx again. 1. SAX will free the data when it go with the next element right? so far my SAX handler copy the data to my hash and I believe all the memory are due to my hash not data struct created by sax. correct me if I am wrong. 2.I already removed some elements from my hash once it is not used but it is not enough. I do not understand your array of array. seems you suggested create $students[11]='mary123456' and then $this...{'mary123456'}=11, then you do not reduce memory but introduce more memory right? I have tried tie file. 222 slow to make my boss happy.
+3  A: 

Perl hashes use a technique known as bucket-chaining. All keys that have the same hash (see the macro PERL_HASH_INTERNAL in hv.h) go in the same “bucket,” a linear list.

According to the perldata documentation

If you evaluate a hash in scalar context, it returns false if the hash is empty. If there are any key/value pairs, it returns true; more precisely, the value returned is a string consisting of the number of used buckets and the number of allocated buckets, separated by a slash. This is pretty much useful only to find out whether Perl's internal hashing algorithm is performing poorly on your data set. For example, you stick 10,000 things in a hash, but evaluating %HASH in scalar context reveals "1/16" , which means only one out of sixteen buckets has been touched, and presumably contains all 10,000 of your items. This isn't supposed to happen. If a tied hash is evaluated in scalar context, a fatal error will result, since this bucket usage information is currently not available for tied hashes.

To see whether your dataset has a pathological distribution, you could inspect the various levels in scalar context, e.g.,

print scalar(%$this), "\n",
      scalar(%{ $this->{date} }), "\n",
      scalar(%{ $this->{date}{"school 1"} }), "\n",
      ...

For a somewhat dated overview, see How Hashes Really Work at perl.com.

The modest reduction in the lengths of students' names, keys that are four levels down, won't make a significant difference. In general, the perl implementation has a strong bias toward throwing memory at problems. It ain't your father's FORTRAN.

Greg Bacon
very helpful. Thanks you very much.
A: 

It may be useful to use the Devel::Size module that can report back how big various data structures are:

use Devel::Size qw(total_size);
print "Total Size is: ".total_size($hashref)."\n";
Mark Fowler
Thanks. I heard Devel::Size itself consume lot of memory. but good to know and will try.