views:

1441

answers:

6

Does a hash shrink in Perl as you delete elements.

More specifically I had a perl program that I inherited that would parse a huge file ( 1 GB ) and load up a hash of hashes. it would do that same for another file and then do a comparison of different elements. The memory consumption was huge during this process and even though I added deleting hash elements has they were used the memory consumption seemed to be unaffected.

The script was extremely slow and such a memory hog. I know it was not well designed but any ideas about the hash memory usage?

+6  A: 

In general, Perl cannot return memory to the operating system. It may be able to reuse memory internally, though, which could reduce the amount of memory needed by a program.

See perlfaq3: How can I free an array or hash so my program shrinks?

If the memory used by the hashes is excessive (i.e. > physical memory) you could tie them to a file on disk. This would greatly reduce your memory usage but be warned that accessing a structure on disk is much slower than accessing one in memory. (So is disk thrashing.)

Michael Carman
+10  A: 

You might want to check out something like DBM::Deep. It does that tie-ing stuff that Michael mentioned so you don't have to think about it. Everything is stored on disk instead of in memory. It's just short of needing a fancier database server.

Also, if you want to track down the performance bottleneck, check out Devel::NYTProf, the new hotness in Perl profiling that came out of the New York Times.

brian d foy
+5  A: 

If your hash is truly gigantic, a better strategy is to probably use an on-disk hash and let the OS worry about getting things into and out of memory. I'm particularly fond of Berkeley DB for storing big hashes on disk, and the Perl BerkeleyDB module provides a full-featured interface, including a tied API.

DBM::Deep can also be used as a drop-in hash replacement, but relies on its own format. This can be a pain if your structure needs to be read by other (non-Perl) systems.

friedo
+4  A: 

If inputs in the second file are needed only once (as they are read), you could potentially cut the memory usage in half.

Depending on your algorithm, you might even be able to just hold both filehandles open and a small hash of not-used-yet values in memory. An example would be a merge or comparison of sorted data -- you only need to hold the current line from each file and compare it to each other as you go, skipping ahead until the cmp changes.

Another approach might be to make multiple passes, especially if you have one or more otherwise-idle cores in your machine. Open read pipes and have subprocesses feed you the data in manageable pre-organized chunks.

For more generic algorithms, you can only avoid paying for the memory size by trading it for the cost of disk speed.

In most cases, loading every data source into memory only wins on development time -- then you pay for it in footprint and/or speed when N gets large.

Eric Wilhelm
+4  A: 

Regarding the specific question: No, deleting hash keys does not reduce your program's memory consumption.

Regarding the more general case: The substantial majority of programs and languages will continue to hold on to memory which they have previously used, but are not currently using. This is because requesting the allocation of memory by the operating system is a relatively slow operation, so they keep it in case it's needed again later.

So, if you want to improve on this situation, you'll need to reduce the peak amount of memory required by your program, whether by revising your algorithms to not need access to as much data at once, by using on-disk storage (such as the aforementioned DBM::Deep), or by releasing space from unneeded variables back to perl (let them go out of scope or set them to undef) so that it can be reused.

Dave Sherohman
+4  A: 

Workaround: fork a child process that allocates all that memory. Let it pass back some aggregate info when it's done doing its thing; when the forked process dies, its memory will go with it. A bit of a pain, but works for some cases. An example of a case where this helps would be if you are processing many files, each file one at a time, only a few of the files are big, and little intermediate state needs to be kept.

SquareCog