views:

1162

answers:

4

I enjoy developing algorithms using the STL, however, I have this recurring problem where my data sets are too large for the heap.

I have been searching for drop-in replacements for STL containers and algorithms which are disk-backed, i.e. the data structures on stored on disk rather than the heap.

A friend recently pointed me towards stxxl. Before I get too involved with it... Are any other disk-backed STL replacements available that I should be considering?

NOTE: I'm not interested in persistence or embedded databases. Please don't mention boost::serialization, POST++, Relational Template Library, Berkeley DB, sqlite, etc. I am aware of these projects and use them when they are appropriate for my purposes.

UPDATE: Several people have mentioned memory-mapping files and using a custom allocator, good suggestions BTW, but I would point them to the discussion here where David Abraham suggests that custom iterators would be needed for disk-backed containers. Meaning the custom allocator approach isn't likely to work.

+1  A: 

I don't know much about the subject, but it might be possible to write an STL-like interface to a memory mapped file?

edit: This approach might be suitable if you're trying to get at a specific part of a huge file. If you're attempting to do something with the entire file, you'll likely generate a huge number of page faults as you read in uncached parts of the file.

luke
I have considered doing this... but I'd rather not write it myself if something useable has already been done.
ceretullis
I can appreciate not wanting to reinvent the wheel. I'm not familiar with a library that does this; hopefully someone can recommend one.
luke
I believe Boost.Interprocess can be used to write to a memory mapped file. Haven't actually tried it, though.
Nemanja Trifunovic
+1  A: 

If (as you write) you're not interested in persistence the simplest solution would be to increase your heap size and use your operating system's virtual memory facilities. The part of the heap that will not fit in your computer's physical memory will end up being paged on disk, giving you exactly what you want: normal STL access to data often stored on disk. The operating system will take care of caching the most used pages in the physical memory and evicting to disk those you don't use a lot. Your code will remain the same, and you can increase its performance simply by adding more physical memory.

To increase your heap size check your operating system's parameters, like ulimit(1) on Unix systems and System properties - Advanced - Performance - Advanced - Virtual Memory on Windows XP. If you've hit the 32-bit 4GB limit consider moving to a 64 bit architecture or compiling your program for 64 bits.

Diomidis Spinellis
I am a proficient systems administrator, I have considered the approach that you suggested. I have a amd64 machine running unix. I cannot afford adding more physical memory. My swap space is 2GB, my data set is 42GB, my hard drive is 1TB...
ceretullis
So, how about increasing your swap space?
Arkadiy
Sure, I've thought about rebuilding my server to accommodate this project. However this isn't dedicated hardware, rather a machine used for general research. Beyond that, eventually collaborators will need to run the code too, and I don't think requiring them to rebuild their machines will work.
ceretullis
+3  A: 

I've never had to do anything quite like this, but It might be possible to do what you want to do by writing a custom allocator that makes use of a memory mapped files to back your data.

See boost::interprocesses for docs on their easy to use implementation of memory mapped files and this Dr. Dobbs article for a detailed discussion on writing allocators.

christopher_f
ceretullis
+4  A: 

I have implemented some thing very similar. Implementing the iterators is the most challenging. I used boost::iterator_facade to implement the iterators. Using boost::iterator_facade you can easy adapt any cached on disk data structures to have a STL container interface.

witkamp
I use Boost, but iterator_facade is new to me. I'll have to take a look at that, thanks for sharing.
ceretullis