views:

187

answers:

5

I have an array of objects (say, images), which is too large to fit into memory (e.g. 40GB). But my code needs to be able to randomly access these objects at runtime.

What is the best way to do this?

From my code's point of view, it shouldn't matter, of course, if some of the data is on disk or temporarily stored in memory; it should have transparent access:

container.getObject(1242)->process();
container.getObject(479431)->process();

But how should I implement this container? Should it just send the requests to a database? If so, which one would be the best option? (If a database, then it should be free and not too much administration hassle, maybe Berkeley DB or sqlite?)

Should I just implement it myself, memoizing objects after acces sand purging the memory when it's full? Or are there good libraries (C++) for this out there?

The requirements for the container would be that it minimizes disk access (some elements might be accessed more frequently by my code, so they should be kept in memory) and allows fast access.

UPDATE: I turns out that STXXL does not work for my problem because the objects I store in the container have dynamic size, i.e. my code may update them (increasing or decreasing the size of some objects) at runtime. But STXXL cannot handle that:

STXXL containers assume that the data types they store are plain old data types (POD). http://algo2.iti.kit.edu/dementiev/stxxl/report/node8.html

Could you please comment on other solutions? What about using a database? And which one?

+1  A: 

You could look into memory mapped files, and then access one of those too.

Liz Albin
+7  A: 

Consider using the STXXL:

The core of STXXL is an implementation of the C++ standard template library STL for external memory (out-of-core) computations, i.e., STXXL implements containers and algorithms that can process huge volumes of data that only fit on disks. While the compatibility to the STL supports ease of use and compatibility with existing applications, another design priority is high performance.

James McNellis
This looks nice, but I don't know if it's possible to tell it to cache or preload certain results? For example, once I access element n it might be likely that I will access some from n-100 to n+100 soon, so it should load and store them in memory. Maybe I need my own custom solution in such a case?
Frank
STXXL does not work for me, see the update in my question. Any other ideas?
Frank
+1  A: 

I would implement a basic cache. With this workingset size you will have the best results with a set-associative-cache with x byte cache-lines ( x == what best matches your access pattern ). Just implement in software what every modern processor already has in hardware. This should give you imho the best results. You could than optimize it further if you can optimize the accesspattern to be somehow linear.

DarthCoder
A: 

One solution is to use a structure similar to a B-Tree, indices and "pages" of arrays or vectors. The concept is that the index is used to determine which page to load into memory to access your variable.

If you make the page size smaller, you can store multiple pages in memory. A caching system based on frequency of use or other rule, will reduce the number of page loads.

Thomas Matthews
A: 

I've seen some very clever code that overloads operator[]() to perform disk access on the fly and load required data from disk/database transparently.

SF.
Sure, I was asking if it's worth writing that code myself (and if so, what's the best approach: database access, etc.?) or if that code is available.
Frank