views:

64

answers:

4

I have a situation where I need to work with a number (15-30) of large (several hundreds mb) data structures. They won't fit into memory all at the same time. To make things worse, the algorithms operating on them work across all those structures, i.e. not first one, then the other etc. I need to make this as fast as possible.

So I figured I'd allocate memory on disk, in files that are basically direct binary representations of the data when it's loaded into memory, and use memory mapped files to access the data. I use mmap 'views' of for example 50 megabytes (50 mb of the files are loaded into memory at a time), so when I have 15 data sets, my process uses 750 mb of memory for the data. Which was OK initially (for testing), when I have more data I adjust the 50 mb down at the cost of some speed.

However this heuristic is hard-coded for now (I know the size of the data set I will test with). 'In the wild', my software will need to be able to determine the 'right' amount of memory to allocate to maximize performance. I could say 'I will target a memory use of 500 mb' and then divide 500 by the amount of data structures to come to a mmap view size. I have found that when trying to set this 'target memory usage' too high, that the virtual memory manager disk thrashing will (almost) lock up the machine and render it unusable until the processing finishes. This is to be avoided in my 'production' solution.

So my questions, all somewhat different approaches to the problem:

  • What is the 'best' target size for a single process? Should I just try to max out the 2gb that I have (assuming 32 bit Win XP and up, non-/3GB for now) or try to keep my process size smaller so that my software won't hog the machine? When I have 2 Visual Studio's, Outlook and a Firefox open on my machine, those use 1/2 gb of virtual memory easily by themselves - if I let my software use 2 gb of virtual memory the swapping will severely slow down the machine. But then how do I determine the 'best' process size.

  • What can I do to keep performance of the machine in check when working with memory-mapped files? My application does fairly simple numerical operations on the data, which basically means that it zips over hundreds of megabytes of data real quick, causing the whole memory-mapped files (several gigabytes) to be loaded into memory and swapped out again very quickly, again and again (think Monte Carlo style simulation).

  • Is there any chance that not using memory-mapped files and just using fseek/fgets is going to be faster or less intrusive than using memory mapped files?

  • Any articles, papers or books I can read about this? Either with 'cookbook' style solutions or fundamental concepts.

Thanks.

+1  A: 

It occurs to me that you could set some predefined threshold for "too darn slow" and use the computer's wall-clock to make your alterations on the fly.

Start conservatively low. If this is below your "too darn slow" threshold, bump the size up a little bit for the next file. do this iteratively. When you go above the threshold, slowly back the size off iteratively.

San Jacinto
+1  A: 

I think it's a good place to try Address Windowing Extensions: http://msdn.microsoft.com/en-us/library/aa366527(v=VS.85).aspx

It will allow to use more than 4GB of memory by providing a sliding window. The drawback is that not all versions of windows have it.

ruslik
This would require my users to get machines with 6 or 8 or 10 GB of RAM, which I'd love to put in the requirements (like the 64 bit machine suggested above), but which is infeasible for my target market. I need to manage the memory issues in software, I can't throw hardware at it.
Roel
A: 

It will probably be best to fix the size of the memory mapped file to be a some percentage of the total system memory with probably a set minimum.

Remember that the operating system will effectively load a whole memory page when you access a single byte, this may well happen in the background but will only be fast if sequential data accesses tend to be close together.

You should therefore try to keep sequential accesses to your data as close together in memory/the file as possible. You can also look a preloading strategies access your data speculatively before actually requiring the data. These are the same considerations that you will need when optimizing for memory cache efficiency.

If sequential data accesses are scattered widely in your file, you may be better off using fseek and fread to access the data since this will give you better fine-grain control of what data is written to memory when.

Also remember that there are no hard and fast rules. Optimizations can sometimes be counter-intuitive so try a whole bunch of different things and see which works best on the platform that this will need to operate on.

doron
+1  A: 

I probably wouldn't use a memory-mapped file for this app. Memory-mapped files work best when you have a large virtual address space (at least relative to the size of the data you're processing). You map the entire file, and let the OS decide which pieces remain resident.

However, if you're repeatedly mapping and unmapping segments of the file (rather than the entire file), you'll probably end up doing just as well by reading chunks via fseek and fread -- note, however, that you do not want to read individual pieces of data this way (ie, do one large read rather than a lot of small reads).

The one way that manually segmented memory-mapped files might win is if you have sparse reads: if you'll only be touching, say 10% of a given file. In this case, memory mapping means the OS will read only those pages that are touched, whereas explicit reads will load the entire file.

Oh, and I would definitely not spend time trying to control my resource consumption. The OS will do that better than you can, because it knows about all competing processes.

Anon
After reading your question again, I'd suggest thinking about how you could interleave the original files, or break apart the computation using a map-reduce type design (nb: map-reduce does *not* require multiple machines).
Anon