views:

88

answers:

3

I need to read from a dataset which is very large, highly interlinked, the data is fairly localized, and reads are fairly expensive. Specifically:

  1. The data sets are 2gigs - 30gigs in size, so I have to map sections of the file into memory to read. This is very expensive compared to the rest of the work I do in the algorithm. From profiling I've found roughly 60% of the time is spent reading the memory, so this is the right place to start optimizing.
  2. When operating on a piece of this dataset, I have to follow links inside of it (imagine it like being similar to a linked list), and while those reads aren't guaranteed to anywhere near sequential, they are fairly localized. This means:
  3. Let's say, for example, we operate on 2 megs of memory at a time. If you read 2 megs of data into memory, roughly 40% of the reads I will have to subsequently do will be in that same 2 megs of memory. Roughly 20% of the reads will be purely random access in the rest of the data, and the other 40% very likely links back into the 2meg segment which pointed to this one.

From knowledge of the problem and from profiling, I believe that introducing a cache to the program will help greatly. What I want to do is create a cache which holds N chunks of X megs of memory (N and X configurable so I can tune it) which I can check first, before having to map another section of memory. Additionally, the longer something has been in the cache, the less likely it is that we will request that memory in the short term, and so the oldest data will need to be expired.

After all that, my question is very simple: What data structure would be best to implement a cache of this nature?

I need to have very fast lookups to see if a given address is in the cache. With every "miss" of the cache, I'll want to expire the oldest member of it, and add a new member. However, I plan to try to tune it (by changing the amount that's cached) such that 70% or more of reads are hits.

My current thinking is to use either an AVL tree (LOG2 n for search/insert/delete) would be the safest (no degenerate cases). My other option is a sparse hashtable such that lookups would be O(1) in the best case. In theory this could degenerate into O(n), but in practice I could keep collisions low. The concern here would be how long it takes to find and remove the oldest entry in the hashtable.

Does anyone have any thoughts or suggestions on what data structure would be best here, and why?

+2  A: 

Seems like you are looking for an LRU (Least Recently Used) cache: http://stackoverflow.com/questions/2504178/lru-cache-design

Moron
That's what I thought too, note (even though it was not mentioned in the linked question) than Boost.MultiIndex can really implementing this, it's cited in the doc if I remember correctly.
Matthieu M.
+1  A: 

If 60% of your algorithm is I/O, I suggest that the actual cache design doesn't really matter that much- any sort of cache could be a substantial speed-up.

However, the design depends a lot on what data you're using to access your chunks. String, int, etc. If you had an int, you could do a hashmap into a linked list, erase the back on cache miss, erase and then push on top if cache hit.

hashmaps are provided under varying names (most commonly, unordered map) in many implementations. Boost has one, there's one in TR1, etc. A big advantage of a hash_map is less performance loss with growing numbers, and more flexibility about key values.

DeadMG
Unfortunately it's not simple data we are talking about. Each entry I'm looking at is an arbitrary sized structure. Each structure is self describing and can contain different types of data including ints, strings, etc. That's why I basically have to implement a cache at the memory level, not at the data level, if that makes sense.Thanks for the tip about hashmap though. That might save me some time implementing a custom datastructure.
LCC
Blimey. That sounds like quite a job. Good luck.
DeadMG
A: 

Put the cache into two sorted trees (AVL or any other reasonably balanced tree implementation is fine--you're better off using one from a library than creating your own).

One tree should sort by position in the file. This lets you do log(n) lookups to see if your cache is there.

The other tree should sort by time used (which can be represented by a number that increments by one on each use). When you use a cached block, you remove it, update the time, and insert it again. This will take log(n) also. When you miss, remove the smallest element of the tree, and add the new block as the largest. (Don't forget to also remove/add that block to the by-position-in-file tree.)

If your cache doesn't have very many items in it, you'll be better off still by just storing everything in a sorted array (using insertion sort to add new elements). Moving 16 items down one spot in an array is incredibly fast.

Rex Kerr