views:

558

answers:

5

I have a service that is responsible for collecting a constantly updating stream of data off the network. The intent is that the entire data set must be available for use (read only) at any time. This means that the newest data message that arrives to the oldest should be accessible to client code.

The current plan is to use a memory mapped file on Windows. Primarily because the data set is enormous, spanning tens of GiB. There is no way to know which part of the data will be needed, but when its needed, the client might need to jump around at will.

Memory mapped files fit the bill. However I have seen it said (written) that they are best for data sets that are already defined, and not constantly changing. Is this true? Can the scenario that I described above work reasonably well with memory mapped files?

Or am I better off keeping a memory mapped file for all the data up to some number of MB of recent data, so that the memory mapped file holds almost 99% of the history of the incoming data, but I store the most recent, say 100MB in a separate memory buffer. Every time this buffer becomes full, I move it to the memory mapped file and then clear it.

+1  A: 

Any data set that is defined and doesn't change is best!
Memory mapped files generally win over anthing else - most OSs will cache the accesses in RAM anyway. And the performance will be predictable, you don't fall off a cliff when you start to swap.

Martin Beckett
So is this a vote for separating the most recent ~n MB into a lightweight memory buffer and just appending periodically when the buffer nears capacity?
ApplePieIsGood
No this was a vote for putting the whole thing into a memory mapped file and letting the OS cacheing mechanism worry about it
Martin Beckett
A: 

Sounds like a database fits your description. Paging is something most commercial ones do well out of the box.

Mitch Wheat
Commercial databases have too much overhead and will be much slower than what this will achieve. This is in essence a highly tailored in memory database for a vary narrow problem domain. The question is whether it should be using a separate buffer for the recently changed portion of data or if it should just all sit in one mem mapped file.
ApplePieIsGood
+1  A: 

From your problem statement, I see following requirements:

  1. data must be always available
  2. data is written once, I assume it is append only, never overwritten.
  3. data read access pattern is random, i.e jumping around
  4. there also appears to have an implicit latency requirement

Seems to me, memory mapped file is chosen to address 3) + 4). If your data size can be fit into memory, this may well be a reasonable solution. However, if your data size is too large to fit in memory, memory mapped file may result in performance issue due to frequent page fault.

You did not describe how "jumping around" is done. If it is possible to build an index, you may be able to save data into multiple files, keep index in memory, use index to load data and serve, and also cache most frequent used data. The basic idea is similar to disk based hash. This is probably a more scalable solution.

Journeyman Programmer
A: 

Since you tagged this Win32 I'm assuming you're working on a 32 bit machine, in which case you simply don't have enough address space to memory map all of your data set. This means you will have to create and destroy mappings into the file as you "jump around", which is going to make this less efficient than you might expect.

In practice, you typically have a bit more than 1 GB of contiguous address space to memory map the file into on a 32 bit windows box, and you can end up with less if you fragment your address space.

That being said, doing this with memory maps does have a benefit if you are memory (not address space) constrained, since when you memory map a file as read only (as opposed to explicitly reading it into memory) the OS will not have a second copy in the file system cache.

Don Neufeld
I should have said this is a 64bit machine. Win32 really refers to the API, there is no Win64 api, but rather the Win32 API with 64bit cases I suppose. Good catch, I was not clear at all.
ApplePieIsGood
Also, the file can't really be mapped as read only right, since it has to be written to?
ApplePieIsGood
A: 

The file can be mapped as readonly in one thread that presents the data and have a background worker thread which has the file mapped as readwrite to do the appending.

Andy Dent