On-demand paging to allow analysis of large amounts of data

views:

104

answers:

On-demand paging to allow analysis of large amounts of data

I am working on an analysis tool that reads output from a process and continuously converts this to an internal format. After the "logging phase" is complete, analysis is done on the data. The data is all held in memory.

However, due to the fact that all logged information is held in memory, there is a limit on the duration of the logging. For most use cases this is ok, but it should be possible to run for longer, even if this will hurt performance.

Ideally, the program should be able to start using hard drive space in addition to RAM once the RAM usage reaches a certain limit.

This leads to my question: Are there any existing solutions for doing this? It has to work on both Unix and Windows. Other suggestions for how to "fix" this problems are also appreciated.

Without knowing more about your application it's not possible to provide a perfect answer. However it does sound a bit like you are re-inventing the wheel. Have you considered using an in-process database library like sqlite?

If you used that or similar it will take care of moving the data to and from the disk and memory and give you powerful SQL query capabilities at the same time. Even if your logging data is in a custom format if each item has a key or index of some kind a small light database may be a good fit.

sipwiz 2009-09-16 09:15:14

I have considered using SQLite, but that would require rewriting large parts of the application. Since this is a quite rare use case I would prefer to be able to simply use a file on disk as needed.

ehamberg 2009-09-16 09:28:14

To use the disk after memory is full, we use Cache technologies such as EhCache. They can be configured with the amount of memory to use, and to overflow to disk.

But they also have smarter algorithms you can configure as needed, such as sending to disk data not used in the last 10 minutes etc... This could be a plus for you.

KLE 2009-09-16 09:19:39

This might seem too obvious, but what about memory mapped files? This does what you want and even allows a 32 bit application to use much more than 4GB of memory. The principle is simple, you allocate the memory you need (on disk) and then map just a portion of that into system memory. You could, for example, map something like 75% of the available physical memory size. Then work on it, and when you need another portion of the data, just re-map. The downside to this is that you have to do the mapping manually, but that's not necessarily bad. The good thing is that you can use more data than what fits into physical memory and into the per-process memory limit. It works really great if you actually use only part of the data at any given time.

There may be libraries that do this automatically, like the one KLE suggested (though I don't know that one). Doing it manually means you'll learn a lot about it and have more control, though I'd prefer a library if it does exactly what you want with regard to how and when the disk is being used.

This works similar on both Windows on Unix. For Windows, here is an article by Raymond Chen that shows a simple example.

OregonGhost 2009-09-16 09:31:20

That would work, but it will be quite complicated if I have 20 GiB of data in a file and need to look at data at the start, middle and end of the file. If my understanding is correct, that will mean that I have to mmap parts of the file and do a lot of bookkeeping, basically writing my own memory manager.

ehamberg 2009-09-16 10:01:29

That's true. Random access makes this thing more complicated. You'd have basically the same things to do as the OS when it's swapping. Your advantage may be that you know in advance which sections you have to map. However, if you want to work on more data than fits into virtual address space, you have to do something like that anyway. If you don't want to do that, you can still try to find a library that does that. You can also try to operate on a real file instead, relying on the OS's file system caching mechanism for performance.

OregonGhost 2009-09-16 12:18:46

ansaurus

tags:

views:

answers:

On-demand paging to allow analysis of large amounts of data

related questions