views:

61

answers:

2

So the scenario is as follows: I have a 2-3 gb large files of binary serialized objects, I also have an index file which contains the id of each object and their offset in the file.

I need to write a method that given a set of id's deserializes them into memory. Performance is the most important benchmark and keeping the memory requirements reasonable is the second.

Using MemoryMappedFile seems the way to go, however I'm a bit unsure on how to handle the large file. I can't create a MemoryMappedViewAccessor for the entire file since it's so large. Can I simultaneously have several MemoryMappedViewAccessor's of different segments open without affecting memory too much, in that case how large should those segments be?

The views might be kept in memory a while if the data is accessed much and then disposed of

A perhaps naive method would be to order the objects to be fetched by offset and simply call CreateViewAccessor for each offset with a small buffer. Another would be to try and figure out the least amount of different MemoryMappedViewAccessor needed and their size.. but I'm unsure of the overhead in creating CreateViewAccessor and how much space you can safely access in one go. I can do some testing but if someone has a better idea... :)

I guess another way to go would to split the large datafile into several but I'm not sure that would do any good in this case...

A: 

What kind of storage is the file on? A normal HDD or a SSD? In case of a normal hdd you should minimize seek times, so you might need to order your accesses by the offset.

I think having large memory mapped segments doesn't cost much RAM. They only cost address space since they can be backed by the file itself. So the most of the used RAM is the OS cache.

From what I heard async IO using I/O Completion Ports is fastest, but I haven't used them myself yet.

CodeInChaos
it can be either, but it's a good idea to order the access by offset, which I know thanks to the index file, thanks!
MattiasK
A: 

My question to you is why do you have 2 3GB files of serialized objects ? This is always going to be a performance issue loading this up.
Do you really need to handle all this information at once ? The best approach might be some kind of database that you would use to query the elements you needed, when needed and rebuild them at that point. Can you provide more information on what kind of data you are storing and how you are using it. It seems to me that your design needs a little work.

Romain Hippeau
This is more of a low-level storage library kind of thing rather than a solution per se (This *is* the database :). I don't need to handle all objects at once, but I need to be able to pull out a set of objects on demand
MattiasK
@MattiasK why do you not use an existing solution ? By creating your own you will have way too much complexity and your performance will probably not be as good.
Romain Hippeau
It's something of a niche solution with some specific needs
MattiasK
@MattiasK - If you provide more detail, you might be able to get more help.
Romain Hippeau
I appreciate the help but the above question is something I need to evaluate, if nothing else so I can do performance comparisons
MattiasK