ansaurus

Question

Answer 1

+1 A:

Use memory mapped file (http://en.wikipedia.org/wiki/Memory-mapped_file);

bb 2009-03-12 22:49:38

Answer 2

+10 A:

Since you do not mention an OS that you are running this on, have you looked at memory mapping the file and then using standard memory routines to "walk" the file as you go along?

This way you are not using fseek/fread instead you are using pointer arithmetic. Here is an mmap example to copy one file from a source file to a destination file. This may improve the performance.

Other things you could look into, is splitting the files up into smaller files, and using a hash value corresponding to the time unit to close then open the next file to continue the simulation, this way dealing with smaller files that can be more aggressively cached by the host OS!

X-Istence 2009-03-12 22:50:27

Answer 3

+1 A:

Store the computed data in a relational database.

anon 2009-03-12 22:51:27

Answer 4

+1 A:

Maybe not relevant in this case, but I managed to increase performances in an application with heavy file read and write by writing compressed data (zlib), and decompressing on the fly, the decreased read/write time versus the increased CPU load being a win.

Alternatively, if your problem is that the amount of data does not fit in memory and you want to use the disk as a cache, you can look into memcached, which provides a scalable and distributed memory cache.

small_duck 2009-03-12 22:51:54

Answer 5

+1 A:

"millions" maps do not sound like a lot of data. What prevents you from keeping all data in memory?

Another option is to use some standard file format suitable for your needs e.g., sqlite (use SQL to store/retrieve data) or some specialized format like hdf5 or define you own format using something like Google Protocol Buffers.

J.F. Sebastian 2009-03-12 23:08:59

Answer 6

+2 A:

You might consider using memory mapped files. For example see boost::interprocess as they provide a convenient implementation.

Also you might consider using stlxxl which provides STL like functionality aimed towards large filebased datasets.

And one more also - if you want iterator like access to your data, then have a look at boost::iterator_facade.

If you don't want to play with the fancy tricks, you could provide additional binary file containing the index for the file with structures (containing the offsets of the structure starting offsets). This would provide indirect random access.

Anonymous 2009-03-12 23:17:14

Answer 7

+3 A:

The effectiveness of this idea depends on your pattern of access, but if you are not looking at that variable size data each cycle, you might speed up access by rearranging your file structure:
Instead of writing a direct dump of a structure like this:

struct { 
  int x;
  enum t;
  int sz
  char variable_data[sz];
};

you could write all the fixed size parts up front, then store the variable portions afterward:

struct {
  int x;
  enum t;
  int sz;
  long offset_to_variable_data;
};

Now, as you parse the file each cycle, you can linearly read N records at a time. You will only have to deal with fseek when you need to fetch the variable-sized data. You might even consider keeping that variable portion in a separate file so that you also only read forward through that file.

This strategy may even improve your performance if you do go with a memory-mapped file as others suggested.

AShelly 2009-03-12 23:35:12

Answer 8

A:

Frameworks like Boost and ACE provide platform independent access to memory mapped files. That should speed up your parsing significantly.

lothar 2009-04-07 02:39:39

ansaurus

tags:

views:

answers:

Optimize read/write huge data (C++)

related questions