views:

556

answers:

6

I have created an application that does the following:

  1. Make some calculations, write calculated data to a file - repeat for 500,000 times (over all, write 500,000 files one after the other) - repeat 2 more times (over all, 1.5 mil files were written).
  2. Read data from a file, make some intense calculations with the data from the file - repeat for 1,500,000 iterations (iterate over all the files written in step 1.)
  3. Repeat step 2 for 200 iterations.

Each file is ~212k, so over all i have ~300Gb of data. It looks like the entire process takes ~40 days on a Core 2 Duo CPU with 2.8 Ghz.

My problem is (as you can probably guess) is the time it takes to complete the entire process. All the calculations are serial (each calculation is dependent on the one before), so i can't parallel this process to different CPUs or PCs. I'm trying to think how to make the process more efficient and I'm pretty sure the most of the overhead goes to file system access (duh...). Every time i access a file i open a handle to it and then close it once i finish reading the data.

One of my ideas to improve the run time was to use one big file of 300Gb (or several big files of 50Gb each), and then I would only use one open file handle and simply seek to each relevant data and read it, but I'm not what is the overhead of opening and closing file handles. can someone shed some light on this?

Another idea i had was to try and group the files to bigger ~100Mb files and then i would read 100Mb each time instead of many 212k reads, but this is much more complicated to implement than the idea above.

Anyway, if anyone can give me some advice on this or have any idea how to improve the run time i would appreciate it!

Thanks.

Profiler update:

I ran a profiler on the process, it looks like the calculations take 62% of runtime and the file read takes 34%. Meaning that even if i miraculously cut file i/o costs by a factor of 34, I'm still left with 24 days, which is quite an improvement, but still a long time :)

+6  A: 

Opening a file handle isn't probable to be the bottleneck; actual disk IO is. If you can parallelize disk access (by e.g. using multiple disks, faster disks, a RAM disk, ...) you may benefit way more. Also, be sure to have IO not block the application: read from disk, and process while waiting for IO. E.g. with a reader and a processor thread.

Another thing: if the next step depends on the current calculation, why go through the effort of saving it to disk? Maybe with another view on the process' dependencies you can rework the data flow and get rid of a lot of IO.

Oh yes, and measure it :)

xtofl
I only save data on the first step, and use it on the 2nd step, I must save data to files.Actually the advice of using another thread to read from disk sounds like a good idea.
dudico
"why go through the effort of saving it to disk?" Most likely he doesn't have 300GB of RAM.
Steve Jessop
@onebyone: but he *says* he relies on the result of the previous step to calculate the current one, and 1 step is about 212kB, so...
xtofl
In step 2 i need all the data from step 1 (300Gb), and i need it 200 times, so i can't calculate the data from step 1 on the fly.
dudico
I had cut the time in half, added thread support and I even use Intel's TBB to take maximum advantage of my two cores :)
dudico
+2  A: 

Using memory mapped files should be investigated as it will reduce the number of system calls.

steve
As in using one big file which is mapped to memory?
dudico
Not exactly, it can be done for smaller and bigger files, but is limited by the maximum process address space (on 32 bit around 4GB). You need support for memory mapped files from the os. Basically the virtual memory manager is taking care of mapping the files to the process address space.In unix this is provided by the mmap call and in windows by the createfilemapping call.
steve
A 64 bit OS should be able to map 300MB all in one contiguous address range.
Marsh Ray
+7  A: 

Each file is ~212k, so over all i have ~300Gb of data. It looks like the entire process takes ~40 days ...a ll the calculations are serial (each calculation is dependent on the one before), so i can't parallel this process to different CPUs or PCs. ... pretty sure the most of the overhead goes to file system access ... Every time i access a file i open a handle to it and then close it once i finish reading the data.

Writing data 300GB of data serially might take 40 minutes, only a tiny fraction of 40 days. Disk write performance shouldn't be an issue here.

Your idea of opening the file only once is spot-on. Probably closing the file after every operation is causing your processing to block until the disk has completely written out all the data, negating the benefits of disk caching.

My bet is the fastest implementation of this application will use a memory-mapped file, all modern operating systems have this capability. It can end up being the simplest code, too. You'll need a 64-bit processor and operating system, you should not need 300GB of RAM. Map the whole file into address space at one time and just read and write your data with pointers.

Marsh Ray
+5  A: 

Before making any changes it might be useful to run a profiler trace to figure out where most of the time is spent to make sure you actually optimize the real problem.

steve
+4  A: 

What about using SQLite? I think you can get away with a single table.

Vijay Mathew
I have considered it, but would that make the data extraction faster?
dudico
Faster means how much? I don't think it will be any slower. Consider the overhead of opening and closing thousands of files or searching for information in a single big file. By using SQLite you get a single file, optimized with indexing. Moreover, SQLite supports in-memory databases. That means, you can cache part of your data for faster access, without writing new code for that.
Vijay Mathew
I don't need indexing because i will know the exact offset from the beginning of the file. Caching might help, but that depends on the cache implementation.
dudico
+3  A: 

From your brief explaination it sounds like xtofl suggestion of threads is the correct way to go. I would recommend you profile your application first though to ensure that the time is divided between IO an cpu.

Then I would consider three threads joined by two queues.

  1. Thread 1 reads files and loads them into ram, then places data/pointers in the queue. If the queue goes over a certain size the thread sleeps, if it goes below a certain size if starts again.
  2. Thread 2 reads the data off the queue and does the calculations then writes the data to the second queue
  3. Thread 3 reads the second queue and writes the data to disk

You could consider merging thread 1 and 3, this might reduce contention on the disk as your app would only do one disk op at a time.

Also how does the operating system handle all the files? Are they all in one directory? What is performance like when you browse the directory (gui filemanager/dir/ls)? If this performance is bad you might be working outside your file systems comfort zone. Although you could only change this on unix, some file systems are optimised for different types of file usage, eg large files, lots of small files etc. You could also consider splitting the files across different directories.

iain