views:

193

answers:

6

I've been searching in the web about this question and although there are many similar questions about read/write in C/C++, I haven't found about this specific task.

I want to be able to read from multiple files (256x256 files) only sizeof(double) bytes located in a certain position of each file. Right now my solution is, for each file:

  1. Open the file (read, binary mode):

    fstream fTest("current_file", ios_base::out | ios_base::binary);

  2. Seek the position I want to read:

    fTest.seekg(position*sizeof(test_value), ios_base::beg);

  3. Read the bytes:

    fTest.read((char *) &(output[i][j]), sizeof(test_value));

  4. And close the file:

    fTest.close();

This takes about 350 ms to run inside a for{ for {} } structure with 256x256 iterations (one for each file).


Q: Do you think there is a better way to implement this operation? How would you do it?

+2  A: 

Maybe threading would help.

But first you could try something simpler. Make two copies of your program, with one reading the first 32768 files, and the other the second half. Run both programs at the same time. Does this take less than 14 hours?

If not, then adding threads is probably useless. Defragmenting, as roygiv suggests above, might help.

Added: The 14 hours is obviously wrong, since this is almost 1 second per file. Alejandro's comment above says that with an Solid State Drive, the time is only 0.1 ms per file, totaling 6.5 s. Which seems fast to me.

So I'm guessing that Alejandro has to repeat this about 7000 times, each time with a different slice of data from the 65536 files. If so, two further suggestions are:

  • Write a program to cat the files to a new file. You probably have enough space on you SSD to do this, as your other SO question indicats 32 GB of data, and the SSD is likely several times that. Then each run uses just this single huge file, which removes 65535 open's and close's.

  • And, instead of just concatenation, while creating the huge file you could 'reverse the rows and columns' or 'stripe the data', providing locality.

Further addition: You've probably already considered this, with your phrase "writing of the read data into a single file".

Joseph Quinsey
You may saturate your I/O subsystem if your files are stored on a single drive. However, if they are on different drives you may see a slight increase in performance.
@roygbiv: I agree.
Joseph Quinsey
I wouldn't use a for loop - I would use a queue and then have different threads pull the files from a queue to process. @roygbiv is correct though that you will saturate your I/O if you are using a local drive and if you are running across a network then you will saturate your network connection. Without know more about the how and why's of your system then we can't help more.
graham.reeds
If possible. Run this test where half the files are on one physical hard drive and the other half are on another physical hard drive.
jmucchiello
@Joseph: You're right, I was wrong about the 14 hours. On the way home I was thinking on this and the idea of concatenating the files into a single one was one of my proposals, now I see you agree! That would probably restore the 350 ms in the reading (a bit more taking into account that the file size could possibly change the seek time).
Alejandro Cámara
A: 

Given the nature of the problem, I'm not sure how much more performance you can squeeze out of it. If the files are distributed among several different disks, then I could see creating a thread for each disk; that way you could parallize several reads at a time. If they're all on the same disk, though, then at some level all the reads will be serialized (I think; I'm not a storage expert).

I/O is your limiting factor here, not the algorithm.

John Bode
A: 

Doesn't the fstream API include buffering by default? I wonder if switching APIs to one that doesn't use buffering or disabling buffering with setvbuf might result in a speed up. Cache operations of the underlying OS might well mean that there is no difference, but it'd be interesting to know.

torak
+1  A: 

If you really want to optimize this you probably want to drop the C++ fstream stuff or at least turn off buffering for it. fstream does a lot of memory allocation and deallocation and buffering could read in more data than is needed. The OS will likely need to read an entire page to get the few bytes you need, but fstream will probably want it to copy at least that much (and maybe more, requiring more reads) to its buffers, which will take time.

Now, we can move on to bigger wins. You probably want to use the OS's IO routines directly. If you are using a POSIX system (like Linux) then open, lseek, read, and close are a good first go at this, and may be required if you don't have the next system calls.

If all of the files that you are trying to read from live in one directory (folder) or under one then you may find that opening the directory with opendir or open("directory_name", O_DIRECTORY) (depending on if you need to read the directory entries yourself) and then calling openat, which takes a directory entry file descriptor as one of its arguments will speed up opening each file since the OS won't have work as hard to look up the file you're trying to open each time (that data will probably be in the OS's file system cache, but it still takes time and has lots of tests).

Then you may be able to read in your data by using the pread system call, without having to do any seeking to the location of the data you need. pread takes in an offset rather than using the OS's idea of the current seek point. This will save you one system call at the very least.

edit

If your system supports asynchronous IO this should speed thing up as you would be able to go ahead and let the OS know what you want up front before you go retrieve it (this lets the OS schedule the disk reads better, especially for rotating disks), but this can get complicated. It would likely save you a lot of time, though.

nategoose
So, is it really relevant the use of `open`, `lseek`, `read`, and `close` instead of the C++ `fstream` functions? Personally I feel more confortable with plain C, but I got recommended to use C++ for this purpose. I'll make some tests about it.
Alejandro Cámara
If you are trying to get the highest performance out of this possible, yes. The streams from C++ are very complicated and do lots of things under the hood. You are doing something very specific many many times. If you are running this on a system with `strace` try running it with that for a few iterations and see what it spits out. You can also do the same thing with `ltrace` and see how many times `malloc` and `free` get called as you open, seek, read, and close your streams.
nategoose
+1  A: 

If possible, I suggest reorganizing the data. For example, put all those doubles into one file instead of spreading them across multiple files.

If you need to run the program multiple times and the data doesn't change, you may want to create a tool that will optimize the data first.

The performance issue with the files is the overhead of:

  1. {overhead}Ramping up the hard drive.
  2. {overhead}Locating the file.
  3. Positioning within the file.
  4. Reading data.
  5. {Closing a file adds very little to the performance.}

In most file based systems, that use a lot of data, reading data is optimized to have a longer duration than any overhead. The requests would be cached and sorted for optimal disk access. Unfortunately, in your case, you are not reading enough data so that the overhead is now longer duration than the reading.

I suggest trying to queue the reading operation of the data. Take 4 threads, each opens a file and reads the doubles, then places them into a buffer. The idea here is stagger the operations.

  • Thread 1 opens a file.
  • Thread 2 opens a file while thread 1 is positioning.
  • Thread 3 opens a file while thread 2 is positioning and thread 1 is reading the data.
  • Thread 4 opens a file, thread 3 positions, thread 2 reads, thread 1 closes.

Hopefully, these threads can keep the hard drive busy enough to not slow down; continuous activity. You may be able to try this in a single thread first. If you need better performance, you may want to consider sending commands directly to disk drive (order them first).

Thomas Matthews
A: 

Invert the order of iteration. Or at least read an entire page of data from the disk (say 4kB per file) and store it in memory until the next pass. Then you'd only need to actually touch the filesystem every 512th pass. It would cost 256MB of RAM but save hundreds of GB of file I/O (even when you request only 8 bytes the disk has to transfer a full page into cache). And your OS's disk cache replacement algorithm is likely to remove files that are 65k calls to open old, so don't trust it to do the optimization for you.

Ben Voigt