views:

86

answers:

4

Hi,

I'm processing data from a harddisk from one large file (processing is fast and not a lot of overhead) and then have to write the results back (hundreds of thousands of files).

I started writing the results straight away in files, one at a time, which was the slowest option. I figured it gets a lot faster if I build a vector of a certain amount of the files and then write them all at once, then go back to processing while the hard disk is occupied in writing all that stuff that i poured into it (that at least seems to be what happens).

My question is, can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints ? To me it seems to be a harddisk buffer thing, I have 16MB buffer on that harddisk and get these values (all for ~100000 files):

Buffer size      time (minutes)
------------------------------
no Buffer        ~ 8:30
 1 MB            ~ 6:15
10 MB            ~ 5:45
50 MB            ~ 7:00

Or is this just a coincidence ?

I would also be interested in experience / rules of thumb about how writing performance is to be optimized in general, for example are larger hard disk blocks helpful, etc.

Edit:

Hardware is a pretty standard consumer drive (I'm a student, not a datacenter) WD 3,5 1TB/7200/16MB/USB2, HFS+ journaled, OS is MacOS 10.5. I'll soon give it a try on Ext3/Linux and internal disk rather than external).

+2  A: 

The important thing here is to get as many outstanding writes as possible, so the OS can optimize hard disk access. This means using async I/O, or using a task pool to actually write the new files to disk.

That being said, you should look at optimizing your read access. OS's (at least windows) is already really good at helping write access via buffering "under the hood", but if your reading in serial there isn't too much it can do to help. If use async I/O or (again) a task pool to process/read multiple parts of the file at once, you'll probably see increased perf.

Terry Mahaffey
I think I'm pretty close to the wall with read performance, this is SAX parsing on a huge XML file, and certain nodes of that file (with some processing, which is almost nothing) will be written to new small files. It's really fast, as I can watch when turning off the file saving.
Homer J. Simpson
+1  A: 

Parsing XML should be doable at practically disk read speed, tens of MB/s. Your SAX implementation might not be doing that.

You might want to use some dirty tricks. 100.000s of files to write is not going to be efficient with the normal API.

Test this by writing sequentially to a single file first, not 100.000. Compare the performance. If the difference is interesting, read on.

If you really understand the file system you're writing to, you can make sure you're writing a contiguous block you just later split into multiple files in the directory structure.

You want smaller blocks in this case, not larger ones, as your files are going to be small. All free space in a block is going to be zeroed.

[edit] Do you really have an external need for those 100K files? A single file with an index could be sufficient.

Stephan Eggermont
Reading speed is no problem, as I said.The 'split one file' idea is something that I had in mind too, but I didn't really know how to implement it, probably I should invest some time into figuring that out.
Homer J. Simpson
Yes, that's the way it is ;-)
Homer J. Simpson
+2  A: 

Can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints?

Not in the long term. The problem is that your write performance is going to depend heavily on at least four things:

  • Which filesystem you're using

  • What disk-scheduling algorithm the kernel is using

  • The hardware characteristics of your disk

  • The hardware interconnect you're using

For example, USB is slower than IDE, which is slower than SATA. It wouldn't surprise me if XFS were much faster than ext2 for writing many small files. And kernels change all the time. So there are just too many factors here to make simple predictions easy.

If I were you I'd take these two steps:

  • Split my program into multiple threads (or even processes) and use one thread to deliver system calls open, write, and close to the OS as quickly as possible. Bonus points if you can make the number of threads a run-time parameter.

  • Instead of trying to estimate performance from hardware characteristics, write a program that tries a bunch of alternatives and finds the fastest one for your particular combination of hardware and software on that day. Save the fastest alternative in a file or even compile it into your code. This strategy was pioneered by Matteo Frigo for FFTW and it is remarkably effective.

Then when you change your disk, your interconnect, your kernel, or your CPU, you can just re-run the configuration program and presto! Your code will be optimized for best performance.

Norman Ramsey
That's a very interesting approach, thanks a lot for that !
Homer J. Simpson
A: 

Expanding on Norman's answer: if your files are all going into one filesystem, use only one helper thread.

Communication between the read thread and write helper(s) consists of a two-std::vector double-buffer per helper. (One buffer owned by the write process and one by the read process.) The read thread fills the buffer until a specified limit then blocks. The write thread times the write speed with gettimeofday or whatever, and adjusts the limit. If writing went faster than last time, increase the buffer by X%. If it went slower, adjust by –X%. X can be small.

Potatoswatter