ansaurus

Question

Writing data chunks while processing - is there a convergence value due to hardware constraints ?

Answer 1

+2 A:

The important thing here is to get as many outstanding writes as possible, so the OS can optimize hard disk access. This means using async I/O, or using a task pool to actually write the new files to disk.

That being said, you should look at optimizing your read access. OS's (at least windows) is already really good at helping write access via buffering "under the hood", but if your reading in serial there isn't too much it can do to help. If use async I/O or (again) a task pool to process/read multiple parts of the file at once, you'll probably see increased perf.

Terry Mahaffey 2010-01-01 23:15:02

I think I'm pretty close to the wall with read performance, this is SAX parsing on a huge XML file, and certain nodes of that file (with some processing, which is almost nothing) will be written to new small files. It's really fast, as I can watch when turning off the file saving.

Homer J. Simpson 2010-01-01 23:34:04

Answer 2

+1 A:

Parsing XML should be doable at practically disk read speed, tens of MB/s. Your SAX implementation might not be doing that.

You might want to use some dirty tricks. 100.000s of files to write is not going to be efficient with the normal API.

Test this by writing sequentially to a single file first, not 100.000. Compare the performance. If the difference is interesting, read on.

If you really understand the file system you're writing to, you can make sure you're writing a contiguous block you just later split into multiple files in the directory structure.

You want smaller blocks in this case, not larger ones, as your files are going to be small. All free space in a block is going to be zeroed.

[edit] Do you really have an external need for those 100K files? A single file with an index could be sufficient.

Stephan Eggermont 2010-01-02 00:16:09

Reading speed is no problem, as I said.The 'split one file' idea is something that I had in mind too, but I didn't really know how to implement it, probably I should invest some time into figuring that out.

Homer J. Simpson 2010-01-02 00:40:31

Yes, that's the way it is ;-)

Homer J. Simpson 2010-01-02 07:40:28

Answer 3

+2 A:

Can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints?

Not in the long term. The problem is that your write performance is going to depend heavily on at least four things:

Which filesystem you're using
What disk-scheduling algorithm the kernel is using
The hardware characteristics of your disk
The hardware interconnect you're using

For example, USB is slower than IDE, which is slower than SATA. It wouldn't surprise me if XFS were much faster than ext2 for writing many small files. And kernels change all the time. So there are just too many factors here to make simple predictions easy.

If I were you I'd take these two steps:

Split my program into multiple threads (or even processes) and use one thread to deliver system calls open, write, and close to the OS as quickly as possible. Bonus points if you can make the number of threads a run-time parameter.
Instead of trying to estimate performance from hardware characteristics, write a program that tries a bunch of alternatives and finds the fastest one for your particular combination of hardware and software on that day. Save the fastest alternative in a file or even compile it into your code. This strategy was pioneered by Matteo Frigo for FFTW and it is remarkably effective.

Then when you change your disk, your interconnect, your kernel, or your CPU, you can just re-run the configuration program and presto! Your code will be optimized for best performance.

Norman Ramsey 2010-01-02 00:30:56

That's a very interesting approach, thanks a lot for that !

Homer J. Simpson 2010-01-02 00:45:28

Answer 4

A:

Expanding on Norman's answer: if your files are all going into one filesystem, use only one helper thread.

Communication between the read thread and write helper(s) consists of a two-std::vector double-buffer per helper. (One buffer owned by the write process and one by the read process.) The read thread fills the buffer until a specified limit then blocks. The write thread times the write speed with gettimeofday or whatever, and adjusts the limit. If writing went faster than last time, increase the buffer by X%. If it went slower, adjust by –X%. X can be small.

Potatoswatter 2010-01-14 00:42:17

ansaurus

tags:

views:

answers:

Writing data chunks while processing - is there a convergence value due to hardware constraints ?

related questions