tags:

views:

91

answers:

2

Hi,

I'm writing C code with some real-time constraints. I tested out the speed I can write to a disk with dd:

dd if=/dev/zero of=/dev/sdb bs=32K count=32768 oflag=direct

This writes 1GB of zeros to /dev/sdb in 32K block sizes

I reach about 103 MB/s with this

Now I programmatically do something similar:

open("/dev/sdb",O_WRONLY|O_CREAT|O_DIRECT|O_TRUNC, 0666);

I get a timestamp value write from a 32K buffer to /dev/sdb 10,000 times (in a for loop) get another timestamp value do a bit of number crunching to get the rate in MB/s and it is about 49 MB/s

Why can't I reach the same speed as dd? An strace reveals the same open command that I use.

+5  A: 

Check what system calls dd makes, not just the open but also the subsequent reads and writes. Using the right buffer sizes can make a significant difference in this kind of large copy. Note that /dev/zero is not a good test for benchmarking if your final goal is a disk-to-disk copy.

If you can't match dd's speed by matching it system call for system call... well, read the source.

Gilles
I am actually interested in direct writing from memory to /dev/sdb so I feel /dev/zero should work pretty well.Also what are you talking about with regards to the reads and writes? I specify the block size in the command to be 32K.
dschatz
+1 read dd's source. That's why it's there.
Nathon
The source is 2000 lines. Not exactly a magnum opus. Just check it out: http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/dd.c. It seems it uses read, write, memcpy, memset. Nothing magic there. It does seem to have a few strategies for reading/writing, though some of those variations seem to be designed for special filesystem/OS requirements.
Merlyn Morgan-Graham
I have looked at it and I don't see any differences:fd_reopen (STDOUT_FILENO, output_file, O_WRONLY | opts, perms)nread = iread_fnc (STDIN_FILENO, ibuf, input_blocksize);size_t nwritten = iwrite (STDOUT_FILENO, obuf, n_bytes_read);If anyone can tell me what it does that I'm not seeing then I would be grateful.
dschatz
You don't even need `dd`'s source to see the syscalls - that's what `strace` is for
qrdl
@dschatz: again, did you check that your code makes the exact same sequence of `read` and `write` calls (diff their `strace`s)? If they do, and if you're sure you've eliminated any caching effect in your benchmark, then you'll have to study the source harder. Try copying part of the code of `dd` and seeing if that helps. It might not be easy to find the clincher(s)!
Gilles
A: 

I'm leaving the part about matching the system calls to somebody else. This answer is about the buffering part.

Try benchmarking the buffer size you use. Experiment with a range of values.

When learning Java, I wrote a simple clone of 'copy' and then tried to match it's speed. Since the code did byte-by-byte read/writes the buffer size was what really made the difference. I wasn't buffering it myself but I was asking the read to fetch chunks of a given size. The bigger the chunk, the faster it went - up to a point.

As for using 32K block size, remember that the OS still uses separate IO buffers for user-mode processes. Even if you are doing something with specific hardware, i.e. you're writing a driver for a device that has some physical limitation, e.g. a CD-RW drive with sector sizes, the block size is only part of the story. The OS will still have it's buffer too.

Kelly French
The buffer is 32k which is the same as the block size I use in dd. These translate into the same system calls so what else is there to experiment with?Also I open both with the direct flag so the OS will not buffer it.
dschatz