tags:

views:

125

answers:

4

I have a thread that needs to write data from an in-memory buffer to a disk thousands of times. I have some requirements of how long each write takes because the buffer needs to be cleared for a separate thread to write to it again.

I have tested the disk with dd. I'm not using any filesystem on it and writing directly to the disk (opening it with the direct flag). I am able to get about 100 MB/s with a 32K block size.

In my application, I noticed I wasn't able to write data to the disk at nearly this speed. So I looked into what was happening and I find that some writes are taking very long. My block of code looks like (this is in C by the way):

last = get_timestamp();
write();
now = get_timestamp();
if (longest_write < now - last)
  longest_write = now - last;

And at the end I print out the longest write. I found that for a 32K buffer, I am seeing a longest write speed of about 47ms. This is way too long to meet the requirements of my application. I don't think this can be solely attributed to rotational latency of the disk. Any ideas what is going on and what I can do to get more stable write speeds? Thanks

Edit: I am in fact using multiple buffers of the type I declare above and striping between them to multiple disks. One solution to my problem would be to just increase the number of buffers to amortize the cost of long writes. However I would like to keep the amount of memory being used for buffering as small as possible to avoid dirtying the cache of the thread that is producing the data written into the buffer. My question should be constrained to dealing with variance in the latency of writing a small block to disk and how to reduce it.

A: 

Are you writing to a new file or overwriting the same file?

The big difference with dd is likely to be seek time, dd is streaming to a contigous (mostly) list of blocks, if you are writing lots of small files the head may be seeking all over the drive to allocate them.

The best way of solving the problem is likely to be removing the requirement for the log to be written in a specific time. Can you use a set of buffers so that one is being written (or at least sent to the drives's buffer) while new log data is arriving into another one?

Martin Beckett
I'm writing directly to the disk. There is no filesystem on it. I am writing data in 32K blocks (not files).I understand that there is some rotational latency involved if I am not constantly streaming data to the disk but I can't imagine that I am getting some writes that are 100x slower because of this. Correct me if I'm wrong.
dschatz
Even with your solution I would still have a time requirement on my writes. I can always stripe to more disks but I need to know why an individual write is taking much longer than I expect
dschatz
No filesystem complicates it a bit, one possibility is that the device driver in direct mode is not returning until all the data is physically on the drive, with a single huge dd this has no effect but it might add a lot to each individual write.
Martin Beckett
dd also requires a block size. I set it to 32K which is the same as what I am doing. This translates into the exact same write() syscall so there is no "single huge dd". Also I would doubt that any device driver claims the write is complete only once the data is on disk and not in its SRAM. Also, not having a filesystem simplifies it significantly, not complicates it.
dschatz
If you did 'dd' of say count=1000 bs=32K blocks then there is a single write to disk and dd will take bytes*rate + a single wait for buffers to flush overhead. If you write 1000 individual 32k blocks there is a commit overhead on each write(). Different OSs do return on data fully committed or not - it comes up here a lot with people finding Linux/Windows much slower.
Martin Beckett
there are actually many writes to a disk for a dd call like you made. This can be seen by running an strace on it. The syscalls are identical between my application and dd.
dschatz
+1  A: 

I have a thread that needs to write data from an in-memory buffer to a disk thousands of times.

I have tested the disk with dd. I'm not using any filesystem on it and writing directly to the disk (opening it with the direct flag). I am able to get about 100 MB/s with a 32K block size.

The dd's block size is aligned with file system block size. I guess your log file isn't.

Plus probably your application writes not only the log file, but also does some other file operations. Or your application isn't alone using the disk.

Generally, disk I/O isn't optimized for latencies, it is optimized for the throughput. High latencies are normal - and networked file systems have them even higher.

In my application, I noticed I wasn't able to write data to the disk at nearly this speed. So I looked into what was happening and I find that some writes are taking very long.

Some writes take longer time because after some point of time you saturate the write queue and OS finally decides to actually flush the data to disk. The I/O queues by default configured pretty short: to avoid excessive buffering and information loss due to a crash.

N.B. If you want to see the real speed, try setting the O_DSYNC flag when opening the file.

If your blocks are really aligned you might try using the O_DIRECT flag, since that would remove contentions (with other applications) on the Linux disk cache level. The writes would work at the real speed of the disk.

100MB/s with dd - without any syncing - is a highly synthetic benchmark, as you never know that data have really hit the disk. Try adding conv=dsync to the dd's command line.

Also trying using larger block size. 32K is still small. IIRC 128K size was the optimal when I was testing sequential vs. random I/O few years ago.

I am seeing a longest write speed of about 47ms.

"Real time" != "fast". If I define max response time of 50ms, and your app consistently responds within the 50ms (47 < 50) then your app would classify as real-time.

I don't think this can be solely attributed to rotational latency of the disk. Any ideas what is going on and what I can do to get more stable write speeds?

I do not think you can avoid the write() delays. Latencies are the inherit property of the disk I/O. You can't avoid them - you have to expect and handle them.

I can think only of the following option: use two buffers. First would be used by write(), second - used for storing new incoming data from threads. When write() finishes, switch the buffers and if there is something to write, start writing it. That way there is always a buffer for threads to put the information into. Overflow might still happen if threads generate information faster than the write() can write. Dynamically adding more buffers (up to some limit) might help in the case.

Otherwise, you can achieve some sort of real-time-ness for (rotational) disk I/O only if your application is the sole user of the disk. (Old rule of real time applications applies: there can be only one.) O_DIRECT helps somehow to remove the influence of the OS itself from the equation. (Though you would still have the overhead of file system in form of occasional delays due to block allocation for the file extension. Under Linux that works pretty fast, but still can be avoided by preallocating the whole file in advance, e.g. by writing zeros.) If the timing is really important, consider buying dedicated disk for the job. SSDs have excellent throughput and do not suffer from the seeking.

Dummy00001
As I said in the question, I am not using a filesystem. Also I am using the O_DIRECT flag. This is a real-time constraint, if the write isn't finished in the time i need to write the buffer again, then the data is garbage. Real-time isn't defined by needing <50ms to complete it is defined as needing to be completed in a certain time or the result is worthless or diminished in worth. I'll look at the O_DSYNC flag.
dschatz
@dschatz: O_DIRECT + no file system: you have problem with your hardware. E.g. all modern harddrives have write caches too. Pretty much all desktop drives ignore the sync command ("to improve the performance" as they say) and only so-called "RAID" or "server-ready" variety really do support syncing. To the real-time: you do not mention what latency of writes you actually need.
Dummy00001
@dschatz: and BTW do not forget about limit on number of IOs you do. The caching mechanisms in place exists precisely for the reason: to break above the IO limit. Most drives are not capable of more than 200 IOs per second. IOW, if you write unbuffered in 32K chunks, 200*32K = 6.4M/s is the limit of your writing speed.
Dummy00001
dd is able to get significantly above 6.4MB/s with write blocks of 32K. This is with the direct flag on and no filesystem.
dschatz
@dschats: That would mean that you use desktop HDDs which do not really support syncing of any sort, O_DIRECT to them is no-op and they buffer on their own. I bet `dd` would have the same latency problem. Try changing the HDDs for the server/raid grade ones. (And this is pretty much only the difference between them: former buffer data as they wish, later abide the host command to flush data to platters.)
Dummy00001
+7  A: 

I'm assuming that you are using an ATA or SATA drive connected to the built-in disk controller in a standard computer. Is this a valid assumption, or are you using anything out of the ordinary (hardware RAID controller, SCSI drives, external drive, etc)?

As an engineer who does a lot of disk I/O performance testing at work, I would say that this sounds a lot like your writes are being cached somewhere. Your "high latency" I/O is a result of that cache finally being flushed. Even without a filesystem, I/O operations can be cached in the I/O controller or in the disk itself.

To get a better view of what is going on, record not just your max latency, but your average latency as well. Consider recording your max 10-15 latency samples so you can get a better picture of how (in-)frequent these high-latency samples are. Also, throw out the data recorded in the first two or three seconds of your test and start your data logging after that. There can be high-latency I/O operations seen at the start of a disk test that aren't indicative of the disk's true performance (can be caused by things like the disk having to rev up to full speed, the head having to do a large initial seek, disk write cache being flushed, etc).

If you are wanting to benchmark disk I/O performance, I would recommend using a tool like IOMeter instead of using dd or rolling your own. IOMeter makes it easy to see what kind of a difference it makes to change the I/O size, alignment, etc, plus it keeps track of a number of useful statistics.

Requiring an I/O operation to happen within a certain amount of time is a risky thing to do. For one, other applications on the system can compete with you for disk access or CPU time and it is nearly impossible to predict their exact effect on your I/O speeds. Your disk might encounter a bad block, in which case it has to do some extra work to remap the affected sectors before processing your I/O. This introduces an unpredictable delay. You also can't control what the OS, driver, and disk controller are doing. Your I/O request may get backed up in one of those layers for any number of unforseeable reasons.

If the only reason you have a hard limit on I/O time is because your buffer is being re-used, consider changing your algorithm instead. Try using a circular buffer so that you can flush data out of it while writing into it. If you see that you are filling it faster than flushing it, you can throttle back your buffer usage. Alternatively, you can also create multiple buffers and cycle through them. When one buffer fills up, write that buffer to disk and switch to the next one. You can be writing to the new buffer even if the first is still being written.

Response to comment: You can't really "get the kernel out of the way", it's the lowest level in the system and you have to go through it to one degree or another. You might be able to build a custom version of the driver for your disk controller (provided it's open source) and build in a "high-priority" I/O path for your application to use. You are still at the mercy of the disk controller's firmware and the firmware/hardware of the drive itself, which you can't necessarily predict or do anything about.

Hard drives traditionally perform best when doing large, sequential I/O operations. Drivers, device firmware, and OS I/O subsystems take this into account and try to group smaller I/O requests together so that they only have to generate a single, large I/O request to the drive. If you are only flushing 32K at a time, then your writes are probably being cached at some level, coalesced, and sent to the drive all at once. By defeating this coalescing, you should reduce the number of I/O latency "spikes" and see more uniform disk access times. However, these access times will be much closer to the large times seen in your "spikes" than the moderate times that you are normally seeing. The latency spike corresponds to an I/O request that didn't get coalesced with any others and thus had to absorb the entire overhead of a disk seek. Request coalescing is done for a reason; by bundling requests you are amortizing the overhead of a drive seek operation over multiple commands. Defeating coalescing leads to doing more seek operations than you would normally, giving you overall slower I/O speeds. It's a trade-off: you reduce your average I/O latency at the expense of occasionally having an abnormal, high-latency operation. It is a beneficial trade-off, however, because the increase in average latency associated with disabling coalescing is nearly always more of a disadvantage than having a more consistent access time is an advantage.

I'm also assuming that you have already tried adjusting thread priorities, and that this isn't a case of your high-bandwidth producer thread starving out the buffer-flushing thread for CPU time. Have you confirmed this?

You say that you do not want to disturb the high-bandwidth thread that is also running on the system. Have you actually tested various output buffer sizes/quantities and measured their impact on the other thread? If so, please share some of the results you measured so that we have more information to use when brainstorming.

Given the amount of memory that most machines have, moving from a 32K buffer to a system that rotates through 4 32K buffers is a rather inconsequential jump in memory usage. On a system with 1GB of memory, the increase in buffer size represents only 0.0092% of the system's memory. Try moving to a system of alternating/rotating buffers (to keep it simple, start with 2) and measure the impact on your high-bandwidth thread. I'm betting that the extra 32K of memory isn't going to have any sort of noticeable impact on the other thread. This shouldn't be "dirtying the cache" of the producer thread. If you are constantly using these memory regions, they should always be marked as "in use" and should never get swapped out of physical memory. The buffer being flushed must stay in physical memory for DMA to work, and the second buffer will be in memory because your producer thread is currently writing to it. It is true that using an additional buffer will reduce the total amount of physical memory available to the producer thread (albeit only very slightly), but if you are running an application that requires high bandwidth and low latency then you would have designed your system such that it has quite a lot more than 32K of memory to spare.

Instead of solving the problem by trying to force the hardware and low-level software to perform to specific performance measurements, the easier solution is to adjust your software to fit the hardware. If you measure your max write latency to be 1 second (for the sake of nice round numbers), write your program such that a buffer that is flushed to disk will not need to be re-used for at least 2.5-3 seconds. That way you cover your worst-case scenario, plus provide a safety margin in case something really unexpected happens. If you use a system where you rotate through 3-4 output buffers, you shouldn't have to worry about re-using a buffer before it gets flushed. You aren't going to be able to control the hardware too closely, and if you are already writing to a raw volume (no filesystem) then there's not much between you and the hardware that you can manipulate or eliminate. If your program design is inflexible and you are seeing unacceptable latency spikes, you can always try a faster drive. Solid-state drives don't have to "seek" to do I/O operations, so you should see a fairly uniform hardware I/O latency.

bta
Thanks for your response, I did further testing and noticed that it is generally only a select few large write latencies that affect my machine. I have a very high bandwidth output from one thread on the machine. I want to disturb the cache of that thread as little as possible and so I don't want to use large amount of memory to buffer the data and instead only buffer into a small area and then write that to disk before that buffer needs to be written to again (in the mean time, the data producer will use another buffer). Are there ways to get the kernel out of my way to prevent high latencies
dschatz
Thanks for the response to my previous comment. I unfortunately can't measure the effect of increasing the buffers has on the producer thread due to the nature of the application. I have found that I need on the order of 100 32K buffers to amortize the cost of a long write. This amount of memory is a significant part of the L1 cache and should have an effect on the operation of an application on the producing cpu. I am pinning the producer and writer(s) to separate cpus so that I don't have any concerns with regard to cpu resources.
dschatz
+2  A: 

As long as you are using O_DIRECT | O_SYNC, you can use ioprio_set() to set the IO scheduling priority of your process/thread (although the man page says "process", I believe you can pass a TID as given by gettid()).

If you set a real-time IO class, then your IO will always be given first access to the disk - it sounds like this is what you want.

caf
Thanks, I'll check this out.
dschatz