views:

190

answers:

7

At first glance it seems like a good idea to let the hard disk write to RAM on its own, without CPU instructions copying data, particularly with the success of asynchronous networking in mind. But the Wikipedia article on DMA states this:

With DMA, the CPU gets freed from this overhead and can do useful tasks during data transfer (though the CPU bus would be partly blocked by DMA).

I don't understand how a bus line can be "partly blocked". Presumably memory can be accessed by one device at the time, and it then seems like there is little useful work the CPU can actually do. It would be blocked on the first attempt to read uncached memory, which I expect is very quickly in the case of a 2 mb cache.

The goal of freeing up the CPU to do other tasks seems gratuitous. Does hard disk DMA foster any performance increase in practice?

+2  A: 

Disk controllers often have special block transfer instructions that enable fast data transfers. They may also transfer data in bursts, permitting interleaved CPU bus access. CPUs also tend to access memory in bursts, with the cache controller filling cache lines as they become available, so even though the CPU may be blocked, the end result is simply that the cache usage drops, the CPU doesn't actually stall.

TMN
Writing this, only 2 MB of my 390 MB memory usage resides on my Core Duo L2 cache. I'd think the CPU would stall rather quickly. One thing is hitting the cache and getting a performance boost, another thing is missing the cache and stalling completely.
henle
A: 

Processing doesn't happen on the CPU bus anyway. CPU's issue instructions that might or might not touch memory. When they do, they're typically resolved first against L1 cache, and then L2 and L3 before memory is tried. Therefore, DMA transfers don't block processing.

Even when the CPU and the DMA transfer would both need memory, it's expected that they will not access the same bytes in memory. A memory controller might in fact be able to process both requests at the same time.

MSalters
"A memory controller might be able to process both requests at the same time." It might, but do you have a source on this?
henle
+3  A: 

One possible performance increase can come from the fact that a computer can have multiple DMA devices. So with DMA you can have multiple memory reads occuring in parallel without the CPU having to perform all the overhead.

Reason Enough
+3  A: 

I don't understand how a bus line can be "partly blocked"

Over a period of many clock cycles, some will be blocked and some will not. Quoting the University of Melbourne:

Q2. What is cycle stealing? Why are there cycles to steal?

A2. When a DMA device transfers data to or from memory, it will (in most architectures) use the same bus as the CPU would use to access memory. If the CPU wants to use the bus at the same time as a DMA device, the CPU will stall for a cycle, since the DMA device has the higher priority. This is necessary to prevent overruns with small DMA buffers. (The CPU never suffers from overruns.)

Most modern CPUs have caches that satisfy most memory references without having to go to main memory through the bus. DMA will therefore have much less impact on them.

Even if the CPU is completely starved while a DMA block transfer is occurring, it will happen faster than if the CPU had to sit in a loop shifting bytes to/from an I/O device.

Hugh Allen
So what you are saying is that the DMA controller shifts bytes in a loop faster than the CPU shifts bytes in a loop?
henle
@henle: He is using *"shift"* to mean *"transfer;"* it has nothing to do with binary shifts. See [Gonzalo's answer](http://stackoverflow.com/questions/3716826/what-is-the-purpose-of-hard-disk-direct-memory-access/3772593#3772593), which I believe is much more clear.
BlueRaja - Danny Pflughoeft
Actually I was also talking about transfer, not binary shift.
henle
+2  A: 

I don't know if I'm missing anything.

Let's suppose we don't have DMA controller. Every transfer from the "slow" devices to the memory would be for the CPU a loop

ask_for_a_block_to_device 
wait_until_device_answer (or change_task_and_be_interrupted_when_ready)
write_to_memory

So the CPU should have to write the memory itself. Chunk by chunk.

Is it necessary the use of a CPU for doing memory transfers? No. We use another device (or mecanism like DMA bus mastering) which transfers data to/from the memory.

Meanwhile CPU could be doing something different like : doing things with cache, but even accessing other parts of the memory a great share of the time.

This is the crucial part: data is not being transfered 100% of the time, because the other device is very slow (compared to memory and CPU).

Trying to represent an example of the shared memory bus usage (C when accesed by CPU, D, when accesed by DMA)

Memory Bus ----CCCCCCCC---D----CCCCCCCCCDCCCCCCCCC----D

As you can see memory is accesed one device at a time. Sometimes by CPU, sometimes by the DMA controller. The DMA very few times.

Gonzalo
+1 for the only answer that addresses henle's misunderstanding of how systems *without* DMA work (which makes the necessity of DMA very clear), and which also answers henle's main question
BlueRaja - Danny Pflughoeft
It should also be mentioned that DMA devices tend to write memory in bursts rather than continuous streams, which accounts (along with the fact that it's *"slower"*) for the long times between reads/writes in the above diagram.
BlueRaja - Danny Pflughoeft
However, this is also possible: (C = CPU) (P = CPU transfer) `CCCCCCCCCCPCCCCCCCCCCPCCCCCCCCCCC`
henle
A: 

If you're using Linux, you can test this very easily by disabling DMA with hdparm. The effect is dramatic.

Chris
+3  A: 

1: PIO (programmed IO) thrashes the CPU caches. The data read from the disk will, most of the time, not be processed immediately afterwards. Data is often read in large chunks by the application, but PIO is done in smaller blocks (typically 64K IIRC). So the data-reading application will wait until the large chunk has been transferred, and not benefit from the smaller blocks being in the cache just after they have been fetched from the controller. Meanwhile other applications will suffer from large parts of the cache being evicted by the transfer. This could probably be avoided by using special instructions which instruct the CPU not to cache data but write it "directly" to the main memory, however I'm pretty certain that this would slow down the copy-loop. And thereby hurt even more than the cache-thrashing.

2: PIO, as it's implemented on x86 systems, and probably most other systems, is really slow compared to DMA. The problem is not that the CPU wouldn't be fast enough. The problem stems from the way the bus and the disk controller's PIO modes are designed. If I'm not mistaken, the CPU has to read every byte (or every DWORD when using 32 bit PIO modes) from a so-called IO port. That means for every DWORD of data, the port's address has to be put on the bus, and the controller must respond by putting the data DWORD on the bus. Whereas when using DMA, the controller can transfer bursts of data, utilizing the full bandwidth of the bus and/or memory controller. Of course there is much room for optimizing this legacy PIO design. DMA transfers are such an optimization. Other solutions that would still be considered PIO might be possible too, but then again they would still suffer from other problems (e.g. the cache thrashing mentioned above).

3: Memory- and/or bus-bandwidth is not the limiting factor for most applications, so the DMA transfer will not stall anything. It might slow some applications down a little, but usually it should be hardly noticeable. After all disks are rather slow compared with the bandwidth of the bus and/or memory controller. A "disk" (SSD, RAID array) that delivers > 500 MB/s is really fast. A bus or memory subsystem that cannot at least deliver 10 times that number must be from the stone ages. OTOH PIO really stalls the CPU completely while it's transferring a block of data.

pgroke
Ok, this is the best answer yet. It makes sense that DMA is an optimization which could in theory be implemented without a special controller.
henle