How to mitigate host + device memory tranfer bottlenecks in OpenCL/CUDA

There are a couple things you can try to mitigate the PCIe bottleneck:

Asynchronous transfers - permits overlapping computation and bulk transfer
Mapped memory - allows a kernel to stream data to/from the GPU during execution

Note that neither of these techniques makes the transfer go faster, they just reduce the time the GPU is waiting on the data to arrive.

With the cudaMemcpyAsync API function you can initiate a transfer, launch one or more kernels that do not depend on the result of the transfer, synchronize the host and device, and then launch kernels that were waiting on the transfer to complete. If you can structure your algorithm such that you're doing productive work while the transfer is taking place, then asynchronous copies are a good solution.

With the cudaHostAlloc API function you can allocate host memory that can read and written directly from the GPU. The reason this is faster is that a block that needs host data only needs to wait for a small portion of the data to be transferred. In contrast, the usual approach makes all blocks wait until the entire transfer is complete. Mapped memory essentially breaks a big monolithic transfer into a bunch or smaller copy operations, so the latency is reduced.

You can read more about these topics in Section 3.2.6-3.2.7 of the CUDA Programming Guide and Section 3.1 of the CUDA Best Practices Guide. Chapter 3 of the OpenCL Best Practices Guide explains how to use these features in OpenCL.

ansaurus

tags:

views:

answers:

How to mitigate host + device memory tranfer bottlenecks in OpenCL/CUDA

related questions