views:

1130

answers:

3

I'm trying to figure out a way to allocate a block of memory that is accessible by both the host (CPU) and device (GPU). Other than using cudaHostAlloc() function to allocate page-locked memory that is accessible to both the CPU and GPU, are there any other ways of allocating such blocks of memory? Thanks in advance for your comments.

+2  A: 

The only way for the host and the device to "share" memory is using the newer zero-copy functionality. This is available on the GT200 architecture cards and some newer laptop cards. This memory must be, as you note, allocated with cudaHostAlloc so that it is page locked. There is no alternative, and even this functionality is not available on older CUDA capable cards.

If you're just looking for an easy (possibly non-performant) way to manage host to device transfers, check out the Thrust library. It has a vector class that lets you allocate memory on the device, but read and write to it from host code as if it were on the host.

Another alternative is to write your own wrapper that manages the transfers for you.

Eric
A: 

No there is no "Automatic Way" of uploading buffers on the GPU memory.

fabrizioM
No, but the question is about memory that is accessable from both the host and the device. Zero copy (pinned host memory) provides this in newer versions of CUDA. It is apparently quite useful where the GPU is integrated into the chipset and is using system memory as GPU memory. For discrete GPU's (i.e. plugged into a PCIe slot) zero copy incurs a bus transfer.
mch
+1  A: 

There is no way to allocate a buffer that is accessible by both the GPU and the CPU unless you use cudaHostAlloc(). This is because not only must you allocate the pinned memory on the CPU (which you could do outside of CUDA), but also you must map the memory into the GPU's (or more specifically, the context's) virtual memory.

It's true that on a discrete GPU zero-copy does incur a bus transfer. However if your access is nicely coalesced and you only consume the data once it can still be efficient, since the alternative is to transfer the data to the device and then read it into the multiprocessors in two stages.

Tom