views:

251

answers:

3

The CUDA programming guide states that

"Bandwidth is one of the most important gating factors for performance. Almost all changes to code should be made in the context of how they affect bandwidth."

It goes on to calculate theoretical bandwidth which is in the order of hundreds of gigabytes per second. I am at a loss as to why how many bytes one can read/write to global memory is a reflection of how well optimised a kernel is.

If I have a kernel which does intensive computation on data stored in shared memory and/or registers, with only a single read at the start and write out at the end from and to global memory, surely the effective bandwidth will be small, while the kernel itself may be very efficient.

Could any one further explain bandwidth in this context?

Thanks

A: 

Typically kernels are fairly small and simple and perform the same operation on a lot of data. You might have a bunch of kernels that you invoke in sequence to perform some more complex operation (think of it as a processing pipeline). Obviously the throughput of your pipeline will depend both on how efficient your kernels are and whether you are limited by memory bandwidth in any way.

Paul R
+1  A: 

most all nontrivial computational kernels, in CPU and GPU land, memory bound. GPU has very high computational intensity and throughput, but access to main memory is very slow and has high latency, few hundred cycles per read/store versus four cycles for mmany arithmetic operations.

It sounds like your kernel is computation bound, so your luck. However you still have to watch out for shared memory bank conflict, which can serialize portions of code unexpectedly.

aaa
+1  A: 

Most kernels are memory bound so maximising memory throughput is critical. If you're lucky enough to have a compute bound kernel then optimizing for computation is generally easier. You do need to look out for divergence and you should still ensure you have enough threads to hide memory latency.

Check out the Advanced CUDA C presentation for more information, including some tips for how to compare your realised performance with theoretical performance. The CUDA Best Practices Gude also has some good information, it's available as part of the CUDA toolkit (download from the NVIDIA site).

Tom