tags:

views:

285

answers:

1

I'm trying to harness the power of the GPU (nVidia Quadro NVS140M) to speed up some matrix computations in my project. I'm reading through some documentation (programming guide, best practices guide, and reference manual), but not sure which section(s) I should focus on. It would be great if I can receive some advices on this.

Also, I'm wondering if there are third-party maintained SDKs, such as CuBLAS.net, that may simplify the cublas development process before I stick with the features of cublas offered that would help me achieve my goals with my project. Again, thanks in advance for the comments.

+2  A: 

Most of the documentation that comes with the CUDA toolkit & SDK downloads are about CUDA generally, not CuBLAS specifically. Start with the CUBLAS_Library_2.3.pdf file if you're just going to use CuBLAS--you won't need to write your own CUDA kernels. If you're already using a CPU BLAS, CuBLAS shouldn't be difficult to pick up. (And if you're not, then consider trying an optimized CPU one before CuBLAS, since it will be easier to program).

If you're coding on .NET, then the easiest way to use CuBLAS is probably via platform-invoke calls into cublas.dll. Be sure to keep straight which arrays are in host (CPU) memory, and which are in device (GPU) memory.

Keep in mind that CUDA & CuBLAS aren't magic bullets. Performance depends on a lot of factors (especially transfers across the PCIe bus), and simply swapping CUBLAS calls for CPU-BLAS calls may not give you speedups. You may have to make more substantial changes to your own code to get performance improvements. Those other guides you mention are very useful for understanding the CUDA architecture and its bottlenecks.

EDIT: I wasn't clear about the boundary between user code and kernel code. CUBLAS is a library of pre-built, optimized CUDA kernels. If you only need BLAS functionality, you do not need to write your own kernels. Instead, just call CUBLAS functions. When performance tuning, you shouldn't need to tweak the CUBLAS kernels, but you may need to change how and when you call them, and how you use memory, so as to minimize the number of transfers across the PCI express bus.

Gabriel
I'm already using Intel Math Kernel Library, which is a CPU BLAS according to my understanding. I may need to modify the algorithm a bit using the CUDA kernels and etc.
stanigator