views:

376

answers:

1

I have an existing MFC application with matrix computation with CPU-optimized BLAS libraries. I'm interested in adding CuBLAS computational functionalities to my project, but I have the two following questions:

1) I'm not sure if I would need to do something on specifying my own CUDA kernel, thread, and block configurations at this point. If so, which sections on the architecture would you recommend paying the most attention to when modifying the algorithm?

2) I'm interested in either (a) creating a new project in Visual Studio with CuBLAS features in the program, or (b) integrate CuBLAS capabilities in an existing MFC project. However, I'm having trouble configuring the Visual Studio project to work with CUDA SDK properly other than following a guide like this, which may not work if I'm trying to integrate this with an existing project. What would be your recommendations on this?

Thanks in advance for the comments.

+2  A: 

1) If you're just going to use CUBLAS, you do not need to bother writing your own kernel. CUBLAS is a collection of kernels and C wrappers packaged into a library and designed to be easily callable from other programs. Understanding the details of CUDA threads, blocks, etc, is not necessary to use CUBLAS.

2) Since you don't need to write your own kernels, you don't need to build a separate Visual Studio project with the CUDA SDK. You only need to make function calls from your existing project into cublas.dll (or cublas.lib for static linking). Chapter 1 of CUBLAS_Library_2.3.pdf in the CUDA Toolkit contains an example program showing how to call CUBLAS from C.

Gabriel
I know understanding the details of CUDA threads, blocks, etc. is not necessary to use CUBLAS, but would it be helpful for optimization purposes?
stanigator
Threads, blocks, grids, warps, and shared memory are internal to a kernel, so unless you're creating your own kernel, or editing the source of an existing one, then that level of detail is probably irrelevant to you. As I mentioned in my answer to your other question, the biggest bottleneck is frequently the data transfer from main memory to video memory across the PCIe bus. But the CUDA programming guide explains all of this quite well. Perhaps you should read that over if you want a deeper understanding of the CUDA architecture.
Gabriel
Thanks for your in-depth explanation.
stanigator