cuda

Trying to 'Make' CUDA SDK, ld cannot find library, ldconfig says it can.

I know there are many other questions similar to this one, but none of the solutions posited there are working for me Basically, making the SDK sample files, i get /usr/bin/ld: cannot find -lcuda which would be an easy enough 'find the library and throw it to ldconfig', except ldconfig already says it has it... $ sudo ldconfig -v | gr...

BLAS and CUBLAS

I'm wondering about Nvidia's CUBLAS Library. Does anybody have experience with it? For example if I write a C program using BLAS will I be able to replace the calls to BLAS with calls to CUBLAS? Or even better implement a mechanism which let's the user choose at runtime? What about if I use the BLAS Library provided by Boost with C++? ...

How do pass data to a shared variable in CUDA?

I have a kernel which passes 3 arrays, the first array d_A1 has no data and is used only to write back data, the other two arrays d_D1 and d_ST1 have data. The size of the first array is: d_A1[13000000] The size of the second array is: d_D1[421] The size of the third array is: d_ST1[21] N is 13000000 TestArray<<>>(d_A1,N, d_D1, d...

Why can't nvcc find my Visual C++ installation?

I'm running Windows 7 Pro x64 on a Core i5 with a NVIDIA 3100m, which is CUDA compatible. I've tried installing both the 32-bit and 64-bit CUDA toolkits from NVIDIA, unfortunately from with either of them I cannot compile anything; nvcc says "cannot find a supported cl version. Only MSVC 8.0 and MSVC 9.0 are supported". I have the x86 ...

What is the best algorithm for this array-comparison problem?

What is the most efficient for speed algorithm to solve the following problem? Given 6 arrays, D1,D2,D3,D4,D5 and D6 each containing 6 numbers like: D1[0] = number D2[0] = number ...... D6[0] = number D1[1] = another number D2[1] = another number .... ..... .... ...

How to make different threads execute different parts in CUDA ?

I am working on CUDA and I have a problem related to thread synchronization. In my code I need threads to execute different parts of the code, like: one thread -> all thread -> one thread -> This is what I want. In the initial part of code only one thread will execute and then some part will be executed by all threads then again sing...

Counting FLOPS/GFLOPS in program - CUDA

Already finished my application which multiplies CRS matrix and vector (SpMV) and the only thing to do now is to count FLOPS my application did. In my opinion it's really hard to estimate number of floating point operation in case of sparse matrix - vector multiplication, because the number of multiplies in one row is really "jumpy" or f...

Nvidia Tesla vs 480 for CUDA programming

I am doing research on CUDA programming. i have the option to buy a single NVidia Tesla or buy around 4-5 NVidia 480? what do you recommend? ...

Solving problems involving more complex data structures with CUDA

So I read a bit about CUDA and GPU programming. I noticed a few things such that access to global memory is slow (therefore shared memory should be used) and that the execution path of threads in a warp should not diverge. I also looked at the (dense) matrix multiplication example, described in the programmers manual and the nbody probl...

Financial applications on GPGPU

I want to know what sort of financial applications can be implemented using a GPGPU. I'm aware of Option pricing/ Stock price estimation using Monte Carlo simulation on GPGPU using CUDA. Can someone enumerate the various possibilities of utilizing GPGPU for any application in Finance domain, ...

How to synchronize cuda threads when they are in the same loop and we need to synchronize them to execute only a limited part

Hi all, I have written a code and Now I want to implement this on cuda GPU but I'm new to synchronization so please help me with this, It's little urgent to me. Below I'm presenting the code and I want to that LOOP1 to be executed by all threads (heance I want to this portion to take advantage of cuda and the remaining portion (the porti...

Double precision floating point in CUDA

Does CUDA support double precision floating point numbers ? Also need reasons for the same. ...

cmake: Target-specific preprocessor definitions for CUDA targets seems not to work

I'm using cmake 2.8.1 on Mac OSX 10.6 with CUDA 3.0. So I added a CUDA target which needs BLOCK_SIZE set to some number in order to compile. cuda_add_executable(SimpleTestsCUDA SimpleTests.cu BlockMatrix.cpp Matrix.cpp ) set_target_properties(SimpleTestsCUDA PROPERTIES COMPI...

Optimize code performance when odd/even threads are doing different things in CUDA

I have two large vectors, I am trying to do some sort of element multiplication, where an even-numbered element in the first vector is multiplied by the next odd-numbered element in the second vector... and where the odd-numbered element in the first vector is multiplied by the preceding even-numbered element in the second vector. For e...

CUDA SDK compilation error

I am in the process of setting up a CUDA workstation. Platform specs: Intel Core 2 Duo Nvidia GTX 280 Fedora 10 GCC version 4.3.2 I have installed the developer driver, toolkit, and the SDK. When I try to compile the SDK example code I get the following errors: make[1]: * [obj/i386/release/cutil.cpp.o] Error 1 make: * [lib/libcutil....

Optimizing Vector elements swaps using CUDA

Hi all, Since I am new to cuda .. I need your kind help I have this long vector, for each group of 24 elements, I need to do the following: for the first 12 elements, the even numbered elements are multiplied by -1, for the second 12 elements, the odd numbered elements are multiplied by -1 then the following swap takes place: Graph: b...

Cuda program results are always zero in HW, correct in EMU??

Hi all! I am having a weird problem .. I have written a CUDA code which executes correctly in emulation and all results show up.. however, when executed on hardware "G210" .. the results in the result memory are always 0 I am passing two vectors to the kernel, one with random variables the other is initialized to zero, the code copies ...

how many processors can I get in a block on cuda GPU?

hi all, I have three questions to ask If I create only one block of threads in cuda and execute the parallel program on it then is it possible that more than one processors would be given to single block so that my program get some benefit of multiprocessor platform ? To be more clear, If I use only one block of threads then how many p...

DirectCompute versus OpenCL for GPU programming?

I have some (financial) tasks which should map well to GPU computing, but I'm not really sure if I should go with OpenCL or DirectCompute. I did some GPU computing, but it was a long time ago (3 years). I did it through OpenGL since there was not really any alternative back then. I've seen some OpenCL presentations and it looks really n...

How CudaMalloc work?

I am trying to modify the imageDenosing class in CUDA SDK, I need to repeat the filter many time incase to capture the time. But my code doesn't work properly. //start __global__ void F1D(TColor *image,int imageW,int imageH, TColor *buffer) { const int ix = blockDim.x * blockIdx.x + threadIdx.x; const int iy = blockDim.y * blockIdx....