cuda

Cuda Runtime API and driver API questions

I am new to cuda and graphics. I had several questions about cuda, hope someone will have proper answers: These are for driver API: -- What is the meaning of a cuda context? when i was reading cuda c book (3.1) i've learned that it is analogous to a process in CPU. I don't understand this, the actual host c code becomes a process in cp...

Is there any implementation of JVM that supports CUDA?

Is there any implementation of JVM that supports CUDA? Provide links please =) ...

cublas link in visual studio

I am trying to use cublas.h in visual studio. The program doesn't compile because it can't find some of the external link. Can some one tell me how to link the .dll file, I believe it is in ../C/common/bin. Thanks. ...

CUDA Linking Error Visual Studio 2008

Hi I am getting the following Linking error while compiling ConvolutionFFT2D from CUDA src 1>------ Rebuild All started: Project: FinalTest, Configuration: Release Win32 ------ 1>Deleting intermediate and output files for project 'FinalTest', configuration 'Release|Win32' 1>Compiling with CUDA Build Rule... 1>"C:\CUDA\bin\nvc...

cuda library for calculate diagonal of a matrix

I want to multiply 2 matrices together. And I only want the diagonal of the result matrix, so I don't want to calculate other elements. I am wondering if there is a function implemented in some existing library, such as cublas or other c++ library. I know I can do that through a kernel wrapper, and cuda kernel for this is doable. But I...

How to read successfully from a 2D texture

So I posted on Nvidia's forums (with no luck) and either 1. they just want to talk about how awesome graphics cards are or 2. My question is dumb so they don't look at it. (They don't look at my question in either case). Here's my question: How can I: Bind cudaMallocPitch float memory to a 2D texture reference Copy some host data to ...

Cannot load .cubin module in CUDA Driver API

I am using 0.3.1 JCuda and 3.1 nvidia cuda sdk. I am trying to run JCudaRuntimeDriverMixSample.java from here. I compiled the .cu file with "nvcc -keep invertVectorElements.cu". I set the cuModuleLoad filename to the .sm_10.cubin file generated. When I run the compiled java file, I get CUDA_ERROR_INVALID_SOURCE. I am running nvidia driv...

How to display latency, memory ops, and arithmetic ops in Nvidia Compute Profiler

Hey all, I heard that with the Nvidia compute profiler, it should be possible to get a comparison of how much time is being spent for arithmetic ops, memory ops, or on latency. I searched the profiler after running my program and I tried googling, but I don't see anything related to figuring out this metrics. Can anybody help, is my qu...

What is a bank conflict? (Doing Cuda/OpenCL programming)

I have been reading the programming guide for CUDA and OpenCL, and I cannot figure out what a bank conflict is. They just sort of dive into how to solve the problem without elaborating on the subject itself. I tried googling for bank conflict and bank conflict computer science but I couldn't find much. Can anybody help me understand or p...

Convolution, array with filter, in CUDA

I'm trying to take the convolution of an array of data, 256x256, with a filter, 3x3 on a GPU using shared memory. I understand that I'm to break the array up in blocks, and then apply the filter within each block. This ultimately means that blocks with overlap along the edges, and some padding will need to be done around the edges where ...

Why aren't there bank conflicts in global memory for Cuda/OpenCL?

One thing I haven't figured out and google isn't helping me, is why is it possible to have bank conflicts with shared memory, but not in global memory? Can there be bank conflicts with registers? UPDATE Wow I really appreciate the two answers from Tibbit and Grizzly. It seems that I can only give a green check mark to one answer though....

Fortran interface to call a C function that return a pointer

I have a C function double* foofunc() {...} I don't know how to declare the interface in Fortran to call to this C function. The second question is: if this pointer is supposed to be pointing to GPU device memory. How could I define that in the Fortran interface, i.e. do I need to use DEVICE attribute. Thanks, T. Edit: Use any featu...

Question about Compute Prof's fields for incoherent and coherent gst/gld? (Cuda/OpenCL)

Hey all, I am using Compute Prof 3.2 and a Geforce GTX 280. I have compute capability 1.3 then I believe. This file, http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/visual_profiler_cuda/CUDA_Profiler_3.0.txt, seems to show that I should be able to see these fields since I am using a 1.x compute device. Well I don't s...

Rationalizing what is going on in my simple OpenCL kernel in regards to global memory

const char programSource[] = "__kernel void vecAdd(__global int *a, __global int *b, __global int *c)" "{" " int gid = get_global_id(0);" "for(int i=0; i<10; i++){" " a[gid] = b[gid] + c[gid];}" "}"; The kernel above is a vector addition done ten times per loop. I have used the prog...

matrix multiplication in cuda

say I want to multiply two matrices together, 50 by 50. I have 2 ways to arrange threads and blocks. a) one thread to calculate each element of the result matrix. So I have a loop in thread multiplies one row and one column. b) one thread to do each multiplication. Each element of the result matrix requires 50 threads. After multiplica...

Struct Member Pointer in kernel function, the values associated to the pointer are not copied into device memory

Hi, we have the following struct defined typedef struct PurchaseOrder { char* Value1; double Value2; int Value3Length; device int GetValue3Length() {return Value3Length;} double* Value3; device double GetValue3(int i) { return Value3[i];} device void SetValue3(int i, double value) { Value3[i] = value;} }; The PurchaseOrde...

Trying to mix in openCL with CUDA in Nvidia's SDK template

Hey all, I have been having a tough time setting up an experiment where I allocate memory with CUDA on the device, take that pointer to memory on the device, use it in OpenCL, and return the results. I want to see if this is possible. I had a tough time getting a CUDA project to work so I just used Nvidia's template project in their SDK...

cuda SM register limit

I know number of block running on one SM is limited by block number, threads, shared memory, and register. Is there any strategy to avoiding having too many registers? I mean I just don't want to have too many of them, eventually it limits the number of block I run on one SM. ...

kernel function parameter as const

say I have a kernel foo(int a, int b) { __shared__ int array[a]; } it seems a has to be a constant value, I added const in front of int. It sill didn't work out, any idea? foo(const int a, const int b) { __shared__ int array[a]; } ...

Parallel reduction and find index on CUDA

I have an array of 20K values and I am reducing it over 50 blocks with 400 threads each. num_blocks = 50 and block_size = 400. My code looks like this: getmax <<< num_blocks,block_size >>> (d_in, d_out1, d_indices); __global__ void getmax(float *in1, float *out1, int *index) { // Declare arrays to be in shared memory. __shared...