
How to display latency, memory ops, and arithmetic ops in Nvidia Compute Profiler

Hey all, I heard that with the Nvidia compute profiler, it should be possible to get a comparison of how much time is being spent for arithmetic ops, memory ops, or on latency. I searched the profiler after running my program and I tried googling, but I don't see anything related to figuring out this metrics. Can anybody help, is my qu...

What is a bank conflict? (Doing Cuda/OpenCL programming)

I have been reading the programming guide for CUDA and OpenCL, and I cannot figure out what a bank conflict is. They just sort of dive into how to solve the problem without elaborating on the subject itself. I tried googling for bank conflict and bank conflict computer science but I couldn't find much. Can anybody help me understand or p...

Why aren't there bank conflicts in global memory for Cuda/OpenCL?

One thing I haven't figured out and google isn't helping me, is why is it possible to have bank conflicts with shared memory, but not in global memory? Can there be bank conflicts with registers? UPDATE Wow I really appreciate the two answers from Tibbit and Grizzly. It seems that I can only give a green check mark to one answer though....

Question about Compute Prof's fields for incoherent and coherent gst/gld? (Cuda/OpenCL)

Hey all, I am using Compute Prof 3.2 and a Geforce GTX 280. I have compute capability 1.3 then I believe. This file, http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/visual_profiler_cuda/CUDA_Profiler_3.0.txt, seems to show that I should be able to see these fields since I am using a 1.x compute device. Well I don't s...

Rationalizing what is going on in my simple OpenCL kernel in regards to global memory

const char programSource[] = "__kernel void vecAdd(__global int *a, __global int *b, __global int *c)" "{" " int gid = get_global_id(0);" "for(int i=0; i<10; i++){" " a[gid] = b[gid] + c[gid];}" "}"; The kernel above is a vector addition done ten times per loop. I have used the prog...

Countless warnings with SDL/OpenGL on OS X when dynamic graphics device switching is active

On Snow Leopard with a macbook pro that has two graphics devices the following error is printed to stderr multiple times a second: Wed Oct 6 02:35:27 nausicaa.local TestApp[92464] <Warning>: CGDisplayIsCaptured: Fixing up display ID 0x4272ec2 for offline mux head to 0x4272ec0 When I force the graphics device to be either Nvidia or In...

Trying to mix in openCL with CUDA in Nvidia's SDK template

Hey all, I have been having a tough time setting up an experiment where I allocate memory with CUDA on the device, take that pointer to memory on the device, use it in OpenCL, and return the results. I want to see if this is possible. I had a tough time getting a CUDA project to work so I just used Nvidia's template project in their SDK...

Size of statically allocated shared memory per block question with Compute Prof (Cuda/OpenCL)

In Nvidia's compute prof there is a column called "static private mem per work group" and the tooltip of it says "Size of statically allocated shared memory per block". My application shows that I am getting 64 (bytes I assume) per block. Does that mean I am using somewhere between 1-64 of those bytes or is the profiler just telling me t...

Bitmap conversion using GPU

I don't know whether this is the right forum. Anyway here is the question. In one of our application we display medical images and on top of them some algorithm generated bitmap. The real bitmap is a 16bit gray scale bitmap. From this we generate a color bitmap based on a look up table for eg (0-100)->green (100-200)->blue (200>above)...

.Net Lib\Wrapper that would clear differences between ATI and Nvidea APIs for counting on GPU?

I want to use GPU for counting purposes. I need it to fall on to CPU if no GPU found and provide me with unified api. (interested in any .net for example №4) ...

How to mitigate host + device memory tranfer bottlenecks in OpenCL/CUDA

If my algorithm is bottlenecked by host to device and device to host memory transfers, is the only solution a different or revised algorithm? ...

CUDA kernels throw different results on 2 different GPUs(GeForce 8600M GT vs Quadro FX 770M)

Hi everybody, I've been working on an AES CUDA application and I have a kernel which performs ECB encryption on the GPU. In order to assure the logic of the algorithm is not modified when running in parallel I send a known input test vector provided by NIST and then from host code compare the output with the know test vector output prov...

Why does the OpenCL vector addition Nvidia SDK example use async writes?

The vector addition example has this code: // Asynchronous write of data to GPU device ciErr1 = clEnqueueWriteBuffer(cqCommandQueue, cmDevSrcA, CL_FALSE, 0, sizeof(cl_float) * szGlobalWorkSize, srcA, 0, NULL, NULL); ciErr1 |= clEnqueueWriteBuffer(cqCommandQueue, cmDevSrcB, CL_FALSE, 0, sizeof(cl_float) * szGlobalWorkSize, srcB, 0, NULL,...

Question about gld_efficiency and gst_efficiency in Nvidia Compute Visual Profiler

I have a compute capability 1.2 card. It reports gld_efficiency and gst_efficiency for me. My problem is that I sometimes get values beyond the 0-1 range, sometimes greater than 2. Page 57 of the User Guide for the Compute Visual Profiler states that they should be between 0-1, so I am confused. Can anybody explain? ...

CL_OUT_OF_RESOURCES for 2 millions floats with 1GB VRAM?

It seems like 2 million floats should be no big deal, only 8MBs of 1GB of GPU RAM. I am able to allocate that much at times and sometimes more than that with no trouble. I get CL_OUT_OF_RESOURCES when I do a clEnqueueReadBuffer, which seems odd. Am I able to sniff out where the trouble really started? OpenCL shouldn't be failing like thi...

Example for rendering with Cg to a offscreen frame buffer object

I would like to see an example of rendering with nVidia Cg to an offscreen frame buffer object. The computers I have access to have graphic cards but no monitors (or X server). So I want to render my stuff and output them as images on the disk. The graphic cards are GTX285. ...

CUDA Matrix multiplication breaks for large matrices

I have the following matrix multiplication code, implemented using CUDA 3.2 and VS 2008. I am running on Windows server 2008 r2 enterprise. I am running a Nvidia GTX 480. The following code works fine with values of "Width" (Matrix width) up to about 2500 or so. int size = Width*Width*sizeof(float); float* Md, *Nd, *Pd; cudaError_t err ...