cuda

How does NVAPI device IDs relate to CUDA device IDs?

I'm working on getting a CUDA application to also monitor the GPU's core temp. That information is accessible via NVAPI. A problem is that I want to make sure I'm monitoring the same GPU as I'm running my code on. However, there seems to be information suggesting that the device IDs I get from NvAPI_EnumPhysicalGPUs does not correspond...

printf inside CUDA __global__ function.

I am currently writing a matrix multiplication on a GPU and would like to debug my code, but since I can not use printf inside a device function, is there something else I can do to see what is going on inside that function. This my current function: __global__ void MatrixMulKernel(Matrix Ad, Matrix Bd, Matrix Xd){ int tx = threadI...

screensaver hurts CUDA performance?

I've noticed that the running times of my CUDA kernels are almost tripled the moment the screensaver kicks in. This happens even if it's the blank screensaver. Oddly enough, this appears to have nothing to do with the power settings. When I disable the screen saver and let the screen power off, the performance stays the same. When I set...

How to structure data for optimal speed in a CUDA app

I am attempting to write a simple particle system that leverages CUDA to do the updating of the particle positions. Right now I am defining a particle has an object with a position defined with three float values, and a velocity also defined with three float values. When updating the particles, I am adding a constant value to the Y com...

Computing the null space of a matrix as fast as possible

I need to compute the nullspace of several thousand small matrices (8x9, not 4x3 as I wrote previously) in parallel (CUDA). All references point to SVD but the algorithm in numerical recipes seems very expensive, and gives me lots of things other than the null space that I don't really need. Is Gaussian elimination really not an option...

creating arrays in nvidia cuda kernel

hi i just wanted to know whether it is possible to do the following inside the nvidia cuda kernel __global__ void compute(long *c1, long size, ...) { ... long d[1000]; ... } or the following __global__ void compute(long *c1, long size, ...) { ... long d[size]; ... } ...

Learn Nvidia CUDA

I am C++ programmer that develop image and video algorithims, should i learn Nvidia CUDA? or it is one of these technlogies that will disappear? ...

How to generate pseudo random in cuda

I am attempting to build a particle system utilizing CUDA to do the heavy lifting. I want to randomize some the particles initial values like velocity and life span. The random numbers don't have to be super random since its just for visual effect. I found this post that addresses the same subject http://stackoverflow.com/questions/8...

How Do You Profile & Optimize CUDA Kernels?

I am somewhat familiar with the CUDA visual profiler and the occupancy spreadsheet, although I am probably not leveraging them as well as I could. Profiling & optimizing CUDA code is not like profiling & optimizing code that runs on a CPU. So I am hoping to learn from your experiences about how to get the most out of my code. There was...

help me understand cuda

i am having some troubles understanding threads in NVIDIA gpu architecture with cuda. please could anybody clarify these info: an 8800 gpu has 16 SMs with 8 SPs each. so we have 128 SPs. i was viewing stanford's video presentation and it was saying that every SP is capable of running 96 threads cuncurrently. does this mean that it (SP)...

Mapping a thread number to a (non sequential) position in an array

I would like to map a thread_id. This in C/CUDA but it is more an algebraic problem that I am trying to solve. So the mapping I am trying to achieve is along the lines: Threads 0-15: read value array[0] Threads 16-31: read value [3] Threads 32-47: read value [0] Threads 48-63: read value [3] Threads 64-79: read value array[6] Thread...

nvidia cuda using all cores of the machine

hi i was running cuda program on a machine which has cpu with four cores, how is it possible to change cuda c program to use all four cores and all gpu's available? i mean my program also does things on host side before computing on gpus'... thanks! ...

online server with cuda compiler?

i wrote a cuda program and i am testing it in emulation mode since i don't have a cuda supported NVIDIA card. so my question is, do you know any server that i can ssh or telnet to, and has a cuda compiler on it? ...

cuda program on VMware

i wrote a cuda program and i am testing it on ubuntu as a virtual machine. the reason for this is i have windows 7, i don't want to install ubuntu as a secondary operating system, and i need to use a linux operating system for testing. my question is: will the virtual machine limit the gpu resources? So will my cuda code be faster if i r...

Reducing Number of Registers Used in CUDA Kernel

I have a kernel which uses 17 registers, reducing it to 16 would bring me 100% occupancy. My question is: are there methods that can be used to reduce the number or registers used, excluding completely rewriting my algorithms in a different manner. I have always kind of assumed the compiler is a lot smarter than I am, so for example I o...

Shared Memory Bank Conflicts in CUDA: How memory is aligned to banks

As far as my understanding goes, shared memory is divided into banks and accesses by multiple threads to a single data element within the same bank will cause a conflict (or broadcast). At the moment I allocate a fairly large array which conceptually represents several pairs of two matrices: __shared__ float A[34*N] Where N is the nu...

Easiest way to test for existence of cuda-capable GPU from cmake?

We have some nightly build machines that have the cuda libraries installed, but which do not have a cuda-capable GPU installed. These machines are capable of building cuda-enabled programs, but they are not capable of running these programs. In our automated nightly build process, our cmake scripts use the cmake command find_package(C...

How to render non trivial particles in OpenGL.

I have a particle system where the positions and various properties are stored in a vertex buffer object. The values are continuously updated by a CUDA kernel. Presently I am just rendering them using GL_POINTS as flat circles. What I am interested in is rendering these particles are more involved things like 3d animated bird models f...

cuda research papers

i am currently doing my BS in computer science and i am interested in graduate studies. i realize that most universities ask for student research experience and publications. i am very interested in cuda programming. so my question is: how can i write papers about cuda. i searched a lot on Google and did not find a lot of research papers...

how to optimize cuda program for get better performance?

Hi, I write matlab program(cuda) for generate key. how to optimize cuda program for get better performance? ...