cuda

Incremental compilation in nvcc (CUDA)

I have many structs (classes) and standalone functions that I like to compile separately and then link to the CUDA kernel, but I am getting the "External calls are not supported" error while compiling (not linking) the kernel. nvcc forces to always use inline functions from the kernel. This is very frustrating!! If somebody have figured ...

CUDA with map<value, key> & atomic operations

Hi, as far as i know i can use C++ templates in CUDA device code. So if i'm using map to create a dictionary will the operation of inserting new values be atomic? I want to count the number of appearances of a certain values, i.e. create a code-dictionary with probabilities of the codes. Thanks Macs ...

How to read back a CUDA Texture for testing?

Ok, so far, I can create an array on the host computer (of type float), and copy it to the gpu, then bring it back to the host as another array (to test if the copy was successful by comparing to the original). I then create a CUDA array from the array on the GPU. Then I bind that array to a CUDA texture. I now want to read that text...

Using Macros to Define Constants for CUDA

I'm trying to reduce the number of instructions and constant memory reads for a CUDA kernel. As a result, I have realised that I can pull out the tile sizes from constant memory and turn them into macros. How do I define macros that evaluate to constants during preprocessing so that I can simply adjust three values and reduce the number...

Meaning of bandwidth in CUDA and why it is important

The CUDA programming guide states that "Bandwidth is one of the most important gating factors for performance. Almost all changes to code should be made in the context of how they affect bandwidth." It goes on to calculate theoretical bandwidth which is in the order of hundreds of gigabytes per second. I am at a loss as to why ho...

OpenCL and CUDA

Should i learn OpenCL if i only want to program NVIDIA GPUs ? ...

How to use textures and arrays in CUDA?

Steps for using textures and arrays in CUDA? ...

Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)

How threads are organized to be executed by a GPU? ...

CUDA source files get a .cu extension. What do header files get?

The standard convention seems to be to give CUDA source-code files a .cu extension, to distinguish them from C files with a .c extension. What's the corresponding convention for CUDA-specific header files? Is there one? ...

Thrust (CUDA Library) Compile error like "'vectorize_from_shared_kernel__entry' : is not a member of 'thrust::detail::device::cuda'"

I create a VS project using CUDA VS Wizard, and I try to build a cuda program using Thrust, the test program is quite simple: // ignore headers int main(void) { thrust::device_vector<double> X; X.resize(100); } I will got some compile error like: 1>C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp/tmpxft_00003cc0_00000000-3_sample.cudafe1.stub.c(2...

Calling handwritten CUDA kernel with thrust

Hi, since i needed to sort large arrays of numbers with CUDA, i came along with using thrust. So far, so good...but what when i want to call a "handwritten" kernel, having a thrust::host_vector containing the data? My approach was (backcopy is missing): int CUDA_CountAndAdd_Kernel(thrust::host_vector<float> *samples, thrust::host_vect...

Optimize CUDA with Thrust in a loop

Given the following piece of code, generating a kind of code dictionary with CUDA using thrust (C++ template library for CUDA): thrust::device_vector<float> dCodes(codes->begin(), codes->end()); thrust::device_vector<int> dCounts(counts->begin(), counts->end()); thrust::device_vector<int> newCounts(counts->size()); for (int i = 0; i < ...

GPU programming - transfer bottlenecks

As I would like my GPU to do some of calculation for me, I am interested in the topic of measuring a speed of 'texture' upload and download - because my 'textures' are the data that GPU should crunch. I know that transfer from main memory to GPU memory is the preffered way to go, so I expect such application to be efficient only if ther...

Cuda GPU optimization

i have read that there were 100X acceleration on certain problems when you use NVIDIA GPU instead of CPU. what are the best performance acceleration timings using cuda on different problems. please state the problem and the acceleration factor along with links for papers if possible. ...

graph algorithms on GPU

the current GPU threads are somehow limited (memory limit, limit of data structures, no recursion...). do you think it would be feasible to implement a graph theory problem on GPU. for example vertex cover? dominating set? independent set? max clique?.... is it also feasible to have branch-and-bound algorithms on GPUs? Recursive bac...

What's the most trivial function that would benfit from being computed on a GPU?

Hi. I'm just starting out learning OpenCL. I'm trying to get a feel for what performance gains to expect when moving functions/algorithms to the GPU. The most basic kernel given in most tutorials is a kernel that takes two arrays of numbers and sums the value at the corresponding indexes and adds them to a third array, like so: __ker...

CUDA, more threads for same work = Longer run time despite better occupancy, Why?

I encountered a strange problem where increasing my occupancy by increasing the number of threads reduced performance. I created the following program to illustrate the problem: #include <stdio.h> #include <stdlib.h> #include <cuda_runtime.h> #include <cutil.h> __global__ void less_threads(float * d_out) { int num_inliers; for...

Allocate constant memory

I'm trying to set my simulation params in constant memory but without luck (CUDA.NET). cudaMemcpyToSymbol function returns cudaErrorInvalidSymbol. The first parameter in cudaMemcpyToSymbol is string... Is it symbol name? actualy I don't understand how it could be resolved. Any help appreciated. //init, load .cubin float[] arr = new f...

Using device variable by multiple threads on CUDA

I am playing around with cuda. At the moment I have a problem. I am testing a large array for particular responses, and when I get the response, I have to copy the data onto another array. For example, my test array of 5 elements looks like this: [ ][ ][v1][ ][ ][v2] Result must look like this: [v1][v2] The problem is how do I calc...

Memory error detected by Intel Inspector when using cublasAlloc (CUDA BLAS Library)

I've written the following code: N_Vector_cuda v; if (N <= 0) return(NULL); v = (N_Vector_cuda) malloc(sizeof *v); if (v == NULL) return(NULL); v->inc = 1; v->elemsize = sizeof(real); v->status = cublasAlloc(N, v->elemsize, (void**)&(v->data)); if (v->status != CUBLAS_STATUS_SUCCESS) { free(v); return(NULL); } v->length = ...