I discovered my computer has NVIDIA CUDA Technology and I want measure the power of processing, in CPU and GPU.
Instead of searching for a program to do this, I want have a deeper understanding of how it works. What kind of code (C/C++) I need?
...
I am looking for the information, how double precision is hardware implemented in the tesla gpu . I have read, that two stream processors are working on the single double value, but i didn't found any official paper from nvidia.
Thanks in advance.
PPS
Why most GPU are computing with only single precision (because colors can be stored as...
I am calling cudaMemcpy and the copy returns successfully however the source values are not being copied to the destination. I wrote a similar piece using memcpy() and that works fine. What am I missing here?
// host externs
extern unsigned char landmask[DIMX * DIMY];
// use device constant memory for landmask
unsigned char *tempmask;
...
I am doing some programming with nVidia's CUDA C. I am using Visual Studio 2008 as my development environment and I am having some troubles with some linking and I am wondering if someone knows a way to fix it or has had the same problem and could offer a solution.
My program is made up of 3 files. 1 header file (stuff.h), 1 C source fi...
I have a large array (say 512K elements), GPU resident, where only a small fraction of elements (say 5K randomly distributed elements - set S) needs to be processed. The algorithm to find out which elements belong to S is very efficient, so I can easily create an array A of pointers or indexes to elements from set S.
What is the most e...
I was reading Supercomputing for the Masses: Part 5 on Dr.Dobb's and I have a question concerning the author's code for (fast) reversing arrays.
I understand the need to use shared memory but I didn't get the performance gain in the code of reverseArray_multiblock_fast.cu
In reverseArray_multiblock_fast.cu an array element is trans...
Having just learned that many cpp features (including the stl vector class) do not work in cu files. Even when using them in the host code.
Since I have to use a C++ class which uses STL I cannot compile my CU file which invokes the kernel. (I don't use any STL features in the CU file, but I think the include is the problem.)
I tried t...
I first process a matrix in cublas, I have already sent it to device and I want to process
some column vector of the matrix, still use cublas function. I first try using pointer arithmetic operation to offset the device pointer from host, but it seems doesn't work.
Is there any way I can process vector in matrix without copying it back t...
is it better to use a float instead of an int in cuda? does a float decrease bank conflicts and insures coalescence? or it has nothing to do with this?
...
what is the difference between coalescence and bank conflicts when programming with cuda?
is it only that coalescence happens at the global memory while bank conflicts at the shared memory?
should i worry about coalescence, if i have a >1.2 supported gpu? does it handle coalescence by itself?
...
What is the relationship between a CUDA core, a streaming multiprocessor and the CUDA model of blocks and threads?
What gets mapped to what and what is parallelized and how? and what is more efficient, maximize the number of blocks or the number of threads?
Thanks,
ExtremeCoder
My current understanding is that there are 8 cuda core...
I am developing a program using cuda sdk and 9600 1 GB NVidia Card . In
this program
0)A kernel passes a pointer of 2D int array of size 3000x6 in its input arguments.
1)The kenel has to sort it upto 3 levels (1st, 2nd & 3rd Column).
2)For this purpose, the kernel declares an array of int pointers of size 3000.
3)The kernel then ...
the previous tutorials have not shown anybody else having this problem: compiling openCV and CUDA projects in vs2008 in windows 7 x64. but i have been stuck on it for over a week.
Zero problems building openCV samples and my own code and CUDA within their own projects. I cannot get them to build in a single project together no matter ...
I've written a small CUDA program on my macbook pro and now tried it out on my Linux box and get different results.
In order to ensure correctness, I wrote unit tests: An array of floats, which contains the values to check, is copied to the device and then back. Worst thing is that it sometimes returns different values on Linux (and ver...
I'm working on translating a CUDA application (this if you must know) to OpenCL. The original application uses the C-style CUDA API, with a single stream just to avoid the automatic busy-wait when reading the results.
Now I notice that OpenCL command queues look a lot like CUDA streams. But in the device read command, and likewise in ...
clock() is not accurate enough.
...
Is there a maximum number of streams that can be created in CUDA?
To clarify I mean CUDA streams as in the stream that allows you to execute kernels and memory operations.
...
Is there any effective implement of the solution for sparse matrix linear equation using CUDA?
...
I have been working with CUDA for a while now and started to have bus errors reported on the first attempt to malloc any data to the GPU after working for a short period of time. The only way that i have found to fix this is to restart the machine.
The memory should be cleared up automatically but it does not seem to happen if the ap...
Over the summer, I started to learn CUDA C because the nVIDIA performance claims were simply unbelievable. This past week, I started another semester of my undergrad studies. My major is computer science.
One of the classes I am taking this semester is undergrad research and want to further practice with CUDA C. Does anyone have an...