Why is there a button for enabling and disabling profiling on the Compute Profiler? If I disable profiling, then I can't launch my application for profiling. So why does profiling need to be disabled at all?
...
Is it possible to compare more than two kernels executions at a time in Compute Prof?
...
It seems apparent that each core of the GPU could allow for handling of a request, rather than one main processor (the system's CPU) handling all requests. On the surface, it seems like it is possible, perhaps with Templates in GPU + Redis database in GPU GDDR5?
Is it possible and worthwhile?
...
On page 51 of the Compute Visual Profiler User Guide it states that:
" Note that in case the number
blocks in a kernel is less than or not a multiple of the number of multiprocessors the
counters values across multiple runs will not be consistent.
"
Is that an inclusive or exc...
hi...i'm trying to count the GPU and CPU FLOPS and i've got the source from http://norma.mbg.duth.gr/index.php?id=about:benchmarks:cuda_flops
i renamed it to cudaflops.cu and compile it with this makefile
################################################################################
#
# Build script for project
#
####################...
Hello all,
I have used Visual Studio 2008 to compile and run CUDA applications before. I have switched to Visual Studio 2010 and Windows 7. I've been trying to get integration set up all morning, but haven't had complete success. I've downloaded the toolkit, installed Nsight, made sure the libraries/include/bin paths are set, checked ...
I have some knowledge of C/C++ programming and want to learn CUDA. I'm also on a mac. So what is the best way to learn CUDA?
...
I'm working on a number crunching app using the CUDA framework. I have some static data that should be accessible to all threads, so I've put it in constant memory like this:
__device__ __constant__ CaseParams deviceCaseParams;
I use the call cudaMemcpyToSymbol to transfer these params from the host to the device:
void copyMetaData(C...
I have a class, say
class AddElement{
int a,b,c;
}
With methods to set/get a,b,c... My question is definitely a logic question - say I implement AddElement as follows:
int Value=1;
Value+=AddElement.get_a()+AddElement.get_b()+AddElement.get_b();
Now imagine I want to do the above except 'a,b,c' are now arrays, and instead of '...
I'm running a CUDA library that I need to debug for memory problems and other issues. But when I attach cuda-gdb to the process I get the error
error: All CUDA devices are used for X11 and cannot be used while debugging.
I understand the error, but there has to be a way that I can debug the issues. Since I only have 1 GPU, it real...
say I want to time a memory fetching from device global memory
cudaMemcpy(...cudaMemcpyHostToDevice);
cudaThreadSynchronize();
time1 ...
kernel_call();
cudaThreadSynchronize();
time2 ...
cudaMemcpy(...cudaMemcpyDeviceToHost);
cudaThreadSynchronize();
time3 ...
I don't understand why my time3 and time2 always give same results. My ke...
Hello,
I wrote a CUDA application that has some hardcoded parameters in it (via #defines). Everything seemed to work right, so I tried some other parameters. Now, the program doesn't work correctly anymore.
So, I want to debug it. I compile the application with -deviceemu -g -O0 options, because I read that I can then use gdb to debug...
This is a best practices question. I am making an array
type * x = malloc(size*sizeof(type));
AFAIK sizeof gives a return value of size_t. Does that mean that I should use a size_t to declare, or pass around size? Also when indexing the array should I also use a size_t for the index variable? What is the best practice for these th...
Hey,
I am using Visual Studio 2008, with CUDA 3.2. I am trying to debug into a function with this signature:
MatrixMultiplication_Kernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);
I can step into the function, however when I get into the function it doesn't let me step over any of the code and tells me that no source is available. An...
Hello,
I am interested in using F# for numerical computation. How can I access the GPU using NVIDIA's CUDA standart under F#?
...
kernel1 <<< blocks1, threads1, 0, stream1 >>> ( args ... );
...
kernel2 <<< blocks2, threads2, 0, stream2 >>> ( args ... );
...
I have two kernels to run concurrently,
and the device is GTX460, so it's Fermi architecture.
The cuda toolkit and sdk are 3.2 rc.
Like codes above, two kernels are coded to be run concurrently,
but there ar...
As the cuda's ".cu" file is basically c, Is there a way we can use doxygen to generate documentation for ".cu" files? I noticed that NVIDIA use doxygen to generate cuda's docuementation. However when I use doxygen, the ".cu" files are ignored.
...
Hi,
I am a senior undergrad majoring in CS. At the moment I am taking a Computer Architecture class. We need to do a project. I want to do something related to CUDA, where the performance of the computation will have a moderate increase compred to a serial implementation.
I am really interested in databases so I decided to do something...
I have these template functions for use inline on device with cuda
template <class T> __device__ inline T& cmin(T&a,T&b){return (a<b)?(a):(b);};
template <class T> __device__ inline T& cmax(T&a,T&b){return (a>b)?(a):(b);};
In the code I have
cmin(z[i],y[j])-cmax(x[i],z[j])
for int arrays x,y,and z. I get the error:
error: no ...
Hey guys,
I am trying to debug into my kernel code, using the device emulation mode.
However, I set break points in my kernel and it doesn't break.
MatrixMultiplication_Kernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);
Can anyone assist me with this?
...