I'm trying to think of a way to implement the following algorithm using CUDA:
Working on a large volume of voxels, for each voxel I calculate an index i and a value c. after the calculation I need to perform histogram[i] += c
c is a float value and the histogram can have up to 15,000 bins.
I'm looking for a way to implement this effici...
I guess the question speaks for itself. I'm interested in doing some serious computations but am not a programmer by trade. I can string enough python together to get done what I want. But can I write a program in python and have the GPU execute it using CUDA? Or do I have to use some mix of python and C?
The examples on Klockner's ...
Is there any CUDA library that performs comparison/search operation.
...
Hi,
I am in a real fix. Please help. Its urgent.
I have a host process that spawns multiple host(CPU) threads (pthreads). These threads in turn call the CUDA kernel. These CUDA kernels are written by external users. So it might be bad kernels that enter infinite loop. In order to overcome this I have put a time-out of 2 mins that will ...
Hi,
I've worked on many data matching problems and very often they boil down to quickly and in parallel running many implementations of CPU intensive algorithms such as Hamming / Edit distance. Is this the kind of thing that CUDA would be useful for?
What kinds of data processing problems have you solved with it? Is there really an upl...
If I want to re-write my application so that it leverages the power of nVidia's CUDA SDK, are there any differences at all in runtime performance between the different SDK offerings: C++, Java, Python?
Is there any difference at all between these 3 SDK's, besides the obvious language being used?
...
Hi everyone,
I'm working on a C++ cross-platform OpenGL application (Windows, Linux and MacOS) and I am wondering if some of you could share some advices on porting a large application to OpenGL 3. The reason I am looking into OpenGL 3 is because I think we could benefit a lot from using the new "Sync objects". Nvidia has supported such...
Hi,
I've been searching extensively for a possible solution to my error for the past 2 weeks. I have successfully installed the Cuda 64-bit compiler (tools) and SDK as well as the 64-bit version of Visual Studio Express 2008 and Windows 7 SDK with Framework 3.5. I'm using windows XP 64-bit. I have confirmed that VSE is able to compile i...
This is the post I post days before, and I loss the account and registered another one
I am trying to modify the imageDenosing class in CUDA SDK, I need to repeat the filter many time incase to capture the time. But my code doesn't work properly.
//start
__global__ void F1D(TColor *image,int imageW,int imageH, TColor *buffer)
{
con...
Hi all,
I am writing a CUDA kernel for Histogram on a picture, but I had no idea how to return a array from the kernel, and the array will change when other thread read it. Any possible solution for it?
__global__ void Hist(
TColor *dst, //input image
int imageW,
int imageH,
int*data
){
const int ix = blockDim.x * blo...
I would like to be able to use a feature in PTX 1.3 which is not yet implemented it the C interface. Is there a way to write my own function in PTX and inject into an existing binary?
The feature I'm looking for is getting the value of %smid
...
Is there a way to install Parallel NSight and use it with Visual Studio 2010 without having VS2008 SP1 installed?
The setup checks if VS2008 is installed and won't continue if not.
I know there is no official support for VS2010, but I found on a forum a small application that can integrate Nexus into VS2010 and it seems to work.
Thanks i...
I am trying to write a simple matrixMultiplication application that multiplies two square matrices using CUDA. I am having a problem where my kernel is only computing correctly in block (0,0) of the grid.
This is my invocation code:
dim3 dimBlock(4,4,1);
dim3 dimGrid(4,4,1);
//Launch the kernel;
MatrixMulKernel<<<dimGrid,dimBlock>>>(Md...
Hi,
I can inprove this function under CUDA?
What this function does is:
Given a min and max, ELM1 and ELM, check if any three numbers of array ans[6] are found in any row, from min to max, in array D1,D2,D3,D4,D5,D6, if found return 1
I tried any other way, like looping, or-ing, and-ing, replacing goto with flag etc. etc. but this se...
How can I define, which filetypes should be work by the VisualAssit in Visual Studio 2010? I don't like, how this tool works with openCL and cuda files, therefore i would like to turn off it for thie file types (oherwise it highlights 1000 errors).
thx.
...
The new MacBookPros come with two graphic adapters, the Intel HD Graphics, and the NVIDIA GeForce GT 330M. OS X switches back and forth between them, depending on the workload, detection of an external monitor, or activation of Rosetta.
I want to get my feet wet with CUDA programming, and unfortunately the CUDA SDK doesn't seem to take...
Is it possible to launch two kernels that do independent tasks, simultaneously. For example if I have this Cuda code
// host and device initialization
.......
.......
// launch kernel1
myMethod1 <<<.... >>> (params);
// launch kernel2
myMethod2 <<<.....>>> (params);
Assuming that these kernels are independent, is there a facility to...
For work, I am converting the Image Denoising program that comes with the CUDA SDK into a MATLAB program. As far as I know, I have made all the necessary changes required by MATLAB, but when I try to call mex on it, MATLAB returns a bunch of linkage errors that I have no idea how to fix. If anyone has any suggestions on what I might be d...
On compilation of the CUDA SDK, I'm getting a nvcc fatal : Unsupported gpu architecture 'compute_20' My toolkit is 2.3 and on a shared system (i.e cant really upgrade) and the driver version is also 2.3, running on 4 Tesla C1060s
If it helps, the problem is being called in radixsort.
It appears that a few people online have had this ...
Hi,
I'd like to handle directly 64-bit words on the CUDA platform (eg. uint64_t vars).
I understand, however, that addressing space, registers and the SP architecture are all 32-bit based.
I actually found this to work correctly (on my CUDA cc1.1 card):
__global__ void test64Kernel( uint64_t *word )
{
(*word) <<= 56;
}
but I don'...