Lets assume that I have a computer which has a multicore processor and a GPU. I would like to write an OpenCL program which runs on all cores of the platform. Is this possible or do I need to choose a single device on which to run the kernel?
...
Array to sort has approximately one million strings, where every string can have length up to one million characters.
I am looking for any implementation of sorting algorithm for GPU.
I have a block of data with size approximately 1MB and I need to construct suffix array. Now you can see how it is possible to have one million strings i...
Hi,
I have a program that uses the GPU for performing certain computations. I can get the program to run correctly using the command line. But when i try to execute the same statement through PHP, i run into trouble.
I'm using wamp 2.0, and I've tried the exec and proc_open functions to try to get the program to run, but even though th...
This is part of my header file ("aes_locl.h"):
.
.
# define SWAP(x) (_lrotl(x, 8) & 0x00ff00ff | _lrotr(x, 8) & 0xff00ff00)
# define GETU32(p) SWAP(*((u32 *)(p)))
# define PUTU32(ct, st) { *((u32 *)(ct)) = SWAP((st)); }
.
.
Now from .cu file I have declared a __ global__ function and included the header file like this :
#include "...
Hi,
Can someone please explain the difference in texture memory as used in the context of Cuda as opposed to texture memory used in the context of DirectX. Suppose a graphics card has 512 MB of advertised memory, how is it divided into constant memory/texture memory and global memory.
E.g. I have a tesla card that has totalConstMem as ...
I am calling cudaMemcpy and the copy returns successfully however the source values are not being copied to the destination. I wrote a similar piece using memcpy() and that works fine. What am I missing here?
// host externs
extern unsigned char landmask[DIMX * DIMY];
// use device constant memory for landmask
unsigned char *tempmask;
...
I have a large array (say 512K elements), GPU resident, where only a small fraction of elements (say 5K randomly distributed elements - set S) needs to be processed. The algorithm to find out which elements belong to S is very efficient, so I can easily create an array A of pointers or indexes to elements from set S.
What is the most e...
Some of the concepts and designs of the "SIMT" architecture are still unclear to me.
From what I've seen and read, diverging code paths and if() altogether are a rather bad idea, because many threads might execute in lockstep. Now what does that exactly mean? What about something like:
kernel void foo(..., int flag)
{
if (flag)
...
Hello, is there any way to allocate memory on host, that is accessible directly from gpu, without copying?
like cudaHostGetDevicePointer in cuda.
...
Hi,
I am using OpenGL to do some GPGPU computations through the combination of one vertex shader and one fragment shader. I need to do computations on a image at different scale. I would like to use mipmaps since their generation can be automatic and hardware accelerated. However I can't manage to get access to the mipmap textures in th...
I need some quick advice.
I would like to simulate a cellular automata (from A Simple, Efficient Method
for Realistic Animation of Clouds) on the GPU. However, I am limited to OpenGL ES 2.0 shaders (in WebGL) which does not support any bitwise operations.
Since every cell in this cellular automata represents a boolean value, storing 1 ...
I am getting started with openCL on .NET. How is openTK compared to openCL.NET - which is better?
...
Consider this the complete form of the question in the title: Since OpenCL may be the common standard for serious GPU programming in the future (among other devices programming), why not when programming for OpenGL - in a future-proof way - utilize all GPU operations on OpenCL? That way you get the advantages of GLSL, without its program...
To what extend does OpenGL's GLSL utilize SLI setups? Is it utilized at all at the point of execution or only for end rendering?
Similarly, I know that OpenCL is alien to SLI but assuming one has several GPUs, how does it compare to GLSL in multiprocessing?
Since it might depend on the application, e.g. common transformation, or ray tr...
What would happen if there are four concurrent CUDA Applications competing for resources in one single GPU
so they can offload the work to the graphic card?. The Cuda Programming Guide 3.1 mentions that there
are certain methods which are asynchronous:
Kernel launches
Device device memory copies
Host device memory copies of a memory...
I'm writing some code for activating neural networks on CUDA, and I'm running into an issue. I'm not getting the correct summation of the weights going into a given neuron.
So here is the kernel code, and I'll try to explain it a bit clearer with the variables.
__global__ void kernelSumWeights(float* sumArray, float* weightArray, int...
I'm trying to do some simple image processing using opengl. Since I couldn't find any good library that does this alrdy I've been trying to do my own solution.
I simply want to compose a few images on the gpu and then read them back. However the performance of my implementation seems almost equal to what it takes do on the cpu... someth...
Hello, everyone!
Please tell me what technologies GPGPU exist already and which hardwares vendor's implement GPGPU?
I've been reading articles on various sites from morning and I've become confused.
...
Hey all,
I have been having a tough time setting up an experiment where I allocate memory with CUDA on the device, take that pointer to memory on the device, use it in OpenCL, and return the results. I want to see if this is possible. I had a tough time getting a CUDA project to work so I just used Nvidia's template project in their SDK...
I don't know whether this is the right forum. Anyway here is the question. In one of our application we display medical images and on top of them some algorithm generated bitmap. The real bitmap is a 16bit gray scale bitmap. From this we generate a color bitmap based on a look up table for eg
(0-100)->green
(100-200)->blue
(200>above)...