1 is it possible that more than one processors would be given to single block so that my program get some benefit of multiprocessor platform
Simple answer: No.
The CUDA programming model maps one threadblock to one multiprocessor (SM); the block cannot be split across two or more multiprocessors and, once started, it will not move from one multiprocessor to another.
As you have seen, CUDA provides __syncthreads()
to allow threads within a block to synchronise. This is a very low cost operation, and that's partly because all the threads within a block are in close proximity (on the same SM). If they were allowed to split then this would no longer be possible. In addition, threads within a block can cooperate by sharing data in the shared memory; the shared memory is local to a SM and hence splitting the block would break this too.
2 can I synchronize the threads of different blocks ?
Not really no. There are some things you can do, like get the very last block to do something special (see the threadFenceReduction sample in the SDK) but general synchronisation is not really possible. When you launch a grid, you have no control over the scheduling of the blocks onto the multiprocessors, so any attempt to do global synchronisation would risk deadlock.
3 How to find out warp size ? it is fixed for a particular hardware ?
Yes, it is fixed. In fact, for all current CUDA capable devices (both 1.x and 2.0) it is fixed to 32. If you are relying on the warp size then you should ensure forward-compatibility by checking the warp size.
In device code you can just use the special variable warpSize
. In host code you can query the warp size for a specific device with:
cudaError_t result;
int deviceID;
struct cudaDeviceProp prop;
result = cudaGetDevice(&deviceID);
if (result != cudaSuccess)
{
...
}
result = cudaGetDeviceProperties(&prop, deviceID);
if (result != cudaSuccess)
{
...
}
int warpSize = prop.warpSize;