views:

291

answers:

5

hi all, I have three questions to ask

  1. If I create only one block of threads in cuda and execute the parallel program on it then is it possible that more than one processors would be given to single block so that my program get some benefit of multiprocessor platform ? To be more clear, If I use only one block of threads then how many processors will be allocated to it because so far as I know (I might have misunderstood it) one warp is given only single processing element.
  2. can I synchronize the threads of different blocks ? if yes please give some hints to do it.
  3. How to find out warp size ? it is fixed for a particular hardware ?

Thanks in advance since I know I'll get replies as always I get.

A: 
  1. As of cuda 2.3 one processor per thread block. It might be different in cuda 3/Fermi processors, I do not remember

  2. not really but... (depending on your requirements you may find workaround) read this post http://stackoverflow.com/questions/1644985/cuda-synchronizing-threads

aaa
1. it means I'll not get any benefit if I use only one block since only one processor is assigned to one block ?2. I saw the post, but I need to synchronize threads across blocks 3. also I need to update a global array. Update means insert and delete elements
Vickey
@Vic 1-basically no benefit. 2-that post mentions thread fence operation which may provide some benefit.3-not sure what you ask
aaa
@Vickey it may be worth posting more details about your specific problem (either in a new question here or on the nvidia forums) since it may be possible to avoid the need for global synchronisation.
Tom
A: 

#3. You can query SIMDWidth using cuDeviceGetProperties - see doc

Edric
there is predefined variable `warpSize ` you can use
aaa
Thanks for your post. I'm grateful to you. please see the post again. I have pasted some code. Please see this code and express your view.
Vickey
A: 
Vickey
it is a good idea, but it is not going to work. The problem is different thread blocks are not guaranteed to execute at the same time. Basically in general thread block a is going to run *after* the b has finished, so you cannot synchronize things which are not running simultaneously.
aaa
I have done a minor change which was a bug by mistak. Now as I understood, atomicAdd() adds a value atomically in global memory or shared memory for blocks. Since in this code all the blocks will be in for loop till the last block executes the loop for once(the last block may be somewaht lagging) and hence it shoud provide synchronization(almost, since last block will need to execute . Can u please explain in some more detail that why this code will not work.
Vickey
Firstly, you should ask this as a separate question rather than post a new question as an answer to your first question! No one gets points for answering via comments.Secondly, there is no such variable as threadDim.x.Thirdly aaa correctly said that you have no control over the scheduling. For example if total_threads exceeds the number that can be executed on a GPU then it will be impossible for glob_var to ever reach total_threads and you will have deadlock.
Tom
+1  A: 

To synchronize threads across multiple blocks (at least as far as memory updates are concerned), you can use the new __threadfence_system() call, which is only available on Fermi devices (Compute Capability 2.0 and better). This function is described in the CUDA Programming guide for CUDA 3.0.

Edric
Note that `__threadfence_system()` is *not* the same as synchronisation. A threadfence merely ensures that all memory operations from the thread are visible to the system. It does not synchronise, i.e. it does not cause threads to wait at this point until all threads within the grid have reach this point.
Tom
+1  A: 

1 is it possible that more than one processors would be given to single block so that my program get some benefit of multiprocessor platform

Simple answer: No.

The CUDA programming model maps one threadblock to one multiprocessor (SM); the block cannot be split across two or more multiprocessors and, once started, it will not move from one multiprocessor to another.

As you have seen, CUDA provides __syncthreads() to allow threads within a block to synchronise. This is a very low cost operation, and that's partly because all the threads within a block are in close proximity (on the same SM). If they were allowed to split then this would no longer be possible. In addition, threads within a block can cooperate by sharing data in the shared memory; the shared memory is local to a SM and hence splitting the block would break this too.

2 can I synchronize the threads of different blocks ?

Not really no. There are some things you can do, like get the very last block to do something special (see the threadFenceReduction sample in the SDK) but general synchronisation is not really possible. When you launch a grid, you have no control over the scheduling of the blocks onto the multiprocessors, so any attempt to do global synchronisation would risk deadlock.

3 How to find out warp size ? it is fixed for a particular hardware ?

Yes, it is fixed. In fact, for all current CUDA capable devices (both 1.x and 2.0) it is fixed to 32. If you are relying on the warp size then you should ensure forward-compatibility by checking the warp size.

In device code you can just use the special variable warpSize. In host code you can query the warp size for a specific device with:

cudaError_t result;
int deviceID;
struct cudaDeviceProp prop;

result = cudaGetDevice(&deviceID);
if (result != cudaSuccess)
{
    ...
}
result = cudaGetDeviceProperties(&prop, deviceID);
if (result != cudaSuccess)
{
    ...
}

int warpSize = prop.warpSize;
Tom
minor addition: in device level, warpSize is defined, like threadIdx and the rest
aaa
@aaa: Good point, I've updated my post. Thanks.
Tom
Thanks for the reply. Can u just give a look at the code which I had posted earliar and suggest me something regarding this code to synchronize the blocks.Thanks again for your reply.
Vickey