ansaurus

Question

how many processors can I get in a block on cuda GPU?

Answer 1

A:

As of cuda 2.3 one processor per thread block. It might be different in cuda 3/Fermi processors, I do not remember
not really but... (depending on your requirements you may find workaround) read this post http://stackoverflow.com/questions/1644985/cuda-synchronizing-threads

aaa 2010-05-23 16:50:28

1. it means I'll not get any benefit if I use only one block since only one processor is assigned to one block ?2. I saw the post, but I need to synchronize threads across blocks 3. also I need to update a global array. Update means insert and delete elements

Vickey 2010-05-23 21:16:22

@Vic 1-basically no benefit. 2-that post mentions thread fence operation which may provide some benefit.3-not sure what you ask

aaa 2010-05-23 22:44:08

@Vickey it may be worth posting more details about your specific problem (either in a new question here or on the nvidia forums) since it may be possible to avoid the need for global synchronisation.

Tom 2010-05-26 15:33:31

Answer 2

A:

#3. You can query SIMDWidth using cuDeviceGetProperties - see doc

Edric 2010-05-24 06:26:50

there is predefined variable `warpSize ` you can use

aaa 2010-05-24 16:38:48

Thanks for your post. I'm grateful to you. please see the post again. I have pasted some code. Please see this code and express your view.

Vickey 2010-05-24 16:41:39

Answer 3

A:

Vickey 2010-05-24 07:03:31

it is a good idea, but it is not going to work. The problem is different thread blocks are not guaranteed to execute at the same time. Basically in general thread block a is going to run *after* the b has finished, so you cannot synchronize things which are not running simultaneously.

aaa 2010-05-24 16:36:17

I have done a minor change which was a bug by mistak. Now as I understood, atomicAdd() adds a value atomically in global memory or shared memory for blocks. Since in this code all the blocks will be in for loop till the last block executes the loop for once(the last block may be somewaht lagging) and hence it shoud provide synchronization(almost, since last block will need to execute . Can u please explain in some more detail that why this code will not work.

Vickey 2010-06-05 12:32:24

Firstly, you should ask this as a separate question rather than post a new question as an answer to your first question! No one gets points for answering via comments.Secondly, there is no such variable as threadDim.x.Thirdly aaa correctly said that you have no control over the scheduling. For example if total_threads exceeds the number that can be executed on a GPU then it will be impossible for glob_var to ever reach total_threads and you will have deadlock.

Tom 2010-06-06 14:41:43

Answer 4

+1 A:

To synchronize threads across multiple blocks (at least as far as memory updates are concerned), you can use the new __threadfence_system() call, which is only available on Fermi devices (Compute Capability 2.0 and better). This function is described in the CUDA Programming guide for CUDA 3.0.

Edric 2010-05-25 07:05:26

Note that `__threadfence_system()` is *not* the same as synchronisation. A threadfence merely ensures that all memory operations from the thread are visible to the system. It does not synchronise, i.e. it does not cause threads to wait at this point until all threads within the grid have reach this point.

Tom 2010-05-26 15:30:41

Answer 5

+1 A:

1 is it possible that more than one processors would be given to single block so that my program get some benefit of multiprocessor platform

Simple answer: No.

The CUDA programming model maps one threadblock to one multiprocessor (SM); the block cannot be split across two or more multiprocessors and, once started, it will not move from one multiprocessor to another.

As you have seen, CUDA provides __syncthreads() to allow threads within a block to synchronise. This is a very low cost operation, and that's partly because all the threads within a block are in close proximity (on the same SM). If they were allowed to split then this would no longer be possible. In addition, threads within a block can cooperate by sharing data in the shared memory; the shared memory is local to a SM and hence splitting the block would break this too.

2 can I synchronize the threads of different blocks ?

Not really no. There are some things you can do, like get the very last block to do something special (see the threadFenceReduction sample in the SDK) but general synchronisation is not really possible. When you launch a grid, you have no control over the scheduling of the blocks onto the multiprocessors, so any attempt to do global synchronisation would risk deadlock.

3 How to find out warp size ? it is fixed for a particular hardware ?

Yes, it is fixed. In fact, for all current CUDA capable devices (both 1.x and 2.0) it is fixed to 32. If you are relying on the warp size then you should ensure forward-compatibility by checking the warp size.

In device code you can just use the special variable warpSize. In host code you can query the warp size for a specific device with:

cudaError_t result;
int deviceID;
struct cudaDeviceProp prop;

result = cudaGetDevice(&deviceID);
if (result != cudaSuccess)
{
    ...
}
result = cudaGetDeviceProperties(&prop, deviceID);
if (result != cudaSuccess)
{
    ...
}

int warpSize = prop.warpSize;

Tom 2010-05-26 15:25:39

minor addition: in device level, warpSize is defined, like threadIdx and the rest

aaa 2010-05-26 15:46:18

@aaa: Good point, I've updated my post. Thanks.

Tom 2010-05-26 15:54:31

Thanks for the reply. Can u just give a look at the code which I had posted earliar and suggest me something regarding this code to synchronize the blocks.Thanks again for your reply.

Vickey 2010-06-06 05:51:05

ansaurus

tags:

views:

answers:

how many processors can I get in a block on cuda GPU?

related questions