ansaurus

Question

CUDA, more threads for same work = Longer run time despite better occupancy, Why?

Answer 1

A:

two functions have different number of code lines, so different number of instructions
for loop is implemented using branches. last line of code always divergent
global store request is not the same as global score. operation is set up, but never commited.

aaa 2010-03-16 01:35:36

Answer 2

+4 A:

Since your code has only arithmetic instructions, you don't need very high occupancy to hide the latency of the arithmetic units. Indeed, even if you do have memory instructions you can maximise performance with ~50% occupancy provided your reads/writes are efficient. See the recorded Advanced CUDA C presentation for more information on occupancy and performance.

In your case, given that your kernel doesn't need high occupancy to saturate the arithmetic units, you will have better performance using fewer larger blocks than more smaller blocks since there is a cost for launching blocks. In general however the cost of launching blocks is negligible compared with the time to actually run the code.

Why are there more instructions?

Remember that the counters are not counting per block (aka CTA) but instead per SM (Streaming Multiprocessor) or per TPC (Texture Processing Cluster) which is a group of two or three SMs depending on your device. The instructions count is per SM.

It is fair to expect the less_threads kernel to have fewer instructions, however you are launching four times as many warps per block which means each SM will execute the code approximately four times as many times. Taking into account the shorter kernel code, your measurement doesn't seem unreasonable.

Why is there any branching?

Actually you do have conditional code:

for (int j=0;j<800;++j)

This has a condition, however all threads within a warp are indeed executing the same path so it is not divergent. My guess is the divergence is in the administration code somewhere, you could take a look at the PTX code to analyse this if you were worried. 26 is very low compared with the number of instructions executed, so this will not affect your performance.

Why are there any gst requests?

In your code you have:

if (threadIdx.x == -1)
  d_out[blockIdx.x*blockDim.x+threadIdx.x] = num_inliers;

This will be handled by the load/store unit and hence counted even though it results in no actual transaction. The gst_32/gst_64/gst_128 counters indicate actual memory transfers (your device has compute capability 1.2 or 1.3, older devices have different sets of counters).

Tom 2010-03-16 08:27:44

Thanks for the informative answers. I am still unsure about the number of instructions. As you noted, less_threads does 4 times as many num_inliers += j*(j+n) but uses 4 times more threads. The number of instructions is extrapolated from a single SM or TPC, but overall shouldn't the numbers be equal?

zenna 2010-03-16 12:15:12

From examining the PTX it is apparent that due to the overhead of the for loop, more_threads is actually executing less instructions. Could this account for the difference? If it would, then proportionally increasing the work done within the loops of both kernels would diminish this for loop overhead, but it does not from my tests.

zenna 2010-03-16 12:15:41

I have seen the Advanced CUDA tutorial and I understand performance may not increase above 50%, but what we are seeing here is a decrease in performance. The number of blocks launched is the same, just the number of threads per block in less_threads is less.

zenna 2010-03-16 12:16:22

Answer 3

+2 A:

The two functions are not doing the same amount of work.

more_threads<<<780, 128>>>():

780 blocks
128 threads per block
4 mul per loop
8 add per loop
780*128*800*(4+8) = 958,464,000 flops

less_threads<<<780, 32>>>():

780 blocks
32 threads per block
12 mul per loop
24 add per loop
780*32*800*(12+24) = 718,848,000 flops

So, more_threads is doing more work than less threads, which is why the number of instructions goes up and why more_threads is slower. To fix more_threads, do only 3 computations inside the loop: 780*128*800*(3+6) = 718,848,000.

mch 2010-03-18 15:31:07

ansaurus

tags:

views:

answers:

CUDA, more threads for same work = Longer run time despite better occupancy, Why?

related questions