tags:

views:

368

answers:

4

I have some CUDA code that nvcc (well, technically ptxas) likes to take upwards of 10 minutes to compile. While it isn't small, it certainly isn't huge. (~5000 lines).

The delay seems to come and go between CUDA version updates, but previously it only took a minute or so instead of 10.

When I used the -v option, it seemed to get stuck after displaying the following:

ptxas --key="09ae2a85bb2d44b6" -arch=sm_13 "/tmp/tmpxft_00002ab1_00000000-2_trip3dgpu_kernel.ptx" -o "/tmp/tmpxft_00002ab1_00000000-9_trip3dgpu_kernel.sm_13.cubin"

The kernel does have a fairly large parameter list and a structure with a good number of pointers is passed around, but I do know that there was at least one point in time in which very nearly the exact same code compiled in only a couple seconds.

I am running 64 bit Ubuntu 9.04 if it helps.

Any ideas?

A: 

You should note that there is a limit on the the size of the parameter list that can be passed to a function, currently 256 bytes (see section B.1.4 of the CUDA Programming Guide). Has the function changed at all?

There is also a limit of 2 million PTX instructions per kernel, but you shouldn't be approaching that ;-)

What version of the toolkit are you using? The 3.0 beta is available if you are a registered developer which is a major update. If you still have the problem you should contact NVIDIA, they will need to be able to reproduce the problem of course.

Tom
I'm aware of the parameter restriction, I've run into that problem before. (And annoyingly, I don't seem able to get around it by just using structs) It's unclear to me, however, if that's a factor in the slowdown.
rck
A: 

I have the same problem, and the compile time is about 2hours. And it is about some seconds in the emu mode. Have you solved this problem?

Liangxu Wang
Not yet. I did a significant code restructuring for other reasons, and that helped a lot, but I wasn't able to determine the base cause. (And it's still way slower than compiling in emulation mode)
rck
+1  A: 

I had similar problem - without optimization, compilation failed running out of registers, and with optimizations it took nearly half an hour. My kernel had expressions like

t1itern[II(i,j)] = (1.0 - overr) * t1itero[II(i,j)] + overr * (rhs[IJ(i-1,j-1)].rhs1 - abiter[IJ(i-1,j-1)].as  * t1itern[II(i,j - 1)] - abiter[IJ(i-1,j-1)].ase * t1itero[II(i + 1,j - 1)] - abiter[IJ(i-1,j-1)].ae  * t1itern[II(i + 1,j)] - abiter[IJ(i-1,j-1)].ane * t1itero[II(i + 1,j + 1)] - abiter[IJ(i-1,j-1)].an  * t1itern[II(i,j + 1)] - abiter[IJ(i-1,j-1)].anw * t1itero[II(i - 1,j + 1)] - abiter[IJ(i-1,j-1)].aw  * t1itern[II(i - 1,j)] - abiter[IJ(i-1,j-1)].asw * t1itero[II(i - 1,j - 1)] - rhs[IJ(i-1,j-1)].aads * t2itern[II(i,j - 1)] - rhs[IJ(i-1,j-1)].aadn * t2itern[II(i,j + 1)] - rhs[IJ(i-1,j-1)].aade * t2itern[II(i + 1,j)] - rhs[IJ(i-1,j-1)].aadw * t2itern[II(i - 1,j)] - rhs[IJ(i-1,j-1)].aadc * t2itero[II(i,j)]) / abiter[IJ(i-1,j-1)].ac;

and when i rewrote them:

tt1 = lrhs.rhs1;
tt1 = tt1 - labiter.as  * t1itern[II(1,j - 1)];
tt1 = tt1 - labiter.ase * t1itern[II(2,j - 1)];
tt1 = tt1 - labiter.ae  * t1itern[II(2,j)];
//etc

it significantly reduced compilation time and register usage.

aland
Interesting. When you rewrote them did it still fail without optimization? That is, was you re-writing it like that enough of a hint for only the optimizer being able to save registers, or is the basic compiler able to as well?
rck
It helped even without optimization. Looks like nvcc has troubles with basic compiler, and optimizator issue is just consequence
aland
A: 

Setting -maxrregcount 64 on the compile line helps since it causes the register allocator to spill to lmem earlier

James Sharpe