ansaurus

Question

CUDA: What reasons could there be for nvcc taking several minutes to compile?

Answer 1

A:

You should note that there is a limit on the the size of the parameter list that can be passed to a function, currently 256 bytes (see section B.1.4 of the CUDA Programming Guide). Has the function changed at all?

There is also a limit of 2 million PTX instructions per kernel, but you shouldn't be approaching that ;-)

What version of the toolkit are you using? The 3.0 beta is available if you are a registered developer which is a major update. If you still have the problem you should contact NVIDIA, they will need to be able to reproduce the problem of course.

Tom 2009-11-20 15:24:28

I'm aware of the parameter restriction, I've run into that problem before. (And annoyingly, I don't seem able to get around it by just using structs) It's unclear to me, however, if that's a factor in the slowdown.

rck 2009-11-25 17:20:11

Answer 2

A:

I have the same problem, and the compile time is about 2hours. And it is about some seconds in the emu mode. Have you solved this problem?

Liangxu Wang 2009-11-22 05:48:45

Not yet. I did a significant code restructuring for other reasons, and that helped a lot, but I wasn't able to determine the base cause. (And it's still way slower than compiling in emulation mode)

rck 2009-11-25 17:16:13

Answer 3

+1 A:

I had similar problem - without optimization, compilation failed running out of registers, and with optimizations it took nearly half an hour. My kernel had expressions like

t1itern[II(i,j)] = (1.0 - overr) * t1itero[II(i,j)] + overr * (rhs[IJ(i-1,j-1)].rhs1 - abiter[IJ(i-1,j-1)].as  * t1itern[II(i,j - 1)] - abiter[IJ(i-1,j-1)].ase * t1itero[II(i + 1,j - 1)] - abiter[IJ(i-1,j-1)].ae  * t1itern[II(i + 1,j)] - abiter[IJ(i-1,j-1)].ane * t1itero[II(i + 1,j + 1)] - abiter[IJ(i-1,j-1)].an  * t1itern[II(i,j + 1)] - abiter[IJ(i-1,j-1)].anw * t1itero[II(i - 1,j + 1)] - abiter[IJ(i-1,j-1)].aw  * t1itern[II(i - 1,j)] - abiter[IJ(i-1,j-1)].asw * t1itero[II(i - 1,j - 1)] - rhs[IJ(i-1,j-1)].aads * t2itern[II(i,j - 1)] - rhs[IJ(i-1,j-1)].aadn * t2itern[II(i,j + 1)] - rhs[IJ(i-1,j-1)].aade * t2itern[II(i + 1,j)] - rhs[IJ(i-1,j-1)].aadw * t2itern[II(i - 1,j)] - rhs[IJ(i-1,j-1)].aadc * t2itero[II(i,j)]) / abiter[IJ(i-1,j-1)].ac;

and when i rewrote them:

tt1 = lrhs.rhs1;
tt1 = tt1 - labiter.as  * t1itern[II(1,j - 1)];
tt1 = tt1 - labiter.ase * t1itern[II(2,j - 1)];
tt1 = tt1 - labiter.ae  * t1itern[II(2,j)];
//etc

it significantly reduced compilation time and register usage.

aland 2009-11-23 10:27:07

Interesting. When you rewrote them did it still fail without optimization? That is, was you re-writing it like that enough of a hint for only the optimizer being able to save registers, or is the basic compiler able to as well?

rck 2009-11-25 17:14:21

It helped even without optimization. Looks like nvcc has troubles with basic compiler, and optimizator issue is just consequence

aland 2009-11-25 17:27:32

Answer 4

A:

Setting -maxrregcount 64 on the compile line helps since it causes the register allocator to spill to lmem earlier

James Sharpe 2009-11-26 18:17:11

ansaurus

tags:

views:

answers:

CUDA: What reasons could there be for nvcc taking several minutes to compile?

related questions