I'm trying to reduce the number of instructions and constant memory reads for a CUDA kernel.
As a result, I have realised that I can pull out the tile sizes from constant memory and turn them into macros. How do I define macros that evaluate to constants during preprocessing so that I can simply adjust three values and reduce the number of instructions performed in each kernel?
Here's an example:
#define TX 8
#define TY 6
#define TZ 4
#define TX2 (TX * 2)
#define TY2 (TY * 2)
#define OVER_TX (1.0f / float(TX))
Maybe this is already the case (or possibly handled by the nvcc compiler), but clearly I want the second block of macros to be evaluated by the preprocessor rather than replaced in the code so that it is not performed in every kernel. Any suggestions?