ansaurus

Question

Can I call a "function-like macro" in a header file from a CUDA __global__ function???

Answer 1

+1 A:

The error says what the problem really is. You are calling a function/macro defined in another file (which belongs to the CPU code), from inside the CUDA function. This is impossible!

You cannot call CPU functions/macros/code from a GPU function.

You should put your definitions (does _lrotl() exist in CUDA?) inside the same file that will be compiled by nvcc.

karlphillip 2010-07-28 16:17:33

Macros are fine, since the preprocessor just expands them out as you would expect. The problem in this case, as Edric has answered, is that the macro contains function calls and those *functions* are host only.

Tom 2010-08-03 10:44:43

Answer 2

+4 A:

I think the problem is not the macros themselves - the compilation process used by nvcc for CUDA code runs the C preprocessor in the usual way and so using header files in this way should be fine. I believe the problem is in your calls to _lrotl and _lrotr.

You ought to be able to check that that is indeed the problem by temporarily removing those calls.

You should check the CUDA programming guide to see what functionality you need to replace those calls to run on the GPU.

Edric 2010-07-29 07:34:37

Thank that's the problem indeed, if I remove these calls everything works fine now I just need to replace these functions for valid cuda functions I appreciate it!!!!

Bartzilla 2010-07-29 09:25:03

Exactly, the C preprocessor will treat macros exactly the same in host and device code. So the problem is that after processing, the device code is attempting to call a host function.

Tom 2010-08-03 10:59:13

Answer 3

+2 A:

The hardware doesn't have a built-in rotate instruction, and so there is no intrinsic to expose it (you can't expose something that doesn't exist!).

It's fairly simple to implement with shifts and masks though, for example if x is 32-bits then to rotate left eight bits you can do:

((x << 8) | (x >> 24))

Where x << 8 will push everything left eight bits (i.e. discarding the leftmost eight bits), x >> 24 will push everything right twnty-four bits (i.e. discarding all but the leftmost eight bits), and bitwise ORing them together gives the result you need.

// # define SWAP(x) (_lrotl(x, 8) & 0x00ff00ff | _lrotr(x, 8) & 0xff00ff00)
# define SWAP(x) (((x << 8) | (x >> 24)) & 0x00ff00ff | ((x >> 8) | (x << 24)) & 0xff00ff00)

You could of course make this more efficient by recognising that the above is overkill:

# define SWAP(x) (((x & 0xff00ff00) >> 8) | ((x & 0x00ff00ff) << 8))

Tom 2010-08-03 10:57:49

ansaurus

tags:

views:

answers:

Can I call a "function-like macro" in a header file from a CUDA global function???

related questions