views:

1712

answers:

4

I am trying separate a CUDA program into two separate .cu files in effort to edge closer to writing a real app in C++. I have a simple little program that:

Allocates a memory on the host and the device.
Initializes the host array to a series of numbers. Copies the host array to a device array Finds the square of all the elements in the array using a device kernel Copies the device array back to the host array Prints the results

This works great if I put it all in one .cu file and run it. When I split it into two separate files I start getting linking errors. Like all my recent questions, I know this is something small, but what is it?

KernelSupport.cu

#ifndef _KERNEL_SUPPORT_
#define _KERNEL_SUPPORT_

#include <iostream>
#include <MyKernel.cu>

int main( int argc, char** argv) 
{
    int* hostArray;
    int* deviceArray;
    const int arrayLength = 16;
    const unsigned int memSize = sizeof(int) * arrayLength;

    hostArray = (int*)malloc(memSize);
    cudaMalloc((void**) &deviceArray, memSize);

    std::cout << "Before device\n";
    for(int i=0;i<arrayLength;i++)
    {
        hostArray[i] = i+1;
        std::cout << hostArray[i] << "\n";
    }
    std::cout << "\n";

    cudaMemcpy(deviceArray, hostArray, memSize, cudaMemcpyHostToDevice);
    TestDevice <<< 4, 4 >>> (deviceArray);
    cudaMemcpy(hostArray, deviceArray, memSize, cudaMemcpyDeviceToHost);

    std::cout << "After device\n";
    for(int i=0;i<arrayLength;i++)
    {
        std::cout << hostArray[i] << "\n";
    }

    cudaFree(deviceArray);
    free(hostArray);

    std::cout << "Done\n";
}

#endif

MyKernel.cu

#ifndef _MY_KERNEL_
#define _MY_KERNEL_

__global__ void TestDevice(int *deviceArray)
{
    int idx = blockIdx.x*blockDim.x + threadIdx.x;
    deviceArray[idx] = deviceArray[idx]*deviceArray[idx];
}


#endif

Build Log:

1>------ Build started: Project: CUDASandbox, Configuration: Debug x64 ------
1>Compiling with CUDA Build Rule...
1>"C:\CUDA\bin64\nvcc.exe"    -arch sm_10 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin"    -Xcompiler "/EHsc /W3 /nologo /O2 /Zi   /MT  "  -maxrregcount=32  --compile -o "x64\Debug\KernelSupport.cu.obj" "d:\Stuff\Programming\Visual Studio 2008\Projects\CUDASandbox\CUDASandbox\KernelSupport.cu" 
1>KernelSupport.cu
1>tmpxft_000016f4_00000000-3_KernelSupport.cudafe1.gpu
1>tmpxft_000016f4_00000000-8_KernelSupport.cudafe2.gpu
1>tmpxft_000016f4_00000000-3_KernelSupport.cudafe1.cpp
1>tmpxft_000016f4_00000000-12_KernelSupport.ii
1>Linking...
1>KernelSupport.cu.obj : error LNK2005: __device_stub__Z10TestDevicePi already defined in MyKernel.cu.obj
1>KernelSupport.cu.obj : error LNK2005: "void __cdecl TestDevice__entry(int *)" (?TestDevice__entry@@YAXPEAH@Z) already defined in MyKernel.cu.obj
1>D:\Stuff\Programming\Visual Studio 2008\Projects\CUDASandbox\x64\Debug\CUDASandbox.exe : fatal error LNK1169: one or more multiply defined symbols found
1>Build log was saved at "file://d:\Stuff\Programming\Visual Studio 2008\Projects\CUDASandbox\CUDASandbox\x64\Debug\BuildLog.htm"
1>CUDASandbox - 3 error(s), 0 warning(s)
========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========

I am running Visual Studio 2008 on Windows 7 64bit.


Edit:

I think I need to elaborate on this a little bit. The end result I am looking for here is to have a normal C++ application with something like Main.cpp with the int main() event and have things run from there. At certains point in my .cpp code I want to be able to reference CUDA bits. So my thinking (and correct me if there a more standard convention here) is that I will put the CUDA Kernel code into their on .cu files, and then have a supporting .cu file that will take care of talking to the device and calling kernel functions and what not.

+2  A: 

You are including mykernel.cu in kernelsupport.cu, when you try to link the compiler sees mykernel.cu twice. You'll have to create a header defining TestDevice and include that instead.

re comment:

Something like this should work

// MyKernel.h
#ifndef mykernel_h
#define mykernel_h
__global__ void TestDevice(int* devicearray);
#endif

and then change the including file to

//KernelSupport.cu
#ifndef _KERNEL_SUPPORT_
#define _KERNEL_SUPPORT_

#include <iostream>
#include <MyKernel.h>
// ...

re your edit

As long as the header you use in c++ code doesn't have any cuda specific stuff (__kernel__,__global__, etc) you should be fine linking c++ and cuda code.

Scott Wales
Please elaborate with a simple code example
Mr Bell
Your MyKernel.h should have `void TestDeviceWrapper(dim3 grid, dim3 block, int *devicearray)` since when the KernelSupport.cu becomes KernelSupport.cpp cl.exe will not understand the __global__ syntax. Then in the MyKernel.cu, `TestDeviceWrapper()` just calls `TestDevice<<<>>>`.
Tom
That sounds reasonable, the code given assumes it will be included in a cuda file, as is given in the question.
Scott Wales
Yes, but he also says "The end result I am looking for here is to have a normal C++ application with something like Main.cpp with the int main() event and have things run from there." That was added in an edit to the question though.
Tom
A: 

The simple solution is to turn off building of your MyKernel.cu file.

Properties -> General -> Excluded from build

The better solution imo is to split your kernel into a cu and a cuh file, and include that, for example:

//kernel.cu
#include "kernel.cuh"
#include <cuda_runtime.h>

__global__ void increment_by_one_kernel(int* vals) {
  vals[threadIdx.x] += 1;
}

void increment_by_one(int* a) {
  int* a_d;

  cudaMalloc(&a_d, 1);
  cudaMemcpy(a_d, a, 1, cudaMemcpyHostToDevice);
  increment_by_one_kernel<<<1, 1>>>(a_d);
  cudaMemcpy(a, a_d, 1, cudaMemcpyDeviceToHost);

  cudaFree(a_d);
}

 

//kernel.cuh
#pragma once

void increment_by_one(int* a);

 

//main.cpp
#include "kernel.cuh"

int main() {
  int a[] = {1};

  increment_by_one(a);

  return 0;
}
thebaldwin
Please elaborate with a simple code example
Mr Bell
This will only work while you have your main in a .cu file. As soon as you put it into a .cpp file this is unsuitable.
Tom
Once you split out all your CUDA/kernel code into appropriate cu/cuh files, there should be no problem renaming or moving your main to a cpp file. Please see my example, I'm unclear why it is unsuitable.
thebaldwin
+1  A: 

If you look at the CUDA SDK code examples, they have extern C defines that reference functions compiled from .cu files. This way, the .cu files are compiled by nvcc and only linked into the main program while the .cpp files are compiled normally.

For example, in marchingCubes_kernel.cu has the function body:

extern "C" void
launch_classifyVoxel( dim3 grid, dim3 threads, uint* voxelVerts, uint *voxelOccupied, uchar *volume,
                      uint3 gridSize, uint3 gridSizeShift, uint3 gridSizeMask, uint numVoxels,
                      float3 voxelSize, float isoValue)
{
    // calculate number of vertices need per voxel
    classifyVoxel<<<grid, threads>>>(voxelVerts, voxelOccupied, volume, 
                                     gridSize, gridSizeShift, gridSizeMask, 
                                     numVoxels, voxelSize, isoValue);
    cutilCheckMsg("classifyVoxel failed");
}

While in marchingCubes.cpp (where main() resides) just has a definition:

extern "C" void
launch_classifyVoxel( dim3 grid, dim3 threads, uint* voxelVerts, uint *voxelOccupied, uchar *volume,
                      uint3 gridSize, uint3 gridSizeShift, uint3 gridSizeMask, uint numVoxels,
                      float3 voxelSize, float isoValue);

You can put these in a .h file too.

tkerwin
You should not need to use `extern "C"` in recent versions of the CUDA toolkit. In the past this was required since nvcc treated host code as C, however the default is now C++. Drop the `extern "C"`, it obfuscates the code!
Tom
Good to know. They should update the SDK examples to reflect that. However, you still need to perform the CUDA call wrapping, I don't think there's any easy way around that.
tkerwin
Yeah, the SDK samples haven't been updated since they were created, so while the newer ones reflect the latest standards the older ones are a little out-of-date. They do still illustrate the coding techniques though, if not the style.You are correct, there is no way to avoid the CUDA call wrapping. That makes total sense though, the triple chevron syntax (<<<>>>) is part of CUDA C and not C and hence you will need a CUDA C compiler (i.e. nvcc) to compile it. It's a small price to pay for the elegance of the Runtime API I think.
Tom
+1  A: 

Getting the separation is actually quite simple, please check out this answer for how to set it up. Then you simply put your host code in .cpp files and your device code in .cu files, the build rules tell Visual Studio how to link them together into the final executable.

The immediate problem in your code that you are defining the __global__ TestDevice function twice, once when you #include MyKernel.cu and once when you compile the MyKernel.cu independently.

You will need to put a wrapper into a .cu file too - at the moment you are calling TestDevice<<<>>> from your main function but when you move this into a .cpp file it will be compiled with cl.exe, which doesn't understand the <<<>>> syntax. Therefore you would simply call TestDeviceWrapper(griddim, blockdim, params) in the .cpp file and provide this function in your .cu file.

If you want an example, the SobolQRNG sample in the SDK achieves nice separation, although it still uses cutil and I would always recommend avoiding cutil.

Tom