tags:

views:

726

answers:

3

So I posted on Nvidia's forums (with no luck) and either 1. they just want to talk about how awesome graphics cards are or 2. My question is dumb so they don't look at it. (They don't look at my question in either case).

Here's my question:

How can I:

  1. Bind cudaMallocPitch float memory to a 2D texture reference
  2. Copy some host data to the 2D array on the device
  3. Add one to the texture reference and write to either a.) the Pitch 2D array OR b.) write to a linear memory array
  4. Read the answer back and display it.

Below is a code that should accomplish this. Note that for NxN array sizes, my code works. For NxM where N!=M, my code bites the dust (not the correct result). If you can solve this problem I will award you 1 internets (supply limited). Maybe I'm crazy, but according to the documentation this should work (and it does work for square arrays!). The attached code should run with 'nvcc whateveryoucallit.cu -o runit'.

Help is appreciated!

#include<stdio.h>
#include<cuda.h>
#include<iostream>
#define height 16
#define width 11
#define BLOCKSIZE 16

using namespace std;

// Device Kernels

//Texture reference Declaration
texture<float,2> texRefEx;


__global__ void kernel_w_textures(float* devMPPtr, float * devMPtr, int pitch)
{
 // Thread indexes
        unsigned int idx = blockIdx.x*blockDim.x + threadIdx.x;
        unsigned int idy = blockIdx.y*blockDim.y + threadIdx.y;

 // Texutre Coordinates
 float u=(idx)/float(width);
 float v=(idy)/float(height);
 devMPtr[idy*width+idx]=devMPPtr[idy*pitch/sizeof(float)+idx];
 // Write Texture Contents to malloc array +1
 devMPtr[idy*width+idx]= tex2D(texRefEx,u,v);//+1.0f;
}
int main()
{
 // memory size
 size_t memsize=height*width;
 size_t offset;
 float * data,  // input from host
  *h_out,  // host space for output
  *devMPPtr, // malloc Pitch ptr
  *devMPtr; // malloc ptr

 size_t pitch;

 // Allocate space on the host
 data=(float *)malloc(sizeof(float)*memsize);
 h_out=(float *)malloc(sizeof(float)*memsize);


// Define data
for (int i = 0; i <  height; i++)
 for (int j=0; j < width; j++)
  data[i*width+j]=float(j);

// Define the grid
dim3 grid((int)(width/BLOCKSIZE)+1,(int)(height/BLOCKSIZE)+1), threads(BLOCKSIZE,BLOCKSIZE);

// allocate Malloc Pitch
cudaMallocPitch((void**)&devMPPtr,&pitch, width * sizeof(float), height);

// Print the pitch
printf("The pitch is %d \n",pitch/sizeof(float));

// Texture Channel Description
//cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32,0,0,0,cudaChannelFormatKindFloat);

// Bind texture to pitch mem:
cudaBindTexture2D(&offset,&texRefEx,devMPPtr,&channelDesc,width,height,pitch);
cout << "My Description x is " << channelDesc.x << endl;
cout << "My Description y is " << channelDesc.y << endl;
cout << "My Description z is " << channelDesc.z << endl;
cout << "My Description w is " << channelDesc.w << endl;
cout << "My Description kind is " << channelDesc.f << endl;
cout << "Offset is " << offset << endl;

// Set mutable properties:
texRefEx.normalized=true;
texRefEx.addressMode[0]=cudaAddressModeWrap;
texRefEx.addressMode[1]=cudaAddressModeWrap;
texRefEx.filterMode= cudaFilterModePoint;

// Allocate cudaMalloc memory
cudaMalloc((void**)&devMPtr,memsize*sizeof(float));

// Read data from host to device
cudaMemcpy2D((void*)devMPPtr,pitch,(void*)data,sizeof(float)*width,
  sizeof(float)*width,height,cudaMemcpyHostToDevice);

//Read back and check this memory
cudaMemcpy2D((void*)h_out,width*sizeof(float),(void*)devMPPtr,pitch,
  sizeof(float)*width,height,cudaMemcpyDeviceToHost);

// Print the memory
 for (int i=0; i<height; i++){
  for (int j=0; j<width; j++){
   printf("%2.2f ",h_out[i*width+j]);
  }
 cout << endl;
 }

 cout << "Done" << endl;
// Memory is fine... 

kernel_w_textures<<<grid,threads>>>(devMPPtr, devMPtr, pitch);

// Copy back data to host
cudaMemcpy((void*)h_out,(void*)devMPtr,width*height*sizeof(float),cudaMemcpyDeviceToHost);


// Print the Result
 cout << endl;
 for (int i=0; i<height; i++){
  for (int j=0; j<width; j++){
   printf("%2.2f ",h_out[i*width+j]);
  }
 cout << endl;
 }
 cout << "Done" << endl;

return(0);
}

Edit October 17: So I still haven't found a solution to this issue. Nvidia is pretty silent on this seems that the world is too. I found a workaround using shared mem but if anyone has a texture solution I would be very please.

Edit Octoboer 26: Still no soltuion, but still interested in one if anyone knows.

A: 

Graphics cards usually expect textures to have dimensions that are powers of 2, this is especially true for nVidia cards. Cuda's cudaMallocPitch and cudaMemcpy2D work with these pitches and looking at your code, the safest solution is to adjust the width and height yourself to be on the safe side. Otherwise, Cuda might write to an invalid memory because it would be expecting wrong offsets:

#define height 16
#define width 11

...

size_t roundUpToPowerOf2(size_t v)
{
  // See http://graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2
  --v;
  v |= v >> 1;
  v |= v >> 2;
  v |= v >> 4;
  v |= v >> 8;
  v |= v >> 16;
  ++v;
  return v;
}
...

size_t horizontal_pitch = roundUpToPowerOf2(width);
size_t vertical_pitch = roundUpToPowerOf2(height);
size_t memsize = horizontal_pitch * vertical_pitch;

...

// Read data from host to device
cudaMemcpy2D((void*)devMPPtr,pitch,(void*)data,sizeof(float)*horizontal_pitch,
  sizeof(float)*width,height,cudaMemcpyHostToDevice);

//Read back and check this memory
cudaMemcpy2D((void*)h_out,horizontal_pitch*sizeof(float),(void*)devMPPtr,pitch,
  sizeof(float)*width,height,cudaMemcpyDeviceToHost);

// Print the memory
 for (int i=0; i<height; i++){
  for (int j=0; j<width; j++){
   printf("%2.2f ",h_out[i*horizontal_pitch+j]);
  }
 cout << endl;
 }

...

// Copy back data to host
cudaMemcpy((void*)h_out,(void*)devMPtr,horizontal_pitch*vertical_pitch*sizeof(float),cudaMemcpyDeviceToHost);

// Print the Result
 cout << endl;
 for (int i=0; i<height; i++){
  for (int j=0; j<width; j++){
   printf("%2.2f ",h_out[i*horizontal_pitch+j]);
  }
 cout << endl;
 }
 cout << "Done" << endl;

Hopefully I haven't overlooked any place where horizontal_pitch/vertical_pitch should be used instead of plain width/height.

dark_charlie
I just tried this and I am still getting incorrect results - with this small array it doesn't output much. Can someone please tell me how to get this working? Basically the first output is 0 1 2... N where N=(width-1). The second output should be 1 2 3 ... N+1
Marm0t
A: 

It might have do with your blocksize. In this code you are trying to have a block of 16x16 threads write to a 11x16 memory block. That means that some of your threads are writing to unallocated memory. That also explains why your tests of (16*M by 32*N) worked: there were no threads writing to unallocated memory, since your dimensions were a multiple of 16.

An easy way to fix this problem is something like this:

if ((x < width) && (y < height)) {
   // write output 
  devMPtr[idy*width+idx]= tex2D(texRefEx,u,v); 
}

You'll need to either pass the height and width to the kernel function or copy a constant to the card before you call the kernel.

tkerwin
From the programming guide cudamalloc pitch does padding (I'm guessing with zeros, they don't explicitly state that: "width rounded up to the closest multiple of this [pitch] size and its rows padded accordingly." So when the texture reference access memory not in the defined region, it should be accessing zeros (the action is defined). You can test this by writing 2D memory to 2D memory (without textures) - it works fine. If you read back a region that represents the padded 2D array defined by cmp, you see zeros in the appropriate place - thanks for your response much appreciated.
Marm0t
A: 
 // Texutre Coordinates
 float u=(idx + 0.5)/float(width);
 float v=(idy + 0.5)/float(height);

You need an offset to get to the center of the texel. I think there might have been some rounding error for your non-multiple of 16 textures. I tried this and it worked for me (both outputs were identical).

tkerwin
I think I've done this before - but it shouldn't matter. I used 'texRefEx.filterMode= cudaFilterModePoint' so it filters to a single value. - I will try again as a sanity check : )
Marm0t
Point sampling wouldn't fix this problem, since it's actually falling just outside the edge of the texel. It only seems to work on wrap mode and not clamp though.
tkerwin
well that's good, I specifically was interested in the wrap mode (this whole problem I was encountering was just a curiosity/road block). I'll let you know how it goes - If this works I will be 95% happy (If it works it means I need to re-implement things in textures after having a shared memory solution...)
Marm0t