tags:

views:

43

answers:

2

This is the post I post days before, and I loss the account and registered another one

I am trying to modify the imageDenosing class in CUDA SDK, I need to repeat the filter many time incase to capture the time. But my code doesn't work properly.

//start

__global__ void F1D(TColor *image,int imageW,int imageH, TColor *buffer)
{  

const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;

if(iy != 0 && iy < imageH-1  && ix < imageW)
{

    float4 fresult = get_color(image[imageW * iy + ix]);
    float4 fresult4 = get_color(image[imageW * (iy+1) + ix]);
    float4 fresult5 = get_color(image[imageW * (iy-1) + ix]);

    float4 fresult7; 
        fresult7.x = fresult.x*0.5+fresult4.x*.25+fresult5.x*.25;
        fresult7.y = fresult.y*0.5+fresult4.y*.25+fresult5.y*.25;
        fresult7.z = fresult.z*0.5+fresult4.z*.25+fresult5.z*.25;

    buffer[imageW * iy + ix] =      
        make_color(fresult7.x,fresult7.y,fresult7.z,0);     

}

image[imageW * iy + ix] =   buffer[imageW * iy + ix];
//should be use cudaMemcpy, But it fails
}
//extern

extern "C" void
cuda_F1D(TColor *dst, int imageW, int imageH)
{
dim3 threads(BLOCKDIM_X, BLOCKDIM_Y);
dim3 grid(iDivUp(imageW, BLOCKDIM_X), iDivUp(imageH, BLOCKDIM_Y));
Copy<<<grid, threads>>>(dst, imageW, imageH);

size_t size = imageW*imageH*sizeof(TColor);
TColor *host =(TColor*) malloc(size);
TColor *dst2;
//TColor *dst3;
//TColor *d = new TColor(imageW*imageH*sizeof(TColor));
dim3 threads2(imageW,1);
dim3 grid2(iDivUp(imageW, imageW), iDivUp(imageH, 1));

for(int i = 0;i<1;i++)
{   
cudaMalloc( (void **)&dst2, size);
cudaMemcpy(dst2, dst, imageW*imageH*sizeof(TColor),cudaMemcpyHostToDevice);
F1D<<<grid2, threads2>>>(dst, imageW, imageH,dst2);
cudaFree(dst2);
}

}

This code works, but cant synchronise the array of image. and lead to many synchronise problem.

Here is my task question

Lab Class Description:

In this assignment, you will have to implement a small CUDA program for image processing. You may want to re-use the Image denoising visual studio project available with the CUDA SDK.

Question 1 Write a CUDA kernel that applies the following 1D filtering kernel to the image [0.25 0.5 0.25] (Similar to the previous CW). Measure how much time it takes using a CUDA timer. For that, you may want to repeat the filtering n times so that your measurement is accurate enough.

Thus I finished the filtering kernel part and time measure. But the looping N times fails. The result blurs out and when i becomes larger the image black out.

A: 

I already answered this for you when you posted the same question previously - you need to wait for a kernel to complete before running it again - add:

cudaThreadSynchronize(); // *** wait for kernel to complete ***

after the kernel call.

Paul R
Yes thanks for your answer, but the cudaThreadSynchronize() does not help.I had tried before and somehow the result still not correct.
kitw
http://www.mypicx.com/uploadimg/30130182_06042010_1.jpgthe results:left one is when applying first timemiddle one is after this program looping 100 timesright one is what i supposed to had after 100 times
kitw
@kitw: you still don't have `cudaThreadSynchronize()` in the above code though - why would you leave this out ? The code as it is can never work in a loop without this.
Paul R
A: 

The statement

image[imageW * iy + ix] =   buffer[imageW * iy + ix];

is causing the problem. You are overwriting your input image in the kernel. So depending on thread execution order, you would be further blurring parts of the image.

Also, I don't see the purpose of

cudaMemcpy(dst2, dst, imageW*imageH*sizeof(TColor),cudaMemcpyHostToDevice);

dst looks to be device memory since you have access to it in the cuda kernal.

sjchoi
I know that is the problem. but I cant solve it
kitw