Hi,
I'm trying to add the rows of a 4800x9600 matrix together, resulting in a matrix 1x9600.
What I've done is split the 4800x9600 into 9,600 matrices of length 4800 each. I then perform a reduction on the 4800 elements.
The trouble is, this is really slow...
Anyone got any suggestions?
Basically, I'm trying to implement MATLAB's sum(...) function.
Here is the code which I've verified works fine, it's just it's really slow:
void reduceRows(Matrix Dresult,Matrix DA)
{
//split DA into chunks
Matrix Dchunk;
Dchunk.h=1;Dchunk.w=DA.h;
cudaMalloc((void**)&Dchunk.data,Dchunk.h*Dchunk.w*sizeof(float));
Matrix DcolSum;
DcolSum.h=1;DcolSum.w=1;
//cudaMalloc((void**)&DcolSum.data,DcolSum.h*DcolSum.w*sizeof(float));
int i;
for(i=0;i<DA.w;i++) //loop over each column
{
//printf("%d ",i);
cudaMemcpy(Dchunk.data,&DA.data[i*DA.h],DA.h*sizeof(float),cudaMemcpyDeviceToDevice);
DcolSum.data=&Dresult.data[i];
reduceTotal(DcolSum,Dchunk);
}
cudaFree(Dchunk.data);
}
Matrix is defined as:
typedef struct{
long w;
long h;
float* data;
}Matrix;
ReduceTotal() just calls the standard NVIDIA reduction, sums all the elements in Dchunk and puts the answer in DcolSum.
I'm about to do all this on the CPU if I can't find an answer... ;(
Many thanks in advance,