Out of curiosity I decided to benchmark my own matrix multiplication function versus the BLAS implementation... I was to say the least surprised at the result:
Custom Implementation, 10 trials of 1000x1000 matrix multiplication:
Took: 15.76542 seconds.
BLAS Implementation, 10 trials of 1000x1000 matrix multiplication:
Took: 1.32432 seconds.
This is using single precision floating point numbers.
My Implementation:
template<class ValT>
void mmult(const ValT* A, int ADim1, int ADim2, const ValT* B, int BDim1, int BDim2, ValT* C)
{
if ( ADim2!=BDim1 )
throw std::runtime_error("Error sizes off");
memset((void*)C,0,sizeof(ValT)*ADim1*BDim2);
int cc2,cc1,cr1;
for ( cc2=0 ; cc2<BDim2 ; ++cc2 )
for ( cc1=0 ; cc1<ADim2 ; ++cc1 )
for ( cr1=0 ; cr1<ADim1 ; ++cr1 )
C[cc2*ADim2+cr1] += A[cc1*ADim1+cr1]*B[cc2*BDim1+cc1];
}
I have two questions:
- Given that a matrix-matrix multiplication say: nxm * mxn requires n*n*m multiplications, so in the case above 1000^3 or 1e9 operations. How is it possible on my 2.6Ghz processor for BLAS to do 10*1e9 operations in 1.32 seconds? Even if multiplcations were a single operation and there was nothing else being done, it should take ~4 seconds.
- Why is my implementation so much slower?