I wrote two matrix classes in Java just to compare the performance of their matrix multiplications. One class (Mat1) stores a double[][] A
member where row i
of the matrix is A[i]
. The other class (Mat2) stores A
and T
where T
is the transpose of A
.
Let's say we have a square matrix M and we want the product of M.mult(M)
. Call the product P
.
When M is a Mat1 instance the algorithm used was the straightforward one:
P[i][j] += M.A[i][k] * M.A[k][j]
for k in range(0, M.A.length)
In the case where M is a Mat2 I used:
P[i][j] += M.A[i][k] * M.T[j][k]
which is the same algorithm because T[j][k]==A[k][j]
. On 1000x1000 matrices the second algorithm takes about 1.2 seconds on my machine, while the first one takes at least 25 seconds. I was expecting the second one to be faster, but not by this much. The question is, why is it this much faster?
My only guess is that the second one makes better use of the CPU caches, since data is pulled into the caches in chunks larger than 1 word, and the second algorithm benefits from this by traversing only rows, while the first ignores the data pulled into the caches by going immediately to the row below (which is ~1000 words in memory, because arrays are stored in row major order), none of the data for which is cached.
I asked someone and he thought it was because of friendlier memory access patterns (i.e. that the second version would result in fewer TLB soft faults). I didn't think of this at all but I can sort of see how it results in fewer TLB faults.
So, which is it? Or is there some other reason for the performance difference?