views:

122

answers:

4

I wrote two matrix classes in Java just to compare the performance of their matrix multiplications. One class (Mat1) stores a double[][] A member where row i of the matrix is A[i]. The other class (Mat2) stores A and T where T is the transpose of A.

Let's say we have a square matrix M and we want the product of M.mult(M). Call the product P.

When M is a Mat1 instance the algorithm used was the straightforward one:

P[i][j] += M.A[i][k] * M.A[k][j]
    for k in range(0, M.A.length)

In the case where M is a Mat2 I used:

P[i][j] += M.A[i][k] * M.T[j][k]

which is the same algorithm because T[j][k]==A[k][j]. On 1000x1000 matrices the second algorithm takes about 1.2 seconds on my machine, while the first one takes at least 25 seconds. I was expecting the second one to be faster, but not by this much. The question is, why is it this much faster?

My only guess is that the second one makes better use of the CPU caches, since data is pulled into the caches in chunks larger than 1 word, and the second algorithm benefits from this by traversing only rows, while the first ignores the data pulled into the caches by going immediately to the row below (which is ~1000 words in memory, because arrays are stored in row major order), none of the data for which is cached.

I asked someone and he thought it was because of friendlier memory access patterns (i.e. that the second version would result in fewer TLB soft faults). I didn't think of this at all but I can sort of see how it results in fewer TLB faults.

So, which is it? Or is there some other reason for the performance difference?

A: 

The cache and TLB hypotheses are both reasonable, but I'd like to see the complete code of your benchmark ... not just pseudo-code snippets.

Another possibility is that performance difference is a result of your application using 50% more memory for the data arrays in the version with the transpose. If your JVM's heap size is small, it is possible that this is causing the GC to run too often. This could well be a result of using the default heap size. (Three lots of 1000 x 1000 x 8 bytes is ~24Mb)

Try setting the initial and max heap sizes to (say) double the current max size. If that makes no difference, then this is not a simple heap size issue.

Stephen C
Perhaps there was a misunderstanding, but the case that stores more data is the faster one. And there isn't really much GC going on until after the multiplications are finished, so that couldn't have interfered with the timing.
CromTheDestroyer
You are right. I misunderstood your results.
Stephen C
+5  A: 

This because of locality of your data.

In RAM a matrix, although bidimensional from your point of view, it's of course stored as a contiguous array of bytes. The only difference from a 1D array is that the offset is calculated by interpolating both indices that you use.

This means that if you access element at position x,y it will calculate x*row_length + y and this will be the offset used to reference to the element at position specified.

What happens is that a big matrix isn't stored in just a page of memory (this is how you OS manages the RAM, by splitting it into chunks) so it has to load inside CPU cache the correct page if you try to access an element that is not already present.

As long as you go contiguously doing your multiplication you don't create any problems, since you mainly use all coefficients of a page and then switch to the next one but if you invert indices what happens is that every single element may be contained in a different memory page so everytime it needs to ask to RAM a different page, this almost for every single multiplication you do, this is why the difference is so neat.

(I rather simplified the whole explaination, it's just to give you the basic idea around this problem)

In any case I don't think this is caused by JVM by itself. It maybe related in how your OS manages the memory of the Java process..

Jack
*"In RAM a matrix, although bidimensional from your point of view, it's of course stored as a contiguous array of bytes."*. That's NOT TRUE for Java. In Java, a 2-D array is represented as an array of arrays. The locality of the arrays at each level depends on 1) how they were allocated and 2) whether the garbage collector has kept them together.
Stephen C
Stephen C: that's true, but my arrays were allocated as: int n; new double[n][n]; so obviously the jvm will try to allocate it in one contiguous chunk
CromTheDestroyer
JIT will intervene, everything will be optimized, especially if it's a primitive data type.. don't think that the JVM won't care about the fact that you are working with a matrix of numbers otherwise Java couldn't never obtain performance so near to C/C++. Without using a native type performance would just be bad in both cases :)
Jack
@Jack - have you any references for this? Or is it just hypothesis?
Stephen C
@CromTheDestroyer - "obviously" != "fact" :-)
Stephen C
My reference is that I had to wrote a data analysis framework that used matrices intensively. So that I really could appreciate the difference between using primitive types or compound types (especially when working with big ~100k rows matrices). In anycase it should be easy to see if JVM specification has specified instructions for matrices. Let me check..
Jack
In any case I won't worry about if it's a contiguous chunk or not, since also assuming that it is stored as an array of arrays if you load one of the arrays (let's think a row) and use it just for 1 value the problem remains exactly the same.. (btw gut feeling was a nice expression, also if I needed a dictionary :)
Jack
I haven't been able to find something online that is enough new to be considered trustworty, so until any proven fact we can assume that it is implemented as an array of arrays like Stephen C stated. Actually this is for sure how it was implemented in older Java versions (at least 1.4.2) but unfortunately when talking about certain performances JIT really behaves like a black box :(
Jack
@Jack - that's not what I meant by reference. I meant to a link to a webpage (preferably from Sun/Oracle) or a reference to a published paper that describes how HotSpot implements multi-dimensional arrays.
Stephen C
@Jack - Up to a point, the JVM **has to** implement double[][] as an array of array objects. The JLS requires that `((double[][])obj)[1]` evaluates to a reference that is indistinguishable from a normal `double[]` reference. In theory, the JIT could do a global analysis to determine that no "interior" references to an array are needed, and a contiguous memory block could be used. However, subsequent dynamic loading could invalidate this analysis, leaving the JVM in the impossible situation of having to find and convert the representations of existing multi-dimensional array *objects*.
Stephen C
Related on SO http://stackoverflow.com/questions/2512082/java-multi-dimensional-array-vs-one-dimensional and http://stackoverflow.com/questions/2368761/performance-comparison-of-array-of-arrays-vs-multidimensional-arrays
andersoj
A: 

It's easy to guess that the problem might be locality, and maybe it is, but that's still a guess.

It's not necessary to guess. Two techniques might give you the answer - single stepping and random pausing.

If you single-step the slow code you might find out that it's doing a lot of stuff you never dreamed of. Such as, you ask? Try it and find out. What you should see it doing, at the machine-language level, is efficiently stepping through the inner loop with no waste motion.

If it actually is stepping through the inner loop with no waste motion, then random pausing will give you information. Since the slow one is taking 20 times longer than the fast one, that implies 95% of the time it is doing something it doesn't have to. So see what it is. Each time you pause it, the chance is 95% that you will see what that is, and why.

If in the slow case, the instructions it is executing appear just as efficient as the fast case, then cache locality is a reasonable guess of why it is slow. I'm sure, once you've eliminated any other silliness that may be going on, that cache locality will dominate.

Mike Dunlavey
A: 

You might try comparing performance between JDK6 and OpenJDK7, given this set of results...

andersoj