views:

4965

answers:

15

We are computing something whose runtime is bound by matrix operations. (Some details below if interested.) This experience prompted the following question:

Do folk have experience with the performance of Java libraries for matrix math (e.g., multiply, inverse, etc.)? For example:

I searched and found nothing.


Details of our speed comparison:

We are using Intel FORTRAN (ifort (IFORT) 10.1 20070913). We have reimplemented it in Java (1.6) using Apache commons math 1.2 matrix ops, and it agrees to all of its digits of accuracy. (We have reasons for wanting it in Java.) (Java doubles, Fortran real*8). Fortran: 6 minutes, Java 33 minutes, same machine. jvisualm profiling shows much time spent in RealMatrixImpl.{getEntry,isValidCoordinate} (which appear to be gone in unreleased Apache commons math 2.0, but 2.0 is no faster). Fortran is using Atlas BLAS routines (dpotrf, etc.).

Obviously this could depend on our code in each language, but we believe most of the time is in equivalent matrix operations.

In several other computations that do not involve libraries, Java has not been much slower, and sometimes much faster.

+2  A: 

Have you taken a look at the Intel Math Kernel Library? It claims to outperform even ATLAS. MKL can be used in Java through JNI wrappers.

Zach Scrivena
We have that. a) Its licensing is more restrictive than Atlas (so we can't use all our computers); b) it's not Java (and as I said we have reasons to want to be in Java).
dfrankow
i.e., this is not an answer to my question about Java libraries (but I don't have the reputation to downvote it).
dfrankow
@dfrankow: I've updated my answer to address your concern on using it in Java.
Zach Scrivena
Ah, okay. Thanks.
dfrankow
+1, If it's speed you're looking for, this seems to be the way to go
Gab Royer
+2  A: 

I can't really comment on specific libraries, but in principle there's little reason for such operations to be slower in Java. Hotspot generally does the kinds of things you'd expect a compiler to do: it compiles basic math operations on Java variables to corresponding machine instructions (it uses SSE instructions, but only one per operation); accesses to elements of an array are compiled to use "raw" MOV instructions as you'd expect; it makes decisions on how to allocate variables to registers when it can; it re-orders instructions to take advantage of processor architecture... A possible exception is that as I mentioned, Hotspot will only perform one operation per SSE instruction; in principle you could have a fantastically optimised matrix library that performed multiple operations per instruction, although I don't know if, say, your particular FORTRAN library does so or if such a library even exists. If it does, there's currently no way for Java (or at least, Hotspot) to compete with that (though you could of course write your own native library with those optimisations to call from Java).

So what does all this mean? Well:

  • in principle, it is worth hunting around for a better-performing library, though unfortunately I can't recomend one
  • if performance is really critical to you, I would consider just coding your own matrix operations, because you may then be able perform certain optimisations that a library generally can't, or that a particular library your using doesn't (if you have a multiprocessor machine, find out if the library is actually multithreaded)

A hindrance to matrix operations is often data locality issues that arise when you need to traverse both row by row and column by column, e.g. in matrix multiplication, since you have to store the data in an order that optimises one or the other. But if you hand-write the code, you can sometimes combine operations to optimise data locality (e.g. if you're multiplying a matrix by its transformation, you can turn a column traversal into a row traversal if you write a dedicated function instead of combining two library functions). As usual in life, a library will give you non-optimal performance in exchange for faster development; you need to decide just how important performance is to you.

Neil Coffey
+2  A: 

Linalg code that relies heavily on Pentiums and later processors' vector computing capabilities (starting with the MMX extensions, like LAPACK and now Atlas BLAS) is not "fantastically optimized", but simply industry-standard. To replicate that perfomance in Java you are going to need native libraries. I have had the same performance problem as you describe (mainly, to be able to compute Choleski decompositions) and have found nothing really efficient: Jama is pure Java, since it is supposed to be just a template and reference kit for implementers to follow... which never happened. You know Apache math commons... As for COLT, I have still to test it but it seems to rely heavily on Ninja improvements, most of which were reached by building an ad-hoc Java compiler, so I doubt it's going to help. At that point, I think we "just" need a collective effort to build a native Jama implementation...

Varkhan
Good point! An alpha-stage project with JNI wrappers for Atlas: http://www.jblas.org. Author's blog post: http://mikiobraun.blogspot.com/2008/10/matrices-jni-directbuffers-and-number.html
dfrankow
A: 

Building on Varkhan's post that Pentium-specific native code would do better:

dfrankow
A: 

You aren't, by any chance, thrashing in the Java version? The fact that you're spending most of your time in getEntry() is a clue that this is happening.

The other issue that I can see with Commons-Math is that the matrix is managed as a two-dimensional array. While I don't expect the required double-dereference to be the cause of the slowdown, it does mean that the individual rows (assuming [rowIdx][colIdx]) are in fact separate objects that will be dealt with individually by the GC. Are you spending a lot of time in GC?

kdgregory
No thrashing, plenty of memory. My guess on getEntry() is that jvisualvm is not good at estimating the time of a very short function, even though upon setup it tried to benchmark an empty function call. Thus, I think getEntry() is actually showing up higher than it is in non-profiled running.
dfrankow
+2  A: 

We have used COLT for some pretty large serious financial calculations and have been very happy with it. In our heavily profiled code we have almost never had to replace a COLT implementation with one of our own.

In their own testing (obviously not independent) I think they claim within a factor of 2 of the Intel hand-optimised assembler routines. The trick to using it well is making sure that you understand their design philosophy, and avoid extraneous object allocation.

Nick Fortescue
A: 

There are many different freely available java linear algebra libraries. http://www.ujmp.org/java-matrix/benchmark/ Unfortunately that benchmark only gives you info about matrix multiplication (with transposing the test does not allow the different libraries to exploit their respective design features).

What you should look at is how these linear algebra libraries perform when asked to compute various matrix decompositions. http://ojalgo.org/matrix_compare.html

A: 

I have found that if you are creating a lot of high dimensional Matrices, you can make Jama about 20% faster if you change it to use a single dimensional array instead of a two dimensional array. This is because Java doesn't support multi-dimensional arrays as efficiently. ie. it creates an array of arrays.

Colt does this already, but I have found it is more complicated and more powerful than Jama which may explain why simple functions are slower with Colt.

The answer really depends on that you are doing. Jama doesn't support a fraction of the things Colt can do which make make more of a difference.

Peter Lawrey
A: 

Matrix Tookits Java (MTJ) was already mentioned before, but perhaps it's worth mentioning again for anyone else stumbling onto this thread. For those interested, it seems like there's also talk about having MTJ replace the linalg library in the apache commons math 2.0, though I'm not sure how that's progressing lately.

Steve Lianoglou
A: 

You may want to check out the jblas project. It's a relatively new Java library that uses BLAS, LAPACK and ATLAS for high-performance matrix operations.

The developer has posted some benchmarks in which jblas comes off favourably against MTJ and Colt.

Mark Reid
A: 

There's also UJMP

Chad Okere
+1  A: 

Hi there,

I'm the main author of jblas and wanted to point out that I've released Version 1.0 in late December 2009. I worked a lot on the packaging, meaning that you can now just download a "fat jar" with ATLAS and JNI libraries for Windows, Linux, Mac OS X, 32 and 64 bit (except for Windows). This way you will get the native performance just by adding the jar file to your classpath. Check it out at http://jblas.org!

-M

Mikio Braun
A: 

You should add Apache Mahout to your shopping list.

bmargulies
+1  A: 

Just to add my 2 cents. I've compared some of these libraries. I attempted to matrix multiply a 3000 by 3000 matrix of doubles with itself. The results are as follows.

Using multithreaded ATLAS with C/C++, Octave, Python and R, the time taken was around 4 seconds.

Using Jama with Java, the time taken was 50 seconds.

Using Colt and Parallel Colt with Java, the time taken was 150 seconds!

Using JBLAS with Java, the time taken was again around 4 seconds as JBLAS uses multithreaded ATLAS.

So for me it was clear that the Java libraries didn't perform too well. However if someone has to code in Java, then the best option is JBLAS. Jama, Colt and Parallel Colt are not fast.

Hamaad Shah
A: 

There's a benchmark of various matrix packages available in java up on http://code.google.com/p/java-matrix-benchmark/ for a few different hardware configurations. But it's no substitute for doing your own benchmark.

Performance is going to vary with the type of hardware you've got (cpu, cores, memory, L1-3 cache, bus speed), the size of the matrices and the algorithms you intend to use. Different libraries have different takes on concurrency for different algorithms, so there's no single answer. You may also find that the overhead of translating to the form expected by a native library negates the performance advantage for your use case (some of the java libraries have more flexible options regarding matrix storage, which can be used for further performance optimizations).

Generally though, JAMA, Jampack and COLT are getting old, and do not represent the state of the current performance available in Java for linear algebra. More modern libraries make more effective use of multiple cores and cpu caches. JAMA was a reference implementation, and pretty much implements textbook algorithms with little regard to performance. COLT and IBM Ninja were the first java libraries to show that performance was possible in java, even if they lagged 50% behind native libraries.

culana