C/C++ versus Java/C# in high-performance applications

hello

My Question is regarding performance of Java versus compiled code, for example C++/fortran/assembly in high-performance numerical applications. I know this is contentious topic, but I am looking for specific answers/examples. Also community wiki. I have asked similar questions before, but I think I put it broadly and did not get answers I was looking for.

double precision matrix matrix multiplication, commonly known as dgemm in blas library, is able to achieve nearly 100 percent peak CPU performance (in terms of floating operations per second).
There are several factors which allow to achieve that performance:

cache blocking, to achieve maximum memory locality
loop unrolling to minimize control overhead
vector instructions, such as SSE
memory prefetching
guarantee no memory aliasing

I have have seen lots of benchmarks using assembly, C++, fortran, Atlas, vendor BLAS (typical cases are matrix of dimension 512 and above). On the other hand I have have heard that the principle byte compiled languages/implementations such as Java can be fast or nearly as fast as machine compiled languages. However I have not seen definite benchmarks showing that it is so. On the contrary, it seems (from my own research) byte compiled languages are much slower.

Do you have good matrix matrix multiplication benchmarks for Java/C #? does just-in-time compiler (actual implementation, not hypothetical) able to produce instructions which satisfy points I have listed?

Thanks

with regards to performance: every CPU has peak performance, depending on number of instructions processor can execute per second. For example, modern 2 ghz Intel CPU can achieve 8 billion double precision add/multiply a second, resulting in 8 gflops peak performance. Matrix matrix multiply is one of algorithms which is able to achieve nearly full performance with regards number of operations per second, main reason being higher ratio of compute over memory operations (N^3/N^2). Numbers I am interested in a something on the order N > 500 .

with regards to implementation: higher-level details such as blocking is done at source code level. Lower-level optimization is handled by compiler, perhaps with compiler hints with regards to alignment/alias. Byte compiled implementation can be written using block approach as well, so in principle source code details for decent implementation will be very similar.

As far as I know, it doesn't use SSE instructions to vectorize code though, nor does the .NET CLR. Mono does have some structs (Vectors and Matrices) that are treated specially by the JIT compiler that get turned into vectorized code.

JulianR 2010-02-27 21:21:18

@JR that was my impression as well

aaa 2010-02-27 21:55:39

true. But vectorization, aliasing issues are handled by compilers often. Moreover, loop unrolling a something I would expect compiler to do. Cache access is pretty straightforward in compile languages, but how does byte compile language handles it?

aaa 2010-02-27 21:59:02

@aaa: the JIT engine/compiler takes care of that.

LiraNuna 2010-02-27 22:42:42

thank you. What accounts for difference between two SIMD implementations? look in a data it appears to be memory related?

aaa 2010-02-27 22:35:10

Speaking of SSE in C++, I suggest you also compare GCC 4.4, just for the completeness, as MSVC's SSE code generation is really horrible (see http://www.liranuna.com/sse-intrinsics-optimizations-in-popular-compilers/ for details).

LiraNuna 2010-02-27 22:36:37

ansaurus

tags:

views:

answers:

C/C++ versus Java/C# in high-performance applications

related questions