views:

547

answers:

2

Anybody knows a faster way to do what java.nio.charset.Charset.decode(..)/encode(..) does?

It's currently one of the bottleneck of a technology that I'm using.

[EDIT] Specifically, in my application, I changed one segment from a java-solution to a JNI-solution (because there was a C++ technology that was most suitable for my needs than the Java technology that I was using).

This change brought about some significant decrease in speed (and significant increase in cpu & mem usage).

Looking deeper into the JNI-solution that I used, the java application is communicating with the C++ application via byte[]. These byte[] are produced by Charset.encode(..) from the java side and passed to the C++ side. Then when the C++ response with a byte[], it gets decoded in the java side via Charset.decode(..).

Running this against a profiler, I see that Charset.decode(..) and Charset.encode(..) both took a significantly long time compared to the whole execution time of the JNI-solution (I profiled only the JNI-solution because it's something I could whip up quite quickly. I'll profile the whole application on a latter date once I free up my schedule :-) ).

Upon reading further regarding my problem, it's seems that it's a known problem with Charset.encode(..) and decode(..) and it's being addressed in Java7. However, moving to Java7 is not an option for me (for now) due to some constraints.

Which is why I ask here if somebody knows a Java5 solution / alternative to this (Sorry, should have mentioned that this was for Java5 sooner) ? :-)

+2  A: 

The javadoc for encode() and decode() make it clear that these are convenience methods. For example, for encode():

Convenience method that encodes Unicode characters into bytes in this charset.

An invocation of this method upon a charset cs returns the same result as the expression

 cs.newEncoder()
   .onMalformedInput(CodingErrorAction.REPLACE)
   .onUnmappableCharacter(CodingErrorAction.REPLACE)
   .encode(bb); 

except that it is potentially more efficient because it can cache encoders between successive invocations.

The language is a bit vague there, but you might get a performance boost by not using these convenience methods. Create and configure the encoder once, and then re-use it:

 CharsetEncoder encoder = cs.newEncoder()
   .onMalformedInput(CodingErrorAction.REPLACE)
   .onUnmappableCharacter(CodingErrorAction.REPLACE);

 encoder.encode(...);
 encoder.encode(...);
 encoder.encode(...);
 encoder.encode(...);

It always pays to read the javadoc, even if you think you already know the answer.

skaffman
In Java 1.6 (at least) the implementation of `CharSet.encode(...)` uses an encoder that is cached using thread locals, and repeats the setup calls (`onMalformed ...` etc) each time. By doing your own caching, you would only save the overhead of a thread local fetch, and the setup calls. This is probably insignificant ... though the profiler should tell you that.
Stephen C
Fair point. There is a multi-threaded use case here, though.
skaffman
Actually, I've read the javadoc ;)`.
Franz See
A: 

There are very few reasons to "squeeze" a string in a byte array. I would recommend to write the C functions to take utf-16 strings as parameters. This way there is no need for any conversion.

Mihai Nita
Ok, I will try that one.
Franz See