is there any free java library which i can use to convert string in one encoding to other encoding, something like icnov in php? i'm using java version 1.3
CharsetDecoder
should be what you are looking for, no ?
Many network protocols and files store their characters with a byte-oriented character set such as ISO-8859-1
(ISO-Latin-1
).
However, Java's native character encoding is Unicode UTF16BE (Sixteen-bit UCS Transformation Format, big-endian byte order).
See Charset
. That doesn't mean UTF16
is the default charset (i.e.: the default "mapping between sequences of sixteen-bit Unicode code units and sequences of bytes"):
Every instance of the Java virtual machine has a default charset, which may or may not be one of the standard charsets.
[US-ASCII
,ISO-8859-1
a.k.a.ISO-LATIN-1
,UTF-8
,UTF-16BE
,UTF-16LE
,UTF-16
]
The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system.
This example demonstrates how to convert ISO-8859-1
encoded bytes in a ByteBuffer
to a string in a CharBuffer
and visa versa.
// Create the encoder and decoder for ISO-8859-1
Charset charset = Charset.forName("ISO-8859-1");
CharsetDecoder decoder = charset.newDecoder();
CharsetEncoder encoder = charset.newEncoder();
try {
// Convert a string to ISO-LATIN-1 bytes in a ByteBuffer
// The new ByteBuffer is ready to be read.
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap("a string"));
// Convert ISO-LATIN-1 bytes in a ByteBuffer to a character ByteBuffer and then to a string.
// The new ByteBuffer is ready to be read.
CharBuffer cbuf = decoder.decode(bbuf);
String s = cbuf.toString();
} catch (CharacterCodingException e) {
}
You don't need a library beyond the standard one - just use Charset. (You can just use the String constructors and getBytes methods, but personally I don't like just working with the names of character encodings. Too much room for typos.)
EDIT: As pointed out in comments, you can still use Charset instances but have the ease of use of the String methods: new String(bytes, charset) and String.getBytes(charset).
It is a whole lot easier if you think of unicode as a character set (which it actually is - it is very basically the numbered set of all known characters). You can encode it as UTF-8 (1-3 bytes per character depending) or maybe UTF-16 (2 bytes per character or 4 bytes using surrogate pairs).
Back in the mist of time Java used to use UCS-2 to encode the unicode character set. This could only handle 2 bytes per character and is now obsolete. It was a fairly obvious hack to add surrogate pairs and move up to UTF-16.
A lot of people think they should have used UTF-8 in the first place. When Java was originally written unicode had far more than 65535 characters anyway...