ansaurus

Question

Answer 1

+3 A:

CharsetDecoder should be what you are looking for, no ?

Many network protocols and files store their characters with a byte-oriented character set such as ISO-8859-1 (ISO-Latin-1).
However, Java's native character encoding is ~~Unicode~~ UTF16BE (Sixteen-bit UCS Transformation Format, big-endian byte order).

See Charset. That doesn't mean UTF16 is the default charset (i.e.: the default "mapping between sequences of sixteen-bit Unicode code units and sequences of bytes"):

Every instance of the Java virtual machine has a default charset, which may or may not be one of the standard charsets.
[US-ASCII, ISO-8859-1 a.k.a. ISO-LATIN-1, UTF-8, UTF-16BE, UTF-16LE, UTF-16]
The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system.

This example demonstrates how to convert ISO-8859-1 encoded bytes in a ByteBuffer to a string in a CharBuffer and visa versa.

// Create the encoder and decoder for ISO-8859-1
Charset charset = Charset.forName("ISO-8859-1");
CharsetDecoder decoder = charset.newDecoder();
CharsetEncoder encoder = charset.newEncoder();

try {
    // Convert a string to ISO-LATIN-1 bytes in a ByteBuffer
    // The new ByteBuffer is ready to be read.
    ByteBuffer bbuf = encoder.encode(CharBuffer.wrap("a string"));

    // Convert ISO-LATIN-1 bytes in a ByteBuffer to a character ByteBuffer and then to a string.
    // The new ByteBuffer is ready to be read.
    CharBuffer cbuf = decoder.decode(bbuf);
    String s = cbuf.toString();
} catch (CharacterCodingException e) {
}

VonC 2008-10-23 08:57:21

From http://www.exampledepot.com/egs/java.nio.charset/ConvertChar.html

VonC 2008-10-23 08:59:26

Unicode is not an encoding! UTF-8, UTF-16 etc. are. See http://www.joelonsoftware.com/articles/Unicode.html

SealedSun 2010-08-03 14:19:52

@SealedSun: very true. I have fixed that "java native encoding" section in my answer.

VonC 2010-08-03 17:35:06

Answer 2

+3 A:

You don't need a library beyond the standard one - just use Charset. (You can just use the String constructors and getBytes methods, but personally I don't like just working with the names of character encodings. Too much room for typos.)

EDIT: As pointed out in comments, you can still use Charset instances but have the ease of use of the String methods: new String(bytes, charset) and String.getBytes(charset).

Jon Skeet 2008-10-23 08:57:41

I prefer new String(byte[], encoding) and String.getBytes(encoding) in most cases, because they are simple one-liners as opposed to the more powerful but more complicated API of Charset (which, BTW, is only available in Java 1.4+).

Alexander 2008-10-23 09:06:29

Yes, it's a shame that the Charset API is so complicated. The .NET System.Encoding class does this really well, IMO - and keeps the functionality out of String.

Jon Skeet 2008-10-23 09:08:53

Links fixed. See http://www.free-scripts.net/html_tutorial/html/topics/urlencoding.htm

VonC 2008-10-23 10:34:48

@VonC: Thanks. Shame the UI doesn't help do this automatically :)

Jon Skeet 2008-10-23 10:54:16

And now, a brand new sofaq section ;) http://stackoverflow.com/questions/229364

VonC 2008-10-23 11:27:27

Answer 3

A:

It is a whole lot easier if you think of unicode as a character set (which it actually is - it is very basically the numbered set of all known characters). You can encode it as UTF-8 (1-3 bytes per character depending) or maybe UTF-16 (2 bytes per character or 4 bytes using surrogate pairs).

Back in the mist of time Java used to use UCS-2 to encode the unicode character set. This could only handle 2 bytes per character and is now obsolete. It was a fairly obvious hack to add surrogate pairs and move up to UTF-16.

A lot of people think they should have used UTF-8 in the first place. When Java was originally written unicode had far more than 65535 characters anyway...

2009-08-29 17:34:35

ansaurus

tags:

views:

answers:

Encoding conversion in java

related questions