views:

46

answers:

2

According to this documentation ( http://java.sun.com/docs/books/jls/third_edition/html/lexical.html , 3.10.6) an OctalEscape will be converted to an unicode character. Now I have the problem, that the following code will result in a 2 byte Unicode character with wrong informations.

for (byte b : "\222".getBytes()) {
     System.out.format("%02x ", b);
}

The result is "c2 92". I was expacting only "92", because this would be the converted value from 222 octal to hex (92). If I test this with a character, the byte informations are correct.

System.out.format("%02x ", (byte)'\222');

The result is "92" for one byte" My default encoding is "UTF-8" on Linux with Java/c 1.6.0_18.

The background of my question is, that I'm looking for a method to convert an octal escaped string from the input encoding Cp1252 to UTF-8. This fails because of the conversion of an octal escaped string to 2 bytes. Does somebody know why there is always an extra byte "c2" been added to the char array? A simple count shows, that there is only one character in the array.

System.out.println("\222".toCharArray().length); // will result in "1"

Thank you for your hints.

Update: As BalusC mentioned the octal escaped value is interpreted as UTF-8 value, which yield the problem. As long as this value is saved in the source code (UTF-8) I have no possibility to read in this string with an other encoding. I'm right? If I read an Cp1252 encoded file, I have to declare the charset of the InputReader with the correct charset and do an encoding to UTF-8 to process and save the read content as UTF-8.

+4  A: 

The String#getBytes() call without a specified encoding will use the platform default encoding to convert characters to bytes. Since c2 is a typical first byte of a two-byte character of the multibyte UTF-8 sequence, you're apparently using UTF-8 as platform default encoding. If you want to get CP1252 bytes, then you need to specify that explicitly in the String#getBytes(String charsetName) method.

for (byte b : "\222".getBytes("cp1252")) {
     System.out.format("%02x ", b);
}

Update as per your update:

As long as this value is saved in the source code (UTF-8) I have no possibility to read in this string with an other encoding. I'm right?

That's correct. You need to read the file using the same encoding as the file was saved in, otherwise you may risk to end up with mojibake.

If I read an Cp1252 encoded file, I have to declare the charset of the InputReader with the correct charset and do an encoding to UTF-8 to process and save the read content as UTF-8.

Just read the file as CP1252 using InputStreamReader. When read as characters (strings), Java will store it implicitly as Unicode (UTF-16). You can treat the data as Unicode. There's no need to introduce an intermediating UTF-8 file step. If you want to save the file, use OutputStreamWriter with the desired charset, this can be different from CP1252. Only keep in mind that any character which isn't covered by the charset will end up as ?.

See also:

BalusC
How do you know this stuff, lolz.
Tony Ennis
This result in an output of "3f". Which is not the octal value, that I was given in.
DrDol
It's the hexadecimal representation of the byte which is representing the **CP1252 character** as represented by the octal `\222`. `92` is the value when you convert the **literal string** `222` from octal to hexadecimal. I think that you need to revise the functional requirement after all. Maybe you're missing one conversion layer (the character encoding itself, I bet).
BalusC
Little correction: `3f` is a `?`. It's printed because Unicode character `U+0092` can't be represented in CP1252.
axtavt
@axtavt: Ah I didn't check for that. It was just from top of head. But indeed, the CP1252 doesn't support that character.
BalusC
+2  A: 

All chars and strings in Java are UTF-16. So, you have entered the control character U+0092 PRIVATE USE TWO and encoded it to UTF-8 (this character takes two bytes when encoded as UTF-8). Characters encoded as anything other than UTF-16 must be represented by byte arrays.

U+2019: ’

I'm guessing you intend to transcode the character U+2019 RIGHT SINGLE QUOTATION MARK. In windows-1252, this has a byte value of 92. I hate to disappoint, but when encoded as UTF-8 this is going to end up as the multi-byte sequence E2 80 99.

Also note that U+2019 can't be represented by octal escape sequences in Java as it has a value over U+00FF. You'd have to use the Unicode escape sequence \u2019. I wrote a blog post about transcoding in different languages here and encoding in Java source files here.

McDowell
Actually I read some of your blog posts and helped a lot understanding the IHMO complex topic of character encoding. It's like regex, voodoo in the beginning, but really powerful if you dig deeper. THX
DrDol