ansaurus

Question

Answer 1

A:

You should get the proper results if you take the string and call the string's .getBytes("UTF-8") function. This will force it back to a byte array using the same char set that you get the string with.
From there you should be able to send the byte array back to a string using the new String(byte[] bytes], String charset) constructor, passing in the appropriate char set.

See javadocs for details.

Mike Clark 2010-04-12 14:53:53

As others have pointed out: no, this won't work.

Joachim Sauer 2010-04-12 15:07:50

Answer 2

A:

You can use this tutorial

The charset you need should be defined in rt.jar (according to this)

LB 2010-04-12 14:54:37

Answer 3

A:

What you want to do is impossible. Once you have a Java String, the information about the byte array is lost. You may have luck doing a "manual conversion". Create a list of all windows-1252 characters and their mapping to UTF-8. Then iterate over all characters in the string to convert them to the right encoding.

Edit: As a commenter said this won't work. When you convert a Windows-1252 byte array as it if was UTF-8 you are bound to get encoding exceptions. (See here and here).

kgiannakakis 2010-04-12 14:54:41

That's what I was afraid of...

Nico 2010-04-12 15:05:18

Answer 4

+3 A:

As there seems to be some confusion on whether this is possible or not I think I'll need to provide an extensive example.

The question claims that the (initial) input is a byte[] that contains Windows-1252 encoded data. I'll call that byte[] ib (for "initial bytes").

For this example I'll choose the German word "Bär" (meaning bear) as the input:

byte[] ib = new byte[] { (byte) 0x42, (byte) 0xE4, (byte) 0x72 };
String correctString = new String(ib, "Windows-1252");
assert correctString.charAt(1) == '\u00E4'; //verify that the character was correctly decoded.

(If your JVM doesn't support that encoding, then you can use ISO-8859-1 instead, because those three letters (and most others) are at the same position in those two encodings).

The question goes on to state that some other code (that is outside of our influence) already converted that byte[] to a String using the UTF-8 encoding (I'll call that String is for "input String"). That String is the only input that is available to achieve our goal (if is were available, it would be trivial):

String is = new String(ib, "UTF-8");
System.out.println(is);

This obviously produces the incorrect output "B�".

The goal would be to produce ib (or the correct decoding of that byte[]) with only is available.

Now some people claim that getting the UTF-8 encoded bytes from that is will return an array with the same values as the initial array:

byte[] utf8Again = is.getBytes("UTF-8");

But that returns the UTF-8 encoding of the two characters B and � and definitely returns the wrong result when re-interpreted as Windows-1252:

System.out.println(new String(utf8Again, "Windows-1252");

This line produces the output "Bï¿½", which is totally wrong (it is also the same output that would be the result if the initial array contained the non-word "Bür" instead).

So in this case you can't undo the operation, because information is lost.

There are in fact cases where such mis-encodings can be undone. It's more likely to work, when all possible (or at least occuring) byte sequences are valid in that encoding. Since UTF-8 has several byte sequences that are simply not valid values, you will have problems.

Joachim Sauer 2010-04-12 16:53:22

I get the problem now. Sorry. It's like an overflow issue during the encoding of an _invalid_ UTF-8 byte[] that causes loss of data. Thanks for the lesson.

nicerobot 2010-04-12 18:37:18

I'm glad we sorted it out! ;-)

Joachim Sauer 2010-04-12 22:49:40

ansaurus

tags:

views:

answers:

"Fix" String encoding in Java

related questions