views:

82

answers:

4

I have a Java string that I'm having trouble manipulating. I have a String, s, that has a value of 丞 (a Chinese character I chose at random, I don't speak Chinese). If I call

String t = new String(s.getBytes());
if (s.equals(t))
    System.out.println("String unchanged");
else
    System.out.println("String changed");

Then I get the String changed result. Does anyone know what's going on?

+6  A: 

Because that method:

Encodes this String into a sequence of bytes using the platform's default charset

If your default charset is ie US-ASCII you won't get the same bytes used by that Chinese letter

I imagine an extra bit/byte may be added/droppped in the process.

Try using getBytes( String charSetName )

public byte[] getBytes(String charsetName)

Using the correct charsetName

OscarRyz
+2  A: 

The getBytes() method uses the default encoding. According to the docs:

The CharsetEncoder class should be used when more control over the encoding process is required.

Vincent Ramdhanie
+1  A: 

String t = new String(s.getBytes()); may create string using ASCII as default charset. Use following method to create the string with charsetName as UTF-8

String(byte[] bytes, int offset, int length, String charsetName)

jatanp
+1  A: 

Actually, I figured this out, sorry for the post. I was using the default Java Charset, instead of explicitly casting it as a UTF-8 Charset. It works now.

Jon