views:

169

answers:

2

I am working on a proxy server. I am getting data in byte[] which I convert into a String to perform certain operations. Now when i convert this new String back into a byte[] it causes unknown problems.

So mainly its like I need to know how to correctly convert abyte[] into a String and then back into a byte[] again.

I tried to just convert the byte[] to String and then back to byte[] again (to make sure thats its not my operations that are causing problems).

So it's like:

// where reply is a byte[]

String str= new String(reply,0, bytesRead);
streamToClient.write(str.getBytes(), 0, bytesRead);

is not equivalent to

streamToClient.write(reply, 0, bytesRead);

my proxy works fine when I just send the byte[] without any conversion but when I convert it from byte[] to a String and then back to a byte[] its causes problems.

Can some one please help? =]

+1  A: 

You will need to know the character encoding used, decode the bytes using that and re-encode using the same character encoding. For example:

String str = new String(reply, 0, Charset.forName("UTF-8"));
bytes[] out = str.getBytes(Charset.forName("UTF-8"));
streamToClient.write(bytes, 0, bytes.length);

If not specified, Java using a default character encoding, which is typically UTF-8 (it may even be mandated as such) but HTML will often be something else. I suspect that's your problem.

cletus
The default character encoding is not typically UTF-8, at least not on Windows.
Michael Borgwardt
Most modern Linux distributions default to UTF-8, however.
Joachim Sauer
+5  A: 

The best way to convert a byte[] to String and back into a byte[] is not to do it at all.

If you have to, you must know the encoding that was used to produce the byte[], otherwise the operation uses the platform default encoding, which can corrupt the data because not all encodings can encode all possible strings, and not all possible byte sequences are legal in all encodings. This is what's happening in your case.

As for how to find out the encoding, that depends:

  • If you're using HTTP, look at the Content-Type header
  • If your data is XML, you should be using an XML parser, which will handle the encoding for you
  • If your data is HTML pages, there might also be a <meta http-equiv> header

If there is no way to find out the encoding you have random garbage, not text data.

Michael Borgwardt
aha... thanks for the detailed answer...guess i better start working on a new approch towards the proxy now.
Sid