tags:

views:

14455

answers:

7

In Java, I have a String and I want to encode it as a byte array (in UTF8, or some other encoding). Alternately, I have a byte array (in some known encoding) and I want to convert it into a Java String. How do I do these conversions?

+3  A: 
String original = "hello world";
byte[] utf8Bytes = original.getBytes("UTF8");
smink
Thanks! I wrote it up again myself adding the other direction of conversion.
mcherm
ok great :) glad to have helped.
smink
+15  A: 

Convert from String to byte[]:

String s = "some text here";
byte[] b = s.getBytes("UTF-8");

Convert from byte[] to String:

byte[] b = {(byte) 99, (byte)97, (byte)116};
String s = new String(b, "US-ASCII");

You should, of course, use the correct encoding name. My examples used "US-ASCII" and "UTF-8", the two most common encodings.

mcherm
US-ASCII is actually not a very common encoding nowadays. Windows-1252 and ISO-8859-1 (which are supersets of ASCII) are far more widespread.
Michael Borgwardt
Actually, I find it fairly common in my work. I often read streams of bytes which may have been saved as Windows-1252 or ISO-8859-1 or even just as "output of that legacy program we've had for the past 10 years", but which contain bytes guaranteed to be valid US-ASCII characters. I also often have a requirement to GENERATE such files (for consumption by code which may-or-may-not be able to handle non-ASCII characters. Basically, US-ASCII is the "greatest common denominator" of many pieces of software.
mcherm
+3  A: 

You can convert directly via the String(byte[], String) constructor and getBytes(String) method. Java exposes available character sets via the Charset class. The JDK documentation lists supported encodings.

90% of the time, such conversions are performed on streams, so you'd use the Reader/Writer classes. You would not incrementally decode using the String methods on arbitrary byte streams - you would leave yourself open to bugs involving multibyte characters.

McDowell
+1 for mentioning multibyte characters.
sleske
A: 

terribly late but i just encountered this issue and this is my fix:

private static String removeNonUtf8CompliantCharacters( final String inString ) {
    if (null == inString ) return null;
    byte[] byteArr = inString.getBytes();
    for ( int i=0; i < byteArr.length; i++ ) {
        byte ch= byteArr[i]; 
        // remove any characters outside the valid UTF-8 range as well as all control characters
        // except tabs and new lines
        if ( !( (ch > 31 && ch < 253 ) || ch == '\t' || ch == '\n' || ch == '\r') ) {
            byteArr[i]=' ';
        }
    }
    return new String( byteArr );
}
savio
A: 

If you refer the blog http://infomani.wordpress.com It may be good..

Manikandan.M
+1  A: 

Here's a solution that avoids performing the Charset lookup for every conversion:

import java.nio.charset.Charset;

private final Charset UTF8_CHARSET = Charset.forName("UTF-8");

String decodeUTF8(byte[] bytes) {
    return new String(bytes, UTF8_CHARSET);
}

byte[] encodeUTF8(String string) {
    return string.getBytes(UTF8_CHARSET);
}
mleonhard
That's a good point... if performance is critical, then this would save a tiny amount of time. Only significant inside a very tight loop that isn't doing much else, but it could be helpful.
mcherm
A: 

Hi,

I have used the following logic to eliminate ascii chars in my String, but it removes the double quotes in my string. Any idea how to keep the double quote.

  text = text.replaceAll("\\\\n", "\n");
  text = text.replaceAll("\\\\t", "\t");
  text = text.replaceAll("\\\\r", "\r");
  text = text.replaceAll("\\\\,", ",");
  text = text.replaceAll("\\\\:", ":");
  text = text.replaceAll("[^\\p{ASCII}]", "");
  return text;

For ex:- if my String is : I rcvd "Achievers Award",.. i wanna 2 keep the double quote in this string...

Thanx

Ash
Rather than posting your question here, as an "answer" to my existing question, you should post it as a question of its own. In the code you are showing, the first 5 lines have nothing to do with removing ascii characters. The 6th line looks like it will leave normal "doublequotes" alone (unicode 34) but will simply remove "curly quotes" (unicode 8220, 8216, 8221, and 8217).
mcherm