ansaurus

Question

How to convert Strings to and from UTF8 byte arrays in Java

Answer 1

+3 A:

String original = "hello world";
byte[] utf8Bytes = original.getBytes("UTF8");

smink 2008-09-18 00:13:15

Thanks! I wrote it up again myself adding the other direction of conversion.

mcherm 2008-09-18 00:18:17

ok great :) glad to have helped.

smink 2008-09-18 00:23:48

Answer 2

+15 A:

Convert from String to byte[]:

String s = "some text here";
byte[] b = s.getBytes("UTF-8");

Convert from byte[] to String:

byte[] b = {(byte) 99, (byte)97, (byte)116};
String s = new String(b, "US-ASCII");

You should, of course, use the correct encoding name. My examples used "US-ASCII" and "UTF-8", the two most common encodings.

mcherm 2008-09-18 00:16:39

US-ASCII is actually not a very common encoding nowadays. Windows-1252 and ISO-8859-1 (which are supersets of ASCII) are far more widespread.

Michael Borgwardt 2009-10-09 13:26:06

Actually, I find it fairly common in my work. I often read streams of bytes which may have been saved as Windows-1252 or ISO-8859-1 or even just as "output of that legacy program we've had for the past 10 years", but which contain bytes guaranteed to be valid US-ASCII characters. I also often have a requirement to GENERATE such files (for consumption by code which may-or-may-not be able to handle non-ASCII characters. Basically, US-ASCII is the "greatest common denominator" of many pieces of software.

mcherm 2009-10-13 18:01:55

Answer 3

+3 A:

You can convert directly via the String(byte[], String) constructor and getBytes(String) method. Java exposes available character sets via the Charset class. The JDK documentation lists supported encodings.

90% of the time, such conversions are performed on streams, so you'd use the Reader/Writer classes. You would not incrementally decode using the String methods on arbitrary byte streams - you would leave yourself open to bugs involving multibyte characters.

McDowell 2008-09-18 11:32:38

+1 for mentioning multibyte characters.

sleske 2010-09-23 10:57:43

Answer 4

A:

terribly late but i just encountered this issue and this is my fix:

private static String removeNonUtf8CompliantCharacters( final String inString ) {
    if (null == inString ) return null;
    byte[] byteArr = inString.getBytes();
    for ( int i=0; i < byteArr.length; i++ ) {
        byte ch= byteArr[i]; 
        // remove any characters outside the valid UTF-8 range as well as all control characters
        // except tabs and new lines
        if ( !( (ch > 31 && ch < 253 ) || ch == '\t' || ch == '\n' || ch == '\r') ) {
            byteArr[i]=' ';
        }
    }
    return new String( byteArr );
}

savio 2010-02-19 00:04:18

Answer 5

A:

If you refer the blog http://infomani.wordpress.com It may be good..

Manikandan.M 2010-06-08 13:31:18

Answer 6

+1 A:

Here's a solution that avoids performing the Charset lookup for every conversion:

import java.nio.charset.Charset;

private final Charset UTF8_CHARSET = Charset.forName("UTF-8");

String decodeUTF8(byte[] bytes) {
    return new String(bytes, UTF8_CHARSET);
}

byte[] encodeUTF8(String string) {
    return string.getBytes(UTF8_CHARSET);
}

mleonhard 2010-08-02 09:53:47

That's a good point... if performance is critical, then this would save a tiny amount of time. Only significant inside a very tight loop that isn't doing much else, but it could be helpful.

mcherm 2010-08-06 15:39:06

Answer 7

A:

Hi,

I have used the following logic to eliminate ascii chars in my String, but it removes the double quotes in my string. Any idea how to keep the double quote.

  text = text.replaceAll("\\\\n", "\n");
  text = text.replaceAll("\\\\t", "\t");
  text = text.replaceAll("\\\\r", "\r");
  text = text.replaceAll("\\\\,", ",");
  text = text.replaceAll("\\\\:", ":");
  text = text.replaceAll("[^\\p{ASCII}]", "");
  return text;

For ex:- if my String is : I rcvd "Achievers Award",.. i wanna 2 keep the double quote in this string...

Thanx

Ash 2010-09-13 07:14:30

Rather than posting your question here, as an "answer" to my existing question, you should post it as a question of its own. In the code you are showing, the first 5 lines have nothing to do with removing ascii characters. The 6th line looks like it will leave normal "doublequotes" alone (unicode 34) but will simply remove "curly quotes" (unicode 8220, 8216, 8221, and 8217).

mcherm 2010-09-16 13:41:26

ansaurus

tags:

views:

answers:

How to convert Strings to and from UTF8 byte arrays in Java

related questions