ansaurus

Question

Can a empty java string be created from non-empty UTF-8 byte array?

Answer 1

A:

UTF-8 is a variable length encoding scheme, with most "normal" characters being single byte. So any given non-empty byte[] will always translate into a String, I'd have thought.

If you want to play it says, write a unit test which iterates over every possible byte value, passing in a single-value array of that value, and assert that the string is non-empty.

skaffman 2009-05-07 15:44:50

Answer 2

+5 A:

According to the javadoc for java.util.String, the behavior of new String(byte[], "UTF-8") is not specified when the bytearray contains invalid or unexpected data. If you want more predictability in your resultant string use http://java.sun.com/j2se/1.5.0/docs/api/java/nio/charset/CharsetDecoder.html.

Trey 2009-05-07 15:49:31

Answer 3

+1 A:

Possibly.

From the Java 5 API docs "The behavior of this constructor when the given bytes are not valid in the given charset is unspecified."

I guess that it depends on : Which version of java you're using Which vendor wrote your JVM (Sun, HP, IBM, the open source one, etc)

Once the docs say "unspecified" all bets are off

Edit: Beaten to it by Trey Take his advice about using a CharsetDecoder

Glen 2009-05-07 15:50:30

Answer 4

+1 A:

If Java handles the BOM mark correctly (which I'm not sure whether they have fixed it yet), then it should be possible to input a byte array with just the BOM (U+FEFF, which is in UTF-8 the byte sequence EF BB BF) and to get an empty string.

Update:

I tested that method with all values of 1-3 bytes. None of them returned an empty string on Java 1.6. Here is the test code that I used with different byte array lenghts:

public static void main(String[] args) throws UnsupportedEncodingException {
    byte[] test = new byte[3];
    byte[] end = new byte[test.length];

    if (impossible(test)) {
        System.out.println(Arrays.toString(test));
    }
    do {
        increment(test, 0);
        if (impossible(test)) {
            System.out.println(Arrays.toString(test));
        }
    } while (!Arrays.equals(test, end));

}

private static void increment(byte[] arr, int i) {
    arr[i]++;
    if (arr[i] == 0 && i + 1 < arr.length) {
        increment(arr, i + 1);
    }
}

public static boolean impossible(byte[] myBytes) throws UnsupportedEncodingException {
    if (myBytes.length == 0) {
        return false;
    }
    String string = new String(myBytes, "UTF-8");
    return string.length() == 0;
}

Esko Luontola 2009-05-07 15:52:07

Unfortunately, Java does not handle the UTF-8 BOM correctly. Doesn't handle it at all, really; just treats it as part of the content

Alan Moore 2009-05-07 22:08:03

ansaurus

tags:

views:

answers:

Can a empty java string be created from non-empty UTF-8 byte array?

related questions