tags:

views:

424

answers:

4

I'm trying to debug something and I'm wondering if the following code could ever return true

public boolean impossible(byte[] myBytes) {
  if (myBytes.length == 0)
    return false;
  String string = new String(myBytes, "UTF-8");
  return string.length() == 0;
}

Is there some value I can pass in that will return true? I've fiddled with passing in just the first byte of a 2 byte sequence, but it still produces a single character string.

To clarify, this happened on a PowerPC chip on Java 1.4 code compiled through GCJ to a native binary executable. This basically means that most bets are off. I'm mostly wondering if Java's 'normal' behaviour, or Java's spec made any promises.

A: 

UTF-8 is a variable length encoding scheme, with most "normal" characters being single byte. So any given non-empty byte[] will always translate into a String, I'd have thought.

If you want to play it says, write a unit test which iterates over every possible byte value, passing in a single-value array of that value, and assert that the string is non-empty.

skaffman
+5  A: 

According to the javadoc for java.util.String, the behavior of new String(byte[], "UTF-8") is not specified when the bytearray contains invalid or unexpected data. If you want more predictability in your resultant string use http://java.sun.com/j2se/1.5.0/docs/api/java/nio/charset/CharsetDecoder.html.

Trey
+1  A: 

Possibly.

From the Java 5 API docs "The behavior of this constructor when the given bytes are not valid in the given charset is unspecified."

I guess that it depends on : Which version of java you're using Which vendor wrote your JVM (Sun, HP, IBM, the open source one, etc)

Once the docs say "unspecified" all bets are off

Edit: Beaten to it by Trey Take his advice about using a CharsetDecoder

Glen
+1  A: 

If Java handles the BOM mark correctly (which I'm not sure whether they have fixed it yet), then it should be possible to input a byte array with just the BOM (U+FEFF, which is in UTF-8 the byte sequence EF BB BF) and to get an empty string.


Update:

I tested that method with all values of 1-3 bytes. None of them returned an empty string on Java 1.6. Here is the test code that I used with different byte array lenghts:

public static void main(String[] args) throws UnsupportedEncodingException {
    byte[] test = new byte[3];
    byte[] end = new byte[test.length];

    if (impossible(test)) {
        System.out.println(Arrays.toString(test));
    }
    do {
        increment(test, 0);
        if (impossible(test)) {
            System.out.println(Arrays.toString(test));
        }
    } while (!Arrays.equals(test, end));

}

private static void increment(byte[] arr, int i) {
    arr[i]++;
    if (arr[i] == 0 && i + 1 < arr.length) {
        increment(arr, i + 1);
    }
}

public static boolean impossible(byte[] myBytes) throws UnsupportedEncodingException {
    if (myBytes.length == 0) {
        return false;
    }
    String string = new String(myBytes, "UTF-8");
    return string.length() == 0;
}
Esko Luontola
Unfortunately, Java does not handle the UTF-8 BOM correctly. Doesn't handle it at all, really; just treats it as part of the content
Alan Moore