views:

1255

answers:

5

Given a byte array that is either a UTF-8 encoded string or arbitrary binary data, what approaches can be used in Java to determine which it is?

The array may be generated by code similar to:

byte[] utf8 = "Hello World".getBytes("UTF-8");

Alternatively it may have been generated by code similar to:

byte[] messageContent = new byte[256];
for (int i = 0; i < messageContent.length; i++) {
    messageContent[i] = (byte) i;
}

The key point is that we don't know what the array contains but need to find out in order to fill in the following function:

public final String getString(final byte[] dataToProcess) {
    // Determine whether dataToProcess contains arbitrary data or a UTF-8 encoded string
    // If dataToProcess contains arbitrary data then we will BASE64 encode it and return.
    // If dataToProcess contains an encoded string then we will decode it and return.
}

How would this be extended to also cover UTF-16 or other encoding mechanisms?

+1  A: 

Try decoding it. If you do not get any errors, then it is a valid UTF-8 string.

Thorbjørn Ravn Andersen
-1: Factual error. It is possible for a non-textual binary stream to be decoded as a valid UTF-8 string. If the UTF-8 decoding fails, that implies that your binary data is not UTF-8; but if the UTF-8 decoding _doesn't_ fail, that does not _guarantee_ that the binary data _is_ UTF-8.
Daniel Fortunov
+1 Absolutely correct. If it decodes without error, it's valid UTF-8 textual data. It may be textual data that makes absolutely no sense, such as a wild mix of Latin, Chinese, Thai and Greek characters, but that's a semantic distinction, not a technical one.
Michael Borgwardt
Fair point Michael. I guess in that case I should have said: -1 Not answering the question. Asserting that it is a valid UTF-8 string is not answering the question, which was trying to find out if it was a string or binary data. Just because it is a valid UTF-8 representation doesn't tell you much about whether the original data is binary (which just happens to be valid UTF-8 by coincidence) or whether the original was genuine textual data.
Daniel Fortunov
+5  A: 

It's not possible to make that decision with full accuracy in all cases, because an UTF-8 encoded string is one kind of arbitrary binary data, but you can look for byte sequences that are invalid in UTF-8. If you find any, you know that it's not UTF-8.

If you array is large enough, this should work out well since it is very likely for such sequences to appear in "random" binary data such as compressed data or image files.

However, it is possible to get valid UTF-8 data that decodes to a totally nonsensical string of characters (probably from all kinds of diferent scripts). This is more likely with short sequences. If you're worried about that, you might have to do a closer analysis to see whether the characters that are letters all belong to the same code chart. Then again, this may yield false negatives when you have valid text input that mixes scripts.

Michael Borgwardt
A: 

If the byte array begins with a Byte Order Mark (BOM) then it will be easy to distinguish what encoding has been used. The standard Java classes for processing text streams will probably deal with this for you automatically.

If you do not have a BOM in your byte data this will be substantially more difficult — .NET classes can perform statistical analysis to try and work out the encoding, but I think this is on the assumption that you know that you are dealing with text data (just don't know which encoding was used).

If you have any control over the format for your input data your best choice would be to ensure that it contains a Byte Order Mark.

Daniel Fortunov
Java does not insert a BOM automatically and will not remove it on decode.
McDowell
Erk, I should say that Java does not handle BOMs for UTF-8. Whether it does or not for UTF-16/UTF-32 depends on the encoding mechanism chosen: http://java.sun.com/javase/6/docs/technotes/guides/intl/encoding.doc.html
McDowell
+2  A: 

The question assumes that there is a fundamental difference between a string and binary data. While this is intuitively so, it is next to impossible to define precisely what that difference is.

A Java String is a sequence of 16 bit quantities that correspond to one of the (almost) 2**16 Unicode basic codepoints. But if you look at those 16 bit 'characters', each one could equally represent an integer, a pair of bytes, a pixel, and so on. The bit patterns don't have anything intrinsic about that says what they represent.

Now suppose that you rephrased your question as asking for a way to distinguish UTF-8 encoded TEXT from arbitrary binary data. Does this help? In theory no, because the bit patterns that encode any written text can also be a sequence of numbers. (It is hard to say what "arbitrary" really means here. Can you tell me how to test if a number is "arbitrary"?)

The best we can do here is the following:

  1. Test if the bytes are a valid UTF-8 encoding.
  2. Test if the decoded 16-bit quantities are all legal, "assigned" UTF-8 code-points. (Some 16 bit quantities are illegal (e.g. 0xffff) and others are not currently assigned to correspond to any character.) But what if a text document really uses an unassigned codepoint?
  3. Test if the Unicode codepoints belong to the "planes" that you expect based on the assumed language of the document. But what if you don't know what language to expect, or if a document that uses multiple languages?
  4. Test is the sequences of codepoints look like words, sentences, or whatever. But what if we had some "binary data" that happened to include embedded text sequences?

In summary, you can tell that a byte sequence is definitely not UTF-8 if the decode fails. Beyond that, if you make assumptions about language, you can say that a byte sequence is probably or probably not a UTF-8 encoded text document.

IMO, the best thing you can do is to avoid getting into a situation where you program needs to make this decision. And if cannot avoid it, recognize that your program may get it wrong. With thought and hard work, you can make that unlikely, but the probability will never be zero.

Stephen C
+1  A: 

Here's a way to use the UTF-8 "binary" regex from the W3C site

static boolean looksLikeUTF8(byte[] utf8) throws UnsupportedEncodingException 
{
  Pattern p = Pattern.compile("\\A(\n" +
    "  [\\x09\\x0A\\x0D\\x20-\\x7E]             # ASCII\\n" +
    "| [\\xC2-\\xDF][\\x80-\\xBF]               # non-overlong 2-byte\n" +
    "|  \\xE0[\\xA0-\\xBF][\\x80-\\xBF]         # excluding overlongs\n" +
    "| [\\xE1-\\xEC\\xEE\\xEF][\\x80-\\xBF]{2}  # straight 3-byte\n" +
    "|  \\xED[\\x80-\\x9F][\\x80-\\xBF]         # excluding surrogates\n" +
    "|  \\xF0[\\x90-\\xBF][\\x80-\\xBF]{2}      # planes 1-3\n" +
    "| [\\xF1-\\xF3][\\x80-\\xBF]{3}            # planes 4-15\n" +
    "|  \\xF4[\\x80-\\x8F][\\x80-\\xBF]{2}      # plane 16\n" +
    ")*\\z", Pattern.COMMENTS);

  String phonyString = new String(utf8, "ISO-8859-1");
  return p.matcher(phonyString).matches();
}

As originally written, the regex is meant to be used on a byte array, but you can't do that with Java's regexes; the target has to be something that implements the CharSequence interface (so a char[] is out, too). By decoding the byte[] as ISO-8859-1, you create a String in which each char has the same unsigned numeric value as the corresponding byte in the original array.

As others have pointed out, tests like this can only tell you the byte[] could contain UTF-8 text, not that it does. But the regex is so exhaustive, it seems extremely unlikely that raw binary data could slip past it. Even an array of all zeroes wouldn't match, since the regex never matches NUL. If the only possibilities are UTF-8 and binary, I'd be willing to trust this test.

And while you're at it, you could strip the UTF-8 BOM if there is one; otherwise, the UTF-8 CharsetDecoder will pass it through as if it were text.

UTF-16 would be much more difficult, because there are very few byte sequences that are always invalid. The only ones I can think of offhand are high-surrogate characters that are missing their low-surrogate companions, or vice versa. Beyond that, you would need some context to decide whether a given sequence is valid. You might have a Cyrillic letter followed by a Chinese ideogram followed by a smiley-face dingbat, but it would be perfectly valid UTF-16.

Alan Moore