tags:

views:

126

answers:

3

I have a byte stream that may be UTF-8 data or it may be a binary image. I should be able to make an educated guess about which one it is by inspecting the first 100 bytes or so.

However, I haven't figured out exactly how to do this in Java. I've tried doing things like the following:

new String( bytes, "UTF-8").substring(0,100).matches(".*[^\p{Print}]") to see if the first 100 chars contain non-printable characters, but that doesn't seem to work.

Is there a better way to do this?

+3  A: 

In well formed UTF-8 a byte with the top bit set must be either followed or preceded by another byte that has the top bit set; the first of a run must start with the two topmost bits set and the rest must have the next-to-top bit clear (in fact the first of a run of N top-bit bytes must have the top N bits set and the next one clear).

Those characteristics should be easy enough to look for.

Steve Gilham
+3  A: 
    final Charset charset = Charset.forName("UTF-8");
    final CharsetDecoder decoder = charset.newDecoder();
    decoder.onMalformedInput(CodingErrorAction.REPORT);

    try {
        final String s = decoder.decode(ByteBuffer.wrap(bytes)).toString();
        Log.d( s );
    } catch( CharacterCodingException e ) {
        // don't log binary data
    }
Mike
A: 

I suggest using ICU4J

Eric Anderson