ansaurus

Question

What's a good heuristic to see if a set of bytes are encoded as UTF-8 in Java?

Answer 1

+3 A:

In well formed UTF-8 a byte with the top bit set must be either followed or preceded by another byte that has the top bit set; the first of a run must start with the two topmost bits set and the rest must have the next-to-top bit clear (in fact the first of a run of N top-bit bytes must have the top N bits set and the next one clear).

Those characteristics should be easy enough to look for.

Steve Gilham 2009-08-20 23:12:34

Answer 2

+3 A:

    final Charset charset = Charset.forName("UTF-8");
    final CharsetDecoder decoder = charset.newDecoder();
    decoder.onMalformedInput(CodingErrorAction.REPORT);

    try {
        final String s = decoder.decode(ByteBuffer.wrap(bytes)).toString();
        Log.d( s );
    } catch( CharacterCodingException e ) {
        // don't log binary data
    }

Mike 2009-08-20 23:40:09

Answer 3

A:

I suggest using ICU4J

Eric Anderson 2009-08-21 00:07:47

ansaurus

tags:

views:

answers:

What's a good heuristic to see if a set of bytes are encoded as UTF-8 in Java?

related questions