views:

5652

answers:

4

How do I truncate a java String so that I know it will fit in a given number of bytes storage once it is UTF-8 encoded?

+5  A: 

UTF-8 encoding has a neat trait that allows you to see where in a byte-set you are.

check the stream at the character limit you want.

  • If its high bit is 0, it's a single-byte char, just replace it with 0 and you're fine.
  • If its high bit is 1 and so is the next bit, then you're at the start of a multi-byte char, so just set that byte to 0 and you're good.
  • If the high bit is 1 but the next bit is 0, then you're in the middle of a character, travel back along the buffer until you hit a byte that has 2 or more 1s in the high bits, and replace that byte with 0.

Example: If your stream is: 31 33 31 C1 A3 32 33 00, you can make your string 1, 2, 3, 5, 6, or 7 bytes long, but not 4, as that would put the 0 after C1, which is the start of a multi-byte char.

Bill James
http://java.sun.com/j2se/1.5.0/docs/api/java/io/DataInput.html#modified-utf-8 explains the modified UTF-8 encoding used by Java and demonstrates why this answer is correct.
Alexander
BTW, this solution (the one bill @Bill James) is much more efficient than the currently accepted answer by @Matt Quail, because the former requires you to test 3 bytes at the most, whereas the latter requires you to test all characters in the text.
Alexander
Alexander: the former requires you to *first convert the string to UTF8*, which requires iterating over all the characters in the text.
Matt Quail
True, but the question does state "Once it is UTF-8 encoded". Presumably that price has been paid.
Bill James
A: 

You should use [CharsetEncoder][1], the simple getBytes() + copy as many as you can can cut UTF-8 charcters in half.

[1]: http://java.sun.com/javase/6/docs/api/java/nio/charset/CharsetEncoder.html#encode(java.nio.CharBuffer, java.nio.ByteBuffer, boolean)

Something like this:

public static int truncateUtf8(String input, byte[] output) {

    ByteBuffer outBuf = ByteBuffer.wrap(output);
    CharBuffer inBuf = CharBuffer.wrap(input.toCharArray());

    Charset utf8 = Charset.forName("UTF-8");
    utf8.newEncoder().encode(inBuf, outBuf, true);
    System.out.println("encoded " + inBuf.position() + " chars of " + input.length() + ", result: " + outBuf.position() + " bytes");
    return outBuf.position();
}
mitchnull
+6  A: 

Here is a simple loop that counts how big the UTF-8 representation is going to be, and truncates when it is exceeded:

public static String truncateWhenUTF8(String s, int maxBytes) {
    int b = 0;
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);

        // ranges from http://en.wikipedia.org/wiki/UTF-8
        int skip = 0;
        int more;
        if (c <= 0x007f) {
            more = 1;
        }
        else if (c <= 0x07FF) {
            more = 2;
        } else if (c <= 0xd7ff) {
            more = 3;
        } else if (c <= 0xDFFF) {
            // surrogate area, consume next char as well
            more = 4;
            skip = 1;
        } else {
            more = 3;
        }

        if (b + more > maxBytes) {
            return s.substring(0, i);
        }
        b += more;
        i += skip;
    }
    return s;
}

This does handle surrogate pairs that appear in the input string. Java's UTF-8 encoder (correctly) outputs surrogate pairs as a single 4-byte sequence instead of two 3-byte sequences, so truncateWhenUTF8() will return the longest truncated string it can. If you ignore surrogate pairs in the implementation then the truncated strings may be shorted than they needed to be.

I haven't done a lot of testing on that code, but here are some preliminary tests:

private static void test(String s, int maxBytes, int expectedBytes) {
    String result = truncateWhenUTF8(s, maxBytes);
    byte[] utf8 = result.getBytes(Charset.forName("UTF-8"));
    if (utf8.length > maxBytes) {
        System.out.println("BAD: our truncation of " + s + " was too big");
    }
    if (utf8.length != expectedBytes) {
        System.out.println("BAD: expected " + expectedBytes + " got " + utf8.length);
    }
    System.out.println(s + " truncated to " + result);
}

public static void main(String[] args) {
    test("abcd", 0, 0);
    test("abcd", 1, 1);
    test("abcd", 2, 2);
    test("abcd", 3, 3);
    test("abcd", 4, 4);
    test("abcd", 5, 4);

    test("a\u0080b", 0, 0);
    test("a\u0080b", 1, 1);
    test("a\u0080b", 2, 1);
    test("a\u0080b", 3, 3);
    test("a\u0080b", 4, 4);
    test("a\u0080b", 5, 4);

    test("a\u0800b", 0, 0);
    test("a\u0800b", 1, 1);
    test("a\u0800b", 2, 1);
    test("a\u0800b", 3, 1);
    test("a\u0800b", 4, 4);
    test("a\u0800b", 5, 5);
    test("a\u0800b", 6, 5);

    // surrogate pairs
    test("\uD834\uDD1E", 0, 0);
    test("\uD834\uDD1E", 1, 0);
    test("\uD834\uDD1E", 2, 0);
    test("\uD834\uDD1E", 3, 0);
    test("\uD834\uDD1E", 4, 4);
    test("\uD834\uDD1E", 5, 4);

}

Updated Modified code example, it now handles surrogate pairs.

Matt Quail
UTF-8 can encode any UCS2 character in 3 bytes or less. Check that page you reference. However, if you want to comply with UCS4 or UTF16 (which can both reference the entire charset), you'll need to allow for up to 6-byte characters in UTF8.
Bill James
Bill: see the CESU-8 discussion on the wikipedia page. My understanding is UTF-8 is supposed to encode surrogate pairs as a single 4-byte sequence, not two 3-byte sequences.
Matt Quail
It's not 2 three-byte, it's up to 1 6-byte sequence to store UCS4, which is a full 31-bit character, not 2 16-bit "pairs" (that's UTF16). A 6-byte seq = 1111110C 10CCCCCC 10CCCCCC 10CCCCCC 10CCCCCC 10CCCCCC where the C's are data bits. Right now, only enough chars are in use to need 4 bytes.
Bill James
But 8 years ago, more than 16-bits wasn't even necessary. Expect to see 5-byte chars in the next decade as more dialects and "Klingon"-type language planes are added.
Bill James
Bill: you are correct, my code does not handle code points above U+10FFFF -- which is where more than 4 UTF-8 bytes are required.But Java can't encode characters past U+10FFFF anyway. Each `char` in Java is a 16 bit codepoint between U+0000 and U+FFFF. Surrogate pairs give you up to U+10FFFF.
Matt Quail
Well, then, it would seem my solution is in excess. Didn't know that about Java's character (my I18n work was done for EQ in C++). Nice chat. :)
Bill James
+2  A: 

You can calculate the number of bytes without doing any conversion.

foreach character in the Java string
  if 0 <= character <= 0x7f
     count += 1
  else if 0x80 <= character <= 0x7ff
     count += 2
  else if 0x800 <= character <= 0xd7ff // excluding the surrogate area
     count += 3
  else if 0xdc00 <= character <= 0xffff
     count += 3
  else { // surrogate, a bit more complicated
     count += 4
     skip one extra character in the input stream
  }

You would have to detect surrogate pairs (D800-DBFF and U+DC00–U+DFFF) and count 4 bytes for each valid surrogate pair. If you get the first value in the first range and the second in the second range, it's all ok, skip them and add 4. But if not, then it is an invalid surrogate pair. I am not sure how Java deals with that, but your algorithm will have to do right counting in that (unlikely) case.