ansaurus

Question

Answer 1

+1 A:

Why not convert to bytes and walk forward--obeying UTF8 character boundaries as you do it--until you've got the max number, then convert those bytes back into a string?

Or you could just cut the original string if you keep track of where the cut should occur:

// Assuming that Java will always produce valid UTF8 from a string, so no error checking!
// (Is this always true, I wonder?)
public class UTF8Cutter {
  public static String cut(String s, int n) {
    byte[] utf8 = s.getBytes();
    if (utf8.length < n) n = utf8.length;
    int n16 = 0;
    boolean extraLong = false;
    int i = 0;
    while (i < n) {
      n16 += (extraLong) ? 2 : 1;
      extraLong = false;
      if ((utf8[i] & 0x80) == 0) i += 1;
      else if ((utf8[i] & 0xC0) == 0x80) i += 2;
      else if ((utf8[i] & 0xE0) == 0xC0) i += 3;
      else { i += 4; extraLong = true; }
    }
    return s.substring(0,n16);
  }
}

Rex Kerr 2010-08-26 15:46:51

I definitely could do that. Is there any reason why using String.substring is any worse? It seems like doing it the way you describe would have to account for all the code points, which isn't a whole lot of fun. (depending on your definition of fun :) ).

stevebot 2010-08-26 16:04:53

@stevebot - To be efficient, you need to take advantage of the known structure of the data. If you don't care about efficiency and want it to be easy, or you want to support every possible Java encoding without having to know what it is, your method seems reasonable enough.

Rex Kerr 2010-08-26 16:22:44

Answer 2

A:

you could convert the string to bytes and convert just those bytes back to a string.

public static String substring(String text, int maxBytes) {
   StringBuilder ret = new StringBuilder();
   for(int i = 0;i < text.length(); i++) {
       // works out how many bytes a character takes, 
       // and removes these from the total allowed.
       if((maxBytes -= text.substring(i, i+1).getBytes().length) < 0) break;
       ret.append(text.charAt(i));
   }
   return ret.toString();
}

Peter Lawrey 2010-08-27 21:51:52

ansaurus

tags:

views:

answers:

Truncating Strings in Java by Bytes

related questions