views:

685

answers:

4

The JDK's String.trim() method is pretty naive, and only removes ascii control characters.

Apache Commons' StringUtils.strip() is slightly better, but uses the JDK's Character.isWhitespace(), which doesn't recognize non-breaking space as whitespace.

So what would be the most complete, Unicode-compatible, safe and proper way to trim a string in Java?

And incidentally, is there a better library than commons-lang that I should be using for this sort of stuff?

+1  A: 

I've always found trim to work pretty well for almost every scenario.

However, if you really want to include more characters, you can edit the strip method from commons-lang to include not only the test for Character.isWhitespace, but also for Character.isSpaceChar which seems to be what's missing. Namely, the following lines at stripStart and stripEnd, respectively:

  • while ((start != strLen) && Character.isWhitespace(str.charAt(start)))
  • while ((end != 0) && Character.isWhitespace(str.charAt(end - 1)))
JG
+9  A: 

Google has made guava-libraries available recently. It may have what you are looking for:

CharMatcher.inRange('\0', ' ').trimFrom(str)

is equivalent to String.trim(), but you can customize what to trim, refer to the JavaDoc.

For instance, it has its own definition of WHITESPACE which differs from the JDK and is defined according to the latest Unicode standard, so what you need can be written as:

CharMatcher.WHITESPACE.trimFrom(str)
CrazyCoder
Upvoted for making me feel like a jerk
itsadok
Thanks for the pointer to Guava. I'd missed that.
CPerkins
Any idea when there will be an actual release of Guava? Right now you can only check out their latest trunk code (it seems) and the project page says "the libraries are still subject to change". Looks very promising, but not necessarily something you'd want all your production code to depend on.
Jonik
+1  A: 

I swear I only saw this after I posted the question: Google just released Guava, a library of core Java utilities.

I haven't tried this yet, but from what I can tell, this is fully Unicode compliant:

String s = "  \t testing \u00a0"
s = CharMatcher.WHITESPACE.trimFrom(s);
itsadok
Haha, I've provided the same answer just 5 minutes earlier, but then edited it to include the exact code you need to use, and just then saw your comment that you found it yourself.
CrazyCoder
+1  A: 

It's really hard to define what constitutes white spaces. Sometimes I use non-breakable spaces just to make sure it doesn't get stripped. So it will be hard to find a library to do exactly what you want.

I use my own trim() if I want trim every white space. Here is the function I use to check for white spaces,

  public static boolean isWhitespace (int ch)
  {
    if (ch == ' ' || (ch >= 0x9 && ch <= 0xD))
      return true;
    if (ch < 0x85) // short-circuit optimization.
      return false;
    if (ch == 0x85 || ch == 0xA0 || ch == 0x1680 || ch == 0x180E)
      return true;
    if (ch < 0x2000 || ch > 0x3000)
      return false;
    return ch <= 0x200A || ch == 0x2028 || ch == 0x2029
      || ch == 0x202F || ch == 0x205F || ch == 0x3000;
  }
ZZ Coder
Please name those magic constants..
Thorbjørn Ravn Andersen
ZZ Coder -- you say, "it will be hard to find a library to do exactly what you want." Not true!Say you want to match all whitespace _except_ a \u00a0 (non-breaking space). Easy:CharMatcher.WHITESPACE.and(CharMatcher.isNot('\u00a0')).trimFrom(input)
Kevin Bourrillion