I have the following string (japanese) " ユーザー名" , the first character is "like" whitespace but its number in unicode is 12288, so if I do " ユーザー名".trim() I get the same string (trim doesn't work). If i do trim in c++ it works ok. Does anyone know how to solve this issue in java? Is there a special trim method for unicode?
From the java docs, it explains why this doesn't work.
If this String object represents an empty character sequence, or the first and last characters of character sequence represented by this String object both have codes greater than '\u0020' (the space character), then a reference to this String object is returned.
You could role your own version easily enough. perhaps the method codePointAt could be used for this purpose.
http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html
You'll have to write your own trim()
method based on Character.isWhitespace()
- unfortunately, trim()
does not do what its API doc claims: it strips only ASCII spaces, not any other kind of whitespace.
Try the Apache Commons' StringUtils class. The StringUtils.strip() method should work for you.
Have a look at Unicode Normalization and the Normalizer class. The class is new in Java 6, but you'll find an equivalent version in the ICU4J library if you're on an earlier JRE.
int character = 12288;
char[] ch = Character.toChars(character);
String input = new String(ch);
String normalized = Normalizer.normalize(input, Normalizer.Form.NFKC);
System.out.println("Hex value:\t" + Integer.toHexString(character));
System.out.println("Trimmed length :\t"
+ input.trim().length());
System.out.println("Normalized trimmed length:\t"
+ normalized.trim().length());
As an alternative to the StringUtils
class mentioned by Mike, you can also use a Unicode-aware regular expression, using only Java's own libraries:
" ユーザー名".replaceAll("\\p{Z}", "")
Or, to really only trim, and not remove whitespace inside the string:
" ユーザ ー名 ".replaceAll("(^\\p{Z}+|\\p{Z}+$)", "")