tags:

views:

157

answers:

1

We have a current method which clears out chars that are not alphabetic or whitespace which is simply

String clean(String input)
{
   return input==null?"":input.replaceAll("[^a-zA-Z ]","");
}

which really ought to be fixed to support non-english chars (e.g. ś,ũ, ... ). Unfortunately the java regex classes (e.g. "\W" -A non-word character, "\p{Alpha}" -US-ASCII only}. ) don't seem to support this. Is there a way of doing this with java regex rather than looping manually though each character to test it?

+1  A: 

Java 6 Pattern handles Unicode, see this doc.

Unicode escape sequences such as \u2014 in Java source code are processed as described in §3.3 of the Java Language Specification. Such escape sequences are also implemented directly by the regular-expression parser so that Unicode escapes can be used in expressions that are read from files or from the keyboard. Thus the strings "\u2014" and "\\u2014", while not equal, compile into the same pattern, which matches the character with hexadecimal value 0x2014.

Unicode blocks and categories are written with the \p and \P constructs as in Perl. \p{prop} matches if the input has the property prop, while \P{prop} does not match if the input has that property. Blocks are specified with the prefix In, as in InMongolian. Categories may be specified with the optional prefix Is: Both \p{L} and \p{IsL} denote the category of Unicode letters. Blocks and categories can be used both inside and outside of a character class.

Charlie Martin