views:

2383

answers:

3

I need to identify what character set my input belongs to. The goal is to distinguish between Arabic and English words in a mixed input (the input is unicode and is extracted from XML text nodes). I have noticed class Character.UnicodeBlock : is it related to my problem? How can I get it to work?

Edit: The Character.UnicodeBlock approach was useful for Arabic, but apparently doesn't do it for English (or other European languages) because the BASIC_LATIN unicode block covers symbols and nonprint characters as well as letters. So now I am using the matches() method of the String object with a regex expression "[A-Za-z]+" instead. I can live with it, but perhaps someone can suggest a nicer/faster way.

+4  A: 

Yes, you can simple use Character.UnicodeBlock.of(char)

Dennis Cheung
A: 

You have the opposite problem to this one, but ironically what doesn't work for him it just should work great for you. It is to just look for words in English (only ASCII compatible chars) with reg-exp "\w".

Fernando Miguélez
+1  A: 

If [A-Za-z]+ meets your requirement, you aren't going to find anything faster or prettier. However, if you want to match all letters in the Latin1 block (including accented letters and ligatures), you can use this:

Pattern p = Pattern.compile("[\\pL&&\\p{L1}]+");

That's the intersection of the set of all Unicode letters and the set of all Latin1 characters.

Alan Moore