I need to identify what character set my input belongs to.
The goal is to distinguish between Arabic and English words in a mixed input (the input is unicode and is extracted from XML text nodes).
I have noticed class Character.UnicodeBlock
: is it related to my problem? How can I get it to work?
Edit:
The Character.UnicodeBlock
approach was useful for Arabic, but apparently doesn't do it for English (or other European languages) because the BASIC_LATIN
unicode block covers symbols and nonprint characters as well as letters.
So now I am using the matches()
method of the String
object with a regex expression "[A-Za-z]+"
instead. I can live with it, but perhaps someone can suggest a nicer/faster way.