ansaurus

Question

How to classify Japanese characters as either Kanji or Kana

Answer 1

+4 A:

You need to get a reference that gives the separate ranges for kana and kanji characters. From what ive seen alphabets and equivalents typically get a block of characters.

mP 2010-09-30 00:42:06

Well, in Unicode Kanji has a range of U+4E00 to U+9FBF, Katakana has a range of U+30A0 to U+30FF and Hiragana has a range of U+3040 to U+309F. With that 'splitting' text should be easy, depending on what splitting actually is.

Crag 2010-09-30 00:45:47

Answer 2

+6 A:

Use a table like this one to determine which unicode values are used for katakana and kanji, then you can simply cast the character to an int and check where it belongs, something like

int val = (int)て;
if (val >= 0x3040 && val <= 0x309f)
  return KATAKANA
..

Jack 2010-09-30 00:48:02

+1 - Nice table!

BrunoLM 2010-09-30 01:54:36

Note that jleedev has essentially the same method, but using a table provided by the JVM.

MSalters 2010-09-30 11:40:25

Answer 3

+2 A:

This seems like it'd be an interesting use for Guava's CharMatcher class. Using the tables linked in Jack's answer, I created this:

public class JapaneseCharMatchers {
  public static final CharMatcher HIRAGANA = 
      CharMatcher.inRange((char) 0x3040, (char) 0x309f);

  public static final CharMatcher KATAKANA = 
      CharMatcher.inRange((char) 0x30a0, (char) 0x30ff);

  public static final CharMatcher KANA = HIRAGANA.or(KATAKANA);

  public static final CharMatcher KANJI = 
      CharMatcher.inRange((char) 0x4e00, (char) 0x9faf);

  public static void main(String[] args) {
    test("誰か確認上記これらのフ");
  }

  private static void test(String string) {
    System.out.println(string);
    System.out.println("Hiragana: " + HIRAGANA.retainFrom(string));
    System.out.println("Katakana: " + KATAKANA.retainFrom(string));
    System.out.println("Kana: " + KANA.retainFrom(string));
    System.out.println("Kanji: " + KANJI.retainFrom(string));
  }
}

Running this prints the expected:

誰か確認上記これらのフ

Hiragana: かこれらの

Katakana: フ

Kana: かこれらのフ

Kanji: 誰確認上記

This gives you a lot of power for working with Japanese text by defining the rules for determining if a character is in one of these groups in an object that can not only do a lot of useful things itself, but can also be used with other APIs such as Guava's Splitter class.

Edit:

Based on jleedev's answer, you could also write a method like:

public static CharMatcher inUnicodeBlock(final Character.UnicodeBlock block) {
  return new CharMatcher() {
    public boolean matches(char c) {
      return Character.UnicodeBlock.of(c) == block;
    }
  };
}

and use it like:

CharMatcher HIRAGANA = inUnicodeBlock(Character.UnicodeBlock.HIRAGANA);

I think this might be a bit slower than the other version though.

ColinD 2010-09-30 01:40:07

Right, if you only want to test for membership in a specific range, it might be faster to do it yourself. Surprisingly, the UnicodeBlock class doesn’t have a method to test a character for membership, and it seems the only way is its static `of` method, which loops through every block until it finds one.

jleedev 2010-09-30 03:10:35

Answer 4

+12 A:

This functionality is built into the Character.UnicodeBlock class. For example:

Character.UnicodeBlock.of('誰') == CJK_UNIFIED_IDEOGRAPHS
Character.UnicodeBlock.of('か') == HIRAGANA

jleedev 2010-09-30 02:04:35

Interesting, I didn't know about that.

ColinD 2010-09-30 02:09:47

ansaurus

tags:

views:

answers:

How to classify Japanese characters as either Kanji or Kana

related questions