ansaurus

Question

To split only Chinese characters in java

Answer 1

+7 A:

Chinese characters lies within certain Unicode ranges:

2F00-2FDF: Kangxi
4E00-9FAF: CJK
3400-4DBF: CJK Extension

So all you basically need to do is to check if the character's codepoint lies within the known ranges. This example is a good starting point to write a stackbased parser/splitter, you only need to extend it to separate digits from latin letters, which should be obvious enough (hint: Character#isDigit()):

Set<UnicodeBlock> chineseUnicodeBlocks = new HashSet<UnicodeBlock>() {{
    add(UnicodeBlock.CJK_COMPATIBILITY);
    add(UnicodeBlock.CJK_COMPATIBILITY_FORMS);
    add(UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS);
    add(UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS_SUPPLEMENT);
    add(UnicodeBlock.CJK_RADICALS_SUPPLEMENT);
    add(UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION);
    add(UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS);
    add(UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A);
    add(UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B);
    add(UnicodeBlock.KANGXI_RADICALS);
    add(UnicodeBlock.IDEOGRAPHIC_DESCRIPTION_CHARACTERS);
}};

String mixedChinese = "查詢促進民間參與公共建設法（210ＢＯＴ法）";

for (char c : mixedChinese.toCharArray()) {
    if (chineseUnicodeBlocks.contains(UnicodeBlock.of(c))) {
        System.out.println(c + " is chinese");
    } else {
        System.out.println(c + " is not chinese");
    }
}

Good luck.

BalusC 2009-11-04 18:46:42

As an extension, I believe a character class in an regexp. spanning the above unicode ranges would also work.

pst 2009-11-04 18:55:57

Not really if you also want to intercept on groups of digits/letters/hyphens/whateverlatin. A stackbased parser is a better tool for this kind of job.

BalusC 2009-11-04 18:58:56

Answer 2

A:

Here's an approach I would take.

You can use Character.codePointAt(char[] charArray, int index) to return the Unicode value for a char in your char array.

You will also need a mapping of Latin Unicode characters.

If you look in the source of Character.UnicodeBlock, the full LATIN block is the interval [0x0000, 0x0249]. So basically you check if your Unicode code point is somewhere within that interval.

I suspect there is a way to just use a Character.Subset to check if it contains your char, but I haven't looked into that.

simmbot 2009-11-04 19:01:23

ansaurus

tags:

views:

answers:

To split only Chinese characters in java

related questions