tags:

views:

25

answers:

1

I am working on languge segmentation project. I applied language segmentation for English by using regular expression breaking the string at . ("Full Stop"). Now i want to provide the support for following languages (Chinese, Arabic, Japanese, Russian, Korean, Dutch, Hindi, Greek, Urdu). I want to break the above mentioned language strings on Full stop.

e.g.

For Chinese Full stop is 。 (Unicode value U+3002) String

以有效應對各種事態」。他還表示,希望以符合21世紀的方式切實深化美日同盟關係。

Expected Result

Segment 1 :- 以有效應對各種事態」。
Segment 2 :- 他還表示,希望以符合21世紀的方式切實深化美日同盟關係。

Same logic I have to apply for other languages (Arabic, Japanese, Russian, Korean, Dutch, Hindi, Greek, Urdu).

Thanks in Advanced

A: 

See String.split. You can use /([。])/ as a regular expression separator. Add the other punctuation characters inside the square brackets. The round parentheses will capture your delimiters.

janmoesen