tags:

views:

218

answers:

1

IBM has apparently open-sourced their ICU source code for Unicode and Globalization support, part of which is a text boundary locator for detecting where breaks can be located in text.

However, the break detection stuff relies on rules and I cannot locate the rules files anywhere.

Where can I get the word break rules text files for com.ibm.icu.text.BreakIterator and com.ibm.icu.text.RuleBasedBreakIterator?

+2  A: 

http://www.icu-project.org/ holds all the source code for icu4j which IBM has released under an open source license. This includes the boundary analysis stuff like dictionary- and rule-based break iterators.

However, there doesn't appear to be a text file suitable for perusing. I not sure that IBM would have released their rule set as open source (since it's a pretty big technological advantage to them). Instead, the idea is to create your own rule set, a tutorial of which is here.

That same tutorial states that you can dump the default rules by running:

RuleBasedBreakIterator rbbi = (RuleBasedBreakIterator)
    BreakIterator.getWordInstance(Locale.getDefault());
String defaultRules = rbbi.toString();
paxdiablo