I'm trying to understang how to use icu::BreakIterator
to find specific words.
For example I have following sentence:
To be or not to be? That is the question...
Word instance of break iterator would put breaks there:
|To| |be| |or| |not| |to| |be|?| |That| |is| |the| |question|.|.|.|
Now, not every pair of break points is actual word.
In derived class icu::RuleBasedBreakIterator
there is a "getRuleStatus()" that returns some kind of information about break, and it gives "Word status at following points (marked "/")"
|To/ |be/ |or/ |not/ |to/ |be/?| |That/ |is/ |the/ |question/.|.|.|
But... It all depends on specific rules, and there is absolutely no documentation to understand it (unless I just try), but what would happend with different locales and languages where dictionaries are used? what happens with backware iteration?
Is there any way to get "Begin of Word" or "End of Word" information like in Qt QTextBoundaryFinder: http://qt.nokia.com/doc/4.5/qtextboundaryfinder.html#BoundaryReason-enum?
How should I solve such problem in ICU correctly?