views:

168

answers:

1

I'm trying to understang how to use icu::BreakIterator to find specific words.

For example I have following sentence:

To be or not to be? That is the question...

Word instance of break iterator would put breaks there:

|To| |be| |or| |not| |to| |be|?| |That| |is| |the| |question|.|.|.|

Now, not every pair of break points is actual word.

In derived class icu::RuleBasedBreakIterator there is a "getRuleStatus()" that returns some kind of information about break, and it gives "Word status at following points (marked "/")"

|To/ |be/ |or/ |not/ |to/ |be/?| |That/ |is/ |the/ |question/.|.|.|

But... It all depends on specific rules, and there is absolutely no documentation to understand it (unless I just try), but what would happend with different locales and languages where dictionaries are used? what happens with backware iteration?

Is there any way to get "Begin of Word" or "End of Word" information like in Qt QTextBoundaryFinder: http://qt.nokia.com/doc/4.5/qtextboundaryfinder.html#BoundaryReason-enum?

How should I solve such problem in ICU correctly?

+2  A: 

Have you tried the ICU documentation? It appears to explain everything you are asking about including handling of internationalisation, reverse iteration, and the rules, both default and how to create your own custom set. They also have code snippets to help.

Troubadour
"The function getRuleStatus() returns an enum giving additional information on the text preceding the last break position found." -- form the documentation after re-reading it... So I accept this ansver. However I would exprect something like that would appear in Doxygen description of this member function.
Artyom
@Artyom: Thanks. In relation to your statement about Doxygen comments I would never rely on those sort of things completely. They are great when they exist and are well maintained but as we all know developers often prioritise this less than features or bug fixing. Within less than ideal situation you just have to use all forms of documentation at your disposal including looking at the actual source code if available. Anyway, glad to help, even if it just made you re-read the docs!
Troubadour