views:

64

answers:

2

Can somebody please provide some sample code to strip diacritical marks (i.e., replace characters having accents, umlauts, etc., with their unaccented, unumlauted, etc., character equivalents, e.g., every accented é would become a plain ASCII e) from a UnicodeString using the ICU library in C++? E.g.:

UnicodeString strip_diacritics( UnicodeString const &s ) {
    UnicodeString result;
    // ...
    return result;
}

Assume that s has already been normalized. Thanks.

A: 

Is there a problem with using

result = result.findAndReplace ("ü", "u");
result = result.findAndReplace ("ö", "o");

...and so on?

Edited to add:

The official UNIDATA Index is plain text. You can run it through a filter to emit a table of the needed transformations. For example,

E WITH DIAERESIS, LATIN CAPITAL LETTER 00CB

becomes

{ "\0x00CB", "E" }

As for performance: there is no magic bullet here, as there's no "translate these to those" assembly instruction, only "scan this string for that char". So that leaves you two choices, here expressed in C:

for (int n = 0; str[n]; ++n)
    for (int p = 0; translateList[p]; ++p)
        if (str[n] == translateList[p]->original)
        {
            str[n] = translateList[p]->translation;
            continue;
        }

for (int n = 0; translateList[n]; ++n)
{
    // string.findAndReplace():
    int ptr = NULL;
    while (ptr = strchr(str, translate[n]->original))
        *ptr = translate[n]->translation;
}

On the x86 architecture, at least, the second choice if much faster.

The real performance hit is that the string will get copied over every time, even if no substitutions are made. (In my C++ pseudocode the substitutions are made in-place.)

egrunin
(1) I don't want to have to hand-code every single marked character used in every single language on the planet -- that's error-prone. (2) It's inefficient since it has to go through the entire string N times where N is the number of every single marked character in every single language on the planet. Ideally, it the algorithm should go through the string exactly once.
Paul J. Lucas
You've taken what should be an O(n) algorithm and made it O(n^2).
Paul J. Lucas
@Paul: "should be"? Please point me to any table-of-substitutions algorithm which is O(n), I'm happy to correct my answer.
egrunin
A: 

After more searching elsewhere:

UErrorCode status = U_ZERO_ERROR;
UnicodeString result;

// 's16' is the UTF-16 string to have diacritics removed
Normalizer::normalize( s16, UNORM_NFKD, 0, result, status );
if ( U_FAILURE( status ) )
  // complain

// code to convert UTF-16 's16' to UTF-8 std::string 's8' elided

string buf8;
buf8.reserve( s8.length() );
for ( string::const_iterator i = s8.begin(); i != s8.end(); ++i ) {
  char const c = *i;
  if ( isascii( c ) )
    buf8.push_back( c );
}
// result is in buf8

which is O(n).

Paul J. Lucas