ansaurus

Question

Code to strip diacritical marks using ICU

Answer 1

A:

Is there a problem with using

result = result.findAndReplace ("ü", "u");
result = result.findAndReplace ("ö", "o");

...and so on?

Edited to add:

The official UNIDATA Index is plain text. You can run it through a filter to emit a table of the needed transformations. For example,

E WITH DIAERESIS, LATIN CAPITAL LETTER 00CB

becomes

{ "\0x00CB", "E" }

As for performance: there is no magic bullet here, as there's no "translate these to those" assembly instruction, only "scan this string for that char". So that leaves you two choices, here expressed in C:

for (int n = 0; str[n]; ++n)
    for (int p = 0; translateList[p]; ++p)
        if (str[n] == translateList[p]->original)
        {
            str[n] = translateList[p]->translation;
            continue;
        }

for (int n = 0; translateList[n]; ++n)
{
    // string.findAndReplace():
    int ptr = NULL;
    while (ptr = strchr(str, translate[n]->original))
        *ptr = translate[n]->translation;
}

On the x86 architecture, at least, the second choice if much faster.

The real performance hit is that the string will get copied over every time, even if no substitutions are made. (In my C++ pseudocode the substitutions are made in-place.)

egrunin 2010-06-07 18:31:26

(1) I don't want to have to hand-code every single marked character used in every single language on the planet -- that's error-prone. (2) It's inefficient since it has to go through the entire string N times where N is the number of every single marked character in every single language on the planet. Ideally, it the algorithm should go through the string exactly once.

Paul J. Lucas 2010-06-07 18:49:47

You've taken what should be an O(n) algorithm and made it O(n^2).

Paul J. Lucas 2010-06-07 22:59:57

@Paul: "should be"? Please point me to any table-of-substitutions algorithm which is O(n), I'm happy to correct my answer.

egrunin 2010-06-08 00:06:04

Answer 2

A:

After more searching elsewhere:

UErrorCode status = U_ZERO_ERROR;
UnicodeString result;

// 's16' is the UTF-16 string to have diacritics removed
Normalizer::normalize( s16, UNORM_NFKD, 0, result, status );
if ( U_FAILURE( status ) )
  // complain

// code to convert UTF-16 's16' to UTF-8 std::string 's8' elided

string buf8;
buf8.reserve( s8.length() );
for ( string::const_iterator i = s8.begin(); i != s8.end(); ++i ) {
  char const c = *i;
  if ( isascii( c ) )
    buf8.push_back( c );
}
// result is in buf8

which is O(n).

Paul J. Lucas 2010-06-08 02:02:10

ansaurus

tags:

views:

answers:

Code to strip diacritical marks using ICU

related questions