ansaurus

Question

Answer 1

+4 A:

Well this dirty trick will not work. First, what is the value of mask after this:

   const unsigned char mask = 0x11000000;
   const unsigned char notUtf8Begin = 0x10000000;

Perhaps you are mixing hex representation with binary.

Second, as you correctly say in utf-8 encoding, a character may be several bytes long. std::count_if will iterate through all bytes in a UTF8 sequence. But what you actually need is to look at leading byte for every character and skip the rest until the next character comes.

It will not be hard to implement a single cycle which does the calculation and jumping forward using the simple mask table for leading bytes.

At the end you get the same O(n) for checking the characters and it will work with every UTF8 string.

AlexKR 2009-10-02 08:42:40

Yep, got my masks mixed up, sorry. However the count_if is still correct apart from the combining diacritics problem.

Matthieu M. 2009-10-02 12:23:56

I was working on a utf8 string class where ++ would walk over wide code points correctly and gave up on the array of offsets for jumping from byte to byte. It works great going forward but for -- it provides no benefit. The pedantic code is easier to maintain.

jmucchiello 2009-10-08 19:34:17

Answer 2

A:

Sorting UTF_8 as binary will not sort in 'Unicode' order. BOCU-1 would. As was said, your "as expected" is a pretty low bar for non-English content.

Steven R. Loomis 2009-10-08 19:22:57

Answer 3

A:

We handle it also like this in OpenLieroX (which is really fine in a game I think).

We have a bunch of useful functions/algorithms for such UTF-8 std::strings. See Unicode.h and Unicode.cpp. For example, there are UTF8 iterators, some simple manipulation operators (insert or erase), upper/lower case conversions, case independent search, etc.

But don't expect those functions to be always correct. For example, they don't know really about combining diacritics or possible different ways to encode the same text.

Albert 2010-09-03 17:49:37

ansaurus

tags:

views:

answers:

Utf-8 in c++: quick & dirty tricks

related questions