tags:

views:

418

answers:

3

Hi,

I am aware that there are been various questions about utf-8, mainly about libraries to manipulate utf-8 'string' like objects.

However, I am working on an 'internationalized' project (a website, of which I code a c++ backend... don't ask) where even if we deal with utf-8 we don't acutally need such libraries. Most of the times the plain std::string methods or STL algorithms are very sufficient to our needs, and indeed this is the goal of using utf-8 in the first place.

So, what I am looking for here is a capitalization of the "Quick & Dirty" tricks that you know of related to utf-8 stored as std::string (no const char*, I don't care about c-style code really, I've got better things to do than constantly worrying about my buffer size).

For example, here is a "Quick & Dirty" trick to obtain the number of characters (which is useful to know if it will fit in your display box):

#include <string>
#include <algorithm>

// Let's remember than in utf-8 encoding, a character may be
// 1 byte: '0.......'
// 2 bytes: '110.....' '10......'
// 3 bytes: '1110....' '10......' '10......'
// 4 bytes: '11110...' '10......' '10......' '10......'
// Therefore '10......' is not the beginning of a character ;)

const unsigned char mask = 0xC0;
const unsigned char notUtf8Begin = 0x80;

struct Utf8Begin
{
  bool operator(char c) const { return (c & mask) != notUtf8Begin; }
};

// Let's count
size_t countUtf8Characters(const std::string& s)
{
  return std::count_if(s.begin(), s.end(), Utf8Begin());
}

In fact I have yet to encounter a usecase when I would need anything else than the number of characters and that std::string or the STL algorithms don't offer for free since:

  • sorting works as expected
  • no part of a word can be confused as a word or part of another word

I would like to know if you have other comparable tricks, both for counting and for other simple tasks.
I repeat, I know about ICU and Utf8-CPP, but I am not interested in them since I don't need a full-fledged treatment (and in fact I have never needed more than the count of characters).
I also repeat that I am not interested in treating char*'s, they are old-fashioned.

+4  A: 

Well this dirty trick will not work. First, what is the value of mask after this:

   const unsigned char mask = 0x11000000;
   const unsigned char notUtf8Begin = 0x10000000;

Perhaps you are mixing hex representation with binary.

Second, as you correctly say in utf-8 encoding, a character may be several bytes long. std::count_if will iterate through all bytes in a UTF8 sequence. But what you actually need is to look at leading byte for every character and skip the rest until the next character comes.

It will not be hard to implement a single cycle which does the calculation and jumping forward using the simple mask table for leading bytes.

At the end you get the same O(n) for checking the characters and it will work with every UTF8 string.

AlexKR
Yep, got my masks mixed up, sorry. However the count_if is still correct apart from the combining diacritics problem.
Matthieu M.
I was working on a utf8 string class where ++ would walk over wide code points correctly and gave up on the array of offsets for jumping from byte to byte. It works great going forward but for -- it provides no benefit. The pedantic code is easier to maintain.
jmucchiello
A: 

Sorting UTF_8 as binary will not sort in 'Unicode' order. BOCU-1 would. As was said, your "as expected" is a pretty low bar for non-English content.

Steven R. Loomis
A: 

We handle it also like this in OpenLieroX (which is really fine in a game I think).

We have a bunch of useful functions/algorithms for such UTF-8 std::strings. See Unicode.h and Unicode.cpp. For example, there are UTF8 iterators, some simple manipulation operators (insert or erase), upper/lower case conversions, case independent search, etc.

But don't expect those functions to be always correct. For example, they don't know really about combining diacritics or possible different ways to encode the same text.

Albert