Hi,
I am aware that there are been various questions about utf-8, mainly about libraries to manipulate utf-8 'string' like objects.
However, I am working on an 'internationalized' project (a website, of which I code a c++ backend... don't ask) where even if we deal with utf-8 we don't acutally need such libraries. Most of the times the plain std::string methods or STL algorithms are very sufficient to our needs, and indeed this is the goal of using utf-8 in the first place.
So, what I am looking for here is a capitalization of the "Quick & Dirty" tricks that you know of related to utf-8 stored as std::string (no const char*, I don't care about c-style code really, I've got better things to do than constantly worrying about my buffer size).
For example, here is a "Quick & Dirty" trick to obtain the number of characters (which is useful to know if it will fit in your display box):
#include <string>
#include <algorithm>
// Let's remember than in utf-8 encoding, a character may be
// 1 byte: '0.......'
// 2 bytes: '110.....' '10......'
// 3 bytes: '1110....' '10......' '10......'
// 4 bytes: '11110...' '10......' '10......' '10......'
// Therefore '10......' is not the beginning of a character ;)
const unsigned char mask = 0xC0;
const unsigned char notUtf8Begin = 0x80;
struct Utf8Begin
{
bool operator(char c) const { return (c & mask) != notUtf8Begin; }
};
// Let's count
size_t countUtf8Characters(const std::string& s)
{
return std::count_if(s.begin(), s.end(), Utf8Begin());
}
In fact I have yet to encounter a usecase when I would need anything else than the number of characters and that std::string or the STL algorithms don't offer for free since:
- sorting works as expected
- no part of a word can be confused as a word or part of another word
I would like to know if you have other comparable tricks, both for counting and for other simple tasks.
I repeat, I know about ICU and Utf8-CPP, but I am not interested in them since I don't need a full-fledged treatment (and in fact I have never needed more than the count of characters).
I also repeat that I am not interested in treating char*'s, they are old-fashioned.