ansaurus

Question

Detecting locale from unicode string in c++

Answer 1

+1 A:

The first step is writing a functor to tell if a given wchar_t is Hindi. This will be (derived from) a std::unary_function<wchar_t, bool>. Implementation is trivial: return c>= 0x0900 && c < 0x980;. The second step is using it: std::find_if(begin, end, is_hindi()).

Since you'll need Unicode, you should probably use wchar_t and therefore std::wstring. Neither std::string nor GLib::ustring supports Unicode proper. On some systems (Windows in particular) the implementation of wchar_t is restricted to Unicode 4 = 16 bits but that should still be enough for 99.9% of the worlds population.

You'll need to convert from/to UTF-8 on I/O, but the advantage of "one character = one wchar_t" is big. For instance, std::wstring::substr() will work reasonably. You might still have issues with "characters" like U+094B (DEVANAGARI VOWEL SIGN O), though. When iterating over a std::wstring, that will appear to be a character by itself, instead of a modifier. That's still better than std::string with UTF-8, where you'd end up iterating over the individual bytes of U+094B. And to take just your original examples, none of the bytes in UTF8(U+094B) are reserved for Hindi.

MSalters 2009-08-17 13:46:31

Thanks for the answer. What would the comparison statement inside the functor look like?

Pallavi 2009-08-17 13:52:32

Why do you say GLib::ustring doesn't support Unicode properly?

ltcmelo 2009-08-17 14:39:19

@ltcmelo, he didn't write "properly", he wrote "proper". What this means is that one can use, for instance, std::string to support Unicode, but std::string itself knows nothing about Unicode.

Rob K 2009-08-17 15:58:39

I tried a Hindi word with Glib::ustring and it supports unicode very well. I tried with GCC 4.3.3 on Linux and with GCC 4.4.0 on Windows

Sahasranaman MS 2009-08-17 16:43:31

@Rob K - Yes, I know that. But I asked about Glib::ustring, which he also said doesn't support Unicode. I'm curious about that because in my understanding the whole point of Glib::ustring is to represent UTF-8 properly. Perhaps he was talking about other encodings than UTF-8?

ltcmelo 2009-08-18 12:18:22

Answer 2

+1 A:

If the string is already encoded as UTF-8, I would not convert it to UTF-16 (I assume that's what MSalters calls "Unicode proper") but iterate through the UTF-8 encoded string and check whether there is a Hindi character in it.

With std::string, you can easily iterate with the help of the UTF8-CPP library: - take a look at utf8::next() function, or the iterator class.

GLib::ustring has an iterator that seems to enable the same functionality (haven't tried it):

Nemanja Trifunovic 2009-08-17 16:43:38

Answer 3

+2 A:

Here is how you do it with Glib::ustring :

using Glib::ustring;

ustring x("सहस");    // hindi string
bool is_hindi = false;
for (ustring::iterator i = x.begin(); i != x.end(); i ++)
    if (*i >= 0x0900 && *i <= 0x097f)
        is_hindi = true;

Sahasranaman MS 2009-08-17 16:50:21

The "सहस" bit isn't portable, neither in theory nor in practice. It works iff you have a `char` encoding that support Hindi. As Hindi is just U0900-U097F you can append that to ASCII and still fit it in 8 bits, so I'll assume such encodings exist.

MSalters 2009-08-18 08:10:49

ansaurus

tags:

views:

answers:

Detecting locale from unicode string in c++

related questions