views:

874

answers:

3

In a project all internal strings are kept in utf-8 encoding. The project is ported to Linux and Windows. There is a need for a to_lower functionality now.

On POSIX OS I could use std::ctype_byname("ru_RU.UTF-8"). But with g++ (Debian 4.3.4-1), ctype::tolower() don't recognize Russian UTF-8 characters (latin text is lowercased fine).

On Windows, mingw's standard library throws exception "std::runtime_error: locale::facet::_S_create_c_locale name not valid" when I try to construct std::ctype_byname with "ru_RU.UTF-8" argument.

How do I implement/find std::ctype for utf-8 on Windows? The project already depends on libiconv (codecvt facet is based on it), but I don't see an obvious way to implement to_lower with it.

A: 

There is some STL (like the one from Apache - STDCXX, for example) that comes with several locales. But on other situations the locale is dependent only on the system.

If you could use name "ru_RU.UTF-8" on one operating the system, it doesn't mean that other systems have the same name for this locale. Debian and windows have probably other names and this is the reason you have a runtime exception.

You should install the locales you want on the system before. Or use an STL that already have this locale.

My cents...

dudewat
I'm pretty sure windows knows how to handle utf-8 encoding. I've even got codepage number - 65001. The question is - what locale name should be used in my case. Anyway, it seems I'm trying to do a fundamentally wrong thing (see comment to the question).
Basilevs
Does this page help you: http://msdn.microsoft.com/en-us/library/dd373814(VS.85).aspx?
dudewat
+2  A: 
Dmitriy
All of that may be done in linux or windows without STLport.There is no ctype in your example. And your codecvt would convert utf-8 to different encoding (CP????, or WCHAR_T) whereas my question was about utf-8 as internal representation.
Basilevs
+2  A: 

If all you need is to_lower for Cyrillic characters you can write a function by yourself.

АБВГДЕЖ in UTF8  D0 90 D0 91 D0 92 D0 93 D0 94 D0 95 D0 96 0A
абвгдеж in UTF8  D0 B0 D0 B1 D0 B2 D0 B3 D0 B4 D0 B5 D0 B6 0A

But don't forget that UTF8 is multibyte encoding.

Also you can try to convert a string from UTF8 to wchar_t (using libiconv) and use Windows specific function to implement to_lower.

Dmitriy