views:

298

answers:

3

I have a problem with the functions in the string_algo package.

Consider this piece of code:

#include <boost/algorithm/string.hpp>
int main() {
   try{
      string s = "meißen";
      locale l("de_DE.UTF-8");
      to_upper(s, l);
      cout << s << endl;
   catch(std::runtime_error& e){
      cerr << e.what() << endl;
   }

   try{
      string s = "composición";
      locale l("es_CO.UTF-8");
      to_upper(s, l);
      cout << s << endl;
   catch(std::runtime_error& e){
      cerr << e.what() << endl;
   }
}

The expected output for this code would be:

MEISSEN
COMPOSICIÓN

however the only thing I get is

MEIßEN
COMPOSICIóN

so, clearly the locale is not being taken into account. I even try to set the global locale with no success. What can I do?

A: 

In the standard library there is std::toupper (which boost::to_upper uses) that operates on one character at a time.

This explains why the ß doesn't work. You didn't say which standard library and codepage you are using so I don't know why the ó didn't work.

What happens if you use wstring instead?

adrianm
+1  A: 

std::toupper assumes a 1:1 conversion, so there is no hope for the ß to SS case, Boost.StringAlgo or not.

Looking at StringAlgo's code, we see that it does use the locale (Except on Borland, it seems). So, for the other case, I'm curious: What is the result of toupper('ó', std::locale("es_CO.UTF-8"))on your platform?

Writing the above makes me think about something else: What is the encoding of the strings in your sources? UTF8? In that case, std::toupper will see two code units for 'ó', so there is no hope. Latin1? In that case, using a locale named ".UTF-8" is inconsistent.

Éric Malenfant
my sources are encoding as utf8. When I try to use std::toupper('o', std::locale("es_CO.UTF-8")) I get a warning "multi-character character constant" and I don't get any answer but an error: terminate called after throwing an instance of 'std::bad_cast' what(): std::bad_cast
Sambatyon
As I write in my answer: toupper handles strings char by char. 'ó' in UTF-8 is two bytes, so there is no hope that toupper yields something meaningful for that. You need to use an encoding where your characters can be represented in a single code unit. In the present case, I see two choices: Latin1 and char strings, or UTF-16 (or 32, depending on sizeof(wchar_t) on your platform) and wchar_t strings.
Éric Malenfant
A: 

In addition to the answer of Éric Malenfant -- std::locale facets works on single character. To get better result you may use std::wstring -- thus more characters would be converterd, but as you can see it is still not perfect (example ß).

I would suggest to give a try to Boost.Locale (new library for boost, not yet in boost), that does stuff

http://cppcms.sourceforge.net/boost_locale/docs/,

Especially see http://cppcms.sourceforge.net/boost_locale/docs/index.html#conversions that deals with the problem you are talking about.

Artyom