ansaurus

Question

C++ unicode question

Answer 1

+1 A:

Formatting date, time etc can be done by specifying a particular locale. As for rolling your own -- it is always possible, taking as much or as little from the underlying library as you need.

Also having looked at the c++0x standard and noticed literals for utf8, utf16 and utf32, does that mean that standard library (eg strings, streams, etc) will fully support those encodeings and the conversion between them?

Yes. But note these are different data types and not your regular wchar sequence or a wstring.

If so anyone got any idea how long it will be until Visual Studio will support those features?

To the best of my knowledge: vc9 (VS2008) only has partial support for some TR1 features. vc10 (VS2010) is expected to have a better support.

dirkgently 2009-05-07 15:54:18

Yes, but it doesnt format it to a certain encodeing, sure I could format it to an ascii string and then encode it, but what if I wanted to use long month names in chinease which isnt possible in ascii?

Fire Lancer 2009-05-07 15:56:05

That's where the encoding part of the locale comes into play. Also, look up facets.

dirkgently 2009-05-07 15:57:15

Yes. Much under utilized is the local functionality. Do not enforce a format on your user. Let the system decide the format all you have to do is make sure the locale is set correctly the stream will then work correctly. (+1)

Martin York 2009-05-07 16:29:22

"But note these are different data types and not your regular wchar sequence or a wstring." so when I create a class with overloaded >> and << operators I will now have to write an implementation for char, wchar_t and each of the unicode data types (asumming I dont use templates, since I may not want them in a header, but instead in say a dll)? Or will there be someway to have a "generic" stream type?

Fire Lancer 2009-05-07 16:53:49

No, with C++0x, you'd use those new types and not wchar_t or wstring.

dirkgently 2009-05-07 16:59:43

Answer 2

+1 A:

What I really want is something like ICU but wrapped up in a more friendly manner

Unfortunatly, there is no such thing. Their API is not SO terrible, so you can get used to it for some effort.

Can format time, dates etc in a locale dependent manner (eg dd/mm/yy in the UK and mm/dd/yy in the US).

There is a full support of it in std::locale class, read on how to use it. You can also specify locale for std::iostream so it would format numers, dates correctly.

Easy converting of strings between encodings

std::locale provides facets for coverting 8bits local encoding to wide one and back.

so I can for example make it use UTF-16

ICU uses utf-16 internally, win32 wchar_t and wstring use utf-16 as well, under other OSes most of implementations give wchar_t as utf-32 and wstring uses utf-32.

Remarks: Support of std::locale is not perfect, but it already gives many tools that are useful for charrecter manipulations.

See: http://www.cplusplus.com/reference/std/locale/

Artyom 2009-05-07 16:12:45

Answer 3

A:

I did my own small wrapper. I can share if you want.

piotr 2009-05-07 16:41:08

Does it support the c++ streams, because my main issue with ICU and the fact I have a very large app I want to make work with unicode.

Fire Lancer 2009-05-07 16:42:14

yes, uses boost::iostreams filters

piotr 2009-05-26 11:27:56

Answer 4

A:

Tough luck. I know that Dinkumware libraries offer some Unicode support - you may look at the documentation at their web site. AFAIK, it is not free.

Nemanja Trifunovic 2009-05-07 16:44:50

Answer 5

A:

This is how I use ICU to convert between std::string (in UTF-8) and std::wstring

/** Converts a std::wstring into a std::string with UTF-8 encoding.
 */
template < typename StringT >
StringT utf8 ( std::wstring const & rc_string );

/** Converts a std::String with UTF-8 encoding into a std::wstring.
 */
template < typename StringT >
StringT utf8 ( std::string const & rc_string );

/** Nop specialization for std::string.
 */
template < >
inline std::string utf8 ( std::string const & rc_string )
{
  return rc_string;
}

/** Nop specialization for std::wstring.
 */
template < >
inline std::wstring utf8 ( std::wstring const & rc_string )
{
  return rc_string;
}

template < >
std::string utf8 ( std::wstring const & rc_string )
{
  std::string result;
  if(rc_string.empty())
    return result;

  std::vector<UChar> buffer;

  result.resize(rc_string.size() * 3); // UTF-8 uses max 3 bytes per char
  buffer.resize(rc_string.size() * 2); // UTF-16 uses max 2 bytes per char

  UErrorCode status = U_ZERO_ERROR;
  int32_t len = 0;

  u_strFromWCS(
    &buffer[0],
    buffer.size(),
    &len,
    &rc_string[0],
    rc_string.size(),
    &status
  );
  if(!U_SUCCESS(status))
  {
    throw XXXException("utf8: u_strFromWCS failed");
  }
  buffer.resize(len);

  u_strToUTF8(
    &result[0],
    result.size(),
    &len,
    &buffer[0],
    buffer.size(),
    &status
  );
  if(!U_SUCCESS(status))
  {
    throw XXXException("utf8: u_strToUTF8 failed");
  }
  result.resize(len);

  return result;
}/* end of utf8 ( ) */


template < >
std::wstring utf8 ( std::string const & rc_string )
{
  std::wstring result;
  if(rc_string.empty())
    return result;

  std::vector<UChar> buffer;

  result.resize(rc_string.size());
  buffer.resize(rc_string.size());

  UErrorCode status = U_ZERO_ERROR;
  int32_t len = 0;

  u_strFromUTF8(
    &buffer[0],
    buffer.size(),
    &len,
    &rc_string[0],
    rc_string.size(),
    &status
  );
  if(!U_SUCCESS(status))
  {
    throw XXXException("utf8: u_strFromUTF8 failed");
  }
  buffer.resize(len);

  u_strToWCS(
    &result[0],
    result.size(),
    &len,
    &buffer[0],
    buffer.size(),
    &status
  );
  if(!U_SUCCESS(status))
  {
    throw XXXException("utf8: u_strToWCS failed");
  }
  result.resize(len);

  return result;
}/* end of utf8 ( ) */

Using it is as simple as that:

std::string s = utf8<std::string>(std::wstring(L"some string"));
std::wstring s = utf8<std::wstring>(std::string("some string"));

lothar 2009-05-07 16:55:05

One bug: UTF-8 uses max *4* bytes per character.One incorrect use of term: UTF-16 uses max 2 *code units* per character.

dalle 2010-10-08 09:54:23

ansaurus

tags:

views:

answers:

C++ unicode question

related questions