views:

59

answers:

3

When it comes to internationalization & Unicode, I'm an idiot American programmer. Here's the deal.

#include <string>
using namespace std;

typedef basic_string<unsigned char> ustring;

int main()
{
    static const ustring my_str = "Hello, UTF-8!"; // <== error here
    return 0;
}

This emits a not-unexpected complaint:

cannot convert from 'const char [14]' to 'std::basic_string<_Elem>'

Maybe I've had the wrong portion of coffee today. How do I fix this? Can I keep the basic structure:

ustring something = {insert magic incantation here};

?

+3  A: 

Narrow string literals are defined to be const char and there aren't unsigned string literals[1], so you'll have to cast:

ustring s = reinterpret_cast<const unsigned char*>("Hello, UTF-8");

Of course you can put that long thing into an inline function:

inline const unsigned char *uc_str(const char *s){
  return reinterpret_cast<const unsigned char*>(s);
}

ustring s = uc_str("Hello, UTF-8");

Or you can just use basic_string<char> and get away with it 99.9% of the time you're dealing with UTF-8.

[1] Unless char is unsigned, but whether it is or not is implementation-defined, blah, blah.

Steve M
I *think* this is the answer...
John Dibling
+1  A: 

Using different character types for a different encodings has the advantages that the compiler barks at you when you mess them up. The downside is, you have to manually convert.

A few helper functions to the rescue:

inline ustring convert(const std::string& sys_enc) {
  return ustring( sys_enc.begin(), sys_enc.end() );
}

template< std::size_t N >
inline ustring convert(const char (&array)[N]) {
  return ustring( array, array+N );
}

inline ustring convert(const char* pstr) {
  return ustring( reinterpret_cast<ustring::value_type>(pstr) );
}

Of course, all these fail silently and fatally when the string to convert contains anything other than ASCII.

sbi
A: 

Make your life easier, use a UTF-8 string library such as http://utfcpp.sourceforge.net/ or go with std::wstring and use UTF-16. You may be interested in the discussion from another question on stack overflow: C++ strings: UTF-8 or 16-bit encoding?

imafishimafish
Can't use UTF-16. Incoming file is UTF-8.
John Dibling
I guess the next question is, what do you need to do with the data from the file after it's loaded? It may make sense to convert it to UTF-16, or it may be easier and more efficient to keep it as UTF-8.
imafishimafish
@imafishimafish: UTF-16 doesn't really have that many advantages over UTF-8. In fact, the only two I can think of are A) it's Windows' native Unicode encoding, so when you're doing Windows, it makes it easier, and B) when you're using lots of those (CJK?) characters that need three byte in UTF-8, but only two in UTF-16, then UTF-16 needs less memory.
sbi