ansaurus

Question

How to create a UTF-8 string literal in Visual C++ 2008

Answer 1

A:

Brofield,

This article might help you out.

Hope it helps

Everton 2009-03-27 07:07:45

No, it doesn't help at all. I understand what I need to do. I just need to be able to tell VC++ how to do it. i.e. do not change the strings from what they are. Copy them byte for byte into the compiled executable and use them without change.

brofield 2009-03-27 13:14:10

Answer 2

A:

I had a similar problem, the solution was to save in UTF8 withou boom using advanced save options

2009-03-28 15:51:38

Unfortunately this doesn't work for me. I get compile errors as the compiler is then assuming that the source file is in Shift-JIS and so interprets the strings differently.

brofield 2009-03-29 03:26:16

Answer 3

A:

Read the articles. First, you don't want UTF-8. UTF-8 is only a way of representing characters. You want wide characters (wchar_t). You write them down as L"yourtextgoeshere". The type of that literal is wchar_t*. If you're in a hurry, just look up wprintf.

2009-03-28 21:22:11

I'm not wanting to convert to wchar because I would just need to convert all the strings back to UTF-8 again. I want VC2008 to leave my string literals unchanged.

brofield 2009-03-29 03:13:09

Answer 4

+5 A:

While it is probably better to use wide strings and then convert as needed to UTF-8. I think your best bet is to as you have mentioned use hex escapes in the strings. Like suppose you wanted code point \uC911, you could just do this.

const char *str = "\xc9\x11";

I believe this will work just fine, just isn't very readable, so if you do this, please comment it to explain.

Evan Teran 2009-03-28 22:12:35

Answer 5

A:

I agree with Theo Vosse. Read the article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) on Joel On Software ...

Wacek 2009-03-28 22:40:53

Answer 6

+11 A:

Update:

I've decided that there is no guaranteed way to do this. The solution that I present below works for English version VC2003, but fails when compiling with Japanese version VC2003 (or perhaps it is Japanese OS). In any case, it cannot be depended on to work. Note that even declaring everything as L"" strings didn't work (and is painful in gcc as described below).

Instead I believe that you just need to bite the bullet and move all text into a data file and load it from there. I am now storing and accessing the text in INI files via SimpleIni (cross-platform INI-file library). At least there is a guarantee that it works as all text is out of the program.

Original:

I'm answering this myself since only Evan appeared to understand the problem. The answers regarding what Unicode is and how to use wchar_t are not relevant for this problem as this is not about internationalization, nor a misunderstanding of Unicode, character encodings. I appreciate your attempt to help though, apologies if I wasn't clear enough.

The problem is that I have source files that need to be cross-compiled under a variety of platforms and compilers. The program does UTF-8 processing. It doesn't care about any other encodings. I want to have string literals in UTF-8 like currently works with gcc and vc2003. How do I do it with VC2008? (i.e. backward compatible solution).

This is what I have found:

gcc (v4.3.2 20081105):

string literals are used as is (raw strings)
supports UTF-8 encoded source files
source files must not have a UTF-8 BOM

vc2003:

string literals are used as is (raw strings)
supports UTF-8 encoded source files
source files may or may not have a UTF-8 BOM (it doesn't matter)

vc2005+:

string literals are massaged by the compiler (no raw strings)
char string literals are re-encoded to a specified locale
UTF-8 is not supported as a target locale
source files must have a UTF-8 BOM

So, the simple answer is that for this particular purpose, VC2005+ is broken and does not supply a backward compatible compile path. The only way to get Unicode strings into the compiled program is via UTF-8 + BOM + wchar which means that I need to convert all strings back to UTF-8 at time of use.

There isn't any simple cross-platform method of converting wchar to UTF-8, for instance, what size and encoding is the wchar in? On Windows, UTF-16. On other platforms? It varies. See the ICU project for some details.

In the end I decided that I will avoid the conversion cost on all compilers other than vc2005+ with source like the following.

#if defined(_MSC_VER) && _MSC_VER > 1310
// Visual C++ 2005 and later require the source files in UTF-8, and all strings 
// to be encoded as wchar_t otherwise the strings will be converted into the 
// local multibyte encoding and cause errors. To use a wchar_t as UTF-8, these 
// strings then need to be convert back to UTF-8. This function is just a rough 
// example of how to do this.
# define utf8(str)  ConvertToUTF8(L##str)
const char * ConvertToUTF8(const wchar_t * pStr) {
    static char szBuf[1024];
    WideCharToMultiByte(CP_UTF8, 0, pStr, -1, szBuf, sizeof(szBuf), NULL, NULL);
    return szBuf;
}
#else
// Visual C++ 2003 and gcc will use the string literals as is, so the files 
// should be saved as UTF-8. gcc requires the files to not have a UTF-8 BOM.
# define utf8(str)  str
#endif

Note that this code is just a simplified example. Production use would need to clean it up in a variety of ways (thread-safety, error checking, buffer size checks, etc).

This is used like the following code. It compiles cleanly and works correctly in my tests on gcc, vc2003, and vc2008:

std::string mText;
mText = utf8("Chinese (Traditional)");
mText = utf8("中国語 (繁体)");
mText = utf8("중국어 (번체)");
mText = utf8("Chinês (Tradicional)");

brofield 2009-03-30 04:52:22

Answer 7

+1 A:

Brofield,

I had the exact same problem and just stumbled on a solution that doesn't require converting your source strings to wide chars and back: save your source file as UTF-8 without signature and VC2008 will leave it alone. Worked great when I figured out to drop the signature. To sum up:

Unicode (UTF-8 without signature) - Codepage 65001, doesn't throw the c4566 warning in VC2008 and doesn't cause VC to mess with the encoding, while Codepage 65001 (UTF-8 With Signature) does throw c4566 (as you have found).

Hope that's not too late to help you, but it might speed up your VC2008 app to remove your workaround.

2009-09-01 01:51:18

Answer 8

+1 A:

How about this? You store the strings in a UTF-8 encoded file and then preprocess them into an ASCII encoded C++ source file. You keep the UTF-8 encoding inside the string by using hexadecimal escapes. The string

"中国語 (繁体)"

is converted to

"\xE4\xB8\xAD\xE5\x9B\xBD\xE8\xAA\x9E (\xE7\xB9\x81\xE4\xBD\x93)"

Of course this is unreadable by any human, and the purpose is just to avoid problems with the compiler.

You could either use the C++ preprocessor to reference the strings in converted header file or you could convert you entire UTF-8 source into ASCII before compilation using this trick.

Martin Liversage 2009-09-15 02:32:00

Answer 9

A:

Maybe try an experiment:

#pragma setlocale(".UTF-8")

or:

#pragma setlocale("english_england.UTF-8")

Windows programmer 2009-09-15 03:15:51

Answer 10

A:

Hi,

I had a similar problem. My UTF-8 string literals were converted to the current system codepage during compilation - I just opened .obj files in a hex-viewer and they were already mangled. For instance, character ć was just one byte.

The solution for me was to save in UTF-8 and WITHOUT BOM. That's how I tricked the compiler. It now thinks that's just a normal source, and does not translate strings. In .obj files ć is now two bytes.

Disregard some commentators, please. I understand what you want - I want the same too: UTF-8 source, UTF-8 generated files, UTF-8 input files, UTF-8 over communication lines without ever translating.

Maybe this helps...

Daniel N. 2009-12-18 13:10:48

Good that it works for you. I believe that there are problems down that route if you are using a non-English system locale. I have a Japanese compiler and Japanese system locale, and this didn't work for me as it seemed to try to convert the string literals from Shift-JIS which failed because they were UTF-8.

brofield 2009-12-21 08:30:44

Answer 11

A:

File/Advanced Save Options/Encoding: "Unicode (UTF-8 without signature) - Codepage 65001"

TANKS

Vladius 2010-03-09 19:06:01

Try compiling with a Japanese version of the compiler.

brofield 2010-03-11 01:35:05

You say it doesn't work "without signature". That is surely very strange, since the compiler wouldn't recognize the input as an UTF-8 input without performing additional processing. You say that the Japanese version does perform such logic; very interesting. The trick works for Russian nonetheless.

Vladius 2010-03-12 15:34:55

The trick should obviously work on any encoding that leaves the ASCII part intact. That is, UTF-8, ISO-8859-x, KOI8-R and may others.

J-mster 2010-09-16 18:41:09

ansaurus

tags:

views:

answers:

How to create a UTF-8 string literal in Visual C++ 2008

related questions