ansaurus

Question

How to convert Unicode string into a utf-8 or utf-16 string?

Answer 1

A:

WideCharToMultiByte is your friend. :-)

Chris Jester-Young 2008-11-11 08:43:13

Answer 2

+4 A:

Use the WideCharToMultiByte function. Specify CP_UTF8 for the CodePage parameter.

CHAR buf[256]; // or whatever
WideCharToMultiByte(
  CP_UTF8, 
  0, 
  StringToConvert, // the string you have
  -1, // length of the string - set -1 to indicate it is null terminated
  buf, // output
  __countof(buf), // size of the buffer in bytes - if you leave it zero the return value is the length required for the output buffer
  NULL,    
  NULL
);

Also, the default encoding for unicode apps in windows is UTF-16LE, so you might not need to perform any translation and just use the second version sqlite3_open16.

1800 INFORMATION 2008-11-11 08:44:01

I wouldn't recommend a fixed buffer; instead, use a dynamically-allocated buffer (e.g., std::vector), expanding as necessary (when WideCharToMultiByte tells you your string is too small).

Chris Jester-Young 2008-11-11 09:06:08

I have to disagree: You show how to convert from UTF16 to UTF8. This is not the requirement of the OP since there seems to be a function available for wide char strings: sqlite3_open16(). IMO, the correct answer is : use sqlite3_open16().

Serge - appTranslator 2008-11-11 09:41:03

@Chris that was why I said "or whatever" and put the comment on the output buffer size - I didn't want to complicate matters too much

1800 INFORMATION 2008-11-12 06:23:57

Answer 3

A:

utf-8 and utf-16 are both "unicode" character encodings. What you probably talk about is utf-32 which is a fixed-size character encoding. Maybe searching for

"Convert utf-32 into utf-8 or utf-16"

provides you some results or other papers on this.

Johannes Schaub - litb 2008-11-11 08:44:55

Answer 4

+3 A:

All the C++ string types are charset neutral. They just settle on a character width, and make no further assumptions. A wstring uses 16-bit characters in Windows, corresponding roughly to utf-16, but it still depends on what you store in the thread. The wstring doesn't in any way enforce that the data you put in it must be valid utf16. Windows uses utf16 when UNICODE is defined though, so most likely your strings are already utf16, and you don't need to do anything.

A few others have suggested using the WideCharToMultiByte function, which is (one of) the way(s) to go to convert utf16 to utf8. But since sqlite can handle utf16, that shouldn't be necessary.

jalf 2008-11-11 08:46:58

Answer 5

+5 A:

Short answer:

No conversion required if you use Unicode strings such as CString or wstring. Use sqlite3_open16(). You will have to make sure you pass a WCHAR pointer (casted to void *. Seems lame! Even if this lib is cross platform, I guess they could have defined a wide char type that depends on the platform and is less unfriendly than a void *) to the API. Such as for a CString: (void*)(LPCWSTR)strFilename

The longer answer:

You don't have a Unicode string that you want to convert to UTF8 or UTF16. You have a Unicode string represented in your program using a given encoding: Unicode is not a binary representation per se. Encodings say how the Unicode code points (numerical values) are represented in memory (binary layout of the number). UTF8 and UTF16 are the most widely used encodings. They are very different though.

When a VS project says "Unicode charset", it actually means "characters are encoded as UTF16". Therefore, you can use sqlite3_open16() directly. No conversion required. Characters are stored in WCHAR type (as opposed to char) which takes 16 bits (Fallsback on standard C type wchar_t, which takes 16 bits on Win32. Might be different on other platforms. Thanks for the correction, Checkers).

There's one more detail that you might want to pay attention to: UTF16 exists in 2 flavors: Big Endian and Little Endian. That's the byte ordering of these 16 bits. The function prototype you give for UTF16 doesn't say which ordering is used. But you're pretty safe assuming that sqlite uses the same endian-ness as Windows (Little Endian IIRC. I know the order but have always had problem with the names :-) ).

EDIT: Answer to comment by Checkers:

UTF16 uses 16 bits characters. Under Win32, wchar_t is used for such storage unit. The trick is that some Unicode characters require a sequence of 2 such 16-bits characters. They are called Surrogate Pairs.

The same way an UTF8 represents 1 character using a 1 to 4 bytes sequence. Yet UTF8 are used with the char type.

Serge - appTranslator 2008-11-11 09:38:31

No, no, no! sqlite3_open16() uses 'void*' argument, because it is stated to be UTF16, *NOT* wchar_t, which is of different size on different platforms and may or may not be UTF16 (i.e. glibc has 4-byte wchar_t).

Alex B 2008-11-11 09:49:48

Checkers: see my answer as EDIT here above

Serge - appTranslator 2008-11-11 10:45:09

Yes, I am aware about UTF16 representation. But, you cannot assume that internal representation of wchar_t is the same on all platforms, it is not.

Alex B 2008-11-11 11:05:30

"The ISO C90 standard, where wchar_t was introduced, does not say anything specific about the representation. It only requires that this type is capable of storing all elements of the basic character set"http://www.gnu.org/software/libtool/manual/libc/Extended-Char-Intro.html

Alex B 2008-11-11 11:06:06

For example, on my system, where sizeof(wchar_t)==4, L"aaa" compiles as 61 00 00 00 61 00 00 00 61 00 00 00 (UTF32-LE)

Alex B 2008-11-11 11:08:59

OK. You're right. I stand corrected. Now, does your platform support VS2005, as stated by the OP?

Serge - appTranslator 2008-11-11 14:27:20

Actually, I would say that UTF16 uses 16-bit codes (not characters), just as UTF8 uses 8-bit (octet) codes. A Unicode Character code (up to 20 bits) will require 1 UTF16 code for commonly used characters, but take two (called a surrogate pair) for others.

orcmid 2008-11-11 20:18:10

orcmid, you're right. I used 'character' in its programming type acceptance, which might be misleading.

Serge - appTranslator 2008-11-11 20:27:24

ansaurus

tags:

views:

answers:

How to convert Unicode string into a utf-8 or utf-16 string?

related questions