views:

13193

answers:

5

How to convert Unicode string into a utf-8 or utf-16 string? My VS2005 project is using Unicode char set, while sqlite in cpp provide

int sqlite3_open(
  const char *filename,   /* Database filename (UTF-8) */
  sqlite3 **ppDb          /* OUT: SQLite db handle */
);
int sqlite3_open16(
  const void *filename,   /* Database filename (UTF-16) */
  sqlite3 **ppDb          /* OUT: SQLite db handle */
);

for opening a folder. How can I convert string, CString, or wstring into UTF-8 or UTF-16 charset?

Thanks very much!

A: 

WideCharToMultiByte is your friend. :-)

Chris Jester-Young
+4  A: 

Use the WideCharToMultiByte function. Specify CP_UTF8 for the CodePage parameter.

CHAR buf[256]; // or whatever
WideCharToMultiByte(
  CP_UTF8, 
  0, 
  StringToConvert, // the string you have
  -1, // length of the string - set -1 to indicate it is null terminated
  buf, // output
  __countof(buf), // size of the buffer in bytes - if you leave it zero the return value is the length required for the output buffer
  NULL,    
  NULL
);

Also, the default encoding for unicode apps in windows is UTF-16LE, so you might not need to perform any translation and just use the second version sqlite3_open16.

1800 INFORMATION
I wouldn't recommend a fixed buffer; instead, use a dynamically-allocated buffer (e.g., std::vector), expanding as necessary (when WideCharToMultiByte tells you your string is too small).
Chris Jester-Young
I have to disagree: You show how to convert from UTF16 to UTF8. This is not the requirement of the OP since there seems to be a function available for wide char strings: sqlite3_open16(). IMO, the correct answer is : use sqlite3_open16().
Serge - appTranslator
@Chris that was why I said "or whatever" and put the comment on the output buffer size - I didn't want to complicate matters too much
1800 INFORMATION
A: 

utf-8 and utf-16 are both "unicode" character encodings. What you probably talk about is utf-32 which is a fixed-size character encoding. Maybe searching for

"Convert utf-32 into utf-8 or utf-16"

provides you some results or other papers on this.

Johannes Schaub - litb
+3  A: 

All the C++ string types are charset neutral. They just settle on a character width, and make no further assumptions. A wstring uses 16-bit characters in Windows, corresponding roughly to utf-16, but it still depends on what you store in the thread. The wstring doesn't in any way enforce that the data you put in it must be valid utf16. Windows uses utf16 when UNICODE is defined though, so most likely your strings are already utf16, and you don't need to do anything.

A few others have suggested using the WideCharToMultiByte function, which is (one of) the way(s) to go to convert utf16 to utf8. But since sqlite can handle utf16, that shouldn't be necessary.

jalf
+5  A: 

Short answer:

No conversion required if you use Unicode strings such as CString or wstring. Use sqlite3_open16(). You will have to make sure you pass a WCHAR pointer (casted to void *. Seems lame! Even if this lib is cross platform, I guess they could have defined a wide char type that depends on the platform and is less unfriendly than a void *) to the API. Such as for a CString: (void*)(LPCWSTR)strFilename

The longer answer:

You don't have a Unicode string that you want to convert to UTF8 or UTF16. You have a Unicode string represented in your program using a given encoding: Unicode is not a binary representation per se. Encodings say how the Unicode code points (numerical values) are represented in memory (binary layout of the number). UTF8 and UTF16 are the most widely used encodings. They are very different though.

When a VS project says "Unicode charset", it actually means "characters are encoded as UTF16". Therefore, you can use sqlite3_open16() directly. No conversion required. Characters are stored in WCHAR type (as opposed to char) which takes 16 bits (Fallsback on standard C type wchar_t, which takes 16 bits on Win32. Might be different on other platforms. Thanks for the correction, Checkers).

There's one more detail that you might want to pay attention to: UTF16 exists in 2 flavors: Big Endian and Little Endian. That's the byte ordering of these 16 bits. The function prototype you give for UTF16 doesn't say which ordering is used. But you're pretty safe assuming that sqlite uses the same endian-ness as Windows (Little Endian IIRC. I know the order but have always had problem with the names :-) ).

EDIT: Answer to comment by Checkers:

UTF16 uses 16 bits characters. Under Win32, wchar_t is used for such storage unit. The trick is that some Unicode characters require a sequence of 2 such 16-bits characters. They are called Surrogate Pairs.

The same way an UTF8 represents 1 character using a 1 to 4 bytes sequence. Yet UTF8 are used with the char type.

Serge - appTranslator
No, no, no! sqlite3_open16() uses 'void*' argument, because it is stated to be UTF16, *NOT* wchar_t, which is of different size on different platforms and may or may not be UTF16 (i.e. glibc has 4-byte wchar_t).
Alex B
Checkers: see my answer as EDIT here above
Serge - appTranslator
Yes, I am aware about UTF16 representation. But, you cannot assume that internal representation of wchar_t is the same on all platforms, it is not.
Alex B
"The ISO C90 standard, where wchar_t was introduced, does not say anything specific about the representation. It only requires that this type is capable of storing all elements of the basic character set"http://www.gnu.org/software/libtool/manual/libc/Extended-Char-Intro.html
Alex B
For example, on my system, where sizeof(wchar_t)==4, L"aaa" compiles as 61 00 00 00 61 00 00 00 61 00 00 00 (UTF32-LE)
Alex B
OK. You're right. I stand corrected. Now, does your platform support VS2005, as stated by the OP?
Serge - appTranslator
Actually, I would say that UTF16 uses 16-bit codes (not characters), just as UTF8 uses 8-bit (octet) codes. A Unicode Character code (up to 20 bits) will require 1 UTF16 code for commonly used characters, but take two (called a surrogate pair) for others.
orcmid
orcmid, you're right. I used 'character' in its programming type acceptance, which might be misleading.
Serge - appTranslator