ansaurus

Question

Answer 1

+1 A:

You can use the wcstombs() (widechar string to multibyte string) function provided in stdlib.h

The prototype is as follows:

#include <stdlib.h>

size_t wcstombs(char *dest, const wchar_t *src, size_t n);

It will correctly convert your wchar_t string provided by src into a char (a.k.a. octets) string and write it to dest with at most n bytes.

char wide_string[] = "Hellöw, Wörld! :)";
char mb_string[512]; /* Might want to calculate a better, more realistic size! */
int i, length;

memset(mb_string, 0, 512);
length = wcstombs(mb_string, wide_string, 511);

/* mb_string will be zero terminated if it wasn't cancelled by reaching the limit
 * before being finished with converting. If the limit WAS reached, the string
 * will not be zero terminated and you must do it yourself - not happening here */

for (i = 0; i < length; i++)
   printf("Octet #%d: '%02x'\n", i, mb_string[i]);

LukeN 2010-07-25 14:08:19

Thanks for the fast response and your samples, I'm trying to understand whether wcstombs() transparently 'converts' invalid sequences, do you happen to know if it does?

Doori Bar 2010-07-25 14:31:51

If it encounters a wide character that can not be converted to a multibyte representation, `wcstombs()` will return `-1`, so nothing is done without you not knowing :)

LukeN 2010-07-25 14:35:05

@LukeN: That's what I was afraid of, I'm not sure what it defines as a multibyte representation? a different encoding - such as UTF-8? (my goal is to store transparently a wchar_t of 2 bytes - UTF-16LE, under windows)

Doori Bar 2010-07-25 14:37:13

I think C defines a multibyte string as just that - characters that don't fit into 1 byte will be split over multiple bytes and can be "unsplit" again later. I don't know whether or not it's clean UTF-8 when it comes out of that function :( But if you can't be sure that your widechar string contains usable characters, how should C (or in a do-it-yourself attempt: you) know HOW to split them to octets?

LukeN 2010-07-25 15:01:35

Thanks a lot, apparently my sample is correct - I just commented on it

Doori Bar 2010-07-25 15:04:10

The behaviour of wcstombs() depends on the LC_CTYPE category of the current locale. See setlocale().

ninjalj 2010-07-26 18:00:51

Answer 2

A:

If you're trying to see the content of the memory buffer holding the string, you can do this:

  size_t len = wcslen(str) * sizeof(wchar_t);
  const char *ptr = (const char*)(str);
  for (i=0; i<len; i++) {
    printf("(%u)", ptr[i]);
  }

Amnon 2010-07-25 14:15:20

That's C++. The asker is looking for C.

LukeN 2010-07-25 14:16:51

@LukeN, thanks. I hoped it's now C.

Amnon 2010-07-25 14:28:30

the idea can be used in C too, the only c++ism is the reinterpret_cast thing --- is -> was ...

ShinTakezou 2010-07-25 14:28:48

LukeN 2010-07-25 14:33:47

if wcslen() represents a codepoint, and sizeof(wchar_t) represents 2 bytes under windows - how could it possibly handle codepoints which consumes 4 bytes?

Doori Bar 2010-07-25 14:34:23

@Doori Bar: it can't. Code points are 2 bytes in Windows Unicode encoding, otherwise wchar_t would have been bigger.

Amnon 2010-07-25 14:37:35

I assume this is one of the flaws of UTF-16, which microsoft picked as unicode default.

LukeN 2010-07-25 14:39:17

@Amnon: I thought NTFS was UTF-16LE, while it's being represented as wchar_t which is 2 bytes under windows. So NTFS is not UTF-16LE?

Doori Bar 2010-07-25 14:40:31

@LukeN: I'm not sure if I understand what the flaw is?

Doori Bar 2010-07-25 14:41:50

In my opinion, UTF-16 is the worst UTF to pick. With UTF-8 you can be sure, that each and every character is a byte - it is compatible with ASCII (very important!) and doesn't waste space when it doesn't need to be wasted (if a char fits into one byte, it will be one byte, if it fits into 2 byte, it will be 2 byte). UTF-32 assures you, that each and every character is exactly 4 bytes, it's easier to handle but wastes space. UTF-16 will waste space if the character would fit into 1 char (it still uses 2) and would STILL have to split bigger characters that don't fit in 2 UTF-16 wchars.

LukeN 2010-07-25 14:47:04

@Doori: I don't know. I'm not sure what NTFS has got to do with it.

Amnon 2010-07-25 14:50:49

@LukeN I supposed the word "cast" should suggest what the "construct" tries to do, but of course it could also happen differently

ShinTakezou 2010-07-25 14:52:20

Thanks a lot guys, apparently my sample is correct - I just commented on it

Doori Bar 2010-07-25 15:04:57

Answer 3

+3 A:

Unicode text is always encoded. Popular encodings are UTF-8, UTF-16 and UTF-32. Only the latter has a fixed size for a glyph. UTF-16 uses surrogates for codepoints in the upper planes, such a glyph uses 2 wchar_t. UTF-8 is byte oriented, it uses between 1 and 4 bytes to encode a codepoint.

UTF-8 is an excellent choice if you need to transcode the text to a byte oriented stream. A very common choice for text files and HTML encoding on the Internet. If you use Windows then you can use WideCharToMultiByte() with CodePage = CP_UTF8. A good alternative is the ICU library.

Be careful to avoid byte encodings that translate text to a code page, such as wcstombs(). They are lossy encodings, glyphs that don't have a corresponding character code in the code page are replaced by ?.

Hans Passant 2010-07-25 15:30:31

Thanks for the clarification of: "such a glyph uses 2 wchar_t.", I suppose in addition to the comment of mine on the original question, I'll accept your answer.

Doori Bar 2010-07-26 18:36:56

Answer 4

A:

I don't know why printf and wprintf do not work together. Following code works.

unsigned int i;
const wchar_t *wchar1 = L"abc";
wprintf(L"%ls\r\n", wchar1);

for(i=0; i<wcslen(wchar1); i++)
{   
    wprintf(L"(%d)", (wchar1[i]) & 255);
    wprintf(L"(%d)", (wchar1[i] >> 8) & 255);
}

Siddique 2010-07-25 16:50:27

ansaurus

tags:

views:

answers:

wchar_t to octets - in C?

related questions