tags:

views:

95

answers:

4

I'm trying to store a wchar_t string as octets, but I'm positive I'm doing it wrong - anybody mind to validate my attempt? What's going to happen when one char will consume 4 bytes?

  unsigned int i;
  const wchar_t *wchar1 = L"abc";
  wprintf(L"%ls\r\n", wchar1);

  for (i=0;i< wcslen(wchar1);i++) {
    printf("(%d)", (wchar1[i]) & 255);
    printf("(%d)", (wchar1[i] >> 8) & 255);
  }
+1  A: 

You can use the wcstombs() (widechar string to multibyte string) function provided in stdlib.h

The prototype is as follows:

#include <stdlib.h>

size_t wcstombs(char *dest, const wchar_t *src, size_t n);

It will correctly convert your wchar_t string provided by src into a char (a.k.a. octets) string and write it to dest with at most n bytes.

char wide_string[] = "Hellöw, Wörld! :)";
char mb_string[512]; /* Might want to calculate a better, more realistic size! */
int i, length;

memset(mb_string, 0, 512);
length = wcstombs(mb_string, wide_string, 511);

/* mb_string will be zero terminated if it wasn't cancelled by reaching the limit
 * before being finished with converting. If the limit WAS reached, the string
 * will not be zero terminated and you must do it yourself - not happening here */

for (i = 0; i < length; i++)
   printf("Octet #%d: '%02x'\n", i, mb_string[i]);
LukeN
Thanks for the fast response and your samples, I'm trying to understand whether wcstombs() transparently 'converts' invalid sequences, do you happen to know if it does?
Doori Bar
If it encounters a wide character that can not be converted to a multibyte representation, `wcstombs()` will return `-1`, so nothing is done without you not knowing :)
LukeN
@LukeN: That's what I was afraid of, I'm not sure what it defines as a multibyte representation? a different encoding - such as UTF-8? (my goal is to store transparently a wchar_t of 2 bytes - UTF-16LE, under windows)
Doori Bar
I think C defines a multibyte string as just that - characters that don't fit into 1 byte will be split over multiple bytes and can be "unsplit" again later. I don't know whether or not it's clean UTF-8 when it comes out of that function :( But if you can't be sure that your widechar string contains usable characters, how should C (or in a do-it-yourself attempt: you) know HOW to split them to octets?
LukeN
Thanks a lot, apparently my sample is correct - I just commented on it
Doori Bar
The behaviour of wcstombs() depends on the LC_CTYPE category of the current locale. See setlocale().
ninjalj
A: 

If you're trying to see the content of the memory buffer holding the string, you can do this:

  size_t len = wcslen(str) * sizeof(wchar_t);
  const char *ptr = (const char*)(str);
  for (i=0; i<len; i++) {
    printf("(%u)", ptr[i]);
  }
Amnon
That's C++. The asker is looking for C.
LukeN
@LukeN, thanks. I hoped it's now C.
Amnon
the idea can be used in C too, the only c++ism is the reinterpret_cast thing --- is -> was ...
ShinTakezou
LukeN
if wcslen() represents a codepoint, and sizeof(wchar_t) represents 2 bytes under windows - how could it possibly handle codepoints which consumes 4 bytes?
Doori Bar
@Doori Bar: it can't. Code points are 2 bytes in Windows Unicode encoding, otherwise wchar_t would have been bigger.
Amnon
I assume this is one of the flaws of UTF-16, which microsoft picked as unicode default.
LukeN
@Amnon: I thought NTFS was UTF-16LE, while it's being represented as wchar_t which is 2 bytes under windows. So NTFS is not UTF-16LE?
Doori Bar
@LukeN: I'm not sure if I understand what the flaw is?
Doori Bar
In my opinion, UTF-16 is the worst UTF to pick. With UTF-8 you can be sure, that each and every character is a byte - it is compatible with ASCII (very important!) and doesn't waste space when it doesn't need to be wasted (if a char fits into one byte, it will be one byte, if it fits into 2 byte, it will be 2 byte). UTF-32 assures you, that each and every character is exactly 4 bytes, it's easier to handle but wastes space. UTF-16 will waste space if the character would fit into 1 char (it still uses 2) and would STILL have to split bigger characters that don't fit in 2 UTF-16 wchars.
LukeN
@Doori: I don't know. I'm not sure what NTFS has got to do with it.
Amnon
@LukeN I supposed the word "cast" should suggest what the "construct" tries to do, but of course it could also happen differently
ShinTakezou
Thanks a lot guys, apparently my sample is correct - I just commented on it
Doori Bar
+3  A: 

Unicode text is always encoded. Popular encodings are UTF-8, UTF-16 and UTF-32. Only the latter has a fixed size for a glyph. UTF-16 uses surrogates for codepoints in the upper planes, such a glyph uses 2 wchar_t. UTF-8 is byte oriented, it uses between 1 and 4 bytes to encode a codepoint.

UTF-8 is an excellent choice if you need to transcode the text to a byte oriented stream. A very common choice for text files and HTML encoding on the Internet. If you use Windows then you can use WideCharToMultiByte() with CodePage = CP_UTF8. A good alternative is the ICU library.

Be careful to avoid byte encodings that translate text to a code page, such as wcstombs(). They are lossy encodings, glyphs that don't have a corresponding character code in the code page are replaced by ?.

Hans Passant
Thanks for the clarification of: "such a glyph uses 2 wchar_t.", I suppose in addition to the comment of mine on the original question, I'll accept your answer.
Doori Bar
A: 

I don't know why printf and wprintf do not work together. Following code works.

unsigned int i;
const wchar_t *wchar1 = L"abc";
wprintf(L"%ls\r\n", wchar1);

for(i=0; i<wcslen(wchar1); i++)
{   
    wprintf(L"(%d)", (wchar1[i]) & 255);
    wprintf(L"(%d)", (wchar1[i] >> 8) & 255);
}   
Siddique