tags:

views:

449

answers:

2

Hello. This is an ANSI C question. I have the following code.

#include <stdio.h>
#include <locale.h>
#include <wchar.h>

  int main()
  {
    if (!setlocale(LC_CTYPE, "")) {
      printf( "Can't set the specified locale! "
              "Check LANG, LC_CTYPE, LC_ALL.\n");
      return -1;
    }
    wint_t c;
    while((c=getwc(stdin))!=WEOF)
      {
    printf("%lc",c);
      }
    return 0;
  }

I need full UTF-8 support, but even at this simplest level, can I improve this somehow? Why is wint_t used, and not wchar, with appropriate changes?

+3  A: 

wint_t is capable of storing any valid value of wchar_t. A wint_t is also capable of taking on the result of evaluating the WEOF macro (note that a wchar_t is too narrow to hold the result).

Brandon E Taylor
Ok, thanks. So, in brief: when is it better to use wchar_t then? Why not always use wint_t?
Dervin Thunk
+3  A: 

UTF-8 is one possible encoding for Unicode. It defines 1, 2 or 3 bytes per character. When you read it through getwc(), it will fetch one to three bytes and compose from them a single 16-bit character, which would fit within a wchar (which is at least 16 bits wide).

But since all of the Unicode values map to 0x0000 to 0xFFFF, there are no values left to return condition or error codes in.

Various error codes include EOF (WEOF), which maps to -1. If you were to put the return value of getwc() in a wchar, there would be no way to distinguish it from a Unicode 0xFFFF character (which, BTW, is reserved anyway, but I digress).

So the answer is to use a wider type, an wint_t (or int), which holds at least 32 bits. That gives the lower 16 bits for the real value, and anything with a bit set outside of that range means something other than a character returning happened.

Why don't we always use wchar then instead of wint? Most string-related functions use wchar because on most platforms it's ½ the size of wint, so strings have a smaller memory footprint.

lavinio
An UTF-8 character can be 4 bytes long, technical it can even take 5 or 6 bytes, but such compositions are not valid utf8 characters.
quinmars
Well, true. It can be 4 bytes long if you go into the extra plan characters of 0x10000 and higher, but that gets into surrogates when dealing with UTF-16, and I thought it outside the scope of the question. And while 5 or 6 byte sequences are possible, they can always be expressed in fewer than 5 bytes, and are only generated by poor-quality serializers.
lavinio
Your answer is mostly correct, but you provide too many (platofrm depenent) details. `wchar_t` is _not_ always 16 bits, I can think of at least 2 OS/compiler combinations where it's 32.
Logan Capaldo
Thanks. I was referring to the character itself needing 16 bits, but I can see now the ambiguity. Clarified, and also for wint_t.
lavinio