tags:

views:

387

answers:

6

The printf/fprintf/sprintf family supports a width field in its format specifier. I have a doubt for the case of (non-wide) char arrays arguments:

Is the width field supposed to mean bytes or characters?

What is the (correct-de facto) behaviour if the char array corresponds to (say) a raw UTF-8 string? (I know that normally I should use some wide char type, that's not the point)

For example, in

char s[] = "ni\xc3\xb1o";  // utf8 encoded "niño"
fprintf(f,"%5s",s);

Is that function supposed to try to ouput just 5 bytes (plain C chars) (and you take responsability of misalignments or other problems if two bytes results in a textual characters) ?

Or is it supposed to try to compute the length of "textual characters" of the array? (decodifying it... according to the current locale?) (in the example, this would amount to find out that the string has 4 unicode chars, so it would add a space for padding).

UPDATE: I agree with the answers, it is logical that the printf family doesnt distinguish plain C chars from bytes. The problem is my glibc doest not seem to fully respect this notion, if the locale has been set previously, and if one has the (today most used) LANG/LC_CTYPE=en_US.UTF-8

Case in point:

#include<stdio.h>
#include<locale.h>
main () {
        char * locale = setlocale(LC_ALL, ""); /* I have LC_CTYPE="en_US.UTF-8" */
        char s[] = {'n','i', 0xc3,0xb1,'o',0}; /* "niño" in utf8: 5 bytes, 4 unicode chars */
        printf("|%*s|\n",6,s); /* this should pad a blank - works ok*/
        printf("|%.*s|\n",4,s); /* this should eat a char - works ok */
        char s3[] = {'A',0xb1,'B',0}; /* this is not valid UTF8 */
        printf("|%s|\n",s3);     /* print raw chars - ok */
        printf("|%.*s|\n",15,s3);     /* panics (why???) */
}

So, even when a non-POSIX-C locale has been set, still printf seems to have the right notion for counting width: bytes (c plain chars) and not unicode chars. That's fine. However, when given a char array that is not decodable in his locale, it silently panics (it aborts - nothing is printed after the first '|' - without error messages)... only if it needs to count some width. I dont understand why it even tries to decode the string from utf-8, when it doesn need/have to. Is this a bug in glibc ?

Tested with glibc 2.11.1 (Fedora 12) (also glibc 2.3.6)

Note: it's not related to terminal display issues - you can check the output by piping to od : $ ./a.out | od -t cx1 Here's my output:

0000000   |       n   i 303 261   o   |  \n   |   n   i 303 261   |  \n
         7c  20  6e  69  c3  b1  6f  7c  0a  7c  6e  69  c3  b1  7c  0a
0000020   |   A 261   B   |  \n   |
         7c  41  b1  42  7c  0a  7c
+2  A: 

Bytes (chars). There is no built-in support for Unicode semantics. You can imagine it as resulting in at least 5 calls to fputc.

Matthew Flaschen
Seems right. See my update, however.
leonbloy
+3  A: 

It will result in five bytes being output. And five chars. In ISO C, there is no distinction between chars and bytes. Bytes are not necessarily 8 bits, instead being defined as the width of a char.

The ISO term for an 8-bit value is an octet.

Your "niño" string is actually five characters wide in terms of the C environment (sans the null terminator, of course). If only four symbols show up on your terminal, that's almost certainly a function of the terminal, not C's output functions.

I'm not saying a C implementation couldn't handle Unicode. It could quite easily do UTF-32 if CHAR_BITS was defined as 32. UTF-8 would be harder since it's a variable length encoding but there are ways around almost any problem :-)


Based on your update, it seems like you might have a problem. However, I'm not seeing your described behaviour in my setup with the same locale settings. In my case, I'm getting the same output in those last two printf statements.

If your setup is just stopping output after the first | (I assume that's what you mean by abort but, if you meant the whole program aborts, that's much more serious), I would raise the issue with GNU (try your particular distributions bug procedures first). You've done all the important work such as producing a minimal test case so someone should even be happy to run that against the latest version if your distribution doesn't quite get there (most don't).


As an aside, I'm not sure what you meant by checking the od output. On my system, I get:

pax> ./qq | od -t cx1
0000000   |       n   i 303 261   o   |  \n   |   n   i 303 261   |  \n
         7c  20  6e  69  c3  b1  6f  7c  0a  7c  6e  69  c3  b1  7c  0a
0000020   |   A 261   B   |  \n   |   A 261   B   |  \n
         7c  41  b1  42  7c  0a  7c  41  b1  42  7c  0a
0000034

so you can see the output stream contains the UTF-8, meaning that it's the terminal program which must interpret this. C/glibc isn't modifying the output at all, so maybe I just misunderstood what you were trying to say.

Although I've just realised you may be saying that your od output has only the starting bar on that line as well (unlike mine which appears to not have the problem), meaning that it is something wrong within C/glibc, not something wrong with the terminal silently dropping the characters (in all honesty, I would expect the terminal to drop either the whole line or just the offending character (i.e., output |A) - the fact that you're just getting | seems to preclude a terminal problem). Please clarify that.

paxdiablo
Thanks. Sound right. See my update, however.
leonbloy
Have you got UTF-8 as LC_TYPE ? Anyway, I added my output. And I'm thinking I have just traced the problem to this glib issue (not a bug... they say) http://sources.redhat.com/bugzilla/show_bug.cgi?id=649 - See the last comment. That's nasty...
leonbloy
@leonbloy: you could add quotation from the bug comment as an answer for others easier to find.
J.F. Sebastian
Ok, I posted my findings in my own answer.
leonbloy
A: 

To be portable, convert the string using mbstowcs and print it using printf( "%6ls", wchar_ptr ).

%ls is the specifier for a wide string according to POSIX.

There is no "de-facto" standard. Typically, I would expect stdout to accept UTF-8 if the OS and locale have been configured to treat it as a UTF-8 file, but I would expect printf to be ignorant of multibyte encoding because it isn't defined in those terms.

Potatoswatter
+1  A: 

The answer of the original question (bytes or chars?) was right as given by several people: both by the spec and the glib implementation, the width (or precision) in the printf C function counts bytes (or plain C chars, which are the same thing). So, fprintf(f,"%5s",s) in my first example, means definitely "try to output at least 5 bytes (plain chars) from the array s -if not enough, pad with blanks".

It does not matter whether the string (in my example, of length 5) represents text encoded in -say- UTF8 and if fact contains 4 "textual (unicode) characters". To printf(), internally, it just has 5 (plain) characters, and that's what counts.

But this didn't explain my other problem.

Searching in glibs bug-tracker, I found some related (rather old) issues - I was not the first one caught by this... feature:

http://sources.redhat.com/bugzilla/show_bug.cgi?id=6530

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=208308

http://sources.redhat.com/bugzilla/show_bug.cgi?id=649

A comment from the last one happens to apply precisely to my situation:

ISO C99 requires for %.*s to only write complete characters that fit below the
precision number of bytes.  If you are using say UTF-8 locale, but ISO-8859-1
characters as shown in the input file you provided, some of the strings are
not valid UTF-8 strings, therefore sprintf fails with -1 because of the
encoding error. That's not a bug in glibc.

Whether it is a bug (perhaps in interpretation or in the ISO spec itself) is debatable. But what glib is doing is clear now:

Recall my problematic statement: printf("|%.*s|\n",15,s3) . Here, glib must find out if the length of s3 is bigger than 15, and if so, truncate it. For computing this length (as we all agreed) it doesn't need to mess with encodings at all. But, if it must be truncated, glib strives to be careful: if he just keeps the first 15 bytes, he could potentially break a multibyte character in half, and hence produce and invalid text output (I'd be ok with that - but glib sticks to that ISO C99 interpretation). So, he needs to decode the char array, using the environment locale, to find out where are the real characters boundaries. Hence, for example, if LC_TYPE says UTF-8 and the array is not a valid UTF-8 bytes sequence, it aborts (not so badly, because then printf actually returns -1 ; not so well, because it prints part of the string anyway, so it's difficult to recover cleanly).

Only in this case, apparently, when a precision is specified for a string (and there is possibility of truncation) glibc needs to mix some "unicode semantics" with the plain-chars/bytes semantics. Rather ugly, IMO, but so it is.

Edit: Notice that this behaviour is relevant not only for the case of invalid encodings. For example:

char s[] = "ni\xc3\xb1o";  /* "niño" in UTF8: 5 bytes, 4 unicode chars */
printf("|%.3s|",s); /* would cut the double-byte UTF8char in two */

truncates that field to 2 bytes, not 3, because it refuses to output an invalid UTF8 string:

$ ./a.out
|ni|
$ ./a.out | od -t cx1
0000000   |   n   i   |  \n
        7c 6e 69 7c 0a
leonbloy
A: 

Don't use mbstowcs unless you also make sure that wchar_t is at-least 32 bits long. else you'll likely end up with UTF-16 which has all disadvantages of UTF-8 and all the disadvantages of UTF-32.

I'm not saying avoid mbstowcs I just saying don't let windows programmers use it.

It might be simpler to use iconv to convert to UTF-32.

A: 

What you've found is a bug in glibc. Unfortunately it's an intentional one which the developers refuse to fix. See here for a description:

http://www.kernel.org/pub/linux/libs/uclibc/Glibc_vs_uClibc_Differences.txt

R..