The printf/fprintf/sprintf family supports a width field in its format specifier. I have a doubt for the case of (non-wide) char arrays arguments:
Is the width field supposed to mean bytes or characters?
What is the (correct-de facto) behaviour if the char array corresponds to (say) a raw UTF-8 string? (I know that normally I should use some wide char type, that's not the point)
For example, in
char s[] = "ni\xc3\xb1o"; // utf8 encoded "niño"
fprintf(f,"%5s",s);
Is that function supposed to try to ouput just 5 bytes (plain C chars) (and you take responsability of misalignments or other problems if two bytes results in a textual characters) ?
Or is it supposed to try to compute the length of "textual characters" of the array? (decodifying it... according to the current locale?) (in the example, this would amount to find out that the string has 4 unicode chars, so it would add a space for padding).
UPDATE: I agree with the answers, it is logical that the printf family doesnt distinguish plain C chars from bytes. The problem is my glibc doest not seem to fully respect this notion, if the locale has been set previously, and if one has the (today most used) LANG/LC_CTYPE=en_US.UTF-8
Case in point:
#include<stdio.h>
#include<locale.h>
main () {
char * locale = setlocale(LC_ALL, ""); /* I have LC_CTYPE="en_US.UTF-8" */
char s[] = {'n','i', 0xc3,0xb1,'o',0}; /* "niño" in utf8: 5 bytes, 4 unicode chars */
printf("|%*s|\n",6,s); /* this should pad a blank - works ok*/
printf("|%.*s|\n",4,s); /* this should eat a char - works ok */
char s3[] = {'A',0xb1,'B',0}; /* this is not valid UTF8 */
printf("|%s|\n",s3); /* print raw chars - ok */
printf("|%.*s|\n",15,s3); /* panics (why???) */
}
So, even when a non-POSIX-C locale has been set, still printf seems to have the right notion for counting width: bytes (c plain chars) and not unicode chars. That's fine. However, when given a char array that is not decodable in his locale, it silently panics (it aborts - nothing is printed after the first '|' - without error messages)... only if it needs to count some width. I dont understand why it even tries to decode the string from utf-8, when it doesn need/have to. Is this a bug in glibc ?
Tested with glibc 2.11.1 (Fedora 12) (also glibc 2.3.6)
Note: it's not related to terminal display issues - you can check the output by piping to od : $ ./a.out | od -t cx1
Here's my output:
0000000 | n i 303 261 o | \n | n i 303 261 | \n
7c 20 6e 69 c3 b1 6f 7c 0a 7c 6e 69 c3 b1 7c 0a
0000020 | A 261 B | \n |
7c 41 b1 42 7c 0a 7c