ansaurus

Question

How do I represent a Unicode character in a literal string ISO/ANSI C when the character set is ASCII?

Answer 1

+1 A:

wchar_t is the type you are looking for: http://opengroup.org/onlinepubs/007908799/xsh/wchar.h.html

Jeff Ober 2009-09-14 14:17:45

Just keep in mind that that's a *UNIX* spec (SUS), not part of ISO C. I only bring it up since there was no unix tag on the question.

paxdiablo 2009-09-14 14:32:22

I am more interested in how to render é in ASCII text in C, In Perl I can do it by saying `"\x{e9}"`. The problem is that the source is in ASCII, but it needs to create UTF-8 characters.

Chas. Owens 2009-09-14 14:59:48

@Chas: Why not use a UTF-8 as the source file encoding? Most compilers shouldn't have any problem with that as long as the multibyte sequences only occur inside string literals...

Christoph 2009-09-14 15:56:06

Because the source is going thorough a system that requires it to be 7-bit clean. I am just happy I don't have to use trigraphs (e.g. `??=` for `#`). Note, the source is going through that system, not being compiled there. Yes, I know it is silly.

Chas. Owens 2009-09-14 18:48:09

Answer 2

+5 A:

For UTF8, you have to generate the encoding yourself using rules found, for example, here. For example, the German sharp s (ß, code point 0xdf), has the UTF8 encoding 0xc3,0x9f. Your e-acute (é, code point 0xe9) has a UTF8 encoding of 0xc3,0xa9.

And you can put arbitrary hex characters in your strings with:

char *cv = "r\xc3\xa9sum\xc3\xa9";
char *sharpS = "\xc3\x9f";

paxdiablo 2009-09-14 14:18:49

The \xHEX notation is what I was looking for, thanks.

Chas. Owens 2009-09-14 15:01:03

If the variable is wide enough to hold UTF-16, can you say \x00e9?

Chas. Owens 2009-09-14 15:02:18

Answer 3

+3 A:

If you have a C99 compiler you can use <wchar.h> (and <locale.h>) and enter the Unicode code points directly in the source.

$ cat wc.c

#include <locale.h>
#include <stdio.h>
#include <wchar.h>

int main(void) {
  const wchar_t *name = L"r\u00e9sum\u00e9";
  setlocale(LC_CTYPE, "en_US.UTF-8");
  wprintf(L"name is %ls\n", name);
  return 0;
}

$ /usr/bin/gcc -std=c99 -pedantic -Wall wc.c

$ ./a.out

name is résumé

pmg 2009-09-14 15:57:17

ansaurus

tags:

views:

answers:

How do I represent a Unicode character in a literal string ISO/ANSI C when the character set is ASCII?

related questions