ansaurus

Question

Handling special characters in C (UTF-8 encoding)

Answer 1

+1 A:

Make sure you're not accidentally dropping any bytes; some UTF-8 characters are more than one byte in length (that's sort of the point), and you need to keep them all.

It can be useful to print the contents of the buffer as hex, so you can inspect which bytes are actually read:

static void print_buffer(const char *buffer, size_t length)
{
  size_t i;

  for(i = 0; i < length; i++)
    printf("%02x ", (unsigned int) buffer[i]);
  putchar('\n');
}

You can do this after loading a very short file, containing just a few characters.

Also make sure the terminal is set to the proper encoding, so it interprets your characters as UTF-8.

unwind 2009-09-03 13:40:48

My terminal is set to UTF-8 encoding. The program stores all the characters of each line from the text file into a char array via fgets(); If I'm losing bytes, I have no idea why or how to fix it... (Just starting to learn C btw)

Orolin 2009-09-03 13:45:13

@Eirik, don't use fgets() which is ASCII oriented. Use fgetwc() from my post.

Aiden Bell 2009-09-03 13:52:41

Answer 2

+8 A:

First things first:

Read in the buffer
Use libiconv or similar to obtain wchar_t type from UTF-8 and use the wide character handling functions such as wprintf()
Use the wide character functions in C! Most file/output handling functions have a wide-character variant

Ensure that your terminal can handle UTF-8 output. Having the correct locale setup and manipulating the locale data can automate alot of the file opening and conversion for you ... depending on what you are doing.

Remember that the width of a code-point or character in UTF-8 is variable. This means you can't just seek to a byte and begin reading like with ASCII ... because you might land in the middle of a code point. Good libraries can do this in some cases.

Here is some code (not mine) that demonstrates some usage of UTF-8 file reading and wide character handling in C.

#include <stdio.h>
#include <wchar.h>
int main()
{
    FILE *f = fopen("data.txt", "r, ccs=UTF-8");
    if (!f)
        return 1;

    for (wint_t c; (c = fgetwc(f)) != WEOF;)
        printf("%04X\n", c);

    fclose(f);
    return 0;
}

Links

Aiden Bell 2009-09-03 13:51:52

Thanks dude! I'll try this...

Orolin 2009-09-03 14:11:35

No problems. Stick at it, Unicode in C isn't the simplest thing in the world ... get familiar with the standards too :)

Aiden Bell 2009-09-09 00:37:08

Answer 3

+1 A:

Probably your text file is ISO-8559-1 encoded but your terminal is UTF-8. This kind of mismatch is a standard problem when dealing with byte-oriented text handling; other C programs (such as the standard ‘cat’ and ‘more’ commands) will do the same thing and it isn't generally considered an error or something that needs to be fixed.

If you want to operate on a Unicode character level instead of bytes that's fine, but you'll need to use wchar as your character type instead of char throughout your program, and provide switches for the user to specify what the incoming file encoding actually is. (Whilst it is sometimes possible to guess, it's not very reliable.)

bobince 2009-09-03 13:56:05

Answer 4

A:

I don't know if it could help but if you're sure that the encodings of terminal and input file are the same, you can try to setlocale():

#include <locale.h>
…
setlocale(LC_CTYPE, NULL);

Michał Górny 2009-09-03 14:13:13

Answer 5

+2 A:

Here's also an interesting article on handling UTF-8 in C:

http://canonical.org/~kragen/strlen-utf8.html

2009-09-03 14:29:01

+1, An interesting link

Aiden Bell 2009-09-03 19:14:40

ansaurus

tags:

views:

answers:

Handling special characters in C (UTF-8 encoding)

related questions