views:

1755

answers:

5

Hi!

I'm writing a small application in C that reads a simple text file and then outputs the lines one by one. The problem is that the text file contains special characters like Æ, Ø and Å among others. When I run the program in terminal the output for those characters are represented with a "?".

Is there an easy fix?

+1  A: 

Make sure you're not accidentally dropping any bytes; some UTF-8 characters are more than one byte in length (that's sort of the point), and you need to keep them all.

It can be useful to print the contents of the buffer as hex, so you can inspect which bytes are actually read:

static void print_buffer(const char *buffer, size_t length)
{
  size_t i;

  for(i = 0; i < length; i++)
    printf("%02x ", (unsigned int) buffer[i]);
  putchar('\n');
}

You can do this after loading a very short file, containing just a few characters.

Also make sure the terminal is set to the proper encoding, so it interprets your characters as UTF-8.

unwind
My terminal is set to UTF-8 encoding. The program stores all the characters of each line from the text file into a char array via fgets(); If I'm losing bytes, I have no idea why or how to fix it... (Just starting to learn C btw)
Orolin
@Eirik, don't use fgets() which is ASCII oriented. Use fgetwc() from my post.
Aiden Bell
+8  A: 

First things first:

  1. Read in the buffer
  2. Use libiconv or similar to obtain wchar_t type from UTF-8 and use the wide character handling functions such as wprintf()
  3. Use the wide character functions in C! Most file/output handling functions have a wide-character variant

Ensure that your terminal can handle UTF-8 output. Having the correct locale setup and manipulating the locale data can automate alot of the file opening and conversion for you ... depending on what you are doing.

Remember that the width of a code-point or character in UTF-8 is variable. This means you can't just seek to a byte and begin reading like with ASCII ... because you might land in the middle of a code point. Good libraries can do this in some cases.

Here is some code (not mine) that demonstrates some usage of UTF-8 file reading and wide character handling in C.

#include <stdio.h>
#include <wchar.h>
int main()
{
    FILE *f = fopen("data.txt", "r, ccs=UTF-8");
    if (!f)
        return 1;

    for (wint_t c; (c = fgetwc(f)) != WEOF;)
        printf("%04X\n", c);

    fclose(f);
    return 0;
}

Links

  1. libiconv
  2. Locale data in C/GNU libc
  3. Some handy info
  4. Another good Unicode/UTF-8 in C resource
Aiden Bell
Thanks dude! I'll try this...
Orolin
No problems. Stick at it, Unicode in C isn't the simplest thing in the world ... get familiar with the standards too :)
Aiden Bell
+1  A: 

Probably your text file is ISO-8559-1 encoded but your terminal is UTF-8. This kind of mismatch is a standard problem when dealing with byte-oriented text handling; other C programs (such as the standard ‘cat’ and ‘more’ commands) will do the same thing and it isn't generally considered an error or something that needs to be fixed.

If you want to operate on a Unicode character level instead of bytes that's fine, but you'll need to use wchar as your character type instead of char throughout your program, and provide switches for the user to specify what the incoming file encoding actually is. (Whilst it is sometimes possible to guess, it's not very reliable.)

bobince
A: 

I don't know if it could help but if you're sure that the encodings of terminal and input file are the same, you can try to setlocale():

#include <locale.h>
…
setlocale(LC_CTYPE, NULL);
Michał Górny
+2  A: 

Here's also an interesting article on handling UTF-8 in C:

http://canonical.org/~kragen/strlen-utf8.html

+1, An interesting link
Aiden Bell