tags:

views:

189

answers:

5

Hi guys. I have a simple question that I can't find anywhere over the internet, how can I convert UTF-8 to ASCII (mostly accented characters to the same character without accent) in C using only the standard lib? I found solutions to most of the languages out there, but not for C particularly.

Thanks!

EDIT: Some of the kind guys that commented made me double check what I needed and I exaggerated. I only need an idea on how to make a function that does: char with accent -> char without accent. :)

+2  A: 

There's no built in way of doing that. There's really little difference between UTF-8 and ASCII unless you're talking about high level characters, which cannot be represented in ASCII anyway.

If you have a specific mapping you want (such as a with accent -> a) then you should just probably handle that as a string replace operation.

Billy ONeal
But when I try to do an if (c == 'á') { c = 'a'; } it gives me "comparison is always false due to limited range of data type" :(
dccarmo
@dccarmo: `'á'` looks like `'\0703\0120'` to C, so that is a constant that is bigger than a `char` can hold, so if `c` is a char there is no way for it to ever equal that. What it is likely to equal is `'\0703'` and the next character in your stream would be the `'\0120'`.
nategoose
@nategoose: Remove those leading zeros; they're not valid in C octal char escapes. `\0703\0120` is parsed as `\070`, `3`, `\012`, `0`.
R..
Not sure if it is standard C or not, but you may be able to use a wide character literal, like `L'á'`.
Merlyn Morgan-Graham
@R: You're correct, but I can't edit the comment. I don't use octal that often so I messed it up.
nategoose
+5  A: 

Take a look at libiconv. Even if you insist on doing it without libraries, you might find an inspiration there.

zoul
I believe that violates "using only the standard lib"
Billy ONeal
@Billy - not if only a person reads the libiconv sources (e.g. to copy code ranges).
Steve314
I'll give it a look, thanks!
dccarmo
+4  A: 

In general, you can't. UTF-8 covers much more than accented characters.

Nemanja Trifunovic
+2  A: 

Every decent Unicode support library (not the standard library of course) has a way to decompose a string in KC or KD form. Which separates the diacritics from the letters. Giving you a shot at filtering them out. Not so sure this is worth pursuing, the result is just gibberish to the native language reader and not every letter is decomposable. In other words, junk with question marks.

Hans Passant
Yeah, I exaggerated on my question. I really just want to convert accented char -> char without accent. I hate my teacher for only allowing us to do this with the standard lib and also in C. If I could only use Python. :(
dccarmo
You picked the wrong class, can't help you with that.
Hans Passant
A: 

Since this is homework, I'm guessing your teacher is clueless and doesn't know anything about UTF-8, and probably is stuck in the 1980s with "code pages" and "extended ASCII" (words you should erase from your vocabulary if you haven't already). Your teacher probably wants you to write a 128-byte lookup table that maps CP437 or Windows-1252 bytes in the range 128-255 to similar-looking ASCII letters. It would go something like...

void strip_accents(unsigned char *dest, const unsigned char *src)
{
    static const unsigned char lut[128] = { /* mapping here */ };
    do {
        *dest++ = *src < 128 ? *src : lut[*src];
    } while (*src++);
 }
R..