ansaurus

Question

Char with accent to char without accent in C

Answer 1

+3 A:

The C standard says that the character constants such as 'ç' are integer constants:

§6.4.4.4/9

An integer character constant has type int. The value of an integer character constant containing a single character that maps to a single-byte execution character is the numerical value of the representation of the mapped character interpreted as an integer.

If the char type is signed on your machine (it is on Linux), then when comando contains 'ç' and is promoted to integer, it becomes a negative integer, whereas 'ç' is a positive integer. Hence the warning from the compiler.

For an 8-bit character set, by far the fastest way to do such an operation is to create a table of 256 bytes, where each position contains the unaccented version of the character.

int unaccented(int c)
{
     static const char map[256] =
     {
          '\x00', '\x01', ...
          ...
          '0',    '1',    '2', ...
          ...
          'A',    'B',    'C', ...
          ...
          'a',    'b',    'c', ...
          ...
          'A',    'A',    'A', ... // 0xC0 onwards...
          ...
          'a',    'a',    'a', ... // 0xE0 onwards...
          ...
     };
     if (c < 0 || c > 255)
         return EOF;
     else
         return map[c];
}

Of course, you'd write a program - probably a script - to generate the table of data, rather than doing it manually. In the range 0..127, the character at position x is the character with code x (so map['A'] == 'A').

If you are allowed to exploit C99, you can improve the table by using designated initializers:

static const char map[] =
{
    ['\x00'] = '\x00', ...
    ['A']    = 'A', ...
    ['a']    = 'a', ...
    ['å']    = 'a', ...
    ['Å']    = 'A', ...
    ['ÿ']    = 'y', ...
};

It isn't entirely clear what you should do with ~~diphthongs~~ letters such as 'æ' or 'ß' that have no ASCII equivalent; however, the simple rule of 'when in doubt, do not change it' can be applied sensibly. They aren't accented characters, but neither are they ASCII characters.

This does not work so well for UTF-8. For that, you need more specialized tables driven from data in the Unicode standard.

Also note that you should coerce any 'char' value to 'unsigned char' before calling this. That said, the code could also attempt to deal with abusers. However, it is hard to distinguish 'ÿ' (0xFF) from EOF when people are not careful in calling the function. The C standard character test macros are required to support all valid character values (when converted to unsigned char) and EOF as inputs - this follows that design.

§7.4/1

In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined.

Jonathan Leffler 2010-09-15 21:29:57

Just a small correction, but "diphthong" is a term used more frequently in phonology, not orthography. Specially, æ and ß are ligatures, but the term you may be after is "digraph".

dreamlax 2010-09-15 23:29:05

@Dreamlax: if you Google search 'define:diphthong', then 'æ' is a diphthong, but 'ß' is not. Again, 'æ' is a digraph, but 'ß' is not (though it can be transliterated to 'ss' which is a digraph); and 'æ' is a ligature of sorts, but 'ß' is not. So, I'm not sure what was the best term to use - diphthong was wrong, but grapheme is a little too vague, ... maybe just 'letter'. But thanks for pointing out the erroneous usage.

Jonathan Leffler 2010-09-16 02:41:40

ß is absolutely a [ligature](http://en.wikipedia.org/wiki/Typographical_ligature#German_.C3.9F). Also, the Google results indicate that the pronunciation of æ is typically a diphthong, not that the character itself is (the word diphthong literally means "two sounds" so it is more commonly associated with phonology). In any case, 'letter' is better choice of terminology :) .

dreamlax 2010-09-16 02:54:23

Answer 2

+2 A:

You mentioned in another similar question that this was easy enough to do in other languages that you know. If I were you and couldn't find a good way to do this with available code in C and needed to do so in C I would write a program in another language to generate a C function that would do the conversion for you. As long as you can cycle through all characters this shouldn't be too difficult, though it may be large code. I'd probably do this for utf-16, and just have a simple wrapper function that took utf-8, converted them to utf-16, and called the conversion function.

The conversion function would just be made of a very large switch/case statement, and the default case would be for characters that didn't convert.

nategoose 2010-09-15 21:31:37

Answer 3

+3 A:

In supplement to the other answers, try this for size:

#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>

int main(int argc, char** argv)
{
    wchar_t* x = calloc(100, sizeof(wchar_t));
    char*    y = calloc(100, sizeof(char));

    printf("Input something: ");
    fread(y, 1, 99, stdin);

    mbstowcs(x, y, 100);

    if ( x[0] = L'è' )
    {
        printf("Ohhh, french character!\n");
    }


    free(y); free(x);

    return 0;
}

This code shows you two things: firstly, how to convert a multi-byte string you have read in into a wide character string. From there, you can handle nearly every character that exists (theoretically at least).

Having done this, you simply need a map of characters and their transform which will allow you to parse each character and map it to something else. See the other answers for this

Some notes: I've deliberately used fread() on stdin - ctrl+D when done typing input. This is to prevent a buffer overflow attack you would be vulnerable to using scanf if you passed the result to a function (see NOP sled).

Secondly, I have blindly assumed y's input will be mostly single byte. The fact is, if in the multi-byte string two bytes are being used per character, 100 char characters = 50 wchar_t characters. I could check lengths etc too, but that's beyond the scope of this example.

Ninefingers 2010-09-15 21:46:03

For such small input, it would probably be better to declare `wchar_t x[100] = {0}; char y[100] = {0};` and avoid manual memory management. This gives the added benefit of being able to use `sizeof x` as the final parameter to `mbstowcs` and also provide `sizeof y - 1` to `fread`. Also, `wchar_t` on many systems is 32-bit.

dreamlax 2010-09-15 23:39:24

dreamlax - true, one could do it that way and while( fread ()) from source, which if a file would happen until EOF. That might be better than calloc, I agree.

Ninefingers 2010-09-16 13:06:15

ansaurus

tags:

views:

answers:

Char with accent to char without accent in C

related questions