ansaurus

Question

Answer 1

+9 A:

The encoding of character constants actually depend on your locale settings.

The safest bet is to use wide characters, and the corresponding functions. You declare the alphabet as const wchar_t* alphabet = L"abcdefghijklmnopqrstuvwxyzäöå", and the individual characters as L'ö';

This small example program works for me (also on a UNIX console with UTF-8) - try it.

#include <stdlib.h>
#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main(int argc, char** argv)
{
    wint_t letter = L'\0';
    setlocale(LC_ALL, ""); /* Initialize locale, to get the correct conversion to/from wchars */
    while(1)
    {
        if(!letter)
            printf("Type a letter to get its position: ");

        letter = fgetwc(stdin);
        if(letter == WEOF) {
     putchar('\n');
     return 0;
        } else if(letter == L'\n' || letter == L'\r') { 
     letter = L'\0'; /* skip newlines - and print the instruction again*/
        } else {
     printf("%d\n", letter); /* print the character value, and don't print the instruction again */
     }
    }
    return 0;
}

Example session:

Type a letter to get its position: a
97
Type a letter to get its position: A
65
Type a letter to get its position: Ö
214
Type a letter to get its position: ö
246
Type a letter to get its position: Å
197
Type a letter to get its position: <^D>

I understand that on Windows, this does not work with characters outside the Unicode BMP, but that's not an issue here.

gnud 2009-11-12 20:32:59

He is on Mac OS X. So console is UTF8 ready so locale does not influence his encoding.

Michal Sznajder 2009-11-12 21:15:19

Of course the platform matters - 'ö' does not fit in one byte in UTF-8, so you can't compare it as a character constant.

gnud 2009-11-12 21:30:19

I like this the most so far as it seems to be working. However, it gives me two prints, apparently one for the umlaut character (195) and then another, which I assume is the letter code.

pg-robban 2009-11-12 22:10:58

The problem is that most pre-Unicode languages (like, say, C) don't handle UTF8 worth beans. If I were designing a language, I'd separate bytes from characters and build in support for standard Unicode formats.

David Thornley 2009-11-12 22:20:35

What's the output of `locale charmap` in the terminal, and of calling `nl_langinfo(CODESET)` after the call to setlocale() in the c program?

gnud 2009-11-12 22:34:55

locale charmap prints UTF-8, haven't tried compiling it in the terminal. I use the XCode console.

pg-robban 2009-11-12 22:39:15

Should I print nl_langinfo(CODESET)? It says CODESET is undefined.

pg-robban 2009-11-12 22:59:50

You have to include langinfo.h for CODESET to be defined.

gnud 2009-11-13 01:11:44

It prints US-ASCII

pg-robban 2009-11-13 07:58:40

By the way: I preffer this solution to mine. It is more elagant.

Michal Sznajder 2009-11-13 12:15:22

Well, your terminal input is in UTF-8, but your locale is in ASCII. That's gonna cause some problems :)

gnud 2009-11-13 13:48:19

Answer 2

+3 A:

In general encoding stuff is quite complicated. On the other hand if you just want a dirty solution specific to your compiler/platform than add something like this to your code:

printf("letter 0x%x is number %d\n", letter, letter_nr(letter));

It will give hex value for your umlauts. Than just replace in if statements your letter with number.

EDIT You say that you are always getting 98 so your scanf got 98 + 97 = 195 = 0x3C from console. According to this table 0x3C is start of UTF8 sequence for common LATIN SMALL LETTER N WITH Something in Latin1 block. You are on Mac OS X ?

EDIT This is my final call. Quite hackery but it works for me :)

#include <stdio.h>

// scanf for for letter. Return position in Morse Table. 
// Recognises UTF8 for swedish letters.
int letter_nr()
{
  unsigned char letter;
  // scan for the first time,
  scanf("%c", &letter);
  if(0xC3 == letter)
  {
    // we scanf again since this is UTF8 and two byte encoded character will come
    scanf("%c", &letter);
    //LATIN SMALL LETTER A WITH RING ABOVE = å
    if(0xA5 == letter)
      return 26;
    //LATIN SMALL LETTER A WITH DIAERESIS = ä
    if(0xA4 == letter)
      return 27;
   // LATIN SMALL LETTER O WITH DIAERESIS = ö
    if(0xB6 == letter)
      return 28;

    printf("Unknown letter. 0x%x. ", letter);
    return -1;
  } 
  // is seems to be regular ASCII
  return letter - 97;
 } // letter_nr

int main()
{   
    while(1)
    {
        printf("Type a letter to get its position: ");

        int val = letter_nr();
        if(-1 != val)
          printf("Morse code is %d.\n", val);
        else
          printf("Unknown Morse code.\n");

        // strip remaining new line
    unsigned char new_line;
    scanf("%c", &new_line);         
    }
    return 0;
}

Michal Sznajder 2009-11-12 20:34:39

Unfortunately, this seems to give me the same problem as before: I am getting the same hex values for these three letters.

pg-robban 2009-11-12 20:39:15

Can you please explain where you get letter from, should I make it a global variable and pass the reading into the letter_nr function?

pg-robban 2009-11-12 21:21:33

This post shows a profound ignorance of UTF-8, and encodings in general. It's just plain wrong: The sum of the two bytes is NOT the unicode code point. -1

gnud 2009-11-12 21:35:07

I know that sum of two bytes in NOT UNICODE code point. But 0x3C is FIRST char in sequence in UTF-8 for some letters.

Michal Sznajder 2009-11-12 21:38:07

Sorry, I removed my -1. But still - checking if the byte equals 0x3c? Check if it's > 127, please! Otherwise, any UTF-8-sequence not starting with 0x3c will yield wild results, because each byte in the sequence will be treated as ASCII.

gnud 2009-11-12 21:41:03

This doesn't seem to work for me: 'a' and 'b' both return the value 0, 'c' gets 1, d '2' etc. åäö return 98.

pg-robban 2009-11-12 21:41:51

Edit: I now see that you removed the parameter. If I now type in any letter, I get: Program received signal: “EXC_BAD_ACCESS”.sharedlibrary apply-load-rules all

pg-robban 2009-11-12 22:02:46

I messed with parameters to scanf... Silly me.

Michal Sznajder 2009-11-12 22:21:13

This seems to be working properly, without any double prints. Thanks!

pg-robban 2009-11-13 11:19:44

Answer 3

A:

Hmmm ... at first I'd say the "funny" characters are not chars. You cannot pass one of them to a function accepting a char argument and expect it to work.

Try this (add the remaining bits):

char buf[100];
printf("Enter a string with funny characters: ");
fflush(stdout);
fgets(buf, sizeof buf, stdin);
/* now print it, as if it was a sequence of `char`s */
char *p = buf;
while (*p) {
    printf("The character '%c' has value %d\n", *p, *p);
    p++;
}

Now try the same with wide characters: #include <wchar.h> and replace printf with wprintf, fgets with fgetws, etc ...

pmg 2009-11-12 22:02:10

ansaurus

tags:

views:

answers:

Accented/umlauted characters in C?

related questions