ansaurus

Question

UTF-8 decoder fails on non-ASCII characters

Answer 1

+4 A:

The char type is allowed to be signed, and conversion to int and then unsigned (which is what happens implicitly when you convert directly to unsigned) shows the error:

#include <stdio.h>

int main() {
  char c = '\xF4';
  int i = c;
  unsigned n = i;
  printf("%X\n", n);
  n = c;
  printf("%X\n", n);
  return 0;
}

Prints:

FFFFFFF4
FFFFFFF4

Use unsigned char instead.

Roger Pate 2010-09-24 14:15:48

Thank you very much! Changing the prototype of `ctou` to this fixed it: `string ctou(unsigned char *old);`

Delan Azabani 2010-09-24 14:18:21

Answer 2

+2 A:

You've probably ignored the fact that char is a signed type on your platform. Always use:

unsigned char if you will be reading the actual values of bytes
signed char if you're using bytes as small signed integers
char for abstract strings where you don't care about the values except perhaps for 0.

By the way, your code is extremely inefficient. Instead of calling realloc over and over per-character, why not allocate sizeof(unsigned)*(strlen(old)+1) to begin with, then reduce the size at the end if it's too big? Of course this is only one of the many inefficiencies.

R.. 2010-09-24 14:18:14

ansaurus

tags:

views:

answers:

UTF-8 decoder fails on non-ASCII characters

related questions