views:

2077

answers:

2

Hi there,

I've got an international character stored in a unichar variable. This character does not come from a file or url. The variable itself only stores an unsigned short(0xce91) which is in UTF-8 format and translates to the greek capital letter 'A'. I'm trying to put that character into an NSString variable but i fail miserably.

I've tried 2 different ways both of which unsuccessful:

unichar greekAlpha = 0xce91; //could have written greekAlpha = 'Α' instead.

NSString *theString = [NSString stringWithFormat:@"Greek Alpha: %C", greekAlpha];

No good. I get some weird chinese characters. As a sidenote this works perfectly with english characters.

Then I also tried this:

NSString *byteString = [[NSString alloc] initWithBytes:&greekAlpha
                                                length:sizeof(unichar)
                                              encoding:NSUTF8StringEncoding];

But this doesn't work either. I'm obviously doing something terribly wrong, but I don't know what. Can someone help me please ? Thanks!

+7  A: 

Since 0xce91 is in the UTF-8 format and %C expects it to be in UTF-16 a simple solution like the one above won't work. For stringWithFormat:@"%C" to work you need to input 0x391 which is the UTF-16 unicode.

In order to create a string from the UTF-8 encoded unichar you need to first split the unicode into it's octets and then use initWithBytes:length:encoding.

unichar utf8char = 0xce91; 
char chars[2];
int len = 1;

if (utf8char > 127) {
    chars[0] = (utf8char >> 8) & (1 << 8) - 1;
    chars[1] = utf8char & (1 << 8) - 1; 
    len = 2;
} else {
    chars[0] = utf8char;
}

NSString *string = [[NSString alloc] initWithBytes:chars
                                            length:len 
                                          encoding:NSUTF8StringEncoding];
m5h
According to http://www.fileformat.info/info/unicode/char/0391/index.htm 0xce91 is the value of greek capital A in UTF-8 and 0x391 in UTF-16. The thing is that I rely on the value returned from the compiler which is in UTF-8 format(0xce91). At this point I would have to somehow convert the UTF-8 value to UTF-16 and then do what you suggested. Unless there is a way to feed the UTF-8 value directly into the NSString without having to do that extra step.
Terry
I realized the same after I looked closer at the link I posted. I updated my answer with this information and a solution to your problem.
m5h
Thank you. This is exactly what I was looking for! So then, my bits were scrambled :). Even though I'm a new member to this site I've been using it for quite some time now(for c# stuff mostly, just getting my feet wet with objective-c) and I find it amazing how far some people will go to help others. Once again, thank you! :)
Terry
A: 

The above answer is great but doesn't account for UTF-8 characters longer than 16 bits, e.g. the ellipsis symbol - 0xE2,0x80,0xA6. Here's a tweak to the code:

if (utf8char > 65535) {
   chars[0] = (utf8char >> 16) & 255;
   chars[1] = (utf8char >> 8) & 255;
   chars[2] = utf8char & 255; 
   chars[3] = 0x00;
} else if (utf8char > 127) {
    chars[0] = (utf8char >> 8) & 255;
    chars[1] = utf8char & 255; 
    chars[2] = 0x00;
} else {
    chars[0] = utf8char;
    chars[1] = 0x00;
}
NSString *string = [[[NSString alloc] initWithUTF8String:chars] autorelease];

Note the different string initialisation method which doesn't require a length parameter.

Jon Jardine
but 'unichar' is a 16-bit type, so `utf8char` could not hold a value longer than 16 bits.
David Gelhar