views:

651

answers:

3

Consider this example:

zh_Hant_HK format = yy'年'M'月'd'日' ah:mm

Not sure if you can see it, but I see a lot of chinese symbols in there. I've got that string out from a date formatter, which corresponds to a Asian locale. Do I have to consider anything special when trying to get "character" by "character", i.e. looking at every char separately in this string?

A: 

Depends on the representation of the string.

Once upon a time, we had simple string representations (e.g., ASCII) in which all the character codes took up a single unit of space in the string (8 bits ignoring the topmost). [There were earlier string representations of 6 and 9 bits, but they had the same property of being fixed-sized units).

Handling non-English langauges (Eastern Europe, Asia,...) caused people to propose various kinds of so-called "double-byte character strings" (DBCS), in which common charcters occupied a single unit, (pretty much the same set as the ASCII characters) now almost universally 8 bits, but the other characters are encoded as two bytes, the first of which occupies part of the 8 bits space which ASCII doesn't need, and a second byte, provding a character encoding scheme that has ~~ 15 bit characters.

Tearing apart such strings is messy because the routine that does so has to understand the exact DBCS encoding scheme, and pick up 1 or 2 bytes at a time in accordance.

Along came Unicode, to solve the problem by providing 16 bit characters. Most modern progamming languages (Java, C#) provide those 16 bit characters as the basis of their string representations. Life got a lot easier (if we ignore the fact that even 16 bit unicode sometimes allows two sequential charcters to be composed to form what amounts to another characater already defined in the set).

The committee that enhances Unicode, however, couldn't resist, and extended Unicode beyond the 16 bits. We're now stuck back with the dumb DBCS scheme (actually worse, some take several bytes, IIRC) that Unicode was supposed to fix. So, to process strings in those modern langauges, you again have to understand when a byte represents a single characater, and when it represents a lead-in to a multi character sequence.

If you're lucky, the string you have is composed only of 16 bit single characters in Unicode. If not, you'll need to consult your Unicode manual and pray that you have a Unicode string management library to help you do this right.

This last bit is such a colossal hassle, that a lot of coders punt and stick with Unicode-as-single-wide characters. Works in Europe. Not recommended in Asia.

Ira Baxter
This answer really has nothing specific to do with Objective-C or Cocoa, so while broadly enlightening about very basic encoding topics, it has little to offer for answering the question.
groundhog
The question from the OP was whether he could do this character by character. The general background explains why he'd better do that only if he is sure how his representation works, and which characters he has to handle.
Ira Baxter
+1  A: 

If your string is aware of the encoding (which it should be if pulled from a date formatting), then you can just get the unichar representation using characterAtIndex:, or however you wish to access the individual characters.

Knowing what you want to do is probably very useful. Breaking it up into substrings is likely the best to do, since the substrings would carry around their encoding and locale.

groundhog
so rather than fetching lots of unichars, I extract "1-char substrings" as NSString?
HelloMoon
If it's important for you to keep the locale, yes - break it into an NSArray of NSStrings
groundhog
+2  A: 

No you do not to take any special consideration when you peek at the characters of a NSString one character at a time. NSString is build to work with unicode strings.

for(int index = 0; index < [myString length]; index++) {
    unichar ch = [myString characterAtIndex:index];
    // Do stuff to unichar...
}

One thing that you should do is to always treat the character you retrieve from a NSString as the unichar type. The unichar type is not equivalent with wchar_t or any other unicode character type.

PeyloW
could I run into problems like Ira Baxter described below?
HelloMoon
No NSString is implemented as a unichar array plus a character count. Downside is that only unicode characters limited to 16bit can be represented in a NSString, upside is that you run into no problems once the NSString exists.
PeyloW