Guess encoding when creating an NSString from NSData

In general, you can’t. However, you can quite reliably identify UTF-8 files – if a file is valid UTF-8, it’s not very likely that it’s supposed to be any other encoding (except if all the bytes are in the ASCII range, in which case any “extended ASCII” encoding, including UTF-8, will give you the same result). All Unicode encodings also have an optional BOM which identifies them. So a reasonable approach would be:

Look for a valid BOM. If there is one, use the appropriate encoding.
Otherwise, try to interpret it as UTF-8. You can do this by calling initWithData:data encoding:NSUTF8StringEncoding and checking if the result is non-nil.
If that fails, use a default UTF-8 encoding, such as -[NSString defaultCStringEncoding] (which provides a locale-appropriate guess).

It is possible to try to improve the guess in the last step by trying various different encodings and choosing the one which has fewest sequences of letters with junk in the middle, where “junk” is any character that’s not a letter, space or common punctuation mark. This would significantly increase complexity while not actually being reliable.

In short, to be able to handle all available encodings you need to do what TextEdit does: shunt the decision over to the user.

Oh, one more thing: as of 10.5, the encoding is often stored with a file in the undocumented com.apple.TextEncoding extended attribute. If you open a file with +[NSString stringWithContentsOfFile:] or similar, this will automatically be used if present.

ansaurus

tags:

views:

answers:

Guess encoding when creating an NSString from NSData

related questions