views:

32

answers:

1

I want to use NSFileHandle to write large text files to avoid handling very large NSString's in memory. I'm having a problem where after creating the file and opening it in the Text Edit app (Mac), it is not displaying the unicode characters correctly. If I write the same text to a file using the NSString writeToFile:atomically:encoding:error: method, Text Edit display everything correctly.

I'm opening both the files in Text Edit with the "opening files encoding" option set to automatic, so I'm not sure why one works and the other method doesn't. Is there some form of header to declare the format is UTF8?

// Standard string
NSString *myString = @"This is a test with a star character \u272d";

// This works fine
// Displays: "This is a test with a star character ✭" in Text Edit
[myString writeToFile:path atomically:YES encoding:NSUTF8StringEncoding];

// This doesn't work
// Displays: "This is a test with a star character ‚ú≠" in Text Edit
[fileManager createFileAtPath:path contents:nil attributes:nil];
fileHandle = [NSFileHandle fileHandleForWritingAtPath:path];
[fileHandle writeData:[myString dataUsingEncoding:NSUTF8StringEncoding]];
+1  A: 

The problem is not with your code, but with TextEdit: It doesn't try to decode the file as UTF-8 unless it has a UTF-8 BOM identifying it as such. Presumably, the first version of your code adds such a BOM. See this question for further discussion.

UTF-8 data generally should not include a BOM, so you probably shouldn't modify your code from the second version at all—it's working correctly. If opening the file in TextEdit has to work, you should be able to force the BOM by including it (\ufeff) explicitly at the start of the string, but, again, you should not do that unless you really need to.

Peter Hosey
Great, thanks for the answer and the link to the other question! I understand why it's happening now. I inspected the 2 files created to see if the NSString method creates the BOM or not. Turns out it's doesn't but it does set the extended attribute. I've created an NSString category for setting this flag (adapted from some code I found online) http://gist.github.com/543667 Hope this helps anyone else with this issue!
Michael Waterfall
Just a quick question, I've looked up the BOM and the UTF-8 is stated to be `EF BB BF`. So I'm just wondering how `\ufeff` outputs `EF BB BF`? Thanks!
Michael Waterfall
UTF-8 is an encoding; encodings transform characters into bytes. `\ufeff` is a character; `ef bb ff` is a sequence of bytes. It's the encoding that transforms that character into that sequence. If you want to know how that transformation works, look at chapter 3 of the Unicode Standard. http://unicode.org/versions/latest/
Peter Hosey