views:

2153

answers:

5

I'm looking to the equivalent of Windows _wfopen() under Mac OS X. Any idea?

I need this in order to port a Windows library that uses wchar* for its File interface. As this is intended to be a cross-platform library, I am unable to rely on how the client application will get the file path and give it to the library.

A: 

If you're using Cocoa it's fairly easy with NSString. Just load the UTF16 data in using -initWithBytes:length:encoding: (or perhaps -initWithCString:encoding:) and then get a UTF8 version by calling UTF8String on the result. Then, just call fopen with your new UTF8 string as the param.

You can definitely call fopen with a UTF-8 string, regardless of language - can't help with C++ on OSX though - sorry.

Greg Hurlman
I'm not using Cocoa and am using C++, not Objective-C. If you're right about giving an UTF-8 string to fopen(), I could convert my UTF-16 string to UTF-8 but how is this easily possible on Mac OS X (again using C/C++).
Vincent Robert
Not a definitive answer since I rely on CFString instead of NSString but the basic idea is the same. Thank you.
Vincent Robert
A: 

@vincent:

Standard C function accepts UTF-8

Is that all functions? Where did you read this? That has big implications (in terms of convenience) for people porting to OS X but I've not read it anywhere else.

jkp
Read my reply below :)
Mecki
+2  A: 

You just want to open a file handle using a path that may contain Unicode characters, right? Just pass the path in filesystem representation to fopen.

  • If the path came from the stock Mac OS X frameworks (for example, an Open panel whether Carbon or Cocoa), you won't need to do any conversion on it and will be able to use it as-is.

  • If you're generating part of the path yourself, you should create a CFStringRef from your path and then get that in filesystem representation to pass to POSIX APIs like open or fopen.

Generally speaking, you won't have to do a lot of that for most applications. For example, many applications may have auxiliary data files stored the user's Application Support directory, but as long as the names of those files are ASCII, and you use standard Mac OS X APIs to locate the user's Application Support directory, you don't need to do a bunch of paranoid conversion of a path constructed with those two components.

Edited to add: I would strongly caution against arbitrarily converting everything to UTF-8 using something like wcstombs because filesystem encoding is not necessarily identical to the generated UTF-8. Mac OS X and Windows both use specific (but different) canonical decomposition rules for the encoding used in filesystem paths.

For example, they need to decide whether "é" will be stored as one or two code units (either LATIN SMALL LETTER E WITH ACUTE or LATIN SMALL LETTER E followed by COMBINING ACUTE ACCENT). These will result in two different — and different-length — byte sequences, and both Mac OS X and Windows work to avoid putting multiple files with the same name (as the user perceives them) in the same directory.

The rules for how to perform this canonical decomposition can get pretty hairy, so rather than try to implement it yourself it's best to leave it to the functions the system frameworks have provided for you to do the heavy lifting.

Chris Hanson
+1  A: 

@JKP:

Not all functions in MacOS X accept UTF8, but filenames and filepaths may be UTF8, thus all POSIX functions dealing with file access (open, fopen, stat, etc.) accept UTF8.

See here. Quote:

How a file name looks at the API level depends on the API. Current Carbon APIs handle file names as an array of UTF-16 characters; POSIX ones handle them as an array of UTF-8, which is why UTF-8 works well in Terminal. How it's stored on disk depends on the disk format; HFS+ uses UTF-16, but that's not important in most cases.

Some other POSIX functions handle UTF8 as well. E.g. functions dealing with user names, group names or user passwords use UTF8 to store the information (thus a user name can be Japanese and your password can be Chinese, no problem).

But not all handle UTF8. E.g. for all string functions an UTF8 string is just a normal C String and characters above 126 have no special meaning. They don't understand the concept of multiple bytes (chars in C) forming a single Unicode character. How other APIs handle char * pointer being passed to them is different from API to API. However, as a rule as the thumb you can say:

Either the function only accepts C strings with pure ASCII characters (only in the range 0 to 126) or it will accept UTF8. Usually functions don't allow characters above 126 and interpret them in any other encoding than UTF8. If this really was the case, it is documented and then there must be a way to pass the encoding along with the string.

Mecki
+2  A: 

POSIX API in Mac OS X are usable with UTF-8 strings. In order to convert a wchar_t string to UTF-8, it is possible to use the CoreFoundation framework from Mac OS X.

Here is a class that will wrap an UTF-8 generated string from a wchar_t string.

class Utf8
{
public:
    Utf8(const wchar_t* wsz): m_utf8(NULL)
    {
     // OS X uses 32-bit wchar
     const int bytes = wcslen(wsz) * sizeof(wchar_t);
        // comp_bLittleEndian is in the lib I use in order to detect PowerPC/Intel
     CFStringEncoding encoding = comp_bLittleEndian ? kCFStringEncodingUTF32LE
                                                       : kCFStringEncodingUTF32BE;
     CFStringRef str = CFStringCreateWithBytesNoCopy(NULL, 
                                                    (const UInt8*)wsz, bytes, 
                                                     encoding, false, 
                                                     kCFAllocatorNull
                                                     );

     const int bytesUtf8 = CFStringGetMaximumSizeOfFileSystemRepresentation(str);
     m_utf8 = new char[bytesUtf8];
     CFStringGetFileSystemRepresentation(str, m_utf8, bytesUtf8);
     CFRelease(str);
    } 

    ~Utf8() 
    { 
     if( m_utf8 )
     {
      delete[] m_utf8;
     }
    }

public:
    operator const char*() const { return m_utf8; }

private:
    char* m_utf8;
};

Usage:

const wchar_t wsz = L"Here is some Unicode content: éà€œæ";
const Utf8 utf8 = wsz;
FILE* file = fopen(utf8, "r");

This will work for reading or writing files.

Vincent Robert