views:

1635

answers:

5

A customer is complaining that our code used to write files with Japanese characters in the filename but no longer works in all cases. We have always just used good old char * strings to represent filenames, so it came as a bit of a shock to me that it ever worked, and we haven't done anything I am aware of that should have made it stop working. I had them send me a file with an embedded filename in it exported from our software, and it looks like the strings use hex characters 82 and 83 as the first character of a double-byte sequence to represent the Japanese characters. Poking around online leads me to believe this is probably SHIFT_JIS and/or Windows codepage 932.

It looks to me like what is happening is previously both fopen and ofstream::open accepted filenames using this codepage; now only fopen does. I've checked the Visual Studio fopen docs, and I see no hint of what makes an acceptable string to pass to fopen.

In the short run, I'm hoping someone can shed some light on the specific Windows fopen versus ofstream::open issue for me. In the long run, I'd really like to know the accepted way of opening Unicode (and other?) filenames in C++, on Windows, Linux, and OS X.

Edited to add: I believe that the opens that work are done in the "C" locale, whereas the ones that do not work are done in whatever the customer's default locale is. However, that has been the case for years now, and the old version of the program still works today on their system, so this seems a longshot for explaining the issue we are seeing.

Update: I sent off a small test program to the customer. It has verified that fopen works fine with the SHIFT_JIS filename, and std::ofstream does not. This is in Visual Studio 2005, and happened regardless of whether I used the default locale or the "C" locale.

I'm still interested if anyone has an explanation for this behavior (and why it mysteriously changed -- perhaps a service pack of VS2005?) and hoping to put together a comprehensive "best practices" for handling Unicode filenames in portable C++ code.

A: 

I'm nearly certain that on Linux, the filename string is a UTF-8 string (on the EXT3 filesystem, for example, the only disallowed chars are slash and NULL), stored in a normal char *. The man page doesn't seem to mention character encoding, which is what leads me to believe it is the system standard of UTF-8. OS X likely uses the same, since it comes from similar roots, but I am less sure about this.

rmeador
No, all native Linux filesystems ignore character encoding (however, some non-native FS do care). File names are byte strings and the only special characters are slash and null. Any encodings must be handled by the shell.
Zan Lynx
A: 

You may have to set the thread locale to the system default locale. See here for a possible reason for your problems: http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=100887

Stefan
Hmmm... this is interesting. Looking at my code, it's possible that the opens that work are always in the "C" locale, whereas the ones that fail are in whatever the user's machine is in. However, that is not something that has changed recently on our end....
Sol
Did you upgrade your visual studio? If yes, then that's the change on your end. If not, then I'm sorry I'm out of ideas...
Stefan
Nope, Visual Studio 2005 throughout.
Sol
+2  A: 

I'm not aware of any portable way of using unicode files using default system libraries. But there are some frameworks that provide portable functions, for example:

  • for C: glib uses filenames in UTF-8;
  • for C++: glibmm also uses filenames in UTF-8, requires glib;
  • for C++: boost can use wstring for filenames.

I'm pretty sure .NET/mono frameworks also do contain portable filesystem functions, but I don't know them.

Tometzky
A: 

Mac OS X uses Unicode as its native character encoding. The basic string objects are CFString and NSString. They store array of characters as Unicode.

mouviciel
+1  A: 

Functions like fopen or ofstream::open take the file name as char *, but that is interpreted as being in the system code page.

It means that it can be a Japanese character represented as Shift-JIS (cp932), or Chinese Simplified (Big 5/cp936), Korean, Arabic, Russian, you name it (as long as it matches the OS system code page).

It also means that it can use Japanese file names on a Japanese system only. Change the system code page and the application "stops working" I suspect this is what happens here (no big changes in Windows since Win 2000, in this area).

This is how you change the system code page: http://www.mihai-nita.net/article.php?artID=20050611a

In the long run you might consider moving to Unicode (and using _wfopen, wofstream).

Mihai Nita
As I've updated the question, the weird thing here is that fopen works with the code page but ofstream::open does not.Also, are _wfopen and wofstream actually portable?
Sol
"Functions like fopen or ofstream::open take the file name as char *, but that is interpreted as being in the system code page" -- Sorry, I don't believe it. fopen and ofstream::open are functions in C and C++ libraries, so they should default to using the C locale. If an app wants CRT functions to use a Windows locale it has to call the CRT's locale function.
Windows programmer