views:

623

answers:

5

Hi all,

C# question here..

I have a UTF-8 string that is being interpreted by a non-Unicode program in C++.. This text which is displayed improperly, but as far as I can tell, is intact, is then applied as an output filename..

Anyway, in a C# project, I am trying to open this file with an System.Windows.Forms.OpenFileDialog object. The filenames I am getting from this object's .FileNames[] is in Unicode (UCS-2). This string, however, has been misinterpreted.. For example, if the original string was 0xe3 0x81 0x82, a FileName[].ToCharArray() reveals that it is now 0x00e3 0x0081 0x201a .... .. It might seem like the OpenFileDialog object only padded it, but it is not.. In the third character that the OpenFileDialog produced, it is different and I cannot figure out what happened to this byte..

My question is: Is there any way to treat the filenames highlighted in the OpenFileDialog box as UTF-8?

I don't think it's relevant, but if you need to know, the string is in Japanese..

Thanks,

kreb

UPDATE

First of all, thanks to everyone who's offered their suggestions here, they're very much appreciated.

Now, to answer the suggestions to modify the C++ application to handle the strings properly, it doesn't seem to be feasible. It isn't just one application that is doing this to the strings.. There are actually a great number of these applications in my company that I have to work with, and it would take huge amount of manpower and time that simply isn't available. However, sean e's idea would probably be the best choice if I were to take this route..

@Remy Lebeau: I think hit the nail right on the head, I will try your proposed solution and report back.. :) I guess the caveat with your solution is that the Default encoding has to be the same on the C# application environment as the C++ application environment that created the file, which certainly makes sense as it would have to use the same code page..

@Jeff Johnson: I'm not pasting the filenames from the C++ app to the C# app.. I am calling OpenFileDialog.ShowDialog() and getting the OpenFileDialog.FileNames on DialogResult.OK.. I did try to use Encoding.UTF8.GetBytes(), but like Remy Lebeau pointed out, it won't work because the original UTF8 bytes are lost..

@everyone else: Thanks for the ideas.. :)

kreb

UPDATE

@Remy Lebeau: Your idea worked perfectly! As long as the environment of the C++ app is the same as the environment of the C# app is the same (same locale for non-Unicode programs) I am able to retrieve the correct text.. :)

Now I have more problems.. Haha.. Is there any way to determine the encoding of a string? The code now works for UTF8 strings that were mistakenly interpreted as ANSI strings, but screws up UCS-2 strings. I need to be able to determine the encoding and process each accordingly. GetEncoding() doesn't seem to be useful.. =/ And neither is StreamReader's CurrentEncoding property (always says UTF-8)..

P.S. Should I open this new question in a new post?

+1  A: 

I think your problem is at the begining:

I have a UTF-8 string that is being interpreted by a non-Unicode program in C++.. This text which is displayed improperly, but as far as I can tell, is intact, is then applied as an output filename..

If you load a UTF-8 string with a non-unicode program and then serialize it, it will contain non-unicode chars.

Is there any way that your C++ program can handle Unicode?

bruno conde
+1  A: 

Can you use members of the System.Text namespace (e.g., the UTF8Encoding class) to convert the .NET framework's internal string representation to/ from a byte array containing the text in the encoding of your choice?

Jason Musgrove
+1  A: 

If you are sure that the C++ output is fine, then in your C# app you should convert it from UTF-8 to UTF-16 using the .NET encoding class and just work with it in the Windows native format.

If you can modify the C++ app, that might be better - give the C# app input that doesn't need to be re-encoded. In it, the UTF8 to Unicode translation can be handled via MultiByteToWideChar, using CP_UTF8 for the CodePage parameter, but it only works when none of the flags are set for dwFlags (specify 0 for dwFlags). The whole app doesn't need to be Unicode. Even though it is not compiled unicode, you can make selective use of Unicode APIs.

sean e
+1  A: 

In answer to your question "is there a way to treat the filenames as utf-8?" Try this code:

    List<byte[]> utf8FileNames = new List<byte[]>();
    foreach (string fileName in openFileDialog1.FileNames)
    {
        utf8FileNames.Add(Encoding.UTF8.GetBytes(fileName));
    }
    // Each byte array in utf8FileNames is a sequence of utf-8 bytes matching each file name chosen

What do you do with the file names once you have got them from the open file dialog? Can you post that code?

John JJ Curtis
That will not work. The original UTF-8 bytes are lost when the dialog is filling in its FileNames property. Since the resulting strings are not being properly decoded to begin with, passing them to UTF8.GetBytes() will not produce the same bytes as the original UTF-8 filenames.
Remy Lebeau - TeamB
Are you pasting in file names from the C++ application into the C# application?
John JJ Curtis
+2  A: 

0x201a is the Unicode "low single comma quotation mark" character. 0x82 is the Latin-1 (ISO-8859-1, Windows codepage 1252) encoding of that character. That means the bytes of the filename are being interpretted as plain Ansi instead of as UTF-8, and thus being decoded from Ansi to Unicode accordingly. That is not surprising, as the filesystem has no concept of UTF-8, and Windows assumes non-Unicode filenames are using the OS's default Ansi encoding.

To do what you are looking for, you need access to the original UTF-8 encoded bytes so you can decode them properly. One thing you can try is to pass the FileName to the GetBytes() method of System.Text.Encoding.Default (in theory, that is using the same encoding that was used to decode the filename, so it should be able to produce the same bytes as the original), and then pass the resulting bytes to the GetString() method of System.Text.Encoding.UTF8.

Remy Lebeau - TeamB