views:

111

answers:

2

I have a C# COM server which is consumed by a cpp client.

One of the C# methods returns a string.

In cpp the returned string is represented in Unicode (UTF-16), at least according to the memory view.

  1. Is this always the case with COM strings?
  2. Is there a way to use UTF-8 instead?
  3. I saw some code where strings were passed between cpp and c# as byte arrays. Is there any benefit in this?
A: 
  1. No.
  2. Yes. Put the attribute [return: MarshalAs(UnmanagedType.LPStr)] before the method definition in C# if you'd like to return the string as an ANSI string instead of Unicode.
  3. Yeah--the author may have done that to maintain very fine-grained control on the encoding of the contents of the string by side-stepping the default marshalling behavior.
sblom
+1  A: 
  1. Yes. The standard COM string type is BSTR. It is a Unicode string encoded in UTF16, just like Windows' native string type.
  2. No, a COM method isn't going to understand a UTF8 string, it will turn it into Chinese. UTF8 is a good encoding for a text file, not for programs manipulating strings in memory. UTF8 requires anywhere between 1 and 4 bytes to encode a Unicode codepoint. Very incompatible with basic string manipulations like getting the size or indexing a character.
  3. C and C++ programs tend to use 8-bit encodings, compatible with the "char" type. That's an old practice, dating back from an era before Unicode was around. There's nothing attractive about it, there are many 8-bit encodings. The typical problem is that data entered as text can only be interpreted correctly if it is read by a program that uses the same 8-bit encoding. In other words, when the computers are less than 1000 miles apart. Less in Europe.
Hans Passant
Sounds to me like you've got it backward. He's calling into a C# COM component from C++.
sblom
@sblom: yes, your answer mystified me. COM looks the same way on both ends. Automation has always been Unicode enabled.
Hans Passant