ansaurus

Question

Answer 1

A:

Which windows API call wants you to pass an UTF-8 string? It is either an ANSI string or a Widestring (A or W functions). Widestrings have two bytes per character, UTF-8 strings have one. UTF-8 in an Widestring just doesn't make sense. When there is really a Windows function that wants a pointer to an UTF-8 string, you probably have to cast is to a PAnsiChar.

The_Fox 2010-04-23 11:12:19

It's some (broken) legacy code using INI files. So the section, for example, is being passed as a UTF8 string. I know this is wrong, but I need to keep it like that to import old settings files. If I pass Unicode for the section name then it won't match. I cannot use the ANSI versions because the filename is Unicode.

Mick 2010-04-23 11:16:28

Answer 2

A:

Hm why are you doing that? Why are you encoding a WideString to Utf8 just to store it again back to WideString. You are obviously using a unicode version of windows API. So there is no need to use Utf8Encoded string. Or am I missing something.

Because Windows API function are either Unicode (2 bytes) or Ansi (1 byte). Utf8 would be wrong choice here because mainly it contains 1 byte per character but for characters above the ASCII base it uses 2 or more bytes.

Otherwise the equivalent for your old code in unicode Delphi would be:

var
  UnicodeStr: string;
  UTF8Str: string;
begin
  UnicodeStr:='some unicode text';
  UTF8Str:=UTF8Encode(UnicodeStr);
  Windows.SomeFunction(PWideChar(UTF8Str), ...)
end;

WideString and string (UnicodeString) are similar, but the new UnicodeString is faster because it is reference counted and WideString is not.

You code was not correct because the Utf8 string has variable bytes per character. "A" is stored as one byte. Just an ASCII byte code. "ü" on the other hand would be stored as two bytes. And because you are then using PWideChar the function always expects 2 bytes per character.

There is another difference. In older delphi versions (ansi) Utf8String was just and AnsiString, in unicode versions of Delphi Utf8String is a string with a Utf8 code page behind it. So it behaves different.

EDIT:

Forgot to add. The old code would still work correctly:

var
  UnicodeStr: WideString;
  UTF8Str: WideString;
begin
  UnicodeStr:='some unicode text';
  UTF8Str:=UTF8Encode(UnicodeStr);
  Windows.SomeFunction(PWideChar(UTF8Str), ...)
end;

Would act the same as it did in Delphi 2007. So maybe you have a problem elsewhere.

EDIT2:

Mick you are correct. The compiler does some extra work behind the scenes. So in order to avoid this you can do something like this:

var
  UTF8Str: AnsiString;
  UnicodeStr: WideString;
  TempString: RawByteString;
  ResultString: WideString;
begin
  UnicodeStr := 'some unicode text';
  TempString := UTF8Encode(UnicodeStr);
  SetLength(UTF8Str, Length(TempString));
  Move(TempString[1], UTF8Str[1], Length(UTF8Str));
  ResultString := UTF8Str;
end;

I checked and it works just the same. Because I move bytes directly in memory there is no codepage conversion done in the background. I am sure it can be done with greater eleganece but the point is that I see this as the way for what you want to achieve.

Runner 2010-04-23 11:17:45

Answer 3

+4 A:

Your original Delphi 2007 code was converting the UTF-8 string to a widestring using the ANSI codepage. To do the same thing in Delphi 2010 you should use SetCodePage with the Convert parameter false.

var
  UnicodeStr: UnicodeString;
  UTF8Str: RawByteString;
begin
  UTF8Str := UTF8Encode('some unicode text');
  SetCodePage(UTF8Str, 0, False);
  UnicodeStr := UTF8Str;
  Windows.SomeFunction(PWideChar(UnicodeStr), ...)

Craig Peterson 2010-04-23 13:13:54

Nice. Didn't know about that :)

Runner 2010-04-23 13:23:50

Thanks, that works perfectly!

Mick 2010-04-23 23:56:07

ansaurus

tags:

views:

answers:

Storing UTF8 string in a UnicodeString

related questions