views:

1187

answers:

3

Suppose that for some perverse reason you want to display the raw byte contents of a UTF8String.

var
  utf8Str : UTF8String;
begin    
  utf8Str := '€ąćęłńóśźż';
end;

(1) This doesn't do, it displays the readable form:

memo1.Lines.Add( RawByteString( utf8Str ));
// output: '€ąćęłńóśźż'

(2) This, however, does "work" - note the concatenation:

memo1.Lines.Add( 'x' + RawByteString( utf8Str ));
// output: 'x€ąćęłńóśźż'

I understand (1), though the compiler's forced coerction to UnicodeString seems to prevent ever displaying a RawByteString var as-is. However, why does the behavior change in (2)?

(3) Stranger still - let's reverse the concatenation:

memo1.Lines.Add( RawByteString( utf8Str ) + 'x' ); 
// output: '€ąćęłńóśźżx'

I've been reading up on the newfangled string types in Delphi and thought I understood how they work, but this is a puzzle.

+4  A: 

RawByteString only exists to minimize the number of overloads required for functions that work with various flavours of AnsiStrings with different codepage affinities.

In general, don't declare variables of type RawByteString. Don't typecast values to that type. Don't do concatenations on variables of that type. About the only things you can do are:

  • Declaring a parameter of this type (the original intent)
  • Indexing on such a parameter
  • Searching in such a parameter
  • Intelligent operations that check the actual code page of the string, using the StringCodePage function.

For example, you'll note that the StringCodePage function itself uses RawByteString as its argument type. This way, it will work with any AnsiString, rather than doing a codepage translation before passing it as an argument.

For your case, things like concatenations are largely undefined. The behaviour changed between RTM and Update 2, but when the RTL string concatenation functions receive multiple strings with different code pages, there's no easy way for it to figure out what code page should be used for the final string. That's just one reason why you shouldn't concatenate them like you do here.

Barry Kelly
Thans, Barry, that makes good sense. The concatenation was just a "what if I press this button" experiment, nothing of practical value. Strange though to see Delphi introduce an undefined behavior like this - there were never many of those before.
moodforaday
+1  A: 

You cannot add a string to a TMemo "as is". You always need to so some kind of conversion to Unicode, because that's all TMemo knows about in Delphi 2009.

If you want to pretend that your UTF8String uses code page 1252, do this:

var
  utf8Str : UTF8String;
  Raw: RawByteString;
begin
  utf8Str := '€ąćęłńóśźż';
  Raw := utf8Str;
  SetCodePage(Raw, 1252, False);
  Memo.Lines.Add(Raw);
end;

For more details, see my article Using RawByteString Effectively

Jan Goyvaerts
+1  A: 

Jan - UTF-8 is an 8-bit encoding. It requires codeunits $00-$FF to be processed as-is. However, codepage 1252 maps codeunits $80-$9F to different values when converted to UTF-16. You should use codepage 28591 instead.

Remy Lebeau - TeamB
Should be a comment, not an answer!
Andreas Rejbrand