tags:

views:

2117

answers:

6

Hello! I just got Delphi 2009 and have previously read some articles about modifications that might be necessary because of the switch to Unicode strings. Mostly, it is mentioned that sizeof(char) is not guaranteed to be 1 anymore. But why would this be interesting regarding string manipulation?

For example, if I use an AnsiString:='Test' and do the same with a String (which is unicode now), then I get Length() = 4 which is correct for both cases. Without having tested it, I'm sure all other string manipulation functions behave the same way and decide internally if the argument is a unicode string or anything else.

Why would the actual size of a char be of interest for me if I do string manipulations? (Of course if I use strings as strings and not to store any other data)

Thanks for any help! Holger

A: 

The actual size of a character shouldn't matter, unless you are doing the manipulation at the byte level.

1800 INFORMATION
A: 

(Of course if I use strings as strings and not to store any other data)

That's the key point, YOU don't use strings for other purposes, but some people do. They use strings just like arrays, so they (and that's including me) would need to check all such uses to make sure nothing is broken...

You're right. I got confused because I read that specifically with string manipulations the char size would be important. When I use strings to store anything else but strings, of course it's up to me to handle it correctly.
Holgerwa
A: 

I didn't try Delphi 2009, but are using fpc which is also switching to unicode slowly. I'm 95% sure that everything below also holds for Delphi 2009

In fpc (when supporting unicode) it will be so that functions like 'length' take the codepage into consideration. Thus it will return the length of the string as a 'human' would see it. If there are - for example - two chinese characters, that both take two bytes of memory in unicode, length will return 2, since there are two characters in the string. But the string will take 4 bytes of memory. (+the memory for the reference count and the leading #0, but that aside)

What you can not do anymore is this:

var p : pchar;
begin
  p := s[1];
  for i := 0 to length(string)-1 do
    begin
    write(p);
    inc(p);
    end;      
end;

Because this code will - in the two chinese-character example - write the wrong two characters. Namely the two bytes which are part of the first 'real' character.

In short: Length() doesn't return the amount of bytes allocated for the string anymore, but the amount of characters. (Before the switch to unicode, those two values were equal to eachother)

Loesje
+3  A: 

People often implicitly convert from characters to bytes in old Delphi code without really thinking about it. For example, when writing to a stream. When you write a string to a stream, you have to specify the number of bytes you write, but people often pass the character count instead. See this post from Chris Bensen for another example.

Another way people often make this implicit conversion and older code is by using a "string" to store binary data. In this case, they actually want bytes, but the data type expects characters. D2009 has a better type for this.

Craig Stuntz
A: 

Lets not forget that there are times when this conversion is not really desired. Say for storing a GUID in a record for instance. The guid can only contain hexadecimal characters plus the - and brackets...making them take up twice the space can make quite an impact on existing code. Sure the simple solution is to change them to AnsiString, and deal with the compiler warnings if you do any string manipulation on them.

skamradt
+3  A: 

With Unicode SizeOf(SomeChar) <> Length(SomeChar). Essentially the length of a string is less then the sum of the size of its chars. As long as you don't assume SizeOf(Char) = 1, or SizeOf(SomeString[x]) = 1 (since both are FALSE now) or try to interchange bytes with chars, then you shouldn't have any trouble. Any place you are doing something creative stuffing Bytes into Chars or Strings, then you will need to use AnsiString.

(SizeOf(SomeString) is still 4 no matter the length since it is essentially a pointer with some compiler magic.)

Jim McKeeth