views:

425

answers:

5

I want a string type that is Unicode and that stores the string directly at the adress of the variable, as is the case of the (Ansi-only) ShortString type.

I mean, if I declare a S: ShortString and let S := 'My String', then, at @S, I will find the length of the string (as one byte, so the string cannot contain more than 255 characters) followed by the ANSI-encoded string itself.

What I would like is a Unicode variant of this. That is, I want a string type such that, at @S, I will find a unsigned 32-bit integer (or a single byte would be enough, actually) containing the length of the string in bytes (or in characters, which is half the number of bytes) followed by the Unicode representation of the string. I have tried WideString, UnicodeString, and RawByteString, but they all appear only to store an adress at @S, and the actual string somewhere else (I guess this has do do with reference counting and such). Update: The most important reason for this is probably that it would be very problematic if sizeof(string) were variable.

I suspect that there is no built-in type to use, and that I have to come up with my own way of storing text the way I want (which actually is fun). Am I right?

Update I will, among other things, need to use these strings in packed records. I also need manually to read/write these strings to files/the heap. I could live with fixed-size strings, such as <= 128 characters, and I could redesign the problem so it will work with null-terminated strings. But PChar will not work, for sizeof(PChar) = 1 - it's merely an address.

The approach I eventually settled for was to use a static array of bytes. I will post my implementation as a solution later today.

+1  A: 

PChar should work like this, right? AFAIK, it's an array of chars stored right where you put it. Zero terminated, not sure how that works with Unicode Chars.

Chris Thornton
Yes, this is not a bad idea. In Delphi 2009+, PChar = PWideChar, so it's Unicode. Now it is zero-terminated, and does not start with the length of the string, but perhaps I could live with that. Thanks!
Andreas Rejbrand
But then again, there is one subtle problem. The string will not be stored at @S, but rather at (@S)^, right?
Andreas Rejbrand
Maybe stored at S[1]?
Chris Thornton
Yes, but still not at @S!
Andreas Rejbrand
But maybe then use @S[2] instead of @S and S[1] as length ?
Astronavigator
@Astronavigator: No, as I said above I need to use the string in a packed record, but if I use a PChar, then only a 32-bit unsigned integer (the address) will be included in the record's memory. Therefore I *really* need the string to be stored at @S, as is the case for (e.g.) ShortStrings.
Andreas Rejbrand
+1  A: 

You actually have this in some way with the new unicode strings.
s as a pointer points to s[1] and the 4 bytes on the left contains the length.
But why not simply use Length(s)?

And for direct reading of the length from memory:

procedure TForm9.Button1Click(Sender: TObject);
var
  s: string;
begin
  s := 'hlkk ljhk jhto';
  {$POINTERMATH ON}
  Assert(Length(s) = (PInteger(s)-1)^); 
  //if you don't want POINTERMATH, replace by PInteger(Cardinal(s)-SizeOf(Integer))^
  showmessage(IntToStr(length(s)));
end;
François
I need to write/read raw bytes to/from the heap and to/from files.
Andreas Rejbrand
Among other things, I need to include the string in a packed array, and then I cannot use `string`, for then only a 32-bit unsigned integer will be included in the memory of the record. A unicode ShortString would be perfect... I think I have managed to write working code, and I will post it as an answer tomorrow.
Andreas Rejbrand
Then you limit greatly the size of the strings to 127 in most cases.Direct writing is surely dangerous, but reading is fine. See my example. But I won't dispute your specific needs as you know them better than me... :-)
François
If you need to read/write raw bytes to/from the heap, then you should be using the TBytes type, not a string.
Nick Hodges
@Nick: This is the approach I eventually settled for. I will post my solution as an answer later today.
Andreas Rejbrand
+1 for the reference to the $POINTERMATH directive. I use pointer arithmetics a lot, and have always used casting of the type `pointer(cardinal(MyPtr) + 4)`, not knowing about this directive. This will save me a lot of work and make my code much prettier! I would do +2 if I could!
Andreas Rejbrand
+1  A: 

There's no Unicode version of ShortString. If you want to store unicode data inline inside an object instead of as a reference type, you can allocate a buffer:

var
  buffer = array[0..255] of WideChar;

This has two disadvantages. 1, the size is fixed, and 2, the compiler doesn't recognize it as a string type.

The main problem here is #1: The fixed size. If you're going to declare an array inside of a larger object or record, the compiler needs to know how large it is in order to calculate the size of the object or record itself. For ShortString this wasn't a big problem, since they could only go up to 256 bytes (1/4 of a K) total, which isn't all that much. But if you want to use long strings that are addressed by a 32-bit integer, that makes the max size 4 GB. You can't put that inside of an object!

This, not the reference counting, is why long strings are implemented as reference types, whose inline size is always a constant sizeof(pointer). Then the compiler can put the string data inside a dynamic array and resize it to fit the current needs.

Why do you need to put something like this into a packed array? If I were to guess, I'd say this probably has something to do with serialization. If so, you're better off using a TStream and a normal Unicode string, and writing an integer (size) to the stream, and then the contents of the string. That turns out to be a lot more flexible than trying to stuff everything into a packed array.

Mason Wheeler
Yes, I realized that tonight: It would be very problematic if sizeof(string) were variable.I am constructing my own scripting language and need to store a lot of identifiers etc. on the heap in custom data structures. In my case, no string will be longer than (say) 64 characters, so allocating a fixed size to each string would not be a problem.
Andreas Rejbrand
Because fixed-size is not a major problem for me, this is very similar to the approach I finally settled for. I used a static array of bytes, though, but perhaps an array of WideChar would have been better. +1
Andreas Rejbrand
+4  A: 

You're right. There is no exact analogue to ShortString that holds Unicode characters. There are lots of things that come close, including WideString, UnicodeString, and arrays of WideChar, but if you're not willing to revisit the way you intend to use the data type (make byte-for-byte copies in memory and in files while still being using them in all the contexts a string could be allowed), then none of Delphi's built-in types will work for you.

WideString fails because you insist that the string's length must exist at the address of the string variable, but WideString is a reference type; the only thing at its address is another address. Its length happens to be at the address held by the variable, minus four. That's subject to change, though, because all operations on that type are supposed to go through the API.

UnicodeString fails for that same reason, as well as because it's a reference-counted type; making a byte-for-byte copy of one breaks the reference counting, so you'll get memory leaks, invalid-pointer-operation exceptions, or more subtle heap corruption.

An array of WideChar can be copied without problems, but it doesn't keep track of its effective length, and it also doesn't act like a string very often. You can assign string literals to it and it will act like you called StrLCopy, but you can't assign string variables to it.

You could define a record that has a field for the length and another field for a character array. That would resolve the length issue, but it would still have all the rest of the shortcomings of an undecorated array.

If I were you, I'd simply use a built-in string type. Then I'd write functions to help transfer it between files, blocks of memory, and native variables. It's not that hard; probably much easier than trying to get operator overloading to work just right with a custom record type. Consider how much code you will write to load and store your data versus how much code you're going to write that uses your data structure like an ordinary string. You're going to write the data-persistence code once, but for the rest of the project's lifetime, you're going to be using those strings, and you're going to want them to look and act just like real strings. So use real strings. "Suffer" the inconvenience of manually producing the on-disk format you want, and gain the advantage of being able to use all the existing string library functions.

Rob Kennedy
A: 

The solution I eventually settled for is this (real-world sample - the string is, of course, the third member called "Ident"):

TASStructMemHeader = packed record
  TotalSize: cardinal;
  MemType: TASStructMemType;
  Ident: packed array[0..63] of WideChar;
  DataSize: cardinal;
  procedure SetIdent(const AIdent: string);
  function ReadIdent: string;
end;

where

function TASStructMemHeader.ReadIdent: string;
begin
  result := WideCharLenToString(PWideChar(@(Ident[0])), length(Ident));
end;

procedure TASStructMemHeader.SetIdent(const AIdent: string);
var
  i: Integer;
begin
  if length(AIdent) > 63 then
    raise Exception.Create('Too long structure identifier.');
  FillChar(Ident[0], length(Ident) * sizeof(WideChar), 0);
  Move(AIdent[1], Ident[0], length(AIdent) * sizeof(WideChar));
end;

But then I realized that the compiler really can interpret array[0..63] of WideChar as a string, so I could simply write

  var
    MyStr: string;

  Ident := 'This is a sample string.';
  MyStr := Ident;

Hence, after all, the answer given by Mason Wheeler above is actually the answer.

Andreas Rejbrand
In the last code block, the assignment to `Ident` only works for string literals, which makes for rather uninteresting programs. The assignment to `MyStr` only works if `Ident` ends with null character, which, according to your implementations of `ReadIdent` and `SetIdent`, is not necessarily the case.
Rob Kennedy
@Rob Kennedy: I see. I would have noticed that sooner or later! But then I will need my two small routines after all, right? I changed the code so that the string will always end with #0.
Andreas Rejbrand