views:

1160

answers:

4

I have to convert a large legacy application to Delphi 2009 which uses strings, AnsiStrings, WideStrings and UTF8 data all over the place and I have a hard time to understand how the new string types work and how they should be used.

The application fully supported Unicode using TntUnicodeControls and there are 3rd party DLLs which require strings in specific encodings, mostly UTF8 and UTF16, making the conversion task not as trivial as one would suspect.

I especially have problems with the C DLL calls and choosing the right type. I also get the impression that there are many implicit string conversions happening, because one of the DLL seems to always receive UTF-8 encoded strings, no matter how the Delphi string is encoded.

Can someone please provide a short overview about the new Delphi 2009 string types UnicodeString and RawByteString, perhaps some usage hints and possible pitfalls when converting a pre 2009 application?

+11  A: 

See Delphi and Unicode, a white paper written by Marco Cantù and I guess The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), written by Joel.

One pitfall is that the default Win32 API call has been mapped to use the W (wide string) version instead of the A (ANSI) version, for example ShellExecuteA If your code is doing tricky pointer code assuming internal layout of AnsiString, it will break. A fallback is to substitute PChar with PAnsiChar, Char with AnsiChar, string with AnsiString, and append A at the end of Win32 API call for that portion of code. After the code actually compiles and runs normally, you could refactor your code to use string (UnicodeString).

eed3si9n
+1 great links. Both are very interesting reads.
Smasher
I asked a similar question about [upgrading a C++Builder 2007 application](http://stackoverflow.com/questions/1392409/what-do-i-need-to-know-to-upgrade-a-complex-application-from-cbuilder-2007-to-2). Not all of it will be applicable, but some of the links and answers people gave might be useful for you.
David M
+5  A: 

Watch my CodeRage 4 talk on "Using Unicode and Other Encodings in your Programs" this friday, or wait until the replay of it is available online.

I'm going to cover some encodings and explain about the string format.

The slides will be available shortly (I'll try to get them online today) and contain a lot of references to stuff you should read on the internet (but I must admit I forgot the link to Joel on Unicode that eed3si9n posted).

Will edit this answer today with the uploads and the links.


Edit:

If you have a small sample where you can show that your C/C++ DLL receives the strings UTF8 encoded, but thought they should be encoded otherwise, please post it (mail me; almost anything at the pluimers dot com gets to me, especially if you use my first name before the at sign).

Session materials can be downloaded now, including the "Using Unicode and Other Encodings in your Programs" session.

These are links from that session:

Read these:

  1. Marco Cantu, Whitepaper “Delphi and Unicode
  2. Marco Cantu, Presentation “Delphi and Unicode
  3. Nick Hodges, Whitepaper “Delphi in a Unicode World

Relevant on-line help topics:

  1. What's New in Delphi and C++Builder 2009
  2. String Types: Base: ShortString, AnsiString, WideString, UnicodeString
  3. String Types: Unicode (including internal memory layouts of the string types)
  4. String Types: Enabling for Unicode
  5. String Types: RawByteString (AnsiString with CodePage $ffff)
  6. String Types: UTF8String (AnsiString with CodePage 65001)
  7. String <-> PChar conversions: PChar fundamentals
  8. String <-> PChar conversions: Returning a PChar Local Variable
  9. String <-> PChar conversions: Passing a Local Variable as a PChar

Hope this gets you going. If not, mail me and I'll try to extend the answer here.

Jeroen Pluimers
That's a strange title, considering that Unicode isn't an encoding, but (to quote Wikipedia): "Unicode can be implemented by different character encodings."
mghie
As a non-native English speaker, I couldn't come up with a short title that did cover the subject correctly. If you have one: please let me know. I'd be glad to change the title.
Jeroen Pluimers
I'm not a native speaker either, but I think the title as it stands contains a false statement, and that's unfortunate as there are too many misconceptions about Unicode anyway. "Using Unicode and choosing encodings ..." would be more correct. Since I don't know your talk I don't know whether it's a better title, though.
mghie
Thx! I'll try to rename everything into "Using Unicode and choosing text/string encodings in your programs".
Jeroen Pluimers
As a native English speaker, I think the title is fine. It's "(Using Unicode) and (Other Encodings)" not "Using (Unicode and Other Encodings)". It's not precise, but that's the nature of English, isn't it? :-)
Tim Sullivan
... or perhaps "Using (Unicode) and (Other Encodings)", which is also alright.
Tim Sullivan
Thank you for the list of good resources and your generous offer to help me. But if I still got questions, I'd rather ask them here , so other readers can profit by your knowledge, too. :)
DR
@Tim: How is "(Using Unicode) and (Other Encodings)" fine - isn't that like "(Eating Apples) and (Other Kinds of Wood)"? Is English really **that** imprecise?
mghie
The CodeRage 4 replays have come online. For this particular one, see http://www.delphifeeds.com/go/s/60421 For all sessions, see http://conferences.embarcadero.com/coderage/sessions
Jeroen Pluimers
A: 

Note that it does not only hit real string code. It also hits code where PCHAR is used to trawl through buffers, or interface with APIs.

E.g. initialization code of headers that load the DLL dynamically (getprocedureaddress/loadlibray)

Marco van de Voort
A: 

It seems almost all my problems come from the automatic conversion on assignments to UTF8String.

I already had old code using UTF8String just to help me think which type of string a variable should contain.

When starting to port my application, I replaced AnsiString with UTF8String for the same reason, but the code depended on UTF8String being just an alias to (classic) AnsiString

Now with the automatic conversion that assumption is no longer true, which created many problems.

Be careful if you use UTF8String when porting from pre-2009 Delphi code!

DR