views:

180

answers:

1

I am currently exploring the specification of the Digital Mars D language, and am having a little trouble understanding the complete nature of the primitive character types. The book Learn to Tango With D is similarly vague on the capabilities and limitations of the language in this area.

The types are given on the website as:

char;    // unsinged 8 bit UTF-8
wchar;   // unsigned 16 bit UTF-16
dchar;   // unsigned 32 bit UTF-32

Since we know that most of the Unicode Transformation (UTF) Format encodings represent characters with a variable bit-width, does this mean that a char in D can only contain the values that will fit in 8 bits, or does it expand in the machine's physical memory when you give it double byte characters? Perhaps there is some other possibility, like automatic casting into the next most appropriate type as you overload the variable?

Let's say for example, I want to use the UTF-8 char in an editor and type in Chinese . Will it simply fall over, or is it able to deal with Unicode characters more 'correctly', like in C#? Would it still be necessary to provide glue code to allow working with any language supported by Unicode?

I'd appreciate any specific information you can offer on how these types work under the covers, and any general best practices advice on dealing with their limitations.

+10  A: 

A single char or wchar represents an UTF code unit. This means that, by its own, a char in can either represent an ASCII symbol (0-127) or be part of an UTF-8 sequence representing an Unicode character (code point). Only the dchar type can represent an entire Unicode character, because there are more than 65536 code points in Unicode.

Casting one type of string type (string, wstring and dstring, which are simply dynamic arrays of the character types) will not automatically convert their contents to the respective UTF representation. In order to do this, you must use the functions toUTF8, toUTF16 and toUTF32 from std.utf (or toString / toString16 / toString32 from tango.text.convert.Utf if you use Tango).

Users have implemented string classes which will automatically use the most memory-efficient representation that can map each character to a single code unit. This allows quick slicing and indexing with a minimal memory overhead. One such implementation is mtext by Christopher E. Miller.

Further reading:

CyberShadow
It seems, my problem was in understanding the difference between code units and code points. The last link explains everything very nicely. Thanks.
IanGilham
Ditto on the complement for Mr. Keep's writ-up.
BCS
I'm glad to hear the article helped you. Still, curse you, CyberShadow, for answering this before I got here! :P
DK