Unicode strings in process memory

views:

136

answers:

+2 Q:

Unicode strings in process memory

What is the most preferred format of unicode strings in memory when they are being processed? And why?

I am implementing a programming language by producing an executable file image for it. Obviously a working programming language implementation requires a protocol for processing strings.

I've thought about using dynamic arrays as the basis for strings because they are very simple to implement and very efficient for short strings. I just have no idea about the best possible format for characters when using strings in this manner.

+5 A:

UTF16 is the most widely used format.

The advantage of UTF16 over UTF8 is that, despite being less compact, every character has a constant size of 2bytes (16bits) - as long as you don't use surrogates (when sticking to 2bytes chars, the encoding is called UCS-2).

In UTF8 there is only a small set of characters coded on 1bytes, others are up 4 bytes. This makes character processing less direct and more error prone.

Of course using Unicode is prefered since it enables to hander international characters.

Think Before Coding 2008-12-24 12:23:00

+2 A:

The C Python 2.x series used UTF-16 and UCS-4 depending on platform/build/etc.

Here's an interesting discussion from python-dev on the requirements and trade-offs in choosing the Unicode internal representation for Python 3.0. While there's more content there than I can briefly describe, it includes:

Discussing the external interface (constant time slicing, efficient implementations of .lower, .islower, etc.)
External requirements (GTK takes UTF-8 strings, QT takes UTF-16 and UCS-4 strings, Windows takes UT-16 strings, etc.)
It points at other implementations of Unicode data (eg. QT's).
It discusses important use cases (which is closely related to external interface).
etc.

Aaron Maenpaa 2008-12-24 13:01:06

ansaurus

tags:

views:

answers:

Unicode strings in process memory

related questions