views:

273

answers:

6

Assume you are designing and implementing a new language from scratch, though you may freely borrow ideas from existing languages/implementations.

Question: If a programmer declares a string variable (assume strongly typed), how would you choose to store this variable in memory?

There are many use cases, but do you have a particular model that is superior in certain areas? Is your string mutable? Is it mutable, but only to a certain length that isn't the end of memory? Can I dynamically set the length, or can it only be done at compile time? Is it easy to access the 'nth' element? Does the string require a contiguous sector of memory? Could it be broken up into smaller strings?

Certain things to consider that programmers might like to do with your string: Calculating the length. Adding to the string. Extracting parts of the string (substrings). Applying Regex. Converting to a different value (number, boolean, etc)

EDIT: Clarifying what I mean.

If a user declares the following:

var Name : string

How would you choose, as the language designer, how to store this in RAM? What are the advantages and disadvantages of your method, etc.

+5  A: 

If I were writing a language from the ground up, I would want both mutable and immutable string types defined. Immutability makes string-handling operations much faster, but creates serious limitations, particularly when it comes to concatenation and the like.

The immutable string, I would store as a null-terminated array of unicode values. The mutable string, I would store as a linked list of unicode chars for easier reshuffling, slicing, etc.

Jekke
Wouldn't a LL of chars be quite a performance bottleneck?
Gary Willoughby
I think a linked list of arrays of unicode values would be more efficient, but the basic idea seems correct.
Darron
I would keep even the mutable string as an array of unicode chars in memory; however, I would pre-allocate bigger buffer, like 256 bytes. Otherwise your mutable instances become too costly on both memory and time consumed.
Franci Penov
+2  A: 

I'm assuming you mean as if you were designing a language? Then i guess i'd go with C's model and store it as a contiguous piece of memory, null terminated. That seems the most logical way to me.

Pros: No memory wasted if you discount the null.

Cons: Have to calculate the string length via a method, etc.

Gary Willoughby
A: 

I would dissemble .NET String class and build from that.

Matthew Ruston
Wouldn't that be illegal?
cdmckay
You could use Mono string and it would be pretty much the same.
Manuel Ferreria
+1  A: 

I believe you are asking of how would you implement a string object.

For performance reasons, you would want to keep the memory allocated for the characters in the string as a single block. This would speed up the operations performed on all the elements - case change, copy, length calculation, indexof and so on. It will also make it easier to implement operations that work on the start or the end of the string - trim, substring and so on.

There are certain operations where a data structure like linked list would make it easier to implement, like insert or delete of a character/substring. However, given the ratio of the memory overhead to maintain such data structure to the average string length, the cost outweighs any potential gains.

Whether a string should be immutable or not is dictated by two considerations:

  • do you give direct access to the memory storage or if all operations are encapsulated in a class?
  • is your memory allocation managed by the runtime or you have to manage it yourself?

The traditional C++ approach is to give direct access to the underlying memory to the code that uses the string object. This makes a lot of sense, since the memory is allocated and managed by the client code anyway, so giving direct access to it offers the best performance. The drawback is that any operation that changes the length of the string usually leads to reallocating the memory. There are smart string classes that impelemnt their own memory manager in order to deal with this problem, like the ATL's CString.

The C# approach is to encapsulate the underlying memory in the object and make the string immutable. This allows for the memory to be managed by the CLR and the objects are garbage collected by the same rules that apply to any other object. There's a small perf penalty because of that, but the benefits of the simplified usage and the ability to offer stable implementation of fairly complex operations outweigh the cost to the perf. Plus, there's the accompanying StringBuilder class that offers the some of the advantages of direct access to the memory by prealocating bigger buffer and mutating the instance in it until it's finalized to a String instance.

Franci Penov
+2  A: 

I would start by requiring an encoding attached to the string. If not specified in source, string literals would have the same encoding as the source file itself.

Of course I'm biased towards UTF-8, and would probably arrange for the standard library to operate in that mechanism

Also, I'd consider using a structural representation that is a little smarter than an array of bytes, because who wants to hassle with buffers!

SGI's template library comes with a 'rope' abstract type that does this pretty well. Iterators (but not iteration) are expensive, but in exchange, inserts, deletes, appends, sub-ranges and comparisons are all quite cheap.

There's another good implementation in the Lua Programming guide, which implements a 'tower of Hanoi' optimization, which is ideally suited to building strings from the front to the back iteratively, as is often done when reading a large file.

TCL has an indirect way of doing this by means of its text field widget. It even makes annotating the text generally useful. The only downside is this design works poorly for sequences which do not have a line-oriented distribution.

The major reason cited for using immutable strings is because a dynamic or interpreted language uses strings for itself. Really it uses atoms, which are arbitrary, but need to be convertible to and from strings. Lisp does this explicitly with symbol constants separate from strings. I like this, even though i'm not otherwise enamored of lisp.

TokenMacGuy
+2  A: 

I'd avoid C strings. Computing the length is O(n). Sharing substrings is nearly impossible. The contiguous memory requirement leads to fragmentation. Any problem with the terminator leads to bugs and security holes. If you store it as UCS-4, you waste a lot of space for ASCII strings (and lose C compatibility, the one benefit of C strings); if you store it as UTF-8, indexing is O(n). PDP-11's ASCIZ type only really made a lot of sense when you're writing a library for ASCII on a bare PDP-11.

Languages younger than the PDP-11 often use a different structure:

  • Pascal uses a length field instead of a terminator -- their strlen() is O(1).
  • Forth uses (address, length) doubles -- their strlen() is O(1), plus they can easily share substrings.
  • Many modern "managed" languages like Java store the length separately, as well.
  • In other languages (like Common Lisp), strings are simply a subtype of vector (whose elements are characters).
  • The Excel team used C, but implemented their own Pascal strings for performance.

I'd use something like ropes. Concatenation is constant-time. They don't require contiguous memory. Substring sharing is easy. All operations can be performed without locking in a multithreaded environment. Maybe allow both UCS-4 and ASCII nodes to make storage more compact in the common case, and/or have it automatically use a simpler structure internally for really short strings.

ASCIZ is great if you have little memory, short strings, 7-bit chars, trusted input, and your CPU is so slow that it's worth your programmer-time to be really careful. In the modern world of Unicode, multithreading, efficient GC, fast CPUs, and big (possibly untrusted) inputs, it's no longer a great choice.

Ken