tags:

views:

346

answers:

5

After reading the article "Back to Basics" by Mr.Spolsky I have thought about string structure in C, that converges most of advantages of Pascal-style string (with length byte) and classic ASCIIZ-strings in C and reduces most of their disadvantages. The main demand is to make this new string effective in machine commands. (for this task I assume, that every character is single byte. Excuse me. :) )

My idea is to store strings (that also can contain zero bytes) in byte array of such structure:

  • 0 - length in bytes of string length presentation (value is 1 or more);
  • 1..presLength+1 - string length presentation (1 or more bytes; may have fixed length (that processes simpler, but determines string length limit) or not);
  • presLength+2..arrrayLength - string content.

I expect this approach can solve most of the problems noticed by Mr.Spolsky, but suppose, that there are some lacks. I want to know you opinion. What do you think about it?

+4  A: 

From C++ experience, we know that a more effective string type optimizes for short strings. You do so by storing a string base as a 16 or 32 byte object. This replaces the 4 or 8 byte char* you'd traditionally use in C. In the string base, you would store the size. If the size is small, you use the rest of the struct to store the actual string contents. This avoids heap allocation for small strings, and improves the locality of reference. For big strings, you do allocate memory on the heap, and store that pointer after the string length.

One of the reasons this is effective is because cache lines are typically 16 bytes, and copied from and to main memory as a unit. Hence, if you read a 4 byte string length, the 12 bytes following it are cheap.

MSalters
+3  A: 

Your variable-length length representation is going to be a nightmare to work with, particularly in an efficient way.

Since no single object in C can be so large that its size can't be represented as a size_t, the length field will never need to be bigger than sizeof(size_t). Unless you have a huge number of very short strings (as in, a couple of characters), it seems unlikely that the juice is worth the squeeze - you could just have a fixed size size_t length value at the start of the string.

caf
...And end up with memory overhead that migth reach 100% if the string set is dominated by short strings.
AndreyT
If you have an application that keeps hundreds of millions of 4 character strings in memory... you might find it profitable to store them in another way :)
caf
@AndreyT: if you're storing a lot of strings in a single array, and want to minimise the overhead per string, then each string should not be storing its own size in any format. Sort them by length, and keep a separate bsearchable index that lets you look up the length given the string pointer. The questioner's scheme requires at least 2 bytes per non-empty string, which is already close to a 100% overhead if the string set is dominated by (very) short strings. I'd question how often the difference between 2 bytes per string and 4 bytes per string matters, but you can't improve another way.
Steve Jessop
... but the answer to how often is presumably "more often than never". So although it's wrong for caf to say for certain that you "might as well" use a fixed length-field, I also think he's right to suggest it as an alternative. The question is "what do you think of my design", not "does there exist a design which is strictly better in every respect", so I understand caf's answer to mean "I think this other way a better general approach".
Steve Jessop
With the "might as well", I was thinking about the other end - the way that the OPs designed allowed for strings up to 2^2040 bytes long (which seems somewhat like overkill).
caf
+6  A: 

By far too complicated. Even a very simple operation - appending a single character to the end of the string - requires the following steps (assuming mutable strings):

  1. Read byte 0 to determine length of length presentation
  2. Read byte 1..p, calculate length
  3. Write character at end of string
  4. Increase Length, check for overflow
  5. If overflow: increase byte 0, move whole string to make space for longer length presentation
  6. Write new length

This is hardly reasonable. It's much easier and also much faster to just have a pascal-like string, but with a word (32bit integer) instead of the length byte, and have the whole string word-aligned. The 4GB limit on the maximum size hardly matters for anything that is really a string.

ammoQ
Already done. See BSTR.
Alex
Alex: Sure, it's too obvious not to be invented in late 2009
ammoQ
Firstly, the main problem with 32-bit length is not the 4Gb limit, but potential gigantic memory overhead with short strings. Secondly, so what that it is too much work? What matters is that counter size increase will be a relatively rare event and that the "work" will be donr be the library implementation.
AndreyT
AndreyT: The work will be done by the processor. I'm talking about speed issues. Regarding the memory overhead: I think you overestimate that.
ammoQ
Memory overhead of the counter size? The solution relies on heap allocation, even for short strings. That probably adds another 8-16 bytes per string. Also, there is a very high correspondence between the length of the string and the length of the allocated memory, but this is not used in the memory allocation scheme. Hence this information is stored twice, which is obviously inefficient.
MSalters
Note for adding single character or concatenation. I preffer approach like using Java's classes StringBuilder/StringBuffer (with growing buffers) and avoiding simple concatenation in cases, when performance is important.
Alexander Pavloshchuk
Actually, the suggested algorithm for appending is a bit bad. It's easier to keep the length manipulation together. That means moving step 3 to the end. Still a pain, though.
MSalters
+1  A: 

As an exercise, I think you should go right ahead. For actual usage, there are already enough well tested implementations of strings, that you should really use one of those.

This might give you an idea of what is out there already. Although the list seems not to be complete.

Lucas
A: 

Before going full bore down this path yourself, you should look at other attempts to solve this problem. You'll learn what has worked (and maybe not worked) and get ideas for your own solution. Or possibly even better, find an already made solution.

There are quite a few libraries out there that try to improve on C's handling of strings. Here are some pointers to get you started:

There's a vstr project page that has a pretty comprehensive list of (and comparisons with) a number of other string libraries: http://www.and.org/vstr/comparison

Michael Burr