tags:

views:

305

answers:

9

What is it about the way strings are implemented that makes them so expensive to manipulate?

Is it impossible to make a "cheap" string implementation?

or am I completely wrong in my understanding?

Thanks

+21  A: 

Which language?

Strings are typically immutable, meaning that any change to the data results in a new copy of the string being created. This can have a performance impact with large strings.

This is an important feature, however, because it allows for optimizations such as interning. Interning reduces the size of text data by pointing identical strings to the same copy of data.

If you are concerned about performance with strings, use a StringBuilder (available in C# and Java) or another construct that works with mutable text data.

If you are working with a large amount of text data and need a powerful string solution while still saving space, look into using ropes.

Robert Venables
+1 for Ropes. Interesting read.
Andy_Vulhop
the wikipedia link got broken, looks like a regex fail by SO
Will Bickford
Hint, use the link button.
Dykam
+1 for keeping things in perspective.
Steven Sudit
Thanks novatrust. I'm not particularly interested in a specific language. Joel often mentions strings in the podcast. and they've been mentioned other places. Your explanation of a string being an array underneath is what I was looking for. Thanks again
Greg B
A: 

If you want a universal string working in every condition, you have to sacrifice efficiency in some cases. This is a classic tradeoff between getting one thing fast and another. So... either you use a "standard" string working properly (but not in an optimal way), or a string implementation which is very fast in some cases and cumbersome in other.

Sometimes you need immutability, sometimes random access, sometimes quick insertions/deletions...

liori
+2  A: 

The problem with strings is that they are not primitive types. They are arrays. Therefore, they suffer the same speed and memory problems as arrays(with a few optimizations, perhaps).

Now, "cheap" implementations would require lots of stuff: concatenation, indexOf, etc. There are many ways to do this. You can improve the implementation, but there are some limits. Because strings are not "natural" for computers, they need more memory and are slower to manipulate... ALWAYS. You'll never get a string concatenation algorithm faster than any decent integer sum algorithm.

luiscubal
+1I whole heartily agree. Modern Strings are not Simple entities. They're built for easy manipulation. This functionality is desired but expensive.Compare the functionlity of a .NET String (or Python) vs old school C "string.h" handling (and memory handling as well)
CMB
+1  A: 

It depends entirely on what you're trying to do with it. Mostly it's that it usually requires at least 1 new array allocation unless it's replacing a single character in a direct seek. At the simplest level a string is an array of chars. So just about anything you want to do involves iterating, removing, or inserting new things into an array.

McAden
A: 

Changes and copying of strings tends to involve memory management.

Memory management is not good for performance since it tends to require some kind of global mutex that makes your code scale poorly to multiple cores.

Laserallan
+1  A: 

Since it creates new copy of the object every time in java its advisable to use StringBuffer

Syntax

StringBuffer strBuff=new StringBuffer();
strBuff.append("StringBuffer");
strBuff.append("is");
strBuff.append("more");
strBuff.append("economical");
strBuff.append("than");
strBuff.append("String");
String string=strBuff.tostring();
Harish
Does it automatically put whitespaces between appended words? :P
liori
@liori... LOL! :)
Greg B
A: 

You want to read this Joel Spolsky article:

http://www.joelonsoftware.com/articles/fog0000000319.html

Me, I'm disappointed .NET doesn't have a native type called F***edString.

JCCyC
Lol. Thanks for the link JCCyC. Joel's rants about strings on the podcast are the cause of my question.
Greg B
+1  A: 

Look into mutable strings, immutable strings, and ropes, and think about how you would implement common operations in a low-level language (say, C). Consider:

  1. Concatenation.
  2. Slicing.
  3. Getting a character at an index.
  4. Changing a character at an index.
  5. Locating the index of a character.
  6. Traversing the string.

Coming up with algorithms for these situations will give you a feel for when each type of storage would be appropriate.

Imagist
+2  A: 

Many of the points here are well taken. In isolated cases you may be able to cheat and do thing like using a 64bit int to compare 8 bytes at time in a string, but there are not a lot of generalized cases where you can optimize operations. If you have "pascal style" string with a numeric length field compares can be short circuited logic to only check the rest of the string if the length is not the same. Other operations typically require you to handle the characters a byte at time or completely copy them when you use them. i.e. concatenation => get length of string 1, get length of string 2, allocated memory, copy string 1, copy string 2. It would be possible to do operations like this using a DMA controller in a string libary, but the overhead of setting it up for small strings would outweigh the benefits.

Pete

NoMoreZealots
I was thinking about this while I was going to the garage. Generally the operation which cpu venders optimize are mathmatical, or a specific field of math. Examples are DSP, or 3D matrix ops for simulation and graphics. There is no reason that a CPU COULDN'T be optimized for string manipulation using a combination of cache manipulation and DMA schemes, there just hasn't been enough industry pressure to force venders to optimize string operations.
NoMoreZealots