views:

833

answers:

8

Hi,

I am wondering about the benefits of having the string-type immutable from the programmers point-of-view.

Technical benefits (on the compiler/language side) can be summarized mostly that it is easier to do optimisations if the type is immutable. Read here for a related question.

Also, in a mutable string type, either you have thread-safety already built-in (then again, optimisations are harder to do) or you have to do it yourself. You will have in any case the choice to use a mutable string type with built-in thread safety, so that is not really an advantage of immutable string-types. (Again, it will be easier to do the handling and optimisations to ensure thread-safety on the immutable type but that is not the point here.)

But what are the benefits of immutable string-types in the usage? What is the point of having some types immutable and others not? That seems very inconsistent to me.

In C++, if I want to have some string to be immutable, I am passing it as const reference to a function (const std::string&). If I want to have a changeable copy of the original string, I am passing it as std::string. Only if I want to have it mutable, I am passing it as reference (std::string&). So I just have the choice about what I want to do. I can just do this with every possible type.

In Python or in Java, some types are immutable (mostly all primitive types and strings), others are not.

In pure functional languages like Haskell, everything is immutable.

Is there a good reason why it make sense to have this inconsistency? Or is it just purely for technical lower level reasons?

+1  A: 

There is no overarching, fundamental reason not to have strings mutable. The best explanation I have found for their immutability is that it promotes a more functional, less side-effectsy way of programming. This ends up being cleaner, more elegant, and more Pythonic.

Semantically, they should be immutable, no? The string "hello" should always represent "hello". You can't change it any more than you can change the number three!

katrielalex
The string `"hello"` in a language like C++ is constant, so it is not changeable.
Albert
@Albert: It is only constant as in "The effect of attempting to modify a string literal is undefined."
Rafał Dowgird
+1  A: 

Not sure if you would count this as a 'technical low level' benefit, but the fact that immutable string is implicitly threadsafe saves you a lot of effort of coding for thread safety.

Slightly toy example...

Thread A - Check user with login name FOO has permission to do something, return true

Thread B - Modify user string to login name BAR

Thread A - Perform some operation with login name BAR due to previous permission check passing against FOO.

The fact that the String can't change saves you the effort of guarding against this.

Yes but you could just have a thread-safe mutable string-type and then just use it. The thread-safety is not really a good point from the POV of the programmer why you would prefer the immutable type instead of a thread-safe mutable type.
Albert
It's not just about the thread safety of the String type. In the above example, you'd have to synchronise around the block of check then act if the String was shared state, even if it was a thread safe mutable. Immutability means you don't have to do this. Again, a toy example but a real one :)
In the above example, you would just use a copy of the string in the thread because you don't want it to have it mutable from another thread. In C++, you would get the copy implicitly depending how your implementation looks like. If your point is that immutability is generally better, why is it that other types in Java/Python are mutable then?
Albert
OK so yes you could do a 'defensive copy' but again this is overhead on the programmer! Concede the point that they represent some benefit for the developer - I agree not a lot but your question was very narrow :)
There is not really an overhead. If you do a `std::string a = b;` in C++, you already have a copy, not a reference. In fact, passing as reference is one character more to type, i.e. `std::string`.
Albert
The overhead is trying to remember if you need to make copies or not.
Dennis Zickefoose
+1  A: 

If you want full consistency you can only make everything immutable, because mutable Bools or Ints would simply make no sense at all. Some functional languages do that in fact.

Python's philosophy is "Simple is better than complex." In C you need to be aware that strings can change and think about how that can affect you. Python assumes that the default use case for strings is "put text together" - there is absolutely nothing you need to know about strings to do that. But if you want your strings to change, you just have to use a more appropriate type (ie lists, StringIO, templates, etc).

THC4k
In C++, also `int` or `bool` are mutable. You can pass them as reference as everything else. So it is not a problem to just have everything mutable. But you are hitting my main question: Is there any sense in having some primitive types immutable and the rest mutable?
Albert
@Albert: `int` *variables* are mutable, but `int` *values* are immutable. You can change `i`, but you cannot change `5`.
FredOverflow
In C++, an "int", "double", or "bool" is immutable. When a function is called with the integer value forty-two, the caller passes the bit pattern 0x0000002A which, as an integer, will //always// represent forty-two. This is different from FORTRAN, where the called function would receive a bit pattern pointing to an area of memory which would hopefully contain the number 42. If the called function changed the value stored there, some or all other places where the constant 42 was used would also be changed.
supercat
I think, technically, this depends on the compiler. Probably no C compiler will put an integer somewhere else and refer to it because it can just put it directly into the machine instruction. But on architectures where this is not the case, it might be possible to actually change it. Same for string pools for an executable if it manages somehow to overwrite that memory at runtime.
Albert
+1  A: 

In a language with reference semantics for user-defined types, having mutable strings would be a desaster, because every time you assign a string variable, you would alias a mutable string object, and you would have to do defensive copies all over the place. That's why strings are immutable in Java and C# -- if the string object is immutable, it does not matter how many variables point to it.

Note that in C++, two string variables never share state (at least conceptionally -- technically, there might be copy-on-write going on, but that is getting out of fashion due to inefficiencies in multi-threading scenarios).

FredOverflow
That is why you can decide if you want to pass a variable by reference or by value in languages like C++. So it is a mutable object and you can decide if you want to pass a mutable reference or a copy of the object or an immutable constant reference.
Albert
+1  A: 

If strings are mutable, then many consumers of a string will have to to make copies of it. If strings are immutable, this is far less important (unless immutability is enforced by hardware interlocks, it might not be a bad idea for some security-conscious consumers of a string to make their own copies in case the strings they're given aren't as immutable as they should be).

The StringBuilder class is pretty good, though I think it would be nicer if it had a "Value" property (read would be equivalent to ToString, but it would show up in object inspectors; write would allow direct setting of the whole content) and a default widening conversion to a string. It would have been nice in theory to have MutableString type descended from a common ancestor with String, so a mutable string could be passed to a function which didn't care whether a string was mutable, though I suspect that optimizations which rely on the fact that Strings have a certain fixed implementation would have been less effective.

supercat
Albert
@Albert: Java does not have call by reference or call by reference to const. (Is there any other language besides C++ that has a concept of "reference to const"?) Java leaves you no choice in parameter passing, so Java's string objects *must* be immutable.
FredOverflow
@Albert: Having to create a new copy of a string every time it is passed to a routine, even when the only thing that routine ever does with the string is pass it to other routines, creates substantial overhead.
supercat
@supercat: You mean performance overhead, right? And there we are again on the technical side. Btw., in C++, this is actual done in a performant way by doing some *copy-on-write* magic internally (so copying a `std::string` will not copy the internal data -- the internal data is only really copied once you are trying to modify one of the strings and before that, they are keeping a reference to the same underlying raw data). And, besides that, in C++, you can also pass a const reference, i.e. an immutable reference to your string.
Albert
@Albert: I don't believe any `std::string` implementations utilize COW. And in a multithreaded environment, a const reference is hardly immutable. And as I commented elsewhere, there is non-technical overhead in needing to figure out which type of parameter passing is appropriate that immutable data simplifies.
Dennis Zickefoose
@Dennis: I don't know any STL implementation which does not. I just checked GCCs STL and it does that. Check out `std::basic_string::_Rep` which holds the actual raw data and some refcounting information.
Albert
@Albert: so it does, I stand corrected. I am almost positive that STLPort is not COW, and Dinkumware stopped being COW a few years ago. There was also a proposal to forbid COW in C++0x, but I'm not sure if it made the cut. So hand waving away the cost of passing by value is still just handwaving.
Dennis Zickefoose
@Dennis: Interesting that you point that out. I just checked STLports source and you are right. I always thought that it is kind of a given fact that it does COW and uses this internal refcounting and copying a string is a constant and very fast operation. I should be more careful. :) Whereby in most cases I just passed a const reference anyway. And of course, in multithreading environment, you have to think about what you do but that is just the same in Java for all but the non-primitive types. And I think I never had the case where I shared a non-const string over several threads.
Albert
+13  A: 

What is the point of having some types immutable and others not?

Without some mutable types, you'd have to go the whole hog to pure functional programming -- a completely different paradigm than the OOP and procedural approaches which are currently most popular, and, while extremely powerful, apparently very challenging to a lot of programmers (what happens when you do need side effects in a language where nothing is mutable, and in real-world programming of course you inevitably do, is part of the challenge -- Haskell's Monads are a very elegant approach, for example, but how many programmers do you know that fully and confidently understand them and can use them as well as typical OOP constructs?-).

If you don't understand the enormous value of having multiple paradigms available (both FP one and ones crucially relying on mutable data), I recommend studying Haridi's and Van Roy's masterpiece, Concepts, Techniques, and Models of Computer Programming -- "a SICP for the 21st Century", as I once described it;-).

Most programmers, whether familiar with Haridi and Van Roy or not, will readily admit that having at least some mutable data types is important to them. Despite the sentence I've quoted above from your Q, which takes a completely different viewpoint, I believe that may also be the root of your perplexity: not "why some of each", but rather "why some immutables at all".

The "thoroughly mutable" approach was once (accidentally) obtained in a Fortran implementation. If you had, say,

  SUBROUTINE ZAP(I)
  I = 0
  RETURN

then a program snippet doing, e.g.,

  PRINT 23
  ZAP(23)
  PRINT 23

would print 23, then 0 -- the number 23 had been mutated, so all references to 23 in the rest of the program would in fact refer to 0. Not a bug in the compiler, technically: Fortran had subtle rules about what your program is and is not allowed to do in passing constants vs variables to procedures that assign to their arguments, and this snippet violates those little-known, non-compiler-enforceable rules, so it's a but in the program, not in the compiler. In practice, of course, the number of bugs caused this way was unacceptably high, so typical compilers soon switched to less destructive behavior in such situations (putting constants in read-only segments to get a runtime error, if the OS supported that; or, passing a fresh copy of the constant rather than the constant itself, despite the overhead; and so forth) even though technically they were program bugs allowing the compiler to display undefined behavior quite "correctly";-).

The alternative enforced in some other languages is to add the complication of multiple ways of parameter passing -- most notably perhaps in C++, what with by-value, by-reference, by constant reference, by pointer, by constant pointer, ... and then of course you see programmers baffled by declarations such as const foo* const bar (where the rightmost const is basically irrelevant if bar is an argument to some function... but crucial instead if bar is a local variable...!-).

Actually Algol-68 probably went farther along this direction (if you can have a value and a reference, why not a reference to a reference? or reference to reference to reference? &c -- Algol 68 put no limitations on this, and the rules to define what was going on are perhaps the subtlest, hardest mix ever found in an "intended for real use" programming language). Early C (which only had by-value and by-explicit-pointer -- no const, no references, no complications) was no doubt in part a reaction to it, as was the original Pascal. But const soon crept in, and complications started mounting again.

Java and Python (among other languages) cut through this thicket with a powerful machete of simplicity: all argument passing, and all assignment, is "by object reference" (never reference to a variable or other reference, never semantically implicit copies, &c). Defining (at least) numbers as semantically immutable preserves programmers' sanity (as well as this precious aspect of language simplicity) by avoiding "oopses" such as that exhibited by the Fortran code above.

Treating strings as primitives just like numbers is quite consistent with the languages' intended high semantic level, because in real life we do need strings that are just as simple to use as numbers; alternatives such as defining strings as lists of characters (Haskell) or as arrays of characters (C) poses challenges to both the compiler (keeping efficient performance under such semantics) and the programmer (effectively ignoring this arbitrary structuring to enable use of strings as simple primitives, as real life programming often requires).

Python went a bit further by adding a simple immutable container (tuple) and tying hashing to "effective immutability" (which avoids certain surprises to the programmer that are found, e.g., in Perl, with its hashes allowing mutable strings as keys) -- and why not? Once you have immutability (a precious concept that saves the programmer from having to learn about N different semantics for assignment and argument passing, with N tending to increase with time;-), you might as well get full mileage out of it;-).

Alex Martelli
Hey Alex, you are the one who initially raised the question for me. :) Thanks a lot for this long answer. I understand your Fortran argument but I don't really think it is valid. It actually depends on the compiler implementation. You can do the same in C++ and it might on some obscure compilers produce the same effect. (I think it is undefined behavior after you do a `(int`.) I think the main difference is, as you said, the different interpretation of what assignment means. In C++ it is "overwrite the object content", in Java/Py it is "assign var to a different obj".
Albert
+2  A: 

I am not sure if this qualifies as non-technical, nevertheless: if strings are mutable, then most(*) collections need to make private copies of their string keys.

Otherwise a "foo" key changed externally to "bar" would result in "bar" sitting in the internal structures of the collection where "foo" is expected. This way "foo" lookup would find "bar", which is less of a problem (return nothing, reindex the offending key) but "bar" lookup would find nothing, which is a bigger problem.

(*) A dumb collection that does a linear scan of all keys on each lookup would not have to do that, since it would naturally accomodate key changes.

Rafał Dowgird
Normally, in C++, you are just always creating copies if another object (in this case a hashtable) need to keep the string. In C++ STL implementions of `std::string` which use *copy-on-write* internally, the behavior is mostly the same as with immutable string types.
Albert
Also, beside that: Your argument doesn't really point out why this is done for string-types and primitive types but why all other types are mutable. By your argument, it would make sense to have everything immutable.
Albert
@Albert: Nope, only those things that are keys in collections. Some languages (Python) do exactly that.
Rafał Dowgird
+1  A: 

The main advantage for the programmer is that with mutable strings, you never need to worry about who might alter your string. Therefore, you never have to consciously decide "Should I copy this string here?".

Wouter Lievens
However, that is not really specific to string or other primitive types. By this argument, you would result in something where everything is immutable. And there you got to my key point: Why having strings / other primitive types immutable and the rest not?
Albert
Albert: my default strategy is to make every object immutable unless it has a good reason not to be. For instance, in a geometry model I would make the Point (x, y) class immutable. That way, I won't have to do point.clone() everywhere I'm using a Point.
Wouter Lievens