tags:

views:

313

answers:

7

Since strings are immutable in .NET, why are they copied for simple operations such as Substring or Split? For example, by keeping a char[] value, int start and int length, a substring could be created to simply point to an existing string, and we could save the overhead of copying the string for many simple operations. So I wonder, why was the decision chosen to copy strings for such operations?

For example, was this done to support the current implementation of StringBuilder? Or to avoid keeping a reference to a large char[] when only a few characters are required? Or any other reason you can think of? Can you suggest pros and cons for such design?

As mentioned by @cletus and supported by @Jon Skeet, this is more like asking why .NET strings were built differently from Java in this aspect.

+5  A: 

If you were re-using the same string to return substrings, what would happen when the main string goes out of scope?

In the best scenario, it would need to stay in memory and could not be collected until all substrings were also released, so you'll end-up using actually more memory.

That's just one of the issues.

In effect, the garbage collector would have few choices:

  • keep the whole original string in memory, even through only a very short substring of it may be used.

  • Release the parts of the original string that are not referenced and only keep the substring where it is. This would create a lot of fragmentation, meaning that the garbage collector would probably have to relocate the strings at some point: we'll end-up making a copy anyway.

I'm sure it has its use cases and it could sometimes be more efficient when working on substrings (say when working on large XML documents).
However, as Jon said, Java strings objects require more space, so if you have lots of small strings, they could actually use more memory than the way .Net does.

It's a tradeoff.
I think that if you're in a case where it really matter how memory is managed and you need to have a completely predictable behaviour, neither Java or .Net would be the best tools.

We use Garbage collectors because they are optimised to work efficiently in the vast majority of cases.
Knowing how they work is important, but whether they re-use strings or not is more of an optimisation left to the underlying framework and it should not leak too much to the surface.
GCs are, after all, here to help us.

Renaud Bompuis
Objects don't go "out of scope" in .NET. They are always on the heap, and will not be garbage-collected until they are no longer referenced.
Hosam Aly
@Hosam that's the point, why would we want to keep the original string on the heap after it's gone out of scope, just to support derivations of it?
Rex M
Exactly, so you'll end-up with whole string being kept on the heap just because a substring of it is being referenced somewhere.
Renaud Bompuis
Yes, the object will stay in memory until all substrings are collected, but this might be a better usage of memory and processor. The substring method could have defined a threshold for whether the string should be copied. So why wasn't it done that way? (Or why would this be a bad design?)
Hosam Aly
It's a bad design because it's complicated. You're coming up with extra logic ("threshold for whether the string should be copied"????) to try to plug all the potential holes, when you could just say to your developers "Here is how it works." and be done with it.
Rex M
@Rex M: is complexity really relevant for the internal implementation of a framework that is to be used by billions of people? I don't think so. If it can be implemented internally, performance wins over complexity IMHO.
Hosam Aly
@Hosam Two things: (1) simplicity is important because we need to understand how string works in order to use it effectively. The .NET framework is not a black box. (2) complexity is a symptom of bad design.
Rex M
I agree, why do people question the way .NET works internally? Its built by incredibly smart people. Let them do what they do and let us concentrate on the problem at hand.
Sir Psycho
@Rex: Discussing framework design decisions does not necessarily mean we are trying to improve them, but rather that we want to understand *why* the smartest minds you're talking about think that way.
Hosam Aly
@Sir Psycho: Because understanding the decisions of those incredibly smart people is a good way to enhance our skills. We may be designing frameworks ourselves one day, you know. :)
Hosam Aly
@Rex: You think string is simple now? Are you aware that they're mutable within mscorlib? Do you really *need* to know that? How about the top bit of the "length" field, which (IIRC) indicates whether the string is entirely ASCII or not?
Jon Skeet
@Sir Psycho: That's a terrible attitude. It no one ever tried to understand and question implementation design decisions, we'd never come up with better ones.
cdmckay
@Jon no, "string" is not simple. We are not talking about "string" though, we are talking about whether strings are copied or not. I simply said that a single facet of a complex system should be simple. We as developers benefit from knowing that strings are always copied in .NET.
Rex M
@Rex: It's pretty easy to know that they *aren't* copied in Java. Again, you can say "Here is how it works" and be done with it. Java's success doesn't seem to have been affected by the way it implements strings vs how .NET does...
Jon Skeet
@Jon I never said that Java's implementation is bad, simply that Hosam's suggestion of an amalgam between the two could be bad.
Rex M
@Rex: I think Hosam is suggesting what Java currently does.
cdmckay
@cdmckay further up in the comments, my understanding of @Hosam's suggestion is using Java's method for one set of string sizes and .NET's for the rest.
Rex M
Well, that's a variant of the original suggestion. The original "in the question" suggestion was just what Java does.
Jon Skeet
@Jon my response to the "in the question" suggestion was based on (admittedly anecdotal evidence that) given "average" string use in an average app, .NET's string wins over Java's more times than not. So assuming naivete of usage, .NET's method is better. Though probably not by alot. Is this wrong?
Rex M
I don't think any of us can sensibly talk about what an "average" application is. I wouldn't be surprised if MS had benchmarked both approaches in a number of situations though.
Jon Skeet
+1  A: 

In your substring example, that would mean we re-execute the substring logic every time we make a reference to the "new" string. The overhead of that alone makes it pretty obvious why we copy strings.

Rex M
I'm not sure I understand what you mean...
Hosam Aly
@Hosam every time I need to get the value of the substring, we would be re-seeking to the starting index of the original string and traversing to the ending index. Or, we could do that operation one time, and put the result in memory. Which is how it actually works.
Rex M
Sorry, but I am still unable to get your point. Taking a substring can either be a matter of calculating subscripts (m_start + argStart) and new length, or it can be done with memcpy (or its equivalents). How would copying be inferior to subscript calculation by any means?
Hosam Aly
Copying is always superior, which is why it works that way.
Rex M
*Why* is copying always superior?
Hosam Aly
Er, I should clarify - copying is always superior except when you run into a situation in your specific application where it's not. In that case, write your own code to handle it.
Rex M
You still haven't explained why copying is superior. It seems to me that copying would be more expensive than simply pointing to the start and end of a substring.
cdmckay
Read Jon Skeet's answer, which says it much better than mine. I also tried to clarify; I should not have said copying is "always" better, given a sample set of many different sizes of strings and different use cases, I think copying will win.
Rex M
A: 

I guess the key is highlighting the difference between:

  1. immutable string
  2. string that is immutable exists for all eternity

What you are saying would work if strings were #2. However, while strings are immutable, they can be destroyed.

As you can see further they have their own costs:

  1. immutable string - always making copies, as you have mentioned
  2. string that is immutable exists for all eternity - keeping every string created forever

It is easy to see why #1 would be better :)

(But I don't mean that #2 is bad or dumb)

moogs
But my understanding is that immutable strings do not need to make copies (as they are "immutable"). This doesn't necessarily mean that strings are kept forever; just as long as other objects refer to them.
Hosam Aly
That means if we have a long string (say 1000 characters), and we get a substring (the first word, maybe. 10 characters). We need to keep the 1000character string just so we can have the 10character string.
moogs
Well, we could force a copy if newStringLength * 8 < originalLength, or if newStringLength <= 8 (or 16). Wouldn't this somewhat solve the problem?
Hosam Aly
@moogs: Yeah, but if we were still using that 1000 char string anyway, then pointing to a substring of length 10 would be less expensive than copying.
cdmckay
@cdmmkay - that "but" is the kicker. the choice would be what is ok for all of the applications. @hosam aly : it would solve the problem if you can find the factor (8 in your example), that works well for the majority of programs.
moogs
A: 

Trust me, you'd hate it if strings weren't immutable. To give you an example from Java: java.util.Date is mutable and it's a nightmare. Basiclaly it forces anyone who receives a data as a parameter or a function return must defensively copy it.

I can't speak for .Net strings but Java's substring operation actually does refer to the main string, meaning every string in Java has about a 16-20 byte overhead (pointer to the string, start index, end index, length and possibly something else). This has both advantages and disadvantages. It can be a real "gotcha" from a memory starvation point of view. On one project I worked on we had massive memory use. It turned out to be that we were receiving large messages (thousands of characters) and processing them with substrings. Because the substrings kept a reference to the original string, the original string was never cleaned up.

Now you can get around this by using the String constructor but that's not obvious and many people don't know that.

Basically substrings the way you're talking about are a real can of worms. Be careful what you wish for.

cletus
I am not asking for mutable strings. I am asking why .NET strings do not behave like Java strings do. Your project proves a point, but is this the most likely case for string usage?
Hosam Aly
Hosam isn't suggesting that strings shouldn't be immutable. He was basically suggesting the Java way of doing things - that has the disadvantage you've mentioned, but advantages too. I could easily come up with a situation which would make Java use hardly any space and .NET would come to its knees.
Jon Skeet
Unfortunately in these cases String is a concrete class and not an interface, which would allow you to roll your own version that would do that. String is also final/sealed (Java/.Net) so you can't extend it to do that (unfortunately).
cletus
+10  A: 

That's basically the way that Java works. There are a few benefits of the .NET way, IMO:

  • Locality of reference - the data and the length are in the same place
  • Fewer dereferences - the data is at a fixed point within the string object itself; no need to dereference another char array
  • Lack of aliasing when you've got a single character substring of an originally-large string, as mentioned by Renaud.
  • You end up with fewer objects and variables. In the case of a .NET string (assuming no wasted buffer space), the total size (on x86) is approximately 20+2*n bytes. In Java you've got the size of the array (12 + 2*n) bytes and the string itself (24 bytes: object overhead, reference, start and count; it also caches the hash if it's ever calculated it). So for an empty string, the .NET version takes about 20 bytes compared with Java's 36. Of course that's the worst case, and it'll only be that "constant difference" out - but if you use a lot of independent strings that could end up being significant. More for the garbage collector to look at, too.

Of course, the benefits are in terms of requiring less space when the aliasing above doesn't occur.

In the end it will depend on your usage - the compiler and runtime can't predict which usage pattern is more likely in your exact code.

There may also be interop benefits of the current string representation, but I don't know enough about that to say for sure.

EDIT: I'm not sure why your question has received so many somewhat-hostile answers. It's certainly not a "dumb" way of representing a string, and it clearly works. Fears about data loss and complexity are pretty much just FUD in this case, I believe - the Java string implementation is simple and robust. I personally suspect that the .NET way of doing things is more efficient in most programs, and I suspect MS did research to check that, but there will certainly be situations where the "shared" model works better.

Jon Skeet
I guess locality of reference is a feature of Java too, so it doesn't really count as a difference. Fewer dereferences is a good point, while I think the third point can be solved by having a threshold (e.g. copy if less than 8 (16?) chars, or if newStringLength * 8 < originalLength).
Hosam Aly
@Hosam: No, Java doesn't have locality of reference in this case. The length could be in one bit of virtual memory and the first character somewhere completely different. All the *data* have locality of reference, but not necessarily the data and the metadata, as it were.
Jon Skeet
I'm surprised you didn't mention String.Intern; otherwise, a nice answer.
Frank Krueger
@Frank: What would you have me say about String.Intern? It's certainly available for both .NET and Java, so I'm not sure how much this decision affects it.
Jon Skeet
@Jon: So how does does .NET have locality of reference for length in this case? All I see in Reflector is two integers and one char, which I assume is a pointer to the array of characters. Do you mean the array is allocated internally with the object itself?
Hosam Aly
@Hosam: That's exactly right. Strings are the only objects other than arrays which vary in size, because they contain their data directly.
Jon Skeet
I'm not sure if this was clear but .NET strings cache the hash value. Strings employ defensive copying for the reason that modifying any sub string from the original would otherwise modify the original. Memory wise, .NET strings are kept inside a string pool and is why the memory is found elsewhere.
John Leidegren
@John: Where do .NET strings cache their hash values? Looking through Reflector at .NET 3.5 SP1 shows that GetHashCode() recalculates its value every time it is called (which seems a bad idea to me when used in hash sets/maps!).
Hosam Aly
@Hosam: .NET strings don't cache them; Java does. Note that if one is stored as a key in a hashtable, the hashtable will store the hashcode - it won't compute the hash of every string every time!
Jon Skeet
@John: I agree with Hosam - .NET strings *don't* cache hash values. Also, strings are only kept in a string pool *if they're interned*. Otherwise they're reasonably normal objects.
Jon Skeet
@Jon: But the hash code is still calculated every time a string is inserted or used as a key into a map. If a map is frequently accessed, it would have bad performance.
Hosam Aly
@Hosam: It's calculated once for the key each time - if you reuse the same key again and again, that could be bad. But the cost would be 4 bytes for *every* string. I think .NET chose the better path here.
Jon Skeet
A: 

Is what your saying this?

String s = "hello"; String v = s.Substring(1,4); //This should point to the last 4 characters of "hello" instead of creating a copy

If things were like that, can you imagine the internal links that the runtime would need to manage and how unsafe that would be from a data integrity point of view?

Copies of strings are made so that are managed easily by the runtime.

Sir Psycho
Why would it be unsafe? You'd have a char array, and the new string would refer to the same char array as the old string. Very easy - and precisely what Java does.
Jon Skeet
A: 

If the string object would contain a reference to the character data, that would mean that most strings would be two objects instead of one.

Guffa
Yes, the number of objects (and space) is a downside - look at my fourth bullet (just added).
Jon Skeet