views:

359

answers:

6

From a brief look using Reflector, it looks like String.Substring() allocates memory for each substring. Am I correct that this is the case? I thought that wouldn't be necessary since strings are immutable.

My underlying goal was to create a IEnumerable<string> Split(this String, Char) extension method that allocates no additional memory.

+2  A: 

Not possible without poking around inside .net using String classes. You would have to pass around references to an array which was mutable and make sure no one screwed up.

.Net will create a new string every time you ask it to. Only exception to this is interned strings which are created by the compiler (and can be done by you) which are placed into memory once and then pointers are established to the string for memory and performance reasons.

Spence
+1  A: 

Because strings are immutable in .NET, every string operation that results in a new string object will allocate a new block of memory for the string contents.

In theory, it could be possible to reuse the memory when extracting a substring, but that would make garbage collection very complicated: what if the original string is garbage-collected? What would happen to the substring that shares a piece of it?

Of course, nothing prevents the .NET BCL team to change this behavior in future versions of .NET. It wouldn't have any impact on existing code.

Philippe Leybaert
Java's String actually does it that way: Substrings are merely pointers into the original string. However, that also means that when you take a 200-character substring of a 200-MiB string, the 200-MiB string will always lie around in memory as long as the small substring isn't garbage-collected.
Joey
I think it could impact existing code given that it is designed around this behaviour. If people assume that interning their string will stop it from being duplicated and this behaviour was stopped it could cause working apps to stop with out of memory exceptions.
Spence
How can you design around this behavior? Because of the immutability of strings, there's really no way to create code that would break if the internal implementation of the string class changes.
Philippe Leybaert
.Net string operations indeed create new string objects, but it's not *because* strings are immutable. In fact, it's because strings are immutable that string operations *could* reuse current string objects instead of creating new ones.
Rob Kennedy
+1  A: 

Each string has to have it's own string data, with the way that the String class is implemented.

You can make your own SubString structure that uses part of a string:

public struct SubString {

   private string _str;
   private int _offset, _len;

   public SubString(string str, int offset, int len) {
      _str = str;
      _offset = offset;
      _len = len;
   }

   public int Length { get { return _len; } }

   public char this[int index] {
      get {
         if (index < 0 || index > len) throw new IndexOutOfRangeException();
         return _str[_offset + index];
      }
   }

   public void WriteToStringBuilder(StringBuilder s) {
      s.Write(_str, _offset, _len);
   }

   public override string ToString() {
      return _str.Substring(_offset, _len);
   }

}

You can flesh it out with other methods like comparison that is also possible to do without extracting the string.

Guffa
What about a substring into another substring?
Daniel Earwicker
Yes, it's easy for the SubString structure to create another that is part of itself.
Guffa
+18  A: 

One reason why most languages with immutable strings create new substrings rather than refer into existing strings is because this will interfere with garbage collecting those strings later.

What happens if a string is used for its substring, but then the larger string becomes unreachable (except through the substring). The larger string will be uncollectable, because that would invalidate the substring. What seemed like a good way to save memory in the short term becomes a memory leak in the long term.

TokenMacGuy
I thought the main reason was in regards to algorithms over the strings. If you can safely assume that a string will never change you can pass references to it safely and it's also inherently threadsafe. I guess that ties in with garbage collection too.
Spence
@Spence - that is a reason for immutability. It's not a reason for avoiding shared buffers between strings. Once you have immutability and GC, you can easily implement shared buffers behind the scenes without breaking thread safety or existing algorithms.
Daniel Earwicker
A: 

Adding to the point that Strings are immutable, you should be that the following snippet will generate multiple String instances in memory.

String s1 = "Hello", s2 = ", ", s3 = "World!";
String res = s1 + s2 + s3;

s1+s2 => new string instance (temp1)

temp1 + s3 => new string instance (temp2)

res is a reference to temp2.

Babak Naffas
This sounds like something that the compiler folks could optimize.
Ian Boyd
It's not an issue with the compiler, it's a choice made in designing the language. Java has the same rules for Strings. System.Text.StringBuilder is a good class to use that simulates the "mutable" strings.
Babak Naffas
Wrong - s1 + s2 + s3 gets turned into a single call to String.Concat. This is why it is NOT better to use String.Format or StringBuilder (which are both comparatively slow), for up to 4 strings. Look at the IL to see what the compiler does, and use a profiler to find out what performs well in your program. Otherwise you might as well be saying "Look, it is a shoe! He has removed his shoe and this is a sign that others who would follow him should do likewise!" Please post factual answers instead of mythical ones.
Daniel Earwicker
i.e. Ian Boyd's comment is right (except that the compiler folks already took care of it in version 1.)
Daniel Earwicker
As per the C# Languge Reference, the + operator on a string is defined as:string operator +(string x, string y);string operator +(string x, object y);string operator +(object x, string y);While the implementation of the operator may use the Concat method, it doesn't change the fact that + is a binary operator; hence, s1 + s2 + s3 would be the equivalent of String.Concat( String.Concat( s1, s2), s3) with a new string object returned for each call to Concat()
Babak Naffas
A: 

One good reason is for memory usage.

string s1 = "astringwith1kchars.....";

string s2 = "astringwith1kchars.....";

s2 dont need to create a new string, just a reference

Fujiy