views:

273

answers:

9

Assuming I have the following strings:

string str1 = "Hello World!";  
string str2 = str1.SubString(6, 5); // "World"

I am hoping that in the above example str2 does not copy "World", but simply ends up being a new string that points to the same memory space only that it starts with an offset of 6 and a length of 5.

In actuality I am dealing with some potentially very long strings and am interested in how this works behind the scenes for performance reasons. I am not familiar enaugh with IL to look into this.

+2  A: 

It references a brand new string.

RandomNoob
+9  A: 

It's a new string.

Strings, in .NET, are always immutable. Whenever you generate a new string via a method, including Substring, it will construct the new string in memory. The only time you share references to the same data in strings in .NET is if you explicitly assign a string variable to another string (in which its copying the reference), or if you work with string constants, which are typically interned. If you know your string is going to share a value with an interned string (constant/literal from your code), you can retrieve the "shared" copy via String.Intern.

This is a good thing, btw - In order to do what you were describing, every string would require a reference (to the string data), as well as an offset + length. Right now, they only require a reference to the string data.

This would dramatically increase the size of strings in general, throughout the framework.

Reed Copsey
What would be nice would be a class StringSegment or something that worked like ArraySegment. It could be implicitly cast to a string and "normal" strings would not have to incur the performance overhead.But in any case, the strings would still be immutable.
Josh Einstein
@Josh: This would be a trivial class to implement. You'd just need to make a constructor that accepts a string, offset, and length, and return original.Substring(...). However, if the original string were to go out of scope, and your class held it still, you might actually cause more perf. problems than you help.
Reed Copsey
does anybody (Eric?) actually know the .net implementation. It has nothing to do with string being immutable, in fact being immutable makes substring sharing simple to implement. YOu dont have to worry about somebody changing the original string
pm100
@pm100: I am not an expert on the internals of the CLR. I seem to recall that they use length-prefixed UTF16 buffers, but I am not certain of that. You should ask this of a CLR expert.
Eric Lippert
+1 for your answer if you complete it with mention of how equality testing works on strings
Hightechrider
@Reed, Thanks for response and agreed. Re: response to @Josh, a new class that references an original string and returns original.Substring() still makes a copy of the original string every time original.Substring() is called. So, we are back to the original question/issue at hand. With very large strings that require repeated large substrings, performance would suffer to a degree in such a special case where repeated copying of large substrings are required. This now of course calls for a different approach to solving the original problem at hand.
Elan
Regarding Eric's comment, not only is it length prefixed, it is also null terminated. An offset-driven implementation would prevent the ability to do interop marshalling without copying the string.
binarycoder
So can anyone explain why, in .NET 4 string bc = "bc"; string abc = "abc"; Console.WriteLine(string.IsInterned(abc.Substring(1))); shows that the SubString constructed string IsInterned? But if you take the first line away it's not. That would imply it's using the SAME STORAGE?? Immutable doesn't necessarily imply a copy in memory it seems.
Hightechrider
@Hightechrider: IsInterned just checks the interning table to see if a copy of that string, by value, exists in the interning. Since you put 'string bc = "bc";', the string "bc" is an interned string value, so String.IsInterned(abc.Substring(1)); is true, since abc.Substring(1) returns a string with the value "bc", which does exist in the interning table.
Reed Copsey
@Reed, thanks, bit of a misnomer then ... IsSomethingLikeItInterned() would be more correct :-)
Hightechrider
The hashcodes are equal and object ref comparison is equal. How else can you tell if they are in fact pointing to the same interned string then?
Richard Hein
@Richard: You have to use String.Intern to get back the interned copy, then do an object comparison. The object references will only be equal (Object.ReferenceEquals) if they are in fact the interned string...
Reed Copsey
Check my code sample, in my answer below. I have updated my Assert to use Object.ReferenceEquals, and it in fact returns true. So they are the same interned string. Try it out.
Richard Hein
@Reed and Hightechrider: Ok, you're right. I have updated my sample, thanks for the clarification guys.
Richard Hein
+1  A: 

SubString creates a new string. So new memory for the new strin will be allocated.

Beku
+2  A: 

It creates a new string but that's a very intelligent question and would not be inconceivable. However I think the performance losses in the majority of cases would far outweigh the memory savings for rare cases.

I recently heard of something called "ropes" which would work the way you suggest but I don't know of any implementation in .NET.

http://en.wikipedia.org/wiki/Rope_(computer_science)

Josh Einstein
+1  A: 

In the CLR strings are immutable meaning they cannot be changed. When manipulating large strings I would suggest looking the string builder class.

Anthony
+1  A: 

as Reed said, string are immutable. if you're dealing with long strings, consider using a StringBuilder, it might improve performance, depending of course on what you're trying to accomplish. if you can add some details to your question, you'll surely get suggestion on the best implementation.

Ami
+2  A: 

Ya know what, I don't know a thing about .NET.

But, I'd like to make an observation.

Most modern String packages have "copy on write" behaviors.

Specifically, that means if you allocate a substring, it will use the existing storage of the parent string, until the string has a need to change, at which point it will copy out the underlying data in to it's own new space for use.

Now, if you have immutable Strings, where the underlying data can not change, there is little reason to NOT do this. There's no way to "write" to an immutable string, so it doesn't even need copy on write functions, just sharing. C++ has mutable strings, so they do copy on write.

For example, Java does this.

Normally this is a good thing. There's little performance impact.

Where you DON'T want this to happen, though, is say in this example:

String big1MBString = readLongHonkinStringFromTheInterTubes();
static String ittyBitty = big1MBString.substring(1, 5);

You now have a "5 character" string that consumes 1MB of memory, because it shares the underlying 1MB string buffer of the large string, but it's manifested as only a 5 character string. Since you retain the reference to the larger String, internally, you'll "never" free up that original space.

Looking at the Mono sources, they do, in fact, allocate new memory. So, perhaps .NET is an exception to what seems to be common practice today. No doubt they have their valid and informed reasons (i.e. I'm not saying .NET did it wrong), just...different from what others are doing.

Will Hartung
The CLR performs string interning as well, see my answer.
Richard Hein
*"Most modern String packages have "copy on write" behaviours"*. What are you basing this claim on? Most have immutable strings, including Java, and hence do not need copy on write. The only notable exception is C++, and it's been a while since the COW experiment was judged to have failed. See: http://stackoverflow.com/questions/1661154/c-stdstring-in-a-multi-threaded-program
Daniel Earwicker
+9  A: 

As others have noted, the CLR makes copies when doing a substring operation.

As you note, it certainly would be possible for a string to be represented as an interior pointer with a length. This makes the substring operation extremely cheap.

There are also ways to make other operations cheap. For example, string concatenation can be made cheap by representing strings as a tree of substrings.

In both cases what is happening here is the result of the operation is not actually the "result" itself, per se, but rather, a cheap object which represents the ability to get at the results when needed.

The attentive reader will have just realized that this is how LINQ works. When we say

var results = from c in customers where c.City == "London" select c.Name;

"results" does not contain the results of the query. This code returns almost immediately; results contains an object which represents the query. Only when the query is iterated does the expensive mechanism of searching the collection spin up. We use the power of a monadic representation of sequence semantics to defer the calculations until later.

The question then becomes "is it a good idea to do the same thing on strings?" and the answer is a resounding "no". I have plenty of painful real-world experiments on this. I once spent a summer rewriting the VBScript compiler's string handling routines to store string concatenations as a tree of string concatenation operations; only when the result is actually being used as a string does the concatenation actually happen. It was disastrous; the additional time and memory needed to keep track of all the string pointers made the 99% case -- someone doing a few simple little string operations to render a web page -- about twice as slow, while massively speeding up the tiny, tiny minority of pages that were written using naive string concatenations.

The vast majority of realistic string operations in .NET programs are extremely fast; they compile down to memory moves that in normal circumstances stay well within the memory blocks that are cached by the processor, and are therefore blazingly fast.

Furthermore, using an "interior pointer" approach for strings complicates the garbage collector considerably; going with such an approach seems to make it likely that the GC would slow down overall, which benefits no one. You have to look at the total cost of the impact of the change, not just its impact on some narrow scenarios.

If you have specific performance needs due to your unusually large data then you should consider writing your own special-purpose string library that uses a "monadic" approach like LINQ does. You can represent your strings internally as arrays of char, and then substring operations simply become copying a reference to the array and changing the start and end positions.

Eric Lippert
Excellent comments and agreed. I can see for the general case that the current implementation is overall best. I do see the GC would also be affected. I do have a very special "narrow" scenario, where I could optimize on performance - I do dispose of all strings including the original when done. I would consider such an alternate special "SharedString" class useful for certain controlled scenarios.
Elan
A: 

Strings are immutable, so it will create a copy of the string. However if the substring matches the exact string of another string that was known at compile time, it will actually use the same memory as that substring. That's string interning.

From MSDN: "The common language runtime automatically maintains a table, called the "intern pool", which contains a single instance of each unique literal string constant declared in a program, as well as any unique instance of String you add programmatically.

The intern pool conserves string storage. If you assign a literal string constant to several variables, each variable is set to reference the same constant in the intern pool instead of referencing several different instances of String that have identical values."

The code sample is informative. You can prevent automatic interning using the [assembly: CompilationRelaxations(CompilationRelaxations.NoStringInterning)] attribute to prevent automatic string interning. You would also have to use NGEN.exe to compile it to a native image, to prevent interning.

Note that if you use a StringBuilder it avoids interning. It's only for strings that can be matched up with other strings known at compile time.

This is a modified example of the MSDN article, notice that if I pass in part of "abcd" from the Console, it is still interned, even though that str3 is constructed at runtime. However StringBuilder avoids interning.

// Sample for String.IsInterned(String)
using System;
using System.Text;
using System.Runtime.CompilerServices;
using System.Diagnostics;

// In the .NET Framework 2.0 the following attribute declaration allows you to 
// avoid the use of the interning when you use NGEN.exe to compile an assembly 
// to the native image cache.
//[assembly: CompilationRelaxations(CompilationRelaxations.NoStringInterning)]
class Sample
{
    public static void Main()
    {
        // String str1 is known at compile time, and is automatically interned.
        String str1 = "abcd";
        Console.WriteLine("Type cd and it will be ok, type anything else and Assert will fail.");
        string end = Console.ReadLine(); // Constructed, but still interned.
        string str3 = "ab" + end;

        // Constructed string, str2, is not explicitly or automatically interned.
        String str2 = new StringBuilder().Append("wx").Append("yz").ToString();
        Console.WriteLine();
        Test(1, str1);
        Test(2, str2);
        Test(3, str3);

        // Sanity checks. 
        // Debug.Assert(Object.ReferenceEquals(str3, str1)); // Assertion fails, as expected.
         Debug.Assert(Object.ReferenceEquals(string.Intern(str3), string.Intern(str1))); // Passes
         Debug.Assert(Object.ReferenceEquals(string.Intern(str3), (str1))); // Passes
         Debug.Assert(Object.ReferenceEquals((str3), string.Intern(str1))); // Fails
         Console.ReadKey();
    }

    public static void Test(int sequence, String str)
    {
        Console.Write("{0}) The string, '", sequence);
        String strInterned = String.IsInterned(str);
        if (strInterned == null)
            Console.WriteLine("{0}', is not interned.", str);
        else
            Console.WriteLine("{0}', is interned.", strInterned);
    }
}
Richard Hein
Note true: it won't intern it unless you tell it to. MSDN says "literal constant" and "programatically".
Hightechrider
@Hightechrider: That's not quite right, strings are interned automatically, if they are known at compile time. Using a StringBuilder or something would prevent that. I've updated my answer to clarify that.
Richard Hein
You need to edit: "However if the substring matches the exact string of another string that was KNOWN AT COMPILE TIME, it will actually use the same memory as that substring".
Hightechrider
Thanks ... I've updated it.
Richard Hein
Actually, based on Reid's comment above it's not true. IsInterned will report it as interned if it's the same as a compile-time constant string value but it's not the same string. In current implementations of .NET, Substring will return a new string, always it seems.
Hightechrider
I ran the test using .NET 4.0.
Richard Hein
@Richard: When str3 is run through Test, the IsInterned actually returns a DIFFERENT string. It returns the interned copy of the string, which has changed, and is not equal to the original. Try putting this (contents of next comment) after your last Console.WriteLine call, and running:
Reed Copsey
if (Object.ReferenceEquals(strInterned, str) != true) Console.WriteLine("Replaced string with interned value!");
Reed Copsey
Ok, thanks for the clarification, I tested it out and that's the case, it's evident when you invoke IsInterned on them separately, as I have updated in the Debug.Assert statements.
Richard Hein