views:

502

answers:

8

OK, so concatenations of constant string expressions are optimized by the compiler into one string. Great.

Now with string concatenation of strings only known at runtime, why does the compiler not optimize string concatenation in loops and concatenations of say more than 10 strings to use StringBuilder.Append instead? I mean, it's possible, right? Instantiate a StringBuilder and take each concatenation and turn it into an Append() call.

Is there any reason why this should or could not be optimized? What am I missing?

+6  A: 

Two reasons:

  • You can't programmatically identify places where it would be strictly higher performing.
  • The "optimization" will slow things down if performed incorrectly.

You can suggest people use the correct calls for their application, but at some point it's the developer's responsibility to get it right.

Edit: Regarding the cutoff, we have another couple of problems:

  • The only way to know for sure that the cutoff is reached is complicated flow analysis. The number of places where this would be able to find sections that could be converted is extremely small.
  • Flow analysis is expensive. If you do it at runtime, the whole program will run slower for the rare chance that one piece of poorly written code will be faster. If you do it at compile time, it's not an error according to language syntax but you can issue a warning - and that's exactly what FXCop does (a slow but available flow analysis tool). Just think if FXCop always had to run with the compiler; so many hours people would be just waiting to run code. And if it was at runtime, well welcome to JVM startup times...
280Z28
Well, you could have a simple cut-off as I suggested, say 10 concatenations for one result string, and you're likely better off using StringBuilder. The cut-off threshold could deliberately put higher, so you have a guarantee that it will yield performance benefits.
Wim Hollebrandse
@Wim Hollebrandse: See my edit. :)
280Z28
+2  A: 

I believe it would be a little too complex for the complier writers. And when you are referencing the intermediate strings inside the loops besides the concatenation (for example passing them to some other methods or so), this optimization would not be possible.

treaschf
Precisely which of the compiler writers do you think this would be too complex for? Me, Chris, Sam or Ian?
Eric Lippert
You are right on that one. :)
treaschf
+10  A: 

For a single concatenation of multiple strings (e.g. a + b + c + d + e + f + g + h + i + j) you really want to be using String.Concat IMO. It has the overhead of building an array for each call, but it has the benefit that the method can work out the exact length of the resulting string before it needs to allocate any memory. StringBuilder.Append(a).Append(b)... only gives a single value at a time, so the builder doesn't know how much memory to allocate.

As for doing it in loops - at that point you've added a new local variable, and you've got to add code to write back to the string variable at exactly the right time (calling StringBuilder.ToString()). What happens when you're running in the debugger? Wouldn't it be pretty confusing not to see the value building up, only becoming visible at the end of the loop? Oh, and of course you've got to perform appropriate validation that the value isn't used at any point before the end of the loop...

Jon Skeet
The C# compiler actually compiles `a+b+c+d+e` to `string.Concat(a, b, c, d, e)`. There's actually a bug in the version of `Concat` with 4 parameters which led to some crazy issues for me: https://connect.microsoft.com/VisualStudio/feedback/details/361125/
280Z28
My point about using string.Concat was precisely because of that translation - but I wasn't aware of the bug...
Jon Skeet
+20  A: 

The definite answer will have to come from the compiler design team. But let me take a stab here...

If your question is, why the compiler doesn't turn this:

string s = "";
for( int i = 0; i < 100; i ++ )
    s = string.Concat( s, i.ToString() );

into this:

StringBuilder sb = new StringBuilder();
for( int i = 0; i < 100; i++ )
    sb.Append( i.ToString() );
string s = sb.ToString();

The most likely answer is that this is not an optimization. This is a rewrite of the code that introduces new constructs based on knowledge and intent that the developer has - not the compiler.

This type of change would require the compiler to have more knowledge of the BCL than is appropriate. What if tomorrow, some more optimal string assembly service becomes available? Should the compiler use that?

What if your loop conditions were more complicated, should the compiler attempt to perform some static analysis to decide whether the result of such a rewrite would still be functionally equivalent? In many ways, this would be like solving the halting problem.

Finally, I'm not sure that in all cases this would result in faster performing code. There is a cost to instantiating a StringBuilder and resizing its internal buffer as text is appended. In fact, the cost of appending is strongly tied to the size of the string being concatenated, how many there are, what memory pressure looks like. These are things that the compiler cannot predict in advance.

It's your job as a developer to write well-performing code. The compiler can only help by making certain safe, invariant-preserving optimizations. Not rewriting your code for you.

LBushkin
All very good points, especially the one about tie-in to the BCL.
Wim Hollebrandse
You make it feel like this was an impossible think to achieve. In fact, the latest Java compilers have this "optimization". It's even referenced in the specs: http://java.sun.com/docs/books/jls/third_edition/html/expressions.html#15.18.1.2
bruno conde
I think there's a difference between something being *possible*, and something being *worthwhile*. In some narrow uses, it may be possible to substitute `StringBuilder` (or similar approaches) to optimize the concatenation of strings. But in practice, since it's not always possible and in some cases may introduce negative consequences - it's probably not a good idea as a general optimization. As an aside, I am curious if anyone knows whether any Java compiler **actually performs** the string optimization that the link above describes?
LBushkin
@LBushkin, yes it does. I already tried this a while ago and the compiler generated the SringBuilder version clearly seen in the Byte Code. I was amazed.
bruno conde
-1 for *"It's your job as a developer to write well-performing code,"* that responsibility should *always* be pushed onto the compiler when possible.
BlueRaja - Danny Pflughoeft
@BlueRaja: Many performance failures are the result of poor algorithm and data structure selection. Do you expect the compiler to rewrite your code in this manner as well? As I said in my answer, where the compiler can make *safe, invariant-preserving* optimizations it should. Beyond that, you as the developer have the responsibility to: 1) identify where performance matters, 2) design your system to behave optimally for those cases, and 3) profile and refine the implementation as performance issues arise.
LBushkin
I'm a bit in the middle there guys. Yes, *I* and many others know when to use StringBuilder, but not everyone does. Don't forget there's more hobbyists around than SO elitists. In that sense, I tend to agree with @BlueRaja.
Wim Hollebrandse
@BluRaja, I'd -5 you for that if I could. The compiler is responsible for *platform-specific micro-optimizations*. The developer is responsible for picking the correct algorithm. In this case, there's a clear memory / fragmentation / time tradeoff between String.Concat and StringBuilder - only the developer can make the call based on the data to be expected.
peterchen
A: 

Probably because it's complicated to match such a pattern in the code, and in case the compiler can't do the match for some reason, the performance of the code is suddenly terrible. Optimising code like that would encourage writing code like that, which would even further increase the negative impact in the cases where the compiler can no longer do the optimisation.

For concatenating a known set of strings, StringBuilder is not faster than String.Concat.

Guffa
A: 

A String is an immutable type, hence using concatenating the string is slower than using StringBuilder.Append.

Edit: To clarify my point a bit more, when you talk about why is String.Concat not optimized to StringBuilder.Append, a StringBuilder class is completely different semantics to the immutable type of String. Why should you expect the compiler to optimize that as they are clearly two different things? Furthermore, a StringBuilder is a mutable type that can change it's length dynamically, why should a compiler optimize a immutable type to a mutable type. That is the design and semantics engrained into the ECMA spec for the .NET framework, regardless of the language.

It's a bit like asking the compiler (and perhaps expecting too much) to compile a char and optimize it into a int because the int works on 32bits instead of 8bits and would be deemed faster!

Hope this helps, Best regards, Tom.

tommieb75
-1, I'm clearly aware of that, and that does not answer the question.
Wim Hollebrandse
-1: It is also wrong. `String.Concaternate` will allocate a single output of just the right length for all the strings, and then copy each one once. This is quicker than multiple calls to `StringBuilder.Append` with multiple allocations and copies.
Richard
@Richard: It depends on the size and frequency - if it's for short strings, then yes your statement would be true. But if you're talking about long strings, its the reverse, i.e. it would be slower!
tommieb75
@tommieb75: Appending lots of strings with `StringBUilder` ==> lots of reallocations (each requiring copying everything). Remember the `String.Concatenate` is a single method call, it sees all the component strings at once (this is why `StringBuilder` is better in a loop).
Richard
+1  A: 

Because it's the compiler's job to generate semantically-correct code. Changing invocations of String.Concat to invocations of StringBuilder.Append would be changing the semantics of the code.

Jason
As long as the resulting string is correct, the semantics are not affected; string.concat does not have observable side effects.
Eric Lippert
@Eric Lippert: Yes, the effect is the same but the emitted IL is different. Invoking `String.Concat` would be in accordance with the spec concerning function member invocation but replacing it with a call to `StringBuilder.Append` would not. What I mean is "I told the compiler to invoke some method and by the spec I expect it to emit code to invoke that method; I expect that it should not invoke another method." Can the compiler really do that? I know in some cases it will optimize and emit semantically-correct but different code but I'm surprised to learn that's legal for method invocations.
Jason
Ah, I see your point. If we did decide to do this optimization, we'd of course have to note it in the spec, as the spec does say that + on strings is a syntactic sugar for string.concat.
Eric Lippert
+9  A: 

LBuskin's answer is excellent; I have just a couple of things to add.

First, JScript.NET does do this optimization. JScript is frequently used by less-experienced programmers for tasks that involve construction of large strings in loops, like building up JSON objects, HTML data, and so on.

Since those programmers might not be aware of the n-squared cost of naive string allocation, might not be aware of the existence of string builders, and frequently write code using this pattern, we felt that it was reasonable to put this optimization into JScript.NET.

C# programmers tend to be more aware of the underlying costs of the code they write and more aware of the existence of off-the-shelf parts like StringBuilder, so they need this optimization less. And more fundamentally, the design philosophy of C# is that it is a "do what I said" language with a minimum of "magic"; JScript is a "do what I mean" language that does its best to figure out how to best serve you, even if that means sometimes guessing wrong. Both philosophies are valid and useful.

Sometimes it does "go the other way". Compare this choice to the choice we make for switches on strings. Switches on strings are actually compiled as a creation of a dictionary containing the strings, rather than as a series of string comparisons. That optimization could be bad; it might be faster to simply do the string comparisons. But here we make a guess that you "meant" the switch to be a table lookup rather than a series of "if" statements -- if you'd meant the series of if statements, you could easily write that yourself.

Eric Lippert