views:

1572

answers:

7

The following C# code takes 5 minutes to run:

int i = 1;
string fraction = "";
while (fraction.Length < 1000000)
{
    fraction += i.ToString();
    i++;
}

"Optimising it" like this causes it to run in 1.5 seconds:

int i = 1;
string fraction = "";
while (fraction.Length < 1000000)
{
    // concatenating strings is much faster for small strings
    string tmp = "";
    for (int j = 0; j < 1000; j++)
    {
        tmp += i.ToString();
        i++;
    }
    fraction += tmp;
}

EDIT: Some people suggested using StringBuilder, which is an excellent suggestion also, and this comes out at 0.06s:

int i = 1;
StringBuilder fraction = new StringBuilder();
while (fraction.Length < 1000000)
{
    fraction.Append(i);
    i++;
}

Playing around to find the optimum value of j is a topic for another time, but why exactly does this non-obvious optimisation work so well? Also, on a related topic, I've heard it said that you should never use the + operator with strings, in favour of string.Format(), is this true?

+8  A: 

Use StringBuilder for concatenating more than (approximately) 5 strings (results may vary slightly). Also, give the StringBuilder's constructor a hint on the expected maximum size.

[Update]: just commenting on your edit to the question. You can also increase StringBuilder's performance if you have an approximate (or exact) idea of the final size of the concatenated strings, because this will reduce the number of memory allocations it has to perform:

// e.g. Initialise to 10MB
StringBuilder fraction = new StringBuilder(10000000);
Mitch Wheat
+3  A: 
Konrad Rudolph
the outer loop is on the length of the string, not i
BCS
The outer loop actually stop with a longer answer in the second code snippet because of the way it's being handled, so it generates a LONGER sequence, in far less time.
Matthew Scharley
"Like all absolute statements it's nonsense." Hey, +1 for use of irony!
Joel Mueller
+3  A: 

I can't do tests now, but try to use StringBuilder.

int i = 1;
    StringBuilder fraction = new StringBuilder();
    while (fraction.Length < 1000000)
    {
        fraction.Append(i);
        i++;
    }
return sb.ToString();
Zote
+6  A: 

You will probably see that the first 1000 chars will take almost no time opposed to the last 1000 chars.

I would assume that the time-consuming part is the actual copying of the large string into a new memory-area every time you add a char that is the tough work for your computer.

Your optimization can easily be compared to what you usually do with streams, you use a buffer. Larger chunks will usually result in better performance until you hit the critical size where it no longer makes any difference, and starts to be a downside when your handling small amounts of data.

If you however would have defined a char-array with the appropriate size from the beginning, it would probably be blazing fast, because then it won't have to copy it over and over again.

jishi
This involves alot more code doing conversions between strings and character arrays, but I agree with your analysis of the situation. I just found it interesting the exact levels to which this can be a hinderance.
Matthew Scharley
+10  A: 

I don't get your results at all. On my box StringBuilder wins hands down. Could you post your full test program? Here's mine, with three variants - your string concatenation optimisation, the "simple" StringBuilder one, and StringBuilder with an initial capacity. I've increased the limit as it was going too fast on my box to be usefully measurable.

using System;
using System.Diagnostics;
using System.Text;

public class Test
{
    const int Limit = 4000000;

    static void Main()
    {
        Time(Concatenation, "Concat");
        Time(SimpleStringBuilder, "StringBuilder as in post");
        Time(SimpleStringBuilderNoToString, "StringBuilder calling Append(i)");
        Time(CapacityStringBuilder, "StringBuilder with appropriate capacity");
    }

    static void Time(Action action, string name)
    {
        Stopwatch sw = Stopwatch.StartNew();
        action();
        sw.Stop();
        Console.WriteLine("{0}: {1}ms", name, sw.ElapsedMilliseconds);
        GC.Collect();
        GC.WaitForPendingFinalizers();
    }

    static void Concatenation()
    {
        int i = 1;
        string fraction = "";
        while (fraction.Length < Limit)
        {
            // concatenating strings is much faster for small strings
            string tmp = "";
            for (int j = 0; j < 1000; j++)
            {
                tmp += i.ToString();
                i++;
            }
            fraction += tmp;            
        }
    }

    static void SimpleStringBuilder()
    {
        int i = 1;
        StringBuilder fraction = new StringBuilder();
        while (fraction.Length < Limit)
        {
            fraction.Append(i.ToString());
            i++;
        }
    }

    static void SimpleStringBuilderNoToString()
    {
        int i = 1;
        StringBuilder fraction = new StringBuilder();
        while (fraction.Length < Limit)
        {
            fraction.Append(i);
            i++;
        }
    }

    static void CapacityStringBuilder()
    {
        int i = 1;
        StringBuilder fraction = new StringBuilder(Limit + 10);
        while (fraction.Length < Limit)
        {
            fraction.Append(i);
            i++;
        }
    }
}

And the results:

Concat: 5879ms
StringBuilder as in post: 206ms
StringBuilder calling Append(i): 196ms
StringBuilder with appropriate capacity: 184ms

The reason your concatenation is faster than the very first solution is simple though - you're doing several "cheap" concatenations (where relatively little data is being copied each time) and relatively few "large" concatenations (of the whole string so far). In the original, every step would copy all of the data obtained so far, which is obviously more expensive.

Jon Skeet
I'm using DateTime.Now for timing (I'm stuck with 2.0), but other than that, there's only a few multiplications at the end from digits pulled from the string, and that's a constant time across all the runs.
Matthew Scharley
Actually, no, now that I think of it, you're right, I had Console prints... Fixing up the times in the question now.
Matthew Scharley
Same here. For mine the original is still running, 1st one was about 700ms and the last one (StringBuilder) 63 ms.
Quibblesome
Stopwatch was introduced in .NET 2.0, so you're not stuck with DateTime.Now :)
Jon Skeet
:). I forgot to start the stop watch in the second run! ^^ corrected.
Quibblesome
You're right, StringBuilder wins hands down. It was just the Console.Write's that were slowing it down (there was only one in the external loop, so there were far fewer on the tmp string code.)
Matthew Scharley
Good explanation. And I too get similar results. Why did you pick Limit + 10 as your initial StringBuilder capacity though? The needed capacity is much larger isn't it?
C. Dragon 76
You're not adding single characters to the string, you could be adding a large number to the string. Adding a 10 character buffer to the limit makes sure there's enough room for any number that can fit into an int (some 2 billion)
Matthew Scharley
Why would the required capacity be much larger? It's only got to hold up to Limit characters + whatever is required to push it past the limit.
Jon Skeet
C. Dragon 76
@cdragon: Yes, it appends more than one character - which is why I've got an extra 10 as "spare". i is still going to be less than 10 characters long, so it will never overflow. As a check, I've just rerun the test and printed out the final capacity - it's the same as the start.
Jon Skeet
+1  A: 

Answer to the modified queston ("why does this non-obvious optimization work so well" and "is it true you shouldn't use + operator on strings"):

I'm not sure which non-obvious optimization you are talking about. But the answer to the second question, I think, covers all of the bases.

The way strings work in C# is that they are allocated as fixed-length, and cannot be changed. This means that any time you try to change the length of the string, an entire new string is created and the old string is copied in up to the proper length. This is obviously a slow process. When you use String.Format it internally uses a StringBuilder to create the string.

StringBuilders work by using a memory buffer which is more intelligently allocated than fixed-length strings, and thus performs significantly better in most situations. I'm not sure on the details of StringBuilder internally, so you'll have to ask a new question for that. I can speculate it either doesn't reallocate the old portions of the string (instead creating a linked list internally and only actually allocating the final output when needed by ToString) or it reallocates with exponential growth (when it runs out of memory, it allocates twice as much the next time, thus for a 2GB string it would only need to reallocate about 30 times).

Your example with the nested loops grows linearly. it takes a small string and grows that up to 1000, and then tacks that 1000 on to the larger string in one large operation. As the large string gets really large, the copy that results from creating a new string gets to take a long time. When you reduce the amount of times this is done (by instead resizing a smaller string more often instead) you increase the speed. Of course, StringBuilder is even smarter about allocating memory, and thus is much faster.

SoapBox
+1  A: 

Adding a character to a string can have two consequences:

  • if there is still space for the character it is just added at the end; (as a commenter noticed, this can not happen with c# strings, as thy are immutable).
  • if there is no space at the end a new block of memory is allocated for the new string, the contents of the old string is copied there and the character is added.

To analyse your code, it is simpler to add 1000000 times a single character. Your exact example is a bit more complex to explain because for higher i's you add more characters at a time.

Then in the situation where no extra space is reserved, the first example has to do 1000000 allocations and copies, of an average of 0.5 * 1000000 characters. The second one has to do 1000 allocations and copies of an average 0.5 * 1000000 characters, and 1000000 allocations and copies of 0.5 * 1000 characters. If copying is lineair with the size of the copy and allocation free, the first situation takes 500 000 000 000 units of time and the second one 500 000 000 + 500 000 000 units of time.

Stephan Eggermont
In C#, strings are immutable. It is not changing the strings in place. Every time a character is added, and entirely new string is created.
Neil Whitaker
Hmm, I knew that... Just ignore the first consequence
Stephan Eggermont