views:

1034

answers:

7

I know this question has been done but I have a slightly different twist to it. Several have pointed out that this is premature optimization, which is entirely true if I were asking for practicality's sake and practicality's sake only. My problem is rooted in a practical problem but I'm still curious nonetheless.


I'm creating a bunch of SQL statements to create a script (as in it will be saved to disk) to recreate a database schema (easily many many hundreds of tables, views, etc.). This means my string concatenation is append-only. StringBuilder, according to MSDN, works by keeping an internal buffer (surely a char[]) and copying string characters into it and reallocating the array as necessary.

However, my code has a lot of repeat strings ("CREATE TABLE [", "GO\n", etc.) which means I can take advantage of them being interned but not if I use StringBuilder since they would be copied each time. The only variables are essentially table names and such that already exist as strings in other objects that are already in memory.

So as far as I can tell that after my data is read in and my objects created that hold the schema information then all my string information can be reused by interning, yes?

Assuming that, then wouldn't a List or LinkedList of strings be faster because they retain pointers to interned strings? Then it's only one call to String.Concat() for a single memory allocation of the whole string that is exactly the correct length.

A List would have to reallocate string[] of interned pointers and a linked list would have to create nodes and modify pointers, so they aren't "free" to do but if I'm concatenating many thousands of interned strings then they would seem like they would be more efficient.

Now I suppose I could come up with some heuristic on character counts for each SQL statement & count each type and get a rough idea and pre-set my StringBuilder capacity to avoid reallocating its char[] but I would have to overshoot by a fair margin to reduce the probability of reallocating.

So for this case, which would be fastest to get a single concatenated string:

  • StringBuilder
  • List<string> of interned strings
  • LinkedList<string> of interned strings
  • StringBuilder with a capacity heuristic
  • Something else?

As a separate question (I may not always go to disk) to the above: would a single StreamWriter to an output file be faster yet? Alternatively, use a List or LinkedList then write them to a file from the list instead of first concatenating in memory.

EDIT: As requested, the reference (.NET 3.5) to MSDN. It says: "New data is appended to the end of the buffer if room is available; otherwise, a new, larger buffer is allocated, data from the original buffer is copied to the new buffer, then the new data is appended to the new buffer." That to me means a char[] that is realloced to make it larger (which requires copying old data to the resized array) then appending.

+2  A: 

In my experience, I properly allocated StringBuilder outperforms most everything else for large amounts of string data. It's worth wasting some memory, even, by overshooting your estimate by 20% or 30% in order to prevent reallocation. I don't currently have hard numbers to back it up using my own data, but take a look at this page for more.

However, as Jeff is fond of pointing out, don't prematurely optimize!

EDIT: As @Colin Burnett pointed out, the tests that Jeff conducted don't agree with Brian's tests, but the point of linking Jeff's post was about premature optimization in general. Several commenters on Jeff's page noted issues with his tests.

Bob King
Both those links only compare String vs. StringBuilder. Not sure I can extrapolate performance for the other 4 solutions I suggested.I find it suspicious that Brian (first link) and Jeff (second link) get drastically different results for 100,000 iterations. 189s vs. 0.3 for Brian and 0.606 vs. .588 for Jeff. They aren't even comparable.
Colin Burnett
If you read some of the comments on Jeff's post, they take him to task for not testing the right thing. "Steve H" shows tests in line with Brian's results. But, Jeff's point still stands. Don't worry about it unless you *have* to.
Bob King
Touche, but can you add to your answer about the second question involving files?
Colin Burnett
+1  A: 

Actually StringBuilder uses an instance of String internally. String is in fact mutable within the System assembly, which is why StringBuilder can be build on top of it. You can make StringBuilder a wee bit more effective by assigning a reasonable length when you create the instance. That way you will eliminate/reduce the number of resize operations.

String interning works for strings that can be identified at compile time. Thus if you generate a lot of strings during the execution they will not be interned unless you do so yourself by calling the interning method on string.

Interning will only benefit you if your strings are identical. Almost identical strings doesn't benefit from interning, so "SOMESTRINGA" and "SOMESTRINGB" will be two different strings even if they are interned.

Brian Rasmussen
If I have a table name "Foo" (interned) in my object hierarchy then add it to a List<string> then that interned string has two pointers. They are identical. Likewise for all my "CREATE TABLE [" strings.
Colin Burnett
Reflector on mscorlib shows that String.AppendInPlace does a string.wstrcpy() from a char* into a char*. So StringBuilder is just a char[] wrapper. So it doesn't matter if string is mutable or not because it will still copy char values for each and every append and not take advantage of any interning of strings.
Colin Burnett
Perhaps I misunderstood you. I don't believe you will benefit a lot from interning. If all your strings are in code they will be interned per default. However, if they are not in code they will not be interned. Likewise, if you concatenate them at runtime the new strings will not be interned unless you do so yourself.
Brian Rasmussen
+1  A: 

If all (or most) of the strings being concatenated are interned, then your scheme MIGHT give you a performance boost, as it could potentally use less memory, and could save a few large string copies.

However, whether or not it actually improves perf depends on the volume of data you are processing, because the improvement is in constant factors, not in the order of magnitude of the algorithm.

The only way to really tell is to run your app using both ways and measure the results. However, unless you are under significant memory pressure, and need a way to save bytes, I wouldn't bother and would just use string builder.

Scott Wisniewski
+1  A: 

A StringBuilder doesn't use a char[] to store the data, it uses an internal mutable string. That means that there is no extra step to create the final string as it is when you concatenate a list of strings, the StringBuilder just returns the internal string buffer as a regular string.

The reallocations that the StringBuilder does to increase the capacity means that the data is by average copied an extra 1.33 times. If you can provide a good estimate on the size when you create the StringBuilder you can reduce that even furter.

However, to get a bit of perspective, you should look at what it is that you are trying to optimise. What will take most of the time in your program is to actually write the data to disk, so even if you can optimise your string handling to be twice as fast as using a StringBuilder (which is very unlikely), the overall difference will still only be a few percent.

Guffa
A: 

Have you considered C++ for this? Is there a library class that already builds T/SQL expressions, preferably written in C++.

Slowest thing about strings is malloc. It takes 4KB per string on 32-bit platforms. Consider optimizing number of string objects created.

If you must use C#, I'd recommend something like this:

string varString1 = tableName;
string varString2 = tableName;

StringBuilder sb1 = new StringBuilder("const expression");
sb1.Append(varString1);

StringBuilder sb2 = new StringBuilder("const expression");
sb2.Append(varString2);

string resultingString = sb1.ToString() + sb2.ToString();

I would even go as far as letting the computer evaluate the best path for object instantiation with dependency injection frameworks, if perf is THAT important.

GregC
+3  A: 

If I were implementing something like this, I would never build a StringBuilder (or any other in memory buffer of your script). I would just stream it out to your file instead, and make all strings inline.

Here's an example pseudo code (not syntactically correct or anything):

FileStream f = new FileStream("yourscript.sql");
foreach (Table t in myTables)
{
    f.write("CREATE TABLE [");
    f.write(t.ToString());
    f.write("]");
    ....
}

Then, you'll never need an in memory representation of your script, with all the copying of strings.

Opinions?

paquetp
That answers my second question. Now what if I'm taking this script and directly executing it? A round-trip to the disk would be unnecessary because disk is *way* slower than memory.
Colin Burnett
agreed - "Practically" speaking, how about using a memory file system? To implement the "directly executing it" part, point the file to a memory file system (I googled Ramdisk for windows, got some hits...linux has this for free), then execute that?
paquetp
+3  A: 

For your separate question, Win32 has a WriteFileGather function, which could efficiently write a list of (interned) strings to disk - but it would make a notable difference only when being called asynchronously, as the disk write will overshadow all but extremely large concatenations.

For your main question: unless you are reaching megabytes of script, or tens of thousands of scripts, don't worry.

You can expect StringBuilder to double the allocation size on each reallocation. That would mean growing a buffer from 256 bytes to 1MB is just 12 reallocations - quite good, given that your initial estimate was 3 orders of magnitude off the target.

Purely as an exercise, some estimates: building a buffer of 1MB will sweep roughly 3 MB memory (1MB source, 1MB target, 1MB due to copying during realloation).

A linked list implementation will sweep about 2MB, (and that's ignoring the 8 byte / object overhead per string reference). So you are saving 1 MB memory reads/writes, compared to a typical memory bandwidth of 10Gbit/s and 1MB L2 cache.)

Yes, a list implementation is potentially faster, and the difference would matter if your buffers are an order of magnitude larger.

For the much more common case of small strings, the algorithmic gain is negligible, and easily offset by other factors: the StringBuilder code is likely in the code cache already, and a viable target for microoptimizations. Also, using a string internally means no copy at all if the final string fits the initial buffer.

Using a linked list will also bring down the reallocation problem from O(number of characters) to O(number of segments) - your list of string references faces the same problem as a string of characters!


So, IMO the implementation of StringBuilder is the right choice, optimized for the common case, and degrades mostly for unexpectedly large target buffers. I'd expect a list implementation to degrade for very many small segments first, which is actually the extreme kind of scenario StringBuilder is trying to optimize for.

Still, it would be interesting to see a comparison of the two ideas, and when the list starts to be faster.

peterchen
Thank you for answering both questions.That is an excellent point about string lengths. Longer+interned strings would have a significantly higher advantage over short+interned due to the linked list overhead. The problem I've laid out uses lots of short strings.Perhaps an optimization to StringBuilder would be to do a string[] internally for strings of length longer than X but do string.Concat for strings less than X. These seems to be getting to the likes of a Rope needed for text editors.
Colin Burnett