ansaurus

Question

String concatenation in C# with interned strings

Answer 1

+2 A:

In my experience, I properly allocated StringBuilder outperforms most everything else for large amounts of string data. It's worth wasting some memory, even, by overshooting your estimate by 20% or 30% in order to prevent reallocation. I don't currently have hard numbers to back it up using my own data, but take a look at this page for more.

However, as Jeff is fond of pointing out, don't prematurely optimize!

EDIT: As @Colin Burnett pointed out, the tests that Jeff conducted don't agree with Brian's tests, but the point of linking Jeff's post was about premature optimization in general. Several commenters on Jeff's page noted issues with his tests.

Bob King 2009-05-01 18:19:55

Both those links only compare String vs. StringBuilder. Not sure I can extrapolate performance for the other 4 solutions I suggested.I find it suspicious that Brian (first link) and Jeff (second link) get drastically different results for 100,000 iterations. 189s vs. 0.3 for Brian and 0.606 vs. .588 for Jeff. They aren't even comparable.

Colin Burnett 2009-05-01 18:33:05

If you read some of the comments on Jeff's post, they take him to task for not testing the right thing. "Steve H" shows tests in line with Brian's results. But, Jeff's point still stands. Don't worry about it unless you *have* to.

Bob King 2009-05-01 18:39:32

Touche, but can you add to your answer about the second question involving files?

Colin Burnett 2009-05-01 18:44:50

Answer 2

+1 A:

Actually StringBuilder uses an instance of String internally. String is in fact mutable within the System assembly, which is why StringBuilder can be build on top of it. You can make StringBuilder a wee bit more effective by assigning a reasonable length when you create the instance. That way you will eliminate/reduce the number of resize operations.

String interning works for strings that can be identified at compile time. Thus if you generate a lot of strings during the execution they will not be interned unless you do so yourself by calling the interning method on string.

Interning will only benefit you if your strings are identical. Almost identical strings doesn't benefit from interning, so "SOMESTRINGA" and "SOMESTRINGB" will be two different strings even if they are interned.

Brian Rasmussen 2009-05-01 18:28:29

If I have a table name "Foo" (interned) in my object hierarchy then add it to a List<string> then that interned string has two pointers. They are identical. Likewise for all my "CREATE TABLE [" strings.

Colin Burnett 2009-05-01 18:34:36

Reflector on mscorlib shows that String.AppendInPlace does a string.wstrcpy() from a char* into a char*. So StringBuilder is just a char[] wrapper. So it doesn't matter if string is mutable or not because it will still copy char values for each and every append and not take advantage of any interning of strings.

Colin Burnett 2009-05-01 20:20:12

Perhaps I misunderstood you. I don't believe you will benefit a lot from interning. If all your strings are in code they will be interned per default. However, if they are not in code they will not be interned. Likewise, if you concatenate them at runtime the new strings will not be interned unless you do so yourself.

Brian Rasmussen 2009-05-01 21:16:17

Answer 3

+1 A:

If all (or most) of the strings being concatenated are interned, then your scheme MIGHT give you a performance boost, as it could potentally use less memory, and could save a few large string copies.

However, whether or not it actually improves perf depends on the volume of data you are processing, because the improvement is in constant factors, not in the order of magnitude of the algorithm.

The only way to really tell is to run your app using both ways and measure the results. However, unless you are under significant memory pressure, and need a way to save bytes, I wouldn't bother and would just use string builder.

Scott Wisniewski 2009-05-01 18:46:33

Answer 4

+1 A:

A StringBuilder doesn't use a char[] to store the data, it uses an internal mutable string. That means that there is no extra step to create the final string as it is when you concatenate a list of strings, the StringBuilder just returns the internal string buffer as a regular string.

The reallocations that the StringBuilder does to increase the capacity means that the data is by average copied an extra 1.33 times. If you can provide a good estimate on the size when you create the StringBuilder you can reduce that even furter.

However, to get a bit of perspective, you should look at what it is that you are trying to optimise. What will take most of the time in your program is to actually write the data to disk, so even if you can optimise your string handling to be twice as fast as using a StringBuilder (which is very unlikely), the overall difference will still only be a few percent.

Guffa 2009-05-01 18:52:52

Answer 5

A:

Have you considered C++ for this? Is there a library class that already builds T/SQL expressions, preferably written in C++.

Slowest thing about strings is malloc. It takes 4KB per string on 32-bit platforms. Consider optimizing number of string objects created.

If you must use C#, I'd recommend something like this:

string varString1 = tableName;
string varString2 = tableName;

StringBuilder sb1 = new StringBuilder("const expression");
sb1.Append(varString1);

StringBuilder sb2 = new StringBuilder("const expression");
sb2.Append(varString2);

string resultingString = sb1.ToString() + sb2.ToString();

I would even go as far as letting the computer evaluate the best path for object instantiation with dependency injection frameworks, if perf is THAT important.

GregC 2009-05-02 17:10:47

Answer 6

+3 A:

If I were implementing something like this, I would never build a StringBuilder (or any other in memory buffer of your script). I would just stream it out to your file instead, and make all strings inline.

Here's an example pseudo code (not syntactically correct or anything):

FileStream f = new FileStream("yourscript.sql");
foreach (Table t in myTables)
{
    f.write("CREATE TABLE [");
    f.write(t.ToString());
    f.write("]");
    ....
}

Then, you'll never need an in memory representation of your script, with all the copying of strings.

Opinions?

paquetp 2009-05-02 20:30:37

That answers my second question. Now what if I'm taking this script and directly executing it? A round-trip to the disk would be unnecessary because disk is *way* slower than memory.

Colin Burnett 2009-05-02 23:24:32

agreed - "Practically" speaking, how about using a memory file system? To implement the "directly executing it" part, point the file to a memory file system (I googled Ramdisk for windows, got some hits...linux has this for free), then execute that?

paquetp 2009-05-03 16:32:51

Answer 7

+3 A:

For your separate question, Win32 has a WriteFileGather function, which could efficiently write a list of (interned) strings to disk - but it would make a notable difference only when being called asynchronously, as the disk write will overshadow all but extremely large concatenations.

For your main question: unless you are reaching megabytes of script, or tens of thousands of scripts, don't worry.

You can expect StringBuilder to double the allocation size on each reallocation. That would mean growing a buffer from 256 bytes to 1MB is just 12 reallocations - quite good, given that your initial estimate was 3 orders of magnitude off the target.

Purely as an exercise, some estimates: building a buffer of 1MB will sweep roughly 3 MB memory (1MB source, 1MB target, 1MB due to copying during realloation).

A linked list implementation will sweep about 2MB, (and that's ignoring the 8 byte / object overhead per string reference). So you are saving 1 MB memory reads/writes, compared to a typical memory bandwidth of 10Gbit/s and 1MB L2 cache.)

Yes, a list implementation is potentially faster, and the difference would matter if your buffers are an order of magnitude larger.

For the much more common case of small strings, the algorithmic gain is negligible, and easily offset by other factors: the StringBuilder code is likely in the code cache already, and a viable target for microoptimizations. Also, using a string internally means no copy at all if the final string fits the initial buffer.

Using a linked list will also bring down the reallocation problem from O(number of characters) to O(number of segments) - your list of string references faces the same problem as a string of characters!

So, IMO the implementation of StringBuilder is the right choice, optimized for the common case, and degrades mostly for unexpectedly large target buffers. I'd expect a list implementation to degrade for very many small segments first, which is actually the extreme kind of scenario StringBuilder is trying to optimize for.

Still, it would be interesting to see a comparison of the two ideas, and when the list starts to be faster.

peterchen 2009-05-02 23:47:14

Thank you for answering both questions.That is an excellent point about string lengths. Longer+interned strings would have a significantly higher advantage over short+interned due to the linked list overhead. The problem I've laid out uses lots of short strings.Perhaps an optimization to StringBuilder would be to do a string[] internally for strings of length longer than X but do string.Concat for strings less than X. These seems to be getting to the likes of a Rope needed for text editors.

Colin Burnett 2009-05-03 00:09:49

ansaurus

tags:

views:

answers:

String concatenation in C# with interned strings

related questions