views:

445

answers:

4

Hi,

as part of the process of populating a search engine, I populate a Berekely-DB value-store as well. This process is repeated each night and at the moment +/- 60% of the total running time each night is caused by creating the values to be inserted into the value-store ( so excluding the actual insertion into Berekely-DB and the time caused by the Berekely client

These values are created by having a stringbuilder assigned to each key, and appending on average about 1000 strings to such a stringbuilder. The resulting values on average are about 10k. Im wondering if this can be done more effciently, given that: - the (on average) 1000 strings appended to each of the Stringbuilders are of fixed length: i.e: each String has the same length and this length is known up front) - all strings are appended to the end.

Would for example swapping out the stringbuilder for a char[] or characterStream / writer be more performant? That way I could keep and index where to write to in the char[].

Thanks, Geert-Jan

+5  A: 

You could create your stringbuilders with a higher initial capacity to reduce the amount of resizing, i.e. there's a constructor that allows you to say

int SIZE=10000;
StringBuilder b = new StringBuilder(SIZE);

I would expect that manually juggling char[] and indexes wouldn't improve much on this, as (I assume) that's what StringBuilder is already doing for you.

Steve B.
This is all you need. You won´t get that much of a speedup if you do it with char arrays, because that´s exactly what StringBuilder does. +´ed
glebm
yes setting initial size definitely helped as well. Thanks
Geert-Jan
A: 

Where are these 1000 Strings coming from? I have trouble believing that the creation time for those 1000 objects doesn't completely dwarf the time needed for amortized expansion of your StringBuilder.

Carl Smotricz
given 1000 strings, each call below constitutes 1/3th of these strings. 1. agg.sb.append(sDur); 2. agg.sb.append(sermap.get(key));3. agg.sb.append(doc.sbs[durIndexDCToZero]);1. created each innerloop based on a integer2. lookup in a static map3. lookup in a map created once per document. I measured the difference by only commenting-out these calls above, so the creation-time of the inserted strings themselves are not measured in the difference.
Geert-Jan
Fair enough, thanks for the info! But look here: You have 333 object creations for your `sDur`s, whereas it only takes 10 object creations and copy operations to expand a StringBuilder from 16 to over 10,000 bytes. So unless I'm missing something, sDur creation alone should take about 30x as much time as string building. In any case, I second Steve B's suggestion, it certainly won't hurt!
Carl Smotricz
thanks for your reasoning. I think it's a bit odd myself. I'll investigate further.
Geert-Jan
A: 

You should give ropes a try. The site is skimpy on details, but there's a great article here with a better description and some good benchmarks comparing append performance.

I haven't actually used the ropes package, haven't had a good enough excuse to. Looks promising, though.

Edit: Additional benchmark info

I downloaded the PerformanceTest class from the ropes article, and added tests for StringBuilder in addition to StringBuffer. The performance improvement of StringBuilder seems negligible.

I downloaded the test code from the ropes article and changed the test to include StringBuilder and StringBuffer.

Append plan length: 260
[StringBuilder] Average=     117,146,000 ns Median=     114,717,000ns
[StringBuffer]  Average=     117,624,400 ns Median=     115,552,000ns
[Rope]          Average=         484,600 ns Median=         483,000ns

Append plan length: 300
[StringBuilder] Average=     178,329,000 ns Median=     178,009,000ns
[StringBuffer]  Average=     217,147,800 ns Median=     216,819,000ns
[Rope]          Average=         252,800 ns Median=         253,000ns

Append plan length: 500
[StringBuilder] Average=     221,356,200 ns Median=     214,435,000ns
[StringBuffer]  Average=     227,432,200 ns Median=     219,650,000ns
[Rope]          Average=         510,000 ns Median=         507,000ns

The difference between StringBuilder and StringBuffer isn't that great. For the task at hand, Ropes seem like a clear win here.

Sam Barnum
Ropes performs well compared with String concatenation, which is no surprise at all; and better than StringBuffer, whose operations are synchronized. The IBM page doesn't show a comparison with StringBuilder, which was created specifically to overcome this problem.
Carl Smotricz
According to the article, the performance gain of Ropes over StringBuffer was due mostly to the overhead of resizing/copying the backing char[] array in the StringBuffer as it grows. This same overhead applies to StringBuilder.
Sam Barnum
@Carl Smotricz: If you follow Steve B's suggestion of setting a high initial capacity for your StringBuilders, I think StringBuilder would likely be faster than Ropes. If you have many buffer objects and aren't sure how large they may grow, Ropes sound like a perfect candidate for a job like this.
Sam Barnum
There's another IBM Developer article on Ropes, by Ropes' author. It's dated 2008 but *still* doesn't include a performance comparison with StringBuilder. I wonder if this was intentional? I would have been interested to see the result but was unable to leave a comment to that effect.
Carl Smotricz
A: 

Revision III:

If string concatenation in StringBuilders is taking inordinately long, perhaps your memory is very full. So our goal is to achieve string concatenation without chewing up a lot of memory. Hopefully the savings in CPU time will follow automatically.

My plan went like this: Instead of concatenating those substrings into a long StringBuilder, you could build a List of references to those (pre-existing) Strings. The list of references should be shorter than the sum of the substrings and thus consume less memory.

Only when it becomes time to store that big String do we concatenate the parts in one big StringBuilder, pull out the String, store the String, throw away the reference to the String, clear the StringBuilder, repeat. I felt this was a brilliant solution!

However, from this article from 2002, a String reference in an array, probably likewise in an ArrayList, takes a whopping 8 bytes! A recent StackOverflow post confirmed that this is still so. Thus, a list of references to 10-byte Strings saves only 2 bytes per String. Thus, I present my "solution" as a possibility for similar problems, but I don't see this particular problem being able to benefit from it.

Carl Smotricz
just noticed this, and you absolutely nailed my prolem.. GC all over the place was causing the problem...I actually already implemented the exact same thing you suggested, which in this case saves a lot, because the maps that hold the substrings already existed (and should as part of a different routine) Now only when it's time I'm building the String. Thanks
Geert-Jan
I put a lot of thought into this problem and I'm glad it was not for naught. (wipes a tear from the corner of eye) I love happy endings!
Carl Smotricz