views:

834

answers:

6

A while back a post by Jon Skeet planted the idea in my head of building a CompiledFormatter class, for using in a loop instead of String.Format().

The idea is that the portion of a call to String.Format() spent parsing the format string is overhead. We should be able to improve performance by moving that code outside of the loop. The trick, of course, is that the new code has to exactly match the String.Format() behavior.

This week I finally did it. I actually went through using the .Net framework source provided by Microsoft to do a direct adaption of their parser (it turns out String.Format() actually farms the work to StringBuilder.AppendFormat()). The code I came up with works, in that my results are accurate within my (admittedly limited) test data.

Unfortunately, I still have one problem: performance. In my initial tests the performance of my code closely matches that of the normal String.Format(). There's no improvement at all: it's even consistently a few milliseconds slower. At least it's still in the same order (ie: the amount slower doesn't increase; it stays within a few milliseconds even as the test set grows), but I was hoping for something better.

It's possible that the internal calls to StringBuilder.Append() are what actually drive the performance, but I'd like to see if the smart people here can improve on things any.

Here is the relevant portion:

private class FormatItem
{
    public int index; //index of item in the argument list. -1 means it's a literal from the original format string
    public char[] value; //literal data from original format string
    public string format; //simple format to use with supplied argument (ie: {0:X} for Hex

    // for fixed-width format (examples below) 
    public int width;    // {0,7} means it should be at least 7 characters   
    public bool justify; // {0,-7} would use opposite alignment
}

//this data is all populated by the constructor
private List<FormatItem> parts = new List<FormatItem>(); 
private int baseSize = 0;
private string format;
private IFormatProvider formatProvider = null;
private ICustomFormatter customFormatter = null;

// the code in here very closely matches the code in the String.Format/StringBuilder.AppendFormat methods.  
// Could it be faster?
public String Format(params Object[] args)
{
    if (format == null || args == null)
        throw new ArgumentNullException((format == null) ? "format" : "args");

    var sb = new StringBuilder(baseSize);
    foreach (FormatItem fi in parts)
    {
        if (fi.index < 0)
            sb.Append(fi.value);
        else
        {
            //if (fi.index >= args.Length) throw new FormatException(Environment.GetResourceString("Format_IndexOutOfRange"));
            if (fi.index >= args.Length) throw new FormatException("Format_IndexOutOfRange");

            object arg = args[fi.index];
            string s = null;
            if (customFormatter != null)
            {
                s = customFormatter.Format(fi.format, arg, formatProvider);
            }

            if (s == null)
            {
                if (arg is IFormattable)
                {
                    s = ((IFormattable)arg).ToString(fi.format, formatProvider);
                }
                else if (arg != null)
                {
                    s = arg.ToString();
                }
            }

            if (s == null) s = String.Empty;
            int pad = fi.width - s.Length;
            if (!fi.justify && pad > 0) sb.Append(' ', pad);
            sb.Append(s);
            if (fi.justify && pad > 0) sb.Append(' ', pad);
        }
    }
    return sb.ToString();
}

//alternate implementation (for comparative testing)
// my own test call String.Format() separately: I don't use this.  But it's useful to see
// how my format method fits.
public string OriginalFormat(params Object[] args)
{
    return String.Format(formatProvider, format, args);
}

Additional notes:
I'm wary of providing the source code for my constructor, because I'm not sure of the licensing implications from my reliance on the original .Net implementation. However, anyone who wants to test this can just make the relevant private data public and assign values that mimic a particular format string.

Also, I'm very open to changing the FormatInfo class and even the parts List if anyone has a suggestion that could improve the build time. Since my primary concern is iteration time from front to end maybe a LinkedList would fare better?

[Update]:
Hmm... something else I can try is adjusting my tests. My benchmarks were fairly simple: composing names to a "{lastname}, {firstname}" format and composing formatted phone numbers from the area code, prefix, number, and extension components. Neither of those have much in the way of string literals. As I think about how the original state machine parser worked, I think those string literals are exactly where my code has the best chance to do well because I no longer have to examine each character in the string.

Another thought:
This class is still useful, even if I can't make it go faster. As long as performance is no worse than the base String.Format(), I've still created a stronly-typed interface that allows a program to assemble it's own "format string" at run time. All I need to do is provide public access to the parts list.

A: 

Have you accounted for the time to do the JIT compile as well? After all, the framework will be ngen'd which could account for the differences?

Rowland Shaw
I don't think that's it, otherwise I'd see a greater disparity between formatting 10,000 items vs 1,000,000.
Joel Coehoorn
Hmm... that could account for those few milliseconds though, because it would happen once no matter how many items there are.
Joel Coehoorn
JIT compiling only happens once, so that would explain a constant difference in timing between yours and the framework regardless of input size.
Guvante
If you run through this one time, then do your timing on a second loop, it should take care of the JIT timing issues.
Reed Copsey
+1  A: 

It seems to me that in order to get actual performance improvement, you'd need to factor out any format analysis done by your customFormatter and formattable arguments into a function that returns some data structure that tells a later formatting call what to do. Then you pull those data structures in your constructor and store them for later use. Presumably this would involve extending ICustomFormatter and IFormattable. Seems kinda unlikely.

chaos
I would argue that most uses of String.Format() don't involve the custom formatter or format provider. Even if you use it in one part of the format string, you might not in another. My reliance on them is to mimic the original behavior.
Joel Coehoorn
Okay. All I'm really seeing then is tiny incremental things like pulling your fi.index < 0 parts out of the loop (setting up a 'template' StringBuilder in the constructor and using .Insert() to put positional elements into it instead .Append()).
chaos
Don't knock it: the template stringbuilder may end up the winner.
Joel Coehoorn
Well, let me know how it goes. :)
chaos
A: 

The framework provides explicit overrides to the format methods that take fixed-sized parameter lists instead of the params object[] approach to remove the overhead of allocating and collecting all of the temporary object arrays. You might want to consider that for your code as well. Also, providing strongly-typed overloads for common value types would reduce boxing overhead.

thestar
No: the fixed-size overloads just create arrays and call a generic overload. The reason the fixed-size overloads exists is to allow passing in a ProviderInfo and Format string parameter without having them wrapped into the params parameter.
Joel Coehoorn
+3  A: 

Here's the final result:

I changed the format string in a benchmark trial to something that should favor my code a little more:

The quick brown {0} jumped over the lazy {1}.

As I expected, this fares much better compared to the original; 2 million iterations in 5.3 seconds for this code vs 6.1 seconds for String.Format. This is an undeniable improvement. You might even be tempted to start using this as a no-brainer replacement for many String.Format situations. After all, you'll do no worse and you might even get a small performance boost: as much 14%, and that's nothing to sneeze at.

Except that it is. Keep in mind, we're still talking less than half a second difference for 2 million attempts, under a situation specifically designed to favor this code. Not even busy ASP.Net pages are likely to create that much load, unless your lucky enough to work on a Top100 web site.

Most of all, this omits one important alternative: you can just create a new StringBuilder each time and manually handle your own formatting using raw Append() calls. With that technique my benchmark finished in only 3.9 seconds. That's a much greater improvement.

So in the end, if you're in a situation where the performance matters there is a better alternative available. And if it doesn't matter, you probably want to stick with the clarity of using the simple built-in method.

Joel Coehoorn
A: 

I gotta believe that spending as much time optimizing data IO would earn exponentially bigger returns!

This is surely a kissin' cousin to YAGNI for this. Avoid Premature Optimization. APO.

rp
That was kinda the point of my "final result" post- this just isn't worth it, except as a fun experiment :)
Joel Coehoorn
+3  A: 

Don't stop now!

Your custom formatter might only be slightly more efficient than the built-in API, but you can add more features to your own implementation that would make it more useful.

I did a similar thing in Java, and here are some of the features I added (besides just pre-compiled format strings):

1) The format() method accepts either a varargs array or a Map (in .NET, it'd be a dictionary). So my format strings can look like this:

StringFormatter f = StringFormatter.parse(
   "the quick brown {animal} jumped over the {attitude} dog"
);

Then, if I already have my objects in a map (which is pretty common), I can call the format method like this:

String s = f.format(myMap);

2) I have a special syntax for performing regular expression replacements on strings during the formatting process:

// After calling obj.toString(), all space characters in the formatted
// object string are converted to underscores.
StringFormatter f = StringFormatter.parse(
   "blah blah blah {0:/\\s+/_/} blah blah blah"
);

3) I have a special syntax that allows the formatted to check the argument for null-ness, applying a different formatter depending on whether the object is null or non-null.

StringFormatter f = StringFormatter.parse(
   "blah blah blah {0:?'NULL'|'NOT NULL'} blah blah blah"
);

There are a zillion other things you can do. One of the tasks on my todo list is to add a new syntax where you can automatically format Lists, Sets, and other Collections by specifying a formatter to apply to each element as well as a string to insert between all elements. Something like this...

// Wraps each elements in single-quote charts, separating
// adjacent elements with a comma.
StringFormatter f = StringFormatter.parse(
   "blah blah blah {0:@['$'][,]} blah blah blah"
);

But the syntax is a little awkward and I'm not in love with it yet.

Anyhow, the point is that your existing class might not be much more efficient than the framework API, but if you extend it to satisfy all of your personal string-formatting needs, you might end up with a very convenient library in the end. Personally, I use my own version of this library for dynamically constructing all SQL strings, error messages, and localization strings. It's enormously useful.

benjismith