ansaurus

Question

String Benchmarks in C# - Refactoring for Speed/Maintainability

Answer 1

+3 A:

You might want to try instantiating a Regex object as a class member and using the RegexOptions.Compiled option when you create it.

Currently, you're using the static Split member of Regex, and that doesn't cache the regular expression. Using an instanced member object instead of the static method should improve your performance even more (over the long run).

Andrew Rollings 2009-01-23 14:43:22

Answer 2

+2 A:

Use a StringBuilder instead of concatenation. Each concatenation is creating a new string instance and throwing away the old.

HTH, Kent

Kent Boogaart 2009-01-23 14:43:24

Answer 3

+15 A:

1) Use a StringBuilder, preferrably set with a reasonable initial capacity (e.g. string length * 5/4, to allow one extra space per four characters).

2) Try using a foreach loop instead of a for loop - it may well be simpler

3) You don't need to convert the string into a char array first - foreach will work over a string already, or use the indexer.

4) Don't do extra string conversions everywhere - calling Convert.ToString(char) and then appending that string is pointless; there's no need for the single character string

5) For the second option, just build the regex once, outside the method. Try it with RegexOptions.Compiled as well.

EDIT: Okay, full benchmark results. I've tried a few more things, and also executed the code with rather more iterations to get a more accurate result. This is only running on an Eee PC, so no doubt it'll run faster on "real" PCs, but I suspect the broad results are appropriate. First the code:

using System;
using System.Collections;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Reflection;
using System.Text;
using System.Text.RegularExpressions;

class Benchmark
{
    const string TestData = "ThisIsAUpperCaseString";
    const string ValidResult = "This Is A Upper Case String";
    const int Iterations = 1000000;

    static void Main(string[] args)
    {
        Test(BenchmarkOverhead);
        Test(MakeNiceString);
        Test(ImprovedMakeNiceString);
        Test(RefactoredMakeNiceString);
        Test(MakeNiceStringWithStringIndexer);
        Test(MakeNiceStringWithForeach);
        Test(MakeNiceStringWithForeachAndLinqSkip);
        Test(MakeNiceStringWithForeachAndCustomSkip);
        Test(SplitCamelCase);
        Test(SplitCamelCaseCachedRegex);
        Test(SplitCamelCaseCompiledRegex);        
    }

    static void Test(Func<string,string> function)
    {
        Console.Write("{0}... ", function.Method.Name);
        Stopwatch sw = Stopwatch.StartNew();
        for (int i=0; i < Iterations; i++)
        {
            string result = function(TestData);
            if (result.Length != ValidResult.Length)
            {
                throw new Exception("Bad result: " + result);
            }
        }
        sw.Stop();
        Console.WriteLine(" {0}ms", sw.ElapsedMilliseconds);
        GC.Collect();
    }

    private static string BenchmarkOverhead(string str)
    {
        return ValidResult;
    }

    private static string MakeNiceString(string str)
    {
        char[] ca = str.ToCharArray();
        string result = null;
        int i = 0;
        result += System.Convert.ToString(ca[0]);
        for (i = 1; i <= ca.Length - 1; i++)
        {
            if (!(char.IsLower(ca[i])))
            {
                result += " ";
            }
            result += System.Convert.ToString(ca[i]);
        }
        return result;
    }

    private static string ImprovedMakeNiceString(string str)
    { //Removed Convert.ToString()
        char[] ca = str.ToCharArray();
        string result = null;
        int i = 0;
        result += ca[0];
        for (i = 1; i <= ca.Length - 1; i++)
        {
            if (!(char.IsLower(ca[i])))
            {
                result += " ";
            }
            result += ca[i];
        }
        return result;
    }

    private static string RefactoredMakeNiceString(string str)
    {
        char[] ca = str.ToCharArray();
        StringBuilder sb = new StringBuilder((str.Length * 5 / 4));
        int i = 0;
        sb.Append(ca[0]);
        for (i = 1; i <= ca.Length - 1; i++)
        {
            if (!(char.IsLower(ca[i])))
            {
                sb.Append(" ");
            }
            sb.Append(ca[i]);
        }
        return sb.ToString();
    }

    private static string MakeNiceStringWithStringIndexer(string str)
    {
        StringBuilder sb = new StringBuilder((str.Length * 5 / 4));
        sb.Append(str[0]);
        for (int i = 1; i < str.Length; i++)
        {
            char c = str[i];
            if (!(char.IsLower(c)))
            {
                sb.Append(" ");
            }
            sb.Append(c);
        }
        return sb.ToString();
    }

    private static string MakeNiceStringWithForeach(string str)
    {
        StringBuilder sb = new StringBuilder(str.Length * 5 / 4);
        bool first = true;      
        foreach (char c in str)
        {
            if (!first && char.IsUpper(c))
            {
                sb.Append(" ");
            }
            sb.Append(c);
            first = false;
        }
        return sb.ToString();
    }

    private static string MakeNiceStringWithForeachAndLinqSkip(string str)
    {
        StringBuilder sb = new StringBuilder(str.Length * 5 / 4);
        sb.Append(str[0]);
        foreach (char c in str.Skip(1))
        {
            if (char.IsUpper(c))
            {
                sb.Append(" ");
            }
            sb.Append(c);
        }
        return sb.ToString();
    }

    private static string MakeNiceStringWithForeachAndCustomSkip(string str)
    {
        StringBuilder sb = new StringBuilder(str.Length * 5 / 4);
        sb.Append(str[0]);
        foreach (char c in new SkipEnumerable<char>(str, 1))
        {
            if (char.IsUpper(c))
            {
                sb.Append(" ");
            }
            sb.Append(c);
        }
        return sb.ToString();
    }

    private static string SplitCamelCase(string str)
    {
        string[] temp = Regex.Split(str, @"(?<!^)(?=[A-Z])");
        string result = String.Join(" ", temp);
        return result;
    }

    private static readonly Regex CachedRegex = new Regex("(?<!^)(?=[A-Z])");    
    private static string SplitCamelCaseCachedRegex(string str)
    {
        string[] temp = CachedRegex.Split(str);
        string result = String.Join(" ", temp);
        return result;
    }

    private static readonly Regex CompiledRegex =
        new Regex("(?<!^)(?=[A-Z])", RegexOptions.Compiled);    
    private static string SplitCamelCaseCompiledRegex(string str)
    {
        string[] temp = CompiledRegex.Split(str);
        string result = String.Join(" ", temp);
        return result;
    }

    private class SkipEnumerable<T> : IEnumerable<T>
    {
        private readonly IEnumerable<T> original;
        private readonly int skip;

        public SkipEnumerable(IEnumerable<T> original, int skip)
        {
            this.original = original;
            this.skip = skip;
        }

        public IEnumerator<T> GetEnumerator()
        {
            IEnumerator<T> ret = original.GetEnumerator();
            for (int i=0; i < skip; i++)
            {
                ret.MoveNext();
            }
            return ret;
        }

        IEnumerator IEnumerable.GetEnumerator()
        {
            return GetEnumerator();
        }
    }
}

Now the results:

BenchmarkOverhead...  22ms
MakeNiceString...  10062ms
ImprovedMakeNiceString...  12367ms
RefactoredMakeNiceString...  3489ms
MakeNiceStringWithStringIndexer...  3115ms
MakeNiceStringWithForeach...  3292ms
MakeNiceStringWithForeachAndLinqSkip...  5702ms
MakeNiceStringWithForeachAndCustomSkip...  4490ms
SplitCamelCase...  68267ms
SplitCamelCaseCachedRegex...  52529ms
SplitCamelCaseCompiledRegex...  26806ms

As you can see, the string indexer version is the winner - it's also pretty simple code.

Hope this helps... and don't forget, there are bound to be other options I haven't thought of!

Jon Skeet 2009-01-23 14:43:45

I will try just that. I was hoping you'd respond. Quick question: When I do that, should I update my question with benchmarks/results from the refactor? Or should I add it was an answer? I normally don't ask questions that I actually participate in like this.

George Stocker 2009-01-23 14:46:53

@Jon Skeet do you know of any resources that talk about complexity analysis of Regular Expressions?

Jason Punyon 2009-01-23 14:49:49

Aren't #1 and #2 slightly opposed? Using a SB with a set size improves perf, but I always thought foreach was pretty well known to be slower than a for loop since it has to use the iterator. foreach is often simpler, but if the goal is perf I'd continue to use indexing.

ctacke 2009-01-23 14:53:57

@Gortok: Update the question. @JPunyon: Nope, I'm afraid not. @ctacke: foreach can be slightly slower in some situations, but only with a constant multiplicative factor, e.g. "lose 10%" whereas with long strings StringBuilder will get *much* better. We've been asked for perf *and* maintenance.

Jon Skeet 2009-01-23 14:56:09

(Not that the string is *likely* to be hugely long in this case, I should say...)

Jon Skeet 2009-01-23 15:19:20

@Jon: I've written the new stuff except for the Regex change. That'll be done shortly. I had a meeting to hit.

George Stocker 2009-01-23 16:26:02

@Gortok: Righto. I'm about to commute home, and will then be eating etc. In other words, don't expect much for another 2 1/2 hours!

Jon Skeet 2009-01-23 16:33:19

@Jon Skeet: I've added the final compiled regex. I'm not sure I did it properly, and I didn't expect it to be as slow as it is. I've also updated my questions, if you want to take a stab at it now (in a 'completed' state).

George Stocker 2009-01-23 17:43:33

@Jon Skeet: Excellent! I knew there was a better way of setting up the benchmark (my way was 'quick and dirty' but not extensible. So thank you for that was well).

George Stocker 2009-01-23 17:52:20

@Jon Skeet: You have way too much time on your hands!

Andrew Rollings 2009-01-23 17:52:39

@Andrew: Commuting is really handy :) @Gortok: This is the first time I've done the "pass in a delegate and grab the target name" business. Quite handy. Clearly the delegate overhead is pretty low, too.

Jon Skeet 2009-01-23 17:56:49

@Jon Skeet: FWIW, I didn't write the original code, as I said in the question, I found it lying there and it looked odd. So I started monkeying with it.

George Stocker 2009-01-23 18:44:33

Jon--why are you making the call to GC.Collect() at the end of the test?

rp 2009-01-23 21:00:44

Answer 4

+2 A:

Try refactoring so that the regular expression that you're using to split the string in the second method is stored in a static method, and has been built using the RegexOptions.Compiled option. More info about this here: http://msdn.microsoft.com/en-us/library/8zbs0h2f.aspx.

I didn't test the theory, but I'd imagine that having to recreate the regex every time would be time consuming.

Joel Martinez 2009-01-23 14:43:50

Answer 5

+2 A:

This is in response to ctacke's comment on Jon Skeet's answer (It's not long for a comment)

I always thought foreach was pretty well known to be slower than a for loop since it has to use the iterator.

Actually, no, in this case foreach would be faster. Index access is bounds checked (ie. i is check to be in range three time in the loop: once in the for() and once each for the two ca[i]s), which makes a for loop slower than foreach.

If the C# compiler detects the specific syntax:

 for(i = 0; i < ca.Length; i++)

then it will perform a ad hoc optimization, removing the internal bound-checks, making the for() loop faster. However, since here we must treat ca[0] as a special case (to prevent a leading space on the output), we can't trigger that optimization.

James Curran 2009-01-23 15:45:52

Answer 6

+2 A:

I know what they say about RegEx, use it to solve a problem and now you have two problems, but I remain a fan, Just for grins, here is a RegEx version. RegEx, with a little initiation is easy to read, less code, and lets you easily snap in additional delimiters (as I did with the comma).

   s1 = MakeNiceString( "LookOut,Momma,There'sAWhiteBoatComingUpTheRiver" ) );

   private string MakeNiceString( string input )
   {
       StringBuilder sb = new StringBuilder( input );
       int Incrementer = 0;
       MatchCollection mc;
       const string SPACE = " ";

       mc = Regex.Matches( input, "[A-Z|,]" );

       foreach ( Match m in mc )
       {
           if ( m.Index > 0 )
           {
               sb.Insert( m.Index + Incrementer, SPACE );
               Incrementer++;
           }
       }

       return sb.ToString().TrimEnd();
   }

rp 2009-01-23 16:29:30

Answer 7

+2 A:

My first refactoring would be to change the name of the method to something more descriptive. MakeNiceString imo is not a name that would indicate to me what this method does.

How about PascalCaseToSentence? Not loving that name, but it's better than MakeNiceString.

steve_c 2009-01-23 16:41:27

@steve_c: That's what I came into when I found this code. That's initially what attracted me to it, it didn't make sense to have a method named that. That's why we're here, incidentally, because I got curious with a piece of code that wasn't named properly. :-)

George Stocker 2009-01-23 16:50:58

No worries, Gortok. That just would have been step #1 for me with this refactoring. But ya, thanks to the useless name it caught your attention.

steve_c 2009-01-23 20:30:45

Answer 8

A:

Jerry 2009-01-23 17:08:48

That explanation of StringBuilder's internals is entirely incorrect. It doesn't maintain a list of things to append together - it maintains a string which is the current value. This string has spare capacity, and memory is reallocated where necessary.

Jon Skeet 2009-01-23 17:14:46

Ah... my mistake. It doesn't do the re-allocation each time. That's what I was actually aiming at, but you are correct.

Jerry 2009-01-23 17:22:16

Answer 9

A:

The Regex versions of your solution are not equivalent in results to the original code. Perhaps the larger context of the code avoids the areas where they differ. The original code will add a space for anything that is not a lower case character. For example "This\tIsATest" would become "This \t Is A Test" in the original but "This\t Is A Test" with the Regex versions.

(?<!^)(?=[^a-z])

Is the pattern you want for a closer match, but even then it is still ignoring issues of i18n. The following pattern should take care of that:

(?<!^)(?=\P{Ll})

Jeremy Wilde 2009-01-23 19:52:15

The Regex was just included to have something to benchmark against. It was never a 'serious' contender to actually be used. When I started this refactoring, I knew that it'd come down to using String Builder, it was just a matter of refactoring to that point.

George Stocker 2009-01-23 20:12:58

I found it was an interesting illustration of how careful you need to be with refactoring even a small function. It is easy to overlook what the code is actually doing with what its perceived intention is.

Jeremy Wilde 2009-01-23 21:22:04

Answer 10

+1 A:

Here is a slightly more optimal version. I have taken suggestions from previous posters, but also appended to the string builder in a block-wise fashion. This may allow string builder to copy 4 bytes at a time, depending on the size of the word. I have also removed the string allocation and just replace it by str.length.

    static string RefactoredMakeNiceString2(string str)
    {
        char[] ca = str.ToCharArray();
        StringBuilder sb = new StringBuilder(str.Length);
        int start = 0;
        for (int i = 0; i < ca.Length; i++)
        {
            if (char.IsUpper(ca[i]) && i != 0)
            {
                sb.Append(ca, start, i - start);
                sb.Append(' ');
                start = i;
            }
        }
        sb.Append(ca, start, ca.Length - start);
        return sb.ToString();
    }

avid 2009-11-10 12:22:18

ansaurus

tags:

views:

answers:

String Benchmarks in C# - Refactoring for Speed/Maintainability

related questions