views:

192

answers:

8

After considerable measurement, I have identified a hotspot in one of our windows services that I'd like to optimize. We are processing strings that may have multiple consecutive spaces in it, and we'd like to reduce to only single spaces. We use a static compiled regex for this task:

private static readonly Regex 
    regex_select_all_multiple_whitespace_chars = 
        new Regex(@"\s+",RegexOptions.Compiled);

and then use it as follows:

var cleanString=
    regex_select_all_multiple_whitespace_chars.Replace(dirtyString.Trim(), " ");

This line is being invoked several million times, and is proving to be fairly intensive. I've tried to write something better, but I'm stumped. Given the fairly modest processing requirements of the regex, surely there's something faster. Could unsafe processing with pointers speed things further?

Edit:

Thanks for the amazing set of responses to this question... most unexpected!

+3  A: 

Just one suggestion, if your data don't have unicode whitespaces, instead of \s+ use [ \r\n]+ or [ \n]+ or just  + (if there is space only), basically limit it to minimum character set.

S.Mark
@" {2,}" performs almost (within 10%) as well as the best method suggested here over my data, so excellent answer!
spender
+6  A: 

Currently, you are replacing a single space with another single space. Try to match \s{2,} (or something similar, if you want to replace single newlines and other characters).

Kobi
+1 thats very big point.
S.Mark
+1 Nicely spotted.
Mitch Wheat
...but I think perhaps a regex is not the best option...
Mitch Wheat
I agree. For a critical point, a quick test shows we can do better.
Kobi
Probably because the code-change-to-performance ratio is very good. A 25% boost with just a few characters changed in the regular expression. :)
Guffa
+3  A: 

You could not use regular expressions. For example:

private static string NormalizeWhitespace(string test)
{
    string trimmed = test.Trim();

    var sb = new StringBuilder(trimmed.Length);

    int i = 0;
    while (i < trimmed.Length)
    {
        if (trimmed[i] == ' ')
        {
            sb.Append(trimmed[i]);

            do { i++; } while (i < trimmed.Length && trimmed[i] == ' ');
        }

        sb.Append(trimmed[i]);

        i++;
    }

    return sb.ToString();
}

With this method and the following test bed:

private static readonly Regex MultipleWhitespaceRegex = new Regex(
    @"\s+", 
    RegexOptions.Compiled);

static void Main(string[] args)
{
    string test = "regex  select    all multiple     whitespace   chars";

    const int Iterations = 15000;

    var sw = new Stopwatch();

    sw.Start();
    for (int i = 0; i < Iterations; i++)
    {
        NormalizeWhitespace(test);
    }
    sw.Stop();
    Console.WriteLine("{0}ms", sw.ElapsedMilliseconds);

    sw.Reset();

    sw.Start();
    for (int i = 0; i < Iterations; i++)
    {
        MultipleWhitespaceRegex.Replace(test, " ");
    }
    sw.Stop();
    Console.WriteLine("{0}ms", sw.ElapsedMilliseconds);
}

I got the following results:

// NormalizeWhitespace - 27ms
// Regex - 132ms

Note that this was only tested with a very simple example, could be further optimized by removing the call to String.Trim and is only provided to make a point of regular expressions sometimes not being the best answer.

João Angelo
apart from the facts that whitespace is more than just `' '`, and that you don't use the variable `trimemd`, this is exactly what I was going to suggest. +1
Rob Fonseca-Ensor
@Rob Fonseca-Ensor, thanks for spotting that one, it was a last minute incomplete change. :)
João Angelo
+8  A: 

This is about three times faster:

private static string RemoveDuplicateSpaces(string text) {
  StringBuilder b = new StringBuilder(text.Length);
  bool space = false;
  foreach (char c in text) {
    if (c == ' ') {
      if (!space) b.Append(c);
      space = true;
    } else {
      b.Append(c);
      space = false;
    }
  }
  return b.ToString();
}
Guffa
That is uncanny!!!
Mitch Wheat
+1. From Me! ...
Mitch Wheat
+1. this is how I would have done it.
João Portela
I went for this variant. A little more speed can be gained by using for(int i... instead foreach, with char c=text[i]. Thanks again everyone.
spender
+3  A: 

I'm curious how a straight forward implementation might perform:

    static string RemoveConsecutiveSpaces(string input)
    {
        bool whiteSpaceWritten = false;
        StringBuilder sbOutput = new StringBuilder(input.Length);

        foreach (Char c in input)
        {
            if (c == ' ')
            {
                if (!whiteSpaceWritten)
                {
                    whiteSpaceWritten = true;
                    sbOutput.Append(c);
                }
            }
            else
            {
                whiteSpaceWritten = false;
                sbOutput.Append(c);
            }
        }

        return sbOutput.ToString();
    }
Mitch Wheat
It's about three times as fast as the regular expression. Se my answer (that has basically the same code).
Guffa
A: 

As it is such a simple expression, replacing two or more spaces with a single space, get rid of the Regex object and hard code the replacement yourself (in C++/CLI):

String ^text = "Some   text  to process";
bool spaces = false;
// make the following static and just clear it rather than reallocating it every time
System::Text::StringBuilder ^output = gcnew System::Text::StringBuilder;
for (int i = 0, l = text->Length ; i < l ; ++i)
{
  if (spaces)
  {
    if (text [i] != ' ')
    {
      output->Append (text [i]);
      spaces = false;
    }
  }
  else
  {
    output->Append (text [i]);
    if (text [i] == ' ')
    {
      spaces = true;
    }
  }
}
text = output->ToString ();
Skizz
hmmm, not sure C++/CLI is justified!
Mitch Wheat
The syntax is slightly different but it isn't too hard to change the above to C#. I just had a C++/CLI project open to test the code in (yeah I know, but it's what I've got to use).
Skizz
+7  A: 

How about this...

public string RemoveMultiSpace(string test)
{
var words = test.Split(new char[] { ' ' }, 
    StringSplitOptions.RemoveEmptyEntries);
return string.Join(" ", words);
}

Test case run with NUnit:
Test time is in milliseconds.

Regex Test time: 338,8885
RemoveMultiSpace Test time: 78,9335
private static readonly Regex regex_select_all_multiple_whitespace_chars =
   new Regex(@"\s+", RegexOptions.Compiled);

[Test]
public void Test()
{
    string startString = "A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      ";
    string cleanString;
    Trace.WriteLine("Regex Test start");
    int count = 10000;
    Stopwatch timer = new Stopwatch();
    timer.Start();
    for (int i = 0; i < count; i++)
    {
        cleanString = regex_select_all_multiple_whitespace_chars.Replace(startString, " ");
    }
    var elapsed = timer.Elapsed;
    Trace.WriteLine("Regex Test end");
    Trace.WriteLine("Regex Test time: " + elapsed.TotalMilliseconds);

    Trace.WriteLine("RemoveMultiSpace Test start");
    timer = new Stopwatch();
    timer.Start();
    for (int i = 0; i < count; i++)
    {
        cleanString = RemoveMultiSpace(startString);
    }
    elapsed = timer.Elapsed;
    Trace.WriteLine("RemoveMultiSpace Test end");
    Trace.WriteLine("RemoveMultiSpace Test time: " + elapsed.TotalMilliseconds);
}

public string RemoveMultiSpace(string test)
{
    var words = test.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
    return string.Join(" ", words);
}

Edit:
Made some more tests and added Guffa´s method "RemoveDuplicateSpaces" based on StringBuilder.
So my conclusion is that the StringBuilder method is faster when there is a lot of spaces, but with less spaces the string split method is slightly faster.

Cleaning file with about 30000 lines, 10 iterations
RegEx time elapsed: 608,0623
RemoveMultiSpace time elapsed: 239,2049
RemoveDuplicateSpaces time elapsed: 307,2044

Cleaning string, 10000 iterations:
A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      
RegEx time elapsed: 590,3626
RemoveMultiSpace time elapsed: 159,4547
RemoveDuplicateSpaces time elapsed: 137,6816

Cleaning string, 10000 iterations:
A      B      C      D      E      F      A      B      C      D      E      F      A      B      C      D      E      F      A      B      C      D      E      F      A      B      C      D      E      F      A      B      C      D      E      F      A      B      C      D      E      F      A      B      C      D      E      F      
RegEx time elapsed: 290,5666
RemoveMultiSpace time elapsed: 64,6776
RemoveDuplicateSpaces time elapsed: 52,4732

Jens Granlund
That's very nice. A quick shallow test shows this is faster than the StringBuilder approach.
Kobi
Updated: Sorry, at first I forgot the loop over RemoveMultiSpace method.
Jens Granlund
I thought that the test result looked waaaaaay too god. ;) It's a bit faster than using a StringBuilder, but it also creates a bunch of temporary strings. You would have to try them with some real data to see how that affects the performance.
Guffa
Does anyone have an idea as to *why* this is faster than the StringBuilder approach?
Heinzi
@Guffa, I made a test with a c# file with about 30000 lines(RReference.cs) not a lot of spaces but it´s still about 3 times faster.
Jens Granlund
@Heinzi, I made some more tests and updated my answer.
Jens Granlund
@Heinzi: It uses optimised (unsafe mode or unmanaged) methods in the framework which is faster than the managed code that uses a StringBuilder. Eventhough the Split-Join does some extra work, it's still faster in some cases.
Guffa
+1. didn't remember this solution :)
João Portela
A: 

Arrays always will be faster

        public static string RemoveMultiSpace(string input)
    {
        var value = input;

        if (!string.IsNullOrEmpty(input))
        {
            var isSpace = false;
            var index = 0;
            var length = input.Length;
            var tempArray = new char[length];
            for (int i = 0; i < length; i++)
            {
                var symbol = input[i];
                if (symbol == ' ')
                {
                    if (!isSpace)
                    {
                        tempArray[index++] = symbol;
                    }
                    isSpace = true;
                }
                else
                {
                    tempArray[index++] = symbol;
                    isSpace = false;
                }
            }
            value = new string(tempArray, 0, index);
        }

        return value;
    }
Petar Petrov
Overuse of the `var` keyword. Why are you copying the array to another array instead of just creating the string from tempArray?
Guffa