views:

661

answers:

6

I have a method to replace every character except those I specify. For example,

ReplaceNot("test. stop; or, not", ".;/\\".ToCharArray(), '*');

would return

"****.*****;***,****".

Now, this is not an instance of premature optimization. I call this method quite a few times during a network operation. I found that on longer strings, it is causing some latency, and removing it helped a bit. Any help to speed this up would be appreciated.

    public static string ReplaceNot(this string original, char[] pattern, char replacement)
    {           
        int index = 0;
        int old = -1;

        StringBuilder sb = new StringBuilder(original.Length);

        while ((index = original.IndexOfAny(pattern, index)) > -1)
        {
            sb.Append(new string(replacement, index - old - 1));
            sb.Append(original[index]);
            old = index++;
        }

        if (original.Length - old > 1)
        {
            sb.Append(new string(replacement, original.Length - (old + 1)));
        }

        return sb.ToString();
    }

Final #'s. I also added a test case for a 3K character string, ran at 100K times instead of 1M to see how well each of these scales. The only surprise was that the regular expression 'scaled better' than the others, but it is no help since it is very slow to begin with:

User            Short * 1M  Long * 100K     Scale
John            319             2125            6.66
Luke            360             2659            7.39
Guffa           409             2827            6.91
Mine            447             3372            7.54
DirkGently      1094            9134            8.35
Michael         1591            12785           8.04
Peter           21106           94386           4.47

Update: I made the creation of the regular expression for Peter's version a static variable, and set it to RegexOptions.Compiled to be fair:

User            Short * 1M      Long * 100K     Scale
Peter           8997            74715           8.30

Pastebin link to my testing code, please correct me if it is wrong: http://pastebin.com/f64f260ee

A: 

It's going to be O(n). You seem to be replacing all alphabets and whitespaces by *, why not just test if the current character is an alphabet/whitespace and replace it?

dirkgently
+8  A: 

Can't you use Regex.Replace like so:

Regex regex = new Regex(@"[^.;/\\]");
string s = regex.Replace("test. stop; or, not", "*");
Peter van der Heijden
+1, Beat me to it!
Adam Robinson
This screams regex to me as well. As soon as you have more than a couple calls to .IndexOf* or .Substring methods (or the possibly less efficient (?) "new string(replacement, index - old - 1)"), you can probably accomplish whatever you're trying to do with a relatively simple regex, in a couple lines of code.
gregmac
And regex will be faster than running over the string once? I always though regex-es were no good when performance was an issue!
dirkgently
Using a Regex and a simple test to execute it in a loop 1million times, my method above takes 4200 milliseconds, whereas the Regex takes 34000 milliseconds. I know it is cleaner, but I would love to see a speed improvement :)
esac
There you go! I knew that, thanks esac!
dirkgently
@esac: Can you try the method I described (performance comparison)?
dirkgently
dirkgently: if you mean iterating through the string as a char array and assigning that into a new char array and returning a string, that was my first pass through. it isnt always just 'alphabet and whitespace', so it has to be generic. i'll revert my code back to that version and post the same number for comparison.
esac
Oops, sorry about that. Didn't test performance
Peter van der Heijden
@esac: As long as the set of characters to be replaced is fixed, this approach works fine and is probably the best. Try not to create a new string, but modify in place. And if all else fails, you do have the unsafe option of accessing strings as C-style arrays and do the modification (though I doubt if this'll give much perf boost).
dirkgently
@dirkgently: I do not know if I messed it up the first time, but I just rewrote it using a one-pass replace and it definitely performs faster. Its possible that I used a for loop instead of LINQ the first time, but the revision for this was lost so I will never know. I'll edit my post to include this code, but it ran at 700 milliseconds.
esac
@esac: Do you still want a better algorithm? ;-)
dirkgently
RegEx will almost always be slower than an interative method, since it has to use pattern matching
MasterMax1313
+6  A: 

Alright, on a ~60KB string, this will perform about 40% faster than your version:

public static string ReplaceNot(this string original, char[] pattern, char replacement)
{
    int index = 0;

    StringBuilder sb = new StringBuilder(new string(replacement, original.Length));

    while ((index = original.IndexOfAny(pattern, index)) > -1)
    {
     sb[index] = original[index++];
    }

    return sb.ToString();
}

The trick is to initialize a new string with all replacement characters, since most of them will be replaced.

John Rasch
+1 - Nice. However, I guarantee in this case that since the result of the ++index or index++ expression isn't being used, there would be absolutely zero difference in performance (as far as how you increment index). I'm sure the JIT would generate the exact same code. I wouldn't be surprised if csc.exe generated the exact same IL.
Michael Burr
@Michael - it appears you are right, when I changed it back it seems to perform similarly, I don't know why it was different last time I tested it
John Rasch
To view IL there's ildasm which comes with the .NET Framework (or maybe the SDK), or you can do what all the cool kids do and use the free .NET Reflector: http://www.red-gate.com/products/reflector/
Michael Burr
Wow... I've been using Reflector for a while now and didn't know you could do that... awesome.
John Rasch
+4  A: 

I don't know if this will be any faster, but it avoids newing up strings just so they can be appended to the string builder, which may help:

    public static string ReplaceNot(this string original, char[] pattern, char replacement)
    {
        StringBuilder sb = new StringBuilder(original.Length);

        foreach (char ch in original) {
            if (Array.IndexOf( pattern, ch) >= 0) {
                sb.Append( ch);
            }
            else {
                sb.Append( replacement);
            }
        }

        return sb.ToString();
    }

If the number of chars in pattern will be of any size (which I'm guessing it generally won't), it might pay to sort it and perform an Array.BinarySearch() instead of the Array.indexOf().

For such a simple transformation, I'd bet that it'll have no problem being faster than a regex, too.

Also, since your set of characters in pattern are likely to usually come from a string anyway (at least that's been my general experience with this type of API), why don't you have the method signature be:

public static string ReplaceNot(this string original, string pattern, char replacement)

or better yet, have an overload where pattern can be a char[] or string?

Michael Burr
Well - after looking at your timing results I'm a little surprised at how poorly this fares compared to your original... I guess that using String.IndexOfAny() is far, far better than calling Array.IndexOf repeatedly on each character.
Michael Burr
+2  A: 

The StringBuilder has an overload that takes a character and a count, so you don't have to create intermediate strings to add to the StringBuilder. I get about 20% improvement by replacing this:

sb.Append(new string(replacement, index - old - 1));

with:

sb.Append(replacement, index - old - 1);

and this:

sb.Append(new string(replacement, original.Length - (old + 1)));

with:

sb.Append(replacement, original.Length - (old + 1));

(I tested the code that you said was about four times faster, and I find it about 15 times slower...)

Guffa
So what I think is the interesting question now is: is this faster than John Rasch's routine which optimistically loads the StringBuilder with the replacement char or Rasch's faster? I could see it going either way (and I suspect it's data dependant).
Michael Burr
@Michael: updated numbers coming in a few.
esac
@esac - thanks... I'm genuinely interested, and I'm glad there's someone less lazy than me around here to find out!
Michael Burr
+4  A: 

Here's another version for you. My tests suggest that its performance is pretty good.

public static string ReplaceNot(
    this string original, char[] pattern, char replacement)
{
    char[] buffer = new char[original.Length];

    for (int i = 0; i < buffer.Length; i++)
    {
        bool replace = true;

        for (int j = 0; j < pattern.Length; j++)
        {
            if (original[i] == pattern[j])
            {
                replace = false;
                break;
            }
        }

        buffer[i] = replace ? replacement : original[i];
    }

    return new string(buffer);
}
LukeH
very nice, although your scaling is off a little bit, you are very close to beating the top submission :) i've updated the main times.
esac
Interesting - algorithm-wise this is similar to my post. The main differences are an explicit loop instead of calling Array.IndexOf() and using a char[] to build the result instead of a StringBuffer(). But this implementation is 4-5x faster...
Michael Burr
@esac, It seems that the benchmarks vary significantly depending on the machine/framework they're running on: On my old PC (P4 2.6GHz, 1.5GB, XP32) my version is twice as fast as John's; On my newer PC (Core2Duo 2.66GHz, 4GB, Vista64) John's version is twice as fast as mine!
LukeH