views:

2890

answers:

6
 string str1 = "12345ABC...\\...ABC100000"; 
 // Hypothetically huge string of 100000 + Unicode Chars
 str1 = str1.Replace("1", string.Empty);
 str1 = str1.Replace("22", string.Empty);
 str1 = str1.Replace("656", string.Empty);
 str1 = str1.Replace("77ABC", string.Empty);

 // ...  this replace anti-pattern might happen with upto 50 consecutive lines of code.

 str1 = str1.Replace("ABCDEFGHIJD", string.Empty);

I have inherited some code that does the same as the snippet above. It takes a huge string and replaces (removes) constant smaller strings from large string.

I believe this is very memory intensive process given that new large immutable strings are being allocated in memory for each replace, awaitng death via the GC

1. What is the fastest way of replacing these values, ignoring memory concerns?

2. What is the most memory efficient way of achieving the same result?

I am hoping that these are the same answer!

Practical solutions that fit somewhere inbewteen these goals are also appreciated.

Assumptions:

  • All replacements are constant and known in advance
  • Underlying characters do contain some unicode [non-ascii] chars
+3  A: 

StringBuilder: http://msdn.microsoft.com/en-us/library/2839d5h5.aspx

The performance of the Replace operation itself should be roughly same as string.Replace and according to Microsoft no garbage should be produced.

Andreas Huber
Thanks Andreas, can you provide a simple explanation why a SB is a better option.
nick_alot
It is in effect a mutable string.
Ed Swangren
However, to remove the first character of a string, it still needs to copy the rest into place. That's where the regex option may be better, as it may be able to just build up the final string without shuffling things around for the multiple operations.
Jon Skeet
I will hopefully get some time to profile these two approaches and update this post. From experience the regex OR operator never performed that well, but maybe it's very memory efficient, which is my key concern.
nick_alot
+5  A: 

All characters in a .NET string are "unicode chars". Do you mean they're non-ascii? That shouldn't make any odds - unless you run into composition issues, e.g. an "e + acute accent" not being replaced when you try to replace an "e acute".

You could try using a regular expression with Regex.Replace, or StringBuilder.Replace. Here's sample code doing the same thing with both:

using System;
using System.Text;
using System.Text.RegularExpressions;

class Test
{
    static void Main(string[] args)
    {
        string original = "abcdefghijkl";

        Regex regex = new Regex("a|c|e|g|i|k", RegexOptions.Compiled);

        string removedByRegex = regex.Replace(original, "");
        string removedByStringBuilder = new StringBuilder(original)
            .Replace("a", "")
            .Replace("c", "")
            .Replace("e", "")
            .Replace("g", "")
            .Replace("i", "")
            .Replace("k", "")
            .ToString();

        Console.WriteLine(removedByRegex);
        Console.WriteLine(removedByStringBuilder);
    }
}

I wouldn't like to guess which is more efficient - you'd have to benchmark with your specific application. The regex way may be able to do it all in one pass, but that pass will be relatively CPU-intensive compared with each of the many replaces in StringBuilder.

Jon Skeet
Thanks. To answer your question, they are non-ascii. I thought this may relevant if there was an efficient stream based solution.
nick_alot
or more accurately, some of them are non-ascii.
nick_alot
Shouldn't be relevant, beyond the possible composition issues I mentioned.
Jon Skeet
Regex (or any regural expressions) is not prefered with large strings
Ahmed Said
@Ahmed: I'd rather trust actual benchmarks than just received wisdom - I suspect it's *heavily* dependent on the data.
Jon Skeet
+1 for using `""` instead of `String.Empty` The compiler changes `""` to String.Empty anyway, and the result in this case is more readable.
Atømix
@Atømix: No, the compiler doesn't change "" to String.Empty. They *are* different things... I just happen to find "" more readable.
Jon Skeet
Interesting. I had learned that the compiler optimizes `""` and replaces it (an initialized empty String Object) with `String.Empty` I guess I should've checked the IL myself! This page clarified it for me: http://kossovsky.net/index.php/2009/06/string-empty-versus-empty-quotes/ I assume the guy who told me saw: "...advantages of using `""` is that it's in some cases it can be optimized by the compiler before the JIT. For example if you will run the same iteration while comparing `""==null` the whole block will be removed by the optimizer." Apparently, the compiler DOES do some optimization.
Atømix
+3  A: 
StringBuilder sb = new StringBuilder("Hello string");
sb.Replace("string", String.Empty);
Console.WriteLine(sb);

StringBuilder, a mutable string.

Ed Swangren
A: 

if you want a built in class in dotnet i think StringBuilder is the best. to make it manully you can use unsafe code with char* and iterate through your string and replace based on your criteria

Ahmed Said
A: 

Since you have multiple replaces on one string, I wolud recomend you to use RegEx over StringBuilder.

dmajkic
AFAIK, you can't pass StringBuilder to Regex.Replace().
Constantin
+3  A: 

If you want to be really fast, and I mean really fast you'll have to look beyond the StringBuilder and just write well optimized code.

One thing your computer doesn't like to do is branching, if you can write a replace method which operates on a fixed array (char *) and doesn't branch you have great performance.

What you'll be doing is that the replace operation is going to search for a sequence of characters and if it finds any such sub string it will replace it. In effect you'll copy the string and when doing so, preform the find and replace.

You'll rely on these functions for picking the index of some buffer to read/write. The goal is to preform the replace method such that when nothing has to change you write junk instead of branching.

You should be able to complete this without a single if statement and remember to use unsafe code. Otherwise you'll be paying for index checking for every element access.

unsafe
{
    fixed( char * p = myStringBuffer )
    {
        // Do fancy string manipulation here
    }
}

I've written code like this in C# for fun and seen significant performance improvements, almost 300% speed up for find and replace. While the .NET BCL (base class library) performs quite well it is riddled with branching constructs and exception handling this will slow down you code if you use the built-in stuff. Also these optimizations while perfectly sound are not preformed by the JIT-compiler and you'll have to run the code as a release build without any debugger attached to be able to observe the massive performance gain.

I could provide you with more complete code but it is a substantial amount of work. However, I can guarantee you that it will be faster than anything else suggested so far.

John Leidegren
I am very interested in your suggestion. Could you please share it when you have time?
Hosam Aly