tags:

views:

1318

answers:

8

Consider the requirement to strip invalid characters from a string. The characters just need to be removed and replace with blank or string.Empty.

char[] BAD_CHARS = new char[] { '!', '@', '#', '$', '%', '_' }; //simple example

foreach (char bad in BAD_CHARS)
{
    if (someString.Contains(bad))
      someString = someString.Replace(bad.ToString(), string.Empty);
}

I'd have really liked to do this:

if (BAD_CHARS.Any(bc => someString.Contains(bc)))
    someString.Replace(bc,string.Empty); // bc is out of scope

Question: Do you have any suggestions on refactoring this algoritm, or any simpler, easier to read, performant, maintainable algorithms?

+2  A: 

Why would you have REALLY LIKED to do that? The code is absolutely no simpler, you're just forcing a query extension method into your code.

As an aside, the Contains check seems redundant, both conceptually and from a performance perspective. Contains has to run through the whole string anyway, you may as well just call Replace(bad.ToString(), string.Empty) for every character and forget about whether or not it's actually present.

Of course, a regular expression is always an option, and may be more performant (if not less clear) in a situation like this.

Adam Robinson
good point, thanks Adam!
p.campbell
+15  A: 

The string class is immutable (although a reference type), hence all its static methods are designed to return a new string variable. Calling someString.Replace without assigning it to anything will not have any effect in your program. - Seems like you fixed this problem.

The main issue with your suggested algorithm is that it repeatedly assigning many new string variables, potentially causing a big performance hit. LINQ doesn't really help things here. (I doesn't make the code significantly shorter and certainly not any more readable, in my opinion.)

Try the following extension method. The key is the use of StringBuilder, which means only one block of memory is assigned for the result during execution.

private static readonly HashSet<char> badChars = 
    new HashSet<char> { '!', '@', '#', '$', '%', '_' };

public static string CleanString(this string str)
{
    var result = new StringBuilder(str.Length);
    for (int i = 0; i < str.Length; i++)
    {
        if (!badChars.Contains(str[i]))
            result.Append(str[i]);
    }
    return result.ToString();
}

This algorithm also makes use of the .NET 3.5 'HashSet' class to give O(1) look up time for detecting a bad char. This makes the overall algorithm O(n) rather than the O(nm) of your posted one (m being the number of bad chars); it also is lot a better with memory usage, as explained above.

Noldorin
If you use a hashtable/dictionary to store the bad characters, the lookup would be O(1) instead of O(m). That could also allow for a custom replacement of the character (i.e. if he wanted to replace '@' with 'at' instead of '' in the future).
Erich Mirabal
@Erich: Good point. I'll edit my code to use that instead.
Noldorin
If you're going to go to the trouble to obfuscate the code in this manner, why not just use a RegEx?
Adam Robinson
@Adam: Obfuscated - how so? To me this code is both perfectly clear and very efficient. See my comment on the other post for why not to use RegEx.
Noldorin
Also. Just a small comment. You have to use str.Length in your loop - not the stringbuilder :) it doesn't work that way
CasperT
That an algorithm runs in O(nm) does _not_ state that it will always be slower than an algorithm running in O(1) it _only_ states that the execution time of one is affected by the number of two factors (in this case elements in to collections) wheras the other is unaffected by any factors. in this case the O(nm) will run faster than the O(1) for a low nm.
Rune FS
@Rune FS: I'd like to see your numbers that give you that result. I just tried it with both the list and hashset implementation, and the hashset was always faster (it took 1/2 to 2/3 of the time of the list). My `nm` was the base string he gave (i.e. both the input and badchar were the same size -- 6 characters long) so it was very low as you claim.
Erich Mirabal
(@Rune FS: as a side note, yes, I agree with you that O(1) can be slower than O(nm) for certain sizes of nm, just not in this case).
Erich Mirabal
I agree with Erich fully. Rune FS: it should be obvious to you that this algorithm is quicker than the `O(nm)` alternative. Any other reason for the down-vote? :P
Noldorin
@Noldorin and Erich. First of all any down votes was note from me I up voted cuz I like your solution for optimizing the hashing however the solution I've proposed below measured over 1000000 iterations ran in 27 units of time wheras yours ran in 43 on my machine (excluding the hashing initialisation which bloated it to a factor 10 for one iteration). The split solution is O(nm)
Rune FS
@myself the above numbers are incorrect the 43 is for the optimized hashing version by28OZ28 the solution in this answer ran in 85
Rune FS
@Rune FS: Fair enough. Not sure exactly *why* yours runs quicker, though. (Perhaps the internal optimisation of the static `string` methods?)
Noldorin
There seems no point adding a separate answer (especially since the accepted one is frankly much worse).Have you considered noting that the hash can be replaced with a simple switch based static method given the rules are compile time known...
ShuggyCoUk
+16  A: 

I don't know about the readability of it, but a regular expression could do what you need it to:

someString = Regex.Replace(someString, @"[!@#$%_]", "");
CAbbott
The regex should probably be either "[!@#$%_]" or "!|@|#|$|%|_".
Daniel Brückner
Is it faster than Noldorin's solution?
CasperT
@Daniel - good point, I modified the example.
CAbbott
@CaspterT: Afraid not - it's going to be `O(nm)` at best, since it uses a state machine. Even compiled regex will naturally be slower. Also, not so practical for a large number of chars.
Noldorin
@Noldorin - you're right. I was coming at it from a simplicity, elegance standpoint rather than pure performance one.
CAbbott
@Noldorin: I disagree. The assemblying of a new string will happen only once using the regex, and many times using a lot of Replace calls. Assemblying a new string implies allocating memory (= expensive), and Regex.Replace has better cache locality (= cheaper)
Massa
Now I would love to see some benchmarks. Replacing or removing chars in a string is common practice!
CasperT
+1 Yay simplicitly!
Strommy
@Casper, Noldorin: Regular expressions are a proven technology for string parsing and manipulation, and the technology has evolved into something VERY fast. While this solution isn't as fast as Noldorin's bare-metal solution, it is a great deal more elegant (IMO), and is considerably faster than the constant string reconstruction of the OP's question.
Adam Robinson
I could recommend to use static method Regex.Replace(input, @"[!@#$%_]", output, RegexOptions.Compiled) because it supports RegexOptions which increases the speed
abatishchev
@abatishchev: It is still slower than any of the List/Hash/bool[] solutions. I do agree it is more readable.
Erich Mirabal
@CAbbott: Fair enough then. I'm not sure I agree it's more elegant, though it's probably simpler from most perspectives.
Noldorin
@CasperT here are some bench marks (avg. over 1,000,000 iterations in milliseconds)Noldorin: 0,0072813432 (HashSet)RuneFS: 0,0030156636 (Split and concatenation)28OZ28: 0,0091563672 (Hashing)CAbbot: 0,01132827 (RegEX)
Rune FS
Here are my benchmarks. Running over 2 Million iterations, using Stopwatch to get millisecond timings: List<char> = 7337. HashSet<char> = 4892. bool[] = 2940. BitArray = 3105. RegEx (compiled) = 8700. Smaller is better (obviously), and bool[] was the fastest (if not the most memory hungry). BitArray gives a very nice trade-off between speed and size.
Erich Mirabal
Forgot about the original Replace implementation. That one, using the same conditions as the others, took 4851 milliseconds. So faster than the List, edges out the HashSet, but still slower than the bool/BitArray options.
Erich Mirabal
+3  A: 

Something to consider -- if this is for passwords (say), you want to scan for and keep good characters, and assume everything else is bad. Its easier to correctly filter or good things, then try to guess all bad things.

For Each Character If Character is Good -> Keep it (copy to out buffer, whatever.)

jeff

Jeff Mitchell
+3  A: 

if you still want to do it in a LINQy way:

public static string CleanUp(this string orig)
{
    var badchars = new List<char>() { '!', '@', '#', '$', '%', '_' };

    return new string(orig.ToCharArray().Where(c => !badchars.Contains(c)).ToArray());
}
Mike Jacobs
You don't need the ToCharArray at all :) Remove it and the code still works as it should.
CasperT
And you can use a HashSet instead of a List.
Robert Rossney
+4  A: 
char[] BAD_CHARS = new char[] { '!', '@', '#', '$', '%', '_' }; //simple example
someString = string.Concat(someString.Split(BAD_CHARS,StringSplitOption.RemoveEmptyEntries));

should do the trick (sorry for any smaller syntax errors I'm on my phone)

Rune FS
This is a great suggestion!
p.campbell
+7  A: 

This one is faster than HashSet<T>. Also, if you have to perform this action often, please consider the foundations for this question I asked here.

private static readonly bool[] BadCharValues;

static StaticConstructor()
{
    BadCharValues = new bool[char.MaxValue+1];
    char[] badChars = { '!', '@', '#', '$', '%', '_' };
    foreach (char c in badChars)
        BadCharValues[c] = true;
}

public static string CleanString(string str)
{
    var result = new StringBuilder(str.Length);
    for (int i = 0; i < str.Length; i++)
    {
        if (!BadCharValues[str[i]])
            result.Append(str[i]);
    }
    return result.ToString();
}
280Z28
You basically created an expanded version of the hashset where you are guaranteed no collisions and no need to compute a hash value since the hash value is the character value. +1 since it is faster without paying a big memory penalty.
Erich Mirabal
(for the curious: this is still faster than using a BitArray, even though bool[] takes up a bit (ahem) more space)
Erich Mirabal
The memory penalty is significant, but indeed this uses an optimised version of the `HashSet`. Oh, and thanks for copying my code without attribute.
Noldorin
a 65K lookup table with only 6 entries of note will almost certainly perform worse than a hash based approach due to the memory pressure on your cache. If it's compile time known write it as a switch statement and be done with it...
ShuggyCoUk
A: 

This is pretty clean. Restricts it to valid characters instead of removing invalid ones. You should split it to constants probably:

string clean = new string(@"Sour!ce Str&*(@ing".Where(c => 
@"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ,.".Contains(c)).ToArray()
uosɐſ
Why the down-vote??
uosɐſ