ansaurus

Question

C#: Removing common invalid characters from a string: improve this algorithm

Answer 1

+2 A:

Why would you have REALLY LIKED to do that? The code is absolutely no simpler, you're just forcing a query extension method into your code.

As an aside, the Contains check seems redundant, both conceptually and from a performance perspective. Contains has to run through the whole string anyway, you may as well just call Replace(bad.ToString(), string.Empty) for every character and forget about whether or not it's actually present.

Of course, a regular expression is always an option, and may be more performant (if not less clear) in a situation like this.

Adam Robinson 2009-08-25 18:14:37

good point, thanks Adam!

p.campbell 2009-08-25 19:39:09

Answer 2

+15 A:

The string class is immutable (although a reference type), hence all its static methods are designed to return a new string variable. Calling someString.Replace without assigning it to anything will not have any effect in your program. - Seems like you fixed this problem.

The main issue with your suggested algorithm is that it repeatedly assigning many new string variables, potentially causing a big performance hit. LINQ doesn't really help things here. (I doesn't make the code significantly shorter and certainly not any more readable, in my opinion.)

Try the following extension method. The key is the use of StringBuilder, which means only one block of memory is assigned for the result during execution.

private static readonly HashSet<char> badChars = 
    new HashSet<char> { '!', '@', '#', '$', '%', '_' };

public static string CleanString(this string str)
{
    var result = new StringBuilder(str.Length);
    for (int i = 0; i < str.Length; i++)
    {
        if (!badChars.Contains(str[i]))
            result.Append(str[i]);
    }
    return result.ToString();
}

This algorithm also makes use of the .NET 3.5 'HashSet' class to give O(1) look up time for detecting a bad char. This makes the overall algorithm O(n) rather than the O(nm) of your posted one (m being the number of bad chars); it also is lot a better with memory usage, as explained above.

Noldorin 2009-08-25 18:16:23

If you use a hashtable/dictionary to store the bad characters, the lookup would be O(1) instead of O(m). That could also allow for a custom replacement of the character (i.e. if he wanted to replace '@' with 'at' instead of '' in the future).

Erich Mirabal 2009-08-25 18:21:00

@Erich: Good point. I'll edit my code to use that instead.

Noldorin 2009-08-25 18:22:05

If you're going to go to the trouble to obfuscate the code in this manner, why not just use a RegEx?

Adam Robinson 2009-08-25 18:33:27

@Adam: Obfuscated - how so? To me this code is both perfectly clear and very efficient. See my comment on the other post for why not to use RegEx.

Noldorin 2009-08-25 18:38:24

Also. Just a small comment. You have to use str.Length in your loop - not the stringbuilder :) it doesn't work that way

CasperT 2009-08-25 19:06:34

That an algorithm runs in O(nm) does _not_ state that it will always be slower than an algorithm running in O(1) it _only_ states that the execution time of one is affected by the number of two factors (in this case elements in to collections) wheras the other is unaffected by any factors. in this case the O(nm) will run faster than the O(1) for a low nm.

Rune FS 2009-08-25 19:17:22

@Rune FS: I'd like to see your numbers that give you that result. I just tried it with both the list and hashset implementation, and the hashset was always faster (it took 1/2 to 2/3 of the time of the list). My `nm` was the base string he gave (i.e. both the input and badchar were the same size -- 6 characters long) so it was very low as you claim.

Erich Mirabal 2009-08-25 19:40:44

(@Rune FS: as a side note, yes, I agree with you that O(1) can be slower than O(nm) for certain sizes of nm, just not in this case).

Erich Mirabal 2009-08-25 19:43:05

I agree with Erich fully. Rune FS: it should be obvious to you that this algorithm is quicker than the `O(nm)` alternative. Any other reason for the down-vote? :P

Noldorin 2009-08-25 22:50:56

@Noldorin and Erich. First of all any down votes was note from me I up voted cuz I like your solution for optimizing the hashing however the solution I've proposed below measured over 1000000 iterations ran in 27 units of time wheras yours ran in 43 on my machine (excluding the hashing initialisation which bloated it to a factor 10 for one iteration). The split solution is O(nm)

Rune FS 2009-08-26 06:45:25

@myself the above numbers are incorrect the 43 is for the optimized hashing version by28OZ28 the solution in this answer ran in 85

Rune FS 2009-08-26 07:06:06

@Rune FS: Fair enough. Not sure exactly *why* yours runs quicker, though. (Perhaps the internal optimisation of the static `string` methods?)

Noldorin 2009-08-26 10:14:28

There seems no point adding a separate answer (especially since the accepted one is frankly much worse).Have you considered noting that the hash can be replaced with a simple switch based static method given the rules are compile time known...

ShuggyCoUk 2009-09-01 00:15:02

Answer 3

+16 A:

I don't know about the readability of it, but a regular expression could do what you need it to:

someString = Regex.Replace(someString, @"[!@#$%_]", "");

CAbbott 2009-08-25 18:23:20

The regex should probably be either "[!@#$%_]" or "!|@|#|$|%|_".

Daniel Brückner 2009-08-25 18:31:44

Is it faster than Noldorin's solution?

CasperT 2009-08-25 18:32:18

@Daniel - good point, I modified the example.

CAbbott 2009-08-25 18:33:55

@CaspterT: Afraid not - it's going to be `O(nm)` at best, since it uses a state machine. Even compiled regex will naturally be slower. Also, not so practical for a large number of chars.

Noldorin 2009-08-25 18:37:46

@Noldorin - you're right. I was coming at it from a simplicity, elegance standpoint rather than pure performance one.

CAbbott 2009-08-25 18:41:51

@Noldorin: I disagree. The assemblying of a new string will happen only once using the regex, and many times using a lot of Replace calls. Assemblying a new string implies allocating memory (= expensive), and Regex.Replace has better cache locality (= cheaper)

Massa 2009-08-25 18:51:21

Now I would love to see some benchmarks. Replacing or removing chars in a string is common practice!

CasperT 2009-08-25 19:03:47

+1 Yay simplicitly!

Strommy 2009-08-25 19:13:37

@Casper, Noldorin: Regular expressions are a proven technology for string parsing and manipulation, and the technology has evolved into something VERY fast. While this solution isn't as fast as Noldorin's bare-metal solution, it is a great deal more elegant (IMO), and is considerably faster than the constant string reconstruction of the OP's question.

Adam Robinson 2009-08-25 19:55:44

I could recommend to use static method Regex.Replace(input, @"[!@#$%_]", output, RegexOptions.Compiled) because it supports RegexOptions which increases the speed

abatishchev 2009-08-25 21:50:36

@abatishchev: It is still slower than any of the List/Hash/bool[] solutions. I do agree it is more readable.

Erich Mirabal 2009-08-25 21:55:56

@CAbbott: Fair enough then. I'm not sure I agree it's more elegant, though it's probably simpler from most perspectives.

Noldorin 2009-08-25 22:53:00

@CasperT here are some bench marks (avg. over 1,000,000 iterations in milliseconds)Noldorin: 0,0072813432 (HashSet)RuneFS: 0,0030156636 (Split and concatenation)28OZ28: 0,0091563672 (Hashing)CAbbot: 0,01132827 (RegEX)

Rune FS 2009-08-26 07:30:17

Here are my benchmarks. Running over 2 Million iterations, using Stopwatch to get millisecond timings: List<char> = 7337. HashSet<char> = 4892. bool[] = 2940. BitArray = 3105. RegEx (compiled) = 8700. Smaller is better (obviously), and bool[] was the fastest (if not the most memory hungry). BitArray gives a very nice trade-off between speed and size.

Erich Mirabal 2009-08-26 12:24:40

Forgot about the original Replace implementation. That one, using the same conditions as the others, took 4851 milliseconds. So faster than the List, edges out the HashSet, but still slower than the bool/BitArray options.

Erich Mirabal 2009-08-26 12:35:34

Answer 4

+3 A:

Something to consider -- if this is for passwords (say), you want to scan for and keep good characters, and assume everything else is bad. Its easier to correctly filter or good things, then try to guess all bad things.

For Each Character If Character is Good -> Keep it (copy to out buffer, whatever.)

jeff

Jeff Mitchell 2009-08-25 18:27:30

Answer 5

+3 A:

if you still want to do it in a LINQy way:

public static string CleanUp(this string orig)
{
    var badchars = new List<char>() { '!', '@', '#', '$', '%', '_' };

    return new string(orig.ToCharArray().Where(c => !badchars.Contains(c)).ToArray());
}

Mike Jacobs 2009-08-25 18:38:18

You don't need the ToCharArray at all :) Remove it and the code still works as it should.

CasperT 2009-08-25 19:10:10

And you can use a HashSet instead of a List.

Robert Rossney 2009-08-25 21:16:56

Answer 6

+4 A:

char[] BAD_CHARS = new char[] { '!', '@', '#', '$', '%', '_' }; //simple example
someString = string.Concat(someString.Split(BAD_CHARS,StringSplitOption.RemoveEmptyEntries));

should do the trick (sorry for any smaller syntax errors I'm on my phone)

Rune FS 2009-08-25 19:12:41

This is a great suggestion!

p.campbell 2009-08-26 16:35:46

Answer 7

+7 A:

This one is faster than HashSet<T>. Also, if you have to perform this action often, please consider the foundations for this question I asked here.

private static readonly bool[] BadCharValues;

static StaticConstructor()
{
    BadCharValues = new bool[char.MaxValue+1];
    char[] badChars = { '!', '@', '#', '$', '%', '_' };
    foreach (char c in badChars)
        BadCharValues[c] = true;
}

public static string CleanString(string str)
{
    var result = new StringBuilder(str.Length);
    for (int i = 0; i < str.Length; i++)
    {
        if (!BadCharValues[str[i]])
            result.Append(str[i]);
    }
    return result.ToString();
}

280Z28 2009-08-25 20:50:47

You basically created an expanded version of the hashset where you are guaranteed no collisions and no need to compute a hash value since the hash value is the character value. +1 since it is faster without paying a big memory penalty.

Erich Mirabal 2009-08-25 21:36:05

(for the curious: this is still faster than using a BitArray, even though bool[] takes up a bit (ahem) more space)

Erich Mirabal 2009-08-25 21:48:23

The memory penalty is significant, but indeed this uses an optimised version of the `HashSet`. Oh, and thanks for copying my code without attribute.

Noldorin 2009-08-26 10:12:24

a 65K lookup table with only 6 entries of note will almost certainly perform worse than a hash based approach due to the memory pressure on your cache. If it's compile time known write it as a switch statement and be done with it...

ShuggyCoUk 2009-09-01 00:12:46

Answer 8

A:

This is pretty clean. Restricts it to valid characters instead of removing invalid ones. You should split it to constants probably:

string clean = new string(@"Sour!ce Str&*(@ing".Where(c => 
@"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ,.".Contains(c)).ToArray()

uosɐſ 2009-08-25 21:42:55

Why the down-vote??

uosɐſ 2009-09-18 18:55:26

ansaurus

tags:

views:

answers:

C#: Removing common invalid characters from a string: improve this algorithm

related questions