ansaurus

Question

Most efficient way to remove special characters from string

Answer 1

+2 A:

I would use a String Replace with a Regular Expression searching for "special characters", replacing all characters found with an empty string.

Stephen Wrighton 2009-07-13 15:38:48

+1 certainly less code and arguably more readable ignoring write-once Regex.

kenny 2009-07-13 16:38:02

Answer 2

A:

This doesn't seem inefficient at all. You may be able to improve on it but you may compromise readability / maintainability.

Regards

Howard May 2009-07-13 15:38:56

Answer 3

+6 A:

I suggest creating a simple lookup table, which you can initialize in the static constructor to set any combination of characters to valid. This lets you do a quick, single check.

edit

Also, for speed, you'll want to initialize the capacity of your StringBuilder to the length of your input string. This will avoid reallocations. These two methods together will give you both speed and flexibility.

another edit

I think the compiler might optimize it out, but as a matter of style as well as efficiency, I recommend foreach instead of for.

Steven Sudit 2009-07-13 15:39:11

+1 for efficiency

Blixt 2009-07-13 15:42:56

For arrays, `for` and `foreach` produce similar code. I don't know about strings though. I doubt that the JIT knows about the array-like nature of String.

SealedSun 2009-07-13 15:52:25

I bet the JIT knows more about the array-like nature of string than your [joke removed]. Anders etal did a lot of work optimizing everything about strings in .net

Will 2009-07-13 16:02:00

I've done this using HashSet<char> and it is about 2x slower than his method. Using bool[] is barely faster (0.0469ms/iter v. 0.0559ms/iter) than the version he has in OP...with the problem of being less readable.

sixlettervariables 2009-07-13 16:36:00

I used an int[] with 0 or 1, since alignment affects speed. Wasn't able to find anything faster.

Steven Sudit 2009-07-13 16:40:57

I couldn't see any performance difference between using a bool array and an int array. I would use a bool array, as it brings down the lookup table from 256 kb to 64 kb, but it's still a lot of data for such a trivial function... And it's only about 30% faster.

Guffa 2009-07-13 17:07:10

Guffa, I'm going to answer in parts. 1) I looked up my notes and it seems that I was incorrect. Boolean was about the same speed as Integer, so that's what I used.

Steven Sudit 2009-07-13 18:16:22

@Guffa 2) Given that we're only keeping alphanumerics and a few Basic Latin characters, we only need a table for the low byte, so size isn't really an issue. If we wanted to be general-purpose, then the standard Unicode technique is double-indirection. In other words, a table of 256 table references, many of which point to the same empty table.

Steven Sudit 2009-07-13 18:17:57

@Guffa 3) The speed boost depends very much on how complex the criteria are that we need to check for each character. Compared to, say, two checks for 0-9, the table approach isn't a huge win. But its speed remains constant even if the pattern is effectively random and would take hundreds of checks.

Steven Sudit 2009-07-13 18:19:24

Answer 4

+17 A:

Well, unless you really need to squeeze the performance out of your function, just go with what is easiest to maintain and understand. A regular expression would look like this:

For additional performance, you can either pre-compile it or just tell it to compile on first call (subsequent calls will be faster.)

public static string RemoveSpecialCharacters(string str)
{
    return Regex.Replace(str, "[^a-zA-Z0-9_.]+", "", RegexOptions.Compiled);
}

Blixt 2009-07-13 15:40:55

I'd guess that this is probably a complex enough query that it would be faster than the OP's approach, especially if pre-compiled. I have no evidence to back that up, however. It should be tested. Unless it's drastically slower, I'd choose this approach regardless, since it's way easier to read and maintain. +1

rmeador 2009-07-13 15:48:07

Its a very simple regex (no backtracking or any complex stuff in there) so it should be pretty damn fast.

Will 2009-07-13 16:00:04

Perhaps. But do you want to bet the table-driven approach will be faster?

Steven Sudit 2009-07-13 16:11:30

@rmeador: without it being compiled it is about 5x slower, compiled it is 3x slower than his method. Still 10x simpler though :-D

sixlettervariables 2009-07-13 16:15:53

I'd definitely recommend Steven's method if performance is critical. Regular expressions make text validation/modification simple, not efficient, as many are inclined to think.

Blixt 2009-07-13 16:21:51

Regular expressions are no magical hammers and never faster than hand optimized code.

SealedSun 2009-07-13 16:58:38

Answer 5

+3 A:

A regular expression will look like:

public string RemoveSpecialChars(string input)
{
    return Regex.Replace(input, @"[^0-9a-zA-Z\._]", string.Empty);
}

But if performance is highly important, I recommend you to do some benchmarks before selecting the "regex path"...

CMS 2009-07-13 15:42:06

That regex is about 20% slower than: [^0-9a-zA-Z\._]+ interestingly enough...

sixlettervariables 2009-07-13 16:16:45

Answer 6

+1 A:

It seems good to me. The only improvement I would make is to initialize the StringBuilder with the length of the string.

StringBuilder sb = new StringBuilder(str.Length);

bruno conde 2009-07-13 15:42:34

Answer 7

+3 A:

I'm not convinced your algorithm is anything but efficient. It's O(n) and only looks at each character once. You're not gonna get any better than that unless you magically know values before checking them.

I would however initialize the capacity of your StringBuilder to the initial size of the string. I'm guessing your perceived performance problem comes from memory reallocation.

Side note: Checking A-z is not safe. You're including [, \, ], ^, _, and `...

Side note 2: For that extra bit of efficiency, put the comparisons in an order to minimize the number of comparisons. (At worst, you're talking 8 comparisons tho, so don't think too hard.) This changes with your expected input, but one example could be:

if (str[i] >= '0' && str[i] <= 'z' && 
    (str[i] >= 'a' || str[i] <= '9' ||  (str[i] >= 'A' && str[i] <= 'Z') || 
    str[i] == '_') || str[i] == '.')

Side note 3: If for whatever reason you REALLY need this to be fast, a switch statement may be faster. The compiler should create a jump table for you, resulting in only a single comparison:

switch (str[i])
{
    case '0':
    case '1':
    .
    .
    .
    case '.':
        sb.Append(str[i]);
        break;
}

lc 2009-07-13 15:43:22

I agree that you can't beat O(n) on this one. However, there is a cost per comparison which can be lowered. A table lookup has a low, fixed cost, while a series of comparisons is going to increase in cost as you add more exceptions.

Steven Sudit 2009-07-13 15:47:03

About side note 3, do you really think the jump table would be faster than table lookup?

Steven Sudit 2009-07-13 16:12:55

I ran the quick performance test on the switch solution, and it performs the same as the comparison.

Guffa 2009-07-13 16:54:42

@Steven Sudit - I'd venture they're actually about the same. Care to run a test?

lc 2009-07-13 17:12:31

O(n) notation sometimes pisses me off. People will make stupid assumptions based on the fact the algorithm is already O(n). If we changed this routine to replace the str[i] calls with a function that retrieved the comparison value by constructing a one-time SSL connection with a server on the opposite side of the world... you damn sure would see a massive performance difference and the algorithm is STILL O(n). The cost of O(1) for each algorithm is significant and NOT equivalent!

darron 2009-07-13 17:47:12

@Ic: I'm not sure that straight tables would be much faster, but I doubt they'd be slower. Both approaches involve a lookup; the jump table one also involves an additional branch, and that's typically slow. I would also avoid the jump table solution in the first place because of the inflexibility.

Steven Sudit 2009-07-13 18:24:58

Answer 8

+18 A:

Why do you think that your method is not efficient? It's actually one of the most efficient ways that you can do it.

You should of course read the character into a local variable or use an enumerator to reduce the number of array accesses:

public static string RemoveSpecialCharacters(string str) {
   StringBuilder sb = new StringBuilder();
   foreach (char c in str) {
      if ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z') | || (c >= 'a' && c <= 'z') | c == '.' || c == '_') {
         sb.Append(c);
      }
   }
   return sb.ToString();
}

One thing that makes a method like this efficient is that it scales well. The execution time will be relative to the length of the string. There is no nasty surprises if you would use it on a large string.

Edit:
I made a quick performance test, running each function a million times with a 24 character string. These are the results:

Original function: 54.5 ms.
My suggested change: 47.1 ms.
Mine with setting StringBuilder capacity: 43.3 ms.
Regular expression: 294.4 ms.

Edit 2: I added the distinction between A-Z and a-z in the code above. (I reran the performance test, and there is no noticable difference.)

Edit 3:
I tested the lookup+char[] solution, and it runs in about 13 ms.

The prize to pay is of course the initialisation of the huge lookup table and keeping it in memory. Well, it's not that much data, but it's much for such a trivial function...

private static bool[] _lookup;

static Program() {
   _lookup = new bool[65535];
   for (char c = '0'; c <= 9; c++) _lookup[c] = true;
   for (char c = 'A'; c <= 'Z'; c++) _lookup[c] = true;
   for (char c = 'a'; c <= 'z'; c++) _lookup[c] = true;
   _lookup['.'] = true;
   _lookup['_'] = true;
}

public static string RemoveSpecialCharacters(string str) {
   char[] buffer = new char[str.Length];
   int index = 0;
   foreach (char c in str) {
      if (_lookup[c]) {
         buffer[index] = c;
         index++;
      }
   }
   return new string(buffer, 0, index);
}

Guffa 2009-07-13 15:45:46

I agree. The only other change I would make is to add the initial capacity argument to the StringBuilder constructor, "= new StringBuilder(str.Length)".

David 2009-07-13 16:05:07

sixlettervariables 2009-07-13 16:19:06

@David: Yes setting the capacity would give a slight performance improvement.

Guffa 2009-07-13 16:22:00

@sixlettervariables: Yes, that might be what's really intended...

Guffa 2009-07-13 16:27:57

My answer, using a `char[]` buffer rather than `StringBuilder`, has a slight edge on this one according to my testing. (Mine's less readable though, so the small performance benefit probably isn't worth it.)

LukeH 2009-07-13 16:46:26

Luke, while it may well be faster to append to a char buffer than a StringBuffer, we still need to copy it into a string when we're done. The StringBuffer uses a string internally, so there's no additional copy.

Steven Sudit 2009-07-13 16:52:22

@Steven: That may well be the case, but the benchmarks speak for themselves! In my tests, using a `char[]` buffer performs (slightly) better than `StringBuilder`, even when scaling up to strings that are tens of thousands of characters in length.

LukeH 2009-07-13 17:02:11

@Guffa: yours 0.0416ms/string, int[]/bool[] lookup 0.0399ms/string. Yours is also 10x readable.

sixlettervariables 2009-07-13 17:10:54

@Luke: yours is 0.0294ms/string using @Guffa's with char[] v. 0.0416ms/string with StringBuffer. Quite the improvement good sir, couple it with a lookup table and you're at 0.0286ms/string.

sixlettervariables 2009-07-13 17:17:07

@Luke: Ok, then I'll have to benchmark char[]->StringBuffer against straight StringBuffer the next time I have a particular data set to optimize for.

Steven Sudit 2009-07-13 18:21:29

Answer 9

A:

I wonder if a Regex-based replacement (possibly compiled) is faster. Would have to test that.

Other than that, you should initialize the StringBuilder with an expected length, so that the intermediate string doesn't have to be copied around while it grows.

A good number is the length of the original string, or something slightly lower (depending on the nature of the functions inputs).

Finally, you can use a lookup table (in the range 0..127) to find out whether a character is to be accepted.

SealedSun 2009-07-13 15:50:15

Answer 10

A:

If you're worried about speed, use pointers to edit the existing string. You could pin the string and get a pointer to it, then run a for loop over each character, overwriting each invalid character with a replacement character. It would be extremely efficient and would not require allocating any new string memory. You would also need to compile your module with the unsafe option, and add the "unsafe" modifier to your method header in order to use pointers.

static void Main(string[] args)
{
    string str = "string!$%with^&*invalid!!characters";
    Console.WriteLine( str ); //print original string
    FixMyString( str, ' ' );
    Console.WriteLine( str ); //print string again to verify that it has been modified
    Console.ReadLine(); //pause to leave command prompt open
}


public static unsafe void FixMyString( string str, char replacement_char )
{
    fixed (char* p_str = str)
    {
        char* c = p_str; //temp pointer, since p_str is read-only
        for (int i = 0; i < str.Length; i++, c++) //loop through each character in string, advancing the character pointer as well
            if (!IsValidChar(*c)) //check whether the current character is invalid
                (*c) = replacement_char; //overwrite character in existing string with replacement character
    }
}

public static bool IsValidChar( char c )
{
    return (c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || (c == '.' || c == '_');
    //return char.IsLetterOrDigit( c ) || c == '.' || c == '_'; //this may work as well
}

Triynko 2009-07-13 15:57:36

Noooooooooo! Changing a string in .NET is BAAAAAAAAAAAAD! Everything in the framework relies on the rule that strings are immutable, and if you break that you can get very surprising side effects...

Guffa 2009-07-13 16:52:57

Answer 11

+1 A:

public static string RemoveSpecialCharacters(string str)
{
    char[] buffer = new char[str.Length];
    int idx = 0;

    foreach (char c in str)
    {
        if ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z')
            || (c >= 'a' && c <= 'z') || (c == '.') || (c == '_'))
        {
            buffer[idx] = c;
            idx++;
        }
    }

    return new string(buffer, 0, idx);
}

LukeH 2009-07-13 16:06:35

+1, tested and it is about 40% faster than StringBuilder. 0.0294ms/string v. 0.0399ms/string

sixlettervariables 2009-07-13 17:14:37

Just to be sure, do you mean StringBuilder with or without pre-allocation?

Steven Sudit 2009-07-13 19:28:54

With pre-allocation, it is still 40% slower than the char[] allocation and new string.

sixlettervariables 2009-07-14 02:28:40

Answer 12

A:

For S&G's, Linq-ified way:

var original = "(*^%foo)(@)&^@#><>?:\":';=-+_";
var valid = new char[] { 
    'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 
    'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 
    'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 
    'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '1', '2', '3', '4', '5', '6', '7', '8', 
    '9', '0', '.', '_' };
var result = string.Join("",
    (from x in original.ToCharArray() 
     where valid.Contains(x) select x.ToString())
        .ToArray());

I don't think this is going to be the most efficient way, however.

Will 2009-07-13 16:16:56

It's not, because it's a linear search.

Steven Sudit 2009-07-13 16:41:30

Answer 13

A:

StringBuilder sb = new StringBuilder();

for (int i = 0; i < fName.Length; i++)
{
   if (char.IsLetterOrDigit(fName[i]))
    {
       sb.Append(fName[i]);
    }
}

Chamika Somasiri 2010-03-27 19:32:53

Answer 14

A:

public string RemoveSpecial(string evalstr)
{
StringBuilder finalstr = new StringBuilder();
            foreach(char c in evalstr){
            int charassci = Convert.ToInt16(c);
            if (!(charassci >= 33 && charassci <= 47))// special char ???
             finalstr.append(c);
            }
return finalstr.ToString();
}

Shiko 2010-04-27 17:19:06

ansaurus

tags:

views:

answers:

Most efficient way to remove special characters from string

related questions