views:

2844

answers:

8

What is the best way to remove all the special characters from a string - like these:

!@#$%^&*(){}|:"?><,./;'[]\=-

The items having these characters removed would rather short, so would it be better to use REGEX on each or just use string manipulation?

Thx

Environment == C#/.NET

A: 

Us the "tr" command?

You don't say what enviroment you're in... shell? C program? Java? Each of those would have different best solutions.

+4  A: 

It's generally better to have a whitelist than a blacklist.

Regex has a convenient \w that, effectively means alphanumeric plus underscore (some variants also add accented chars (á,é,ô,etc) to the list, others don't).

You can invert that by using \W to mean everything that's not alphanumeric.

So replace \W with empty string will remove all 'special' characters.


Alternatively, if you do need a different set of characters to alphanumeric, you can use a negated character class: [^abc] will match everything that is not a or b or c, and [^a-z] will match everything that is not in the range a,b,c,d...x,y,z

The equivalent to \w is [A-Za-z0-9_] and thus \W is [^A-Za-z0-9_]

Peter Boughton
A: 

In what language are you going the regex?

For example, in Perl you can do a translation which would translate any of the chars in your list into nothing:

e.g. This will translate 'a','b','c' or 'd' into ''

$sentence =~ tr/abcd//;
Assaf Lavie
+2  A: 

I prefer regex because the syntax is simpler to read and maintain:

# in Python
import re
re.sub("[abcdef]", "", text)

where abcdef are the properly escaped characters to be removed.

Alternatively, if you want only alphanumeric characters (plus the underscore), you could use:

re.sub("\W", "", text)

where \W represents a non-word character, i.e. [^a-zA-Z_0-9].

Zach Scrivena
+1  A: 

When you just want to have alphanumeric characters, you could just express this by using an inverted character class:

[^A-Za-z0-9]+

This means: every character that is not alphanumeric.

Gumbo
not quite, you forgot A-Z I think :)
Robert
This can be simplified to \w
Unkwntech
\w stands for [A-Za-z0-9_] and I’m not sure if he want’s the low line as well.
Gumbo
+3  A: 

in php:

$tests = array(
     'hello, world!'
    ,'this is a test'
    ,'and so is this'
    ,'another test with /slashes/ & (parenthesis)'
    ,'l3375p34k stinks'
);

function strip_non_alphanumerics( $subject )
{
    return preg_replace( '/[^a-z0-9]/i', '', $subject );
}

foreach( $tests as $test )
{
    printf( "%s\n", strip_non_alphanumerics( $test ) );
}

output would be:

helloworld
thisisatest
andsoisthis
anothertestwithslashesparenthesis
l3375p34kstinks
Kris
I might add some test cases with capital letters.
jm
@jm: that's a good thing too, i was just too lazy to type any and the "i" following the regex makes the regex case-insensitive ;)
Kris
P.S. Why is this the accepted answer if the question now states environment to be .NET? (I don't think it did when I answered). It wouldn't be too conceptually different in for example C#, but would look nothing like this.
Kris
+1  A: 

here's a simple regex

[^\w]

this should catch all non-word characters this will permit a-z A-Z 0-9 space and _ neither space nor _ were in your list so this works if you wanted to catch these also then I would do something like this:

/[a-z0-90/i

this is the PHP format for a-z and 0-9 the i makes it case-insensitive.

Unkwntech
This is wrong. \w does *not* include space. It is also overly complex to do "[^\w]" instead of just "\W". And your second expression will not work - it has a zero in place of closing bracket. This is also not a PHP-specific format, it works for many different forms.
Peter Boughton
A: 

You can rather validate them at the frontend by getting the askey values of the keyed in characters.

Satish