ansaurus

Question

Removing nonnumeric and nonalpha characters from a string?

Answer 1

A:

Us the "tr" command?

You don't say what enviroment you're in... shell? C program? Java? Each of those would have different best solutions.

2009-02-09 14:53:05

Answer 2

+4 A:

It's generally better to have a whitelist than a blacklist.

Regex has a convenient \w that, effectively means alphanumeric plus underscore (some variants also add accented chars (á,é,ô,etc) to the list, others don't).

You can invert that by using \W to mean everything that's not alphanumeric.

So replace \W with empty string will remove all 'special' characters.

Alternatively, if you do need a different set of characters to alphanumeric, you can use a negated character class: [^abc] will match everything that is not a or b or c, and [^a-z] will match everything that is not in the range a,b,c,d...x,y,z

The equivalent to \w is [A-Za-z0-9_] and thus \W is [^A-Za-z0-9_]

Peter Boughton 2009-02-09 14:53:58

Answer 3

A:

In what language are you going the regex?

For example, in Perl you can do a translation which would translate any of the chars in your list into nothing:

e.g. This will translate 'a','b','c' or 'd' into ''

$sentence =~ tr/abcd//;

Assaf Lavie 2009-02-09 14:56:08

Answer 4

+2 A:

I prefer regex because the syntax is simpler to read and maintain:

# in Python
import re
re.sub("[abcdef]", "", text)

where abcdef are the properly escaped characters to be removed.

Alternatively, if you want only alphanumeric characters (plus the underscore), you could use:

re.sub("\W", "", text)

where \W represents a non-word character, i.e. [^a-zA-Z_0-9].

Zach Scrivena 2009-02-09 14:56:21

Answer 5

+1 A:

When you just want to have alphanumeric characters, you could just express this by using an inverted character class:

[^A-Za-z0-9]+

This means: every character that is not alphanumeric.

Gumbo 2009-02-09 14:56:42

not quite, you forgot A-Z I think :)

Robert 2009-02-09 14:58:57

This can be simplified to \w

Unkwntech 2009-02-09 15:02:41

\w stands for [A-Za-z0-9_] and I’m not sure if he want’s the low line as well.

Gumbo 2009-02-09 15:08:06

Answer 6

+3 A:

in php:

$tests = array(
     'hello, world!'
    ,'this is a test'
    ,'and so is this'
    ,'another test with /slashes/ & (parenthesis)'
    ,'l3375p34k stinks'
);

function strip_non_alphanumerics( $subject )
{
    return preg_replace( '/[^a-z0-9]/i', '', $subject );
}

foreach( $tests as $test )
{
    printf( "%s\n", strip_non_alphanumerics( $test ) );
}

output would be:

helloworld
thisisatest
andsoisthis
anothertestwithslashesparenthesis
l3375p34kstinks

Kris 2009-02-09 14:59:02

I might add some test cases with capital letters.

jm 2010-01-27 00:07:12

@jm: that's a good thing too, i was just too lazy to type any and the "i" following the regex makes the regex case-insensitive ;)

Kris 2010-01-27 13:23:01

P.S. Why is this the accepted answer if the question now states environment to be .NET? (I don't think it did when I answered). It wouldn't be too conceptually different in for example C#, but would look nothing like this.

Kris 2010-01-27 13:26:24

Answer 7

+1 A:

here's a simple regex

[^\w]

this should catch all non-word characters this will permit a-z A-Z 0-9 space and _ neither space nor _ were in your list so this works if you wanted to catch these also then I would do something like this:

/[a-z0-90/i

this is the PHP format for a-z and 0-9 the i makes it case-insensitive.

Unkwntech 2009-02-09 15:04:42

This is wrong. \w does *not* include space. It is also overly complex to do "[^\w]" instead of just "\W". And your second expression will not work - it has a zero in place of closing bracket. This is also not a PHP-specific format, it works for many different forms.

Peter Boughton 2009-02-10 13:46:57

Answer 8

A:

You can rather validate them at the frontend by getting the askey values of the keyed in characters.

Satish 2009-02-09 16:50:08

ansaurus

tags:

views:

answers:

Removing nonnumeric and nonalpha characters from a string?

related questions