ansaurus

Question

Regular expression for excluding special characters

Answer 1

+3 A:

Its usually better to whitelist characters you allow, rather than to blacklist characters you don't allow. both from a security standpoint, and from an ease of implementation standpoint.

If you do go down the blacklist route, here is an example, but be warned, the syntax is not simple.

http://groups.google.com/group/regex/browse_thread/thread/0795c1b958561a07

If you want to whitelist all the accent characters, perhaps using unicode ranges would help? Check out this link.

http://www.regular-expressions.info/unicode.html

Jason Coyne 2009-04-16 15:07:39

Thanks for your reply.We tried whitelisting first but it is not practical since we want to allow any accented characters. We started with this:^[a-zA-Z0-9. '-]+$then we had to add all the French characters manually. Now we need all the German ones and so on.

2009-04-16 15:15:48

Have a look on my pattern, it whitelists all characters including all accented ones.

Lucero 2009-04-16 15:20:12

According to Gaijin's link, Lucero's pattern is too simplistic; check out the section labeled "Unicode Character Properties". (You need something like "\p{L}\p{M}*" to really catch all accented characters.) But I'm quite certain a whitelist is the way to go; a fully-populated blacklist will hurt.

BlairHippo 2009-04-16 15:56:13

Answer 2

+2 A:

Do you really want to blacklist specific characters or rather whitelist the allowed charachters?

I assume that you actually want the latter. This is pretty simple (add any additional symbols to whitelist into the [\-] group):

^(?:\p{L}\p{M}*|[\-])*$

Edit: Optimized the pattern with the input from the comments

Lucero 2009-04-16 15:09:43

This is the right idea, but I don't think the capture group is needed, or in the right place. Wouldn't "[-\p{L}]*", used with the `matches()` method, do just fine?

erickson 2009-04-16 15:34:45

Yes it should. However, I wasn't sure how the Java Regex engine handles [-\p{L}] exactly; I'd at least escape the - character. Or you can make a non-capturing group (which makes the reges a little less easy to read): ^(?:\p{L}|[\-])*$

Lucero 2009-04-16 15:51:56

See the second of Gaijin's two links, under "Unicode Character Properties" -- this might not catch everything it needs to, depending on how the character is encoded. (That page suggests "\p{L}\p{M}*".) But it definitely feels like it's close to being the solution.

BlairHippo 2009-04-16 16:06:07

This depends mainly whether the string is normalized or not, but yes, this is a valid point.

Lucero 2009-04-16 16:11:19

Answer 3

+2 A:

I would just white list the characters.

^[a-zA-Z0-9äöüÄÖÜ]*$

Building a black list is equally simple with regex but you might need to add much more characters - there are a lot of Chinese symbols in unicode ... ;)

^[^<>%$]*$

The expression [^(many characters here)] just matches any character that is not listed.

Daniel Brückner 2009-04-16 15:11:30

Your whitelist pattern does only include the German umlaut, but no French or other characters - and there are many common ones... like: ñëÿêâôîíì etc. therefore, basically only using a Unicode character group makes whitelisting possible with the requirement given.

Lucero 2009-04-16 15:19:05

Of course ... only an example and the Umlaute were easiest to type on a German keyboard.

Daniel Brückner 2009-04-16 15:51:46

You didn't get the point I was trying to make. It's not about your choice of characters as sample, but about not really being able to whitelist all possible combinations.

Lucero 2009-04-16 16:15:19

Why not? There aren't that many accented letters. If you have to manage a separate list for each language, so be it.

Atømix 2009-09-30 16:57:38

@Atomiton, Vietnamese (for example) has 11 vowel nuclei, each of which can have one of 5 accents (ex: ệ) as well as the letter đ. Polish has Ł Ź Ś Ę... Turkish has the dotted I, İ. There are hundreds of different accented letters.

Jacob 2009-09-30 17:03:37

There are a few hundred he wants to include but there are several thousands he wants to exclude.

Daniel Brückner 2009-09-30 18:11:44

Answer 4

A:

I guess it depends what language you are targeting. In general, something like this should work:

[^<>%$]

The "[]" construct defines a character class, which will match any of the listed characters. Putting "^" as the first character negates the match, ie: any character OTHER than one of those listed.

You may need to escape some of the characters within the "[]", depending on what language/regex engine you are using.

KarstenF 2009-04-16 15:11:37

Answer 5

A:

I strongly suspect it's going to be easier to come up with a list of the characters that ARE allowed vs. the ones that aren't -- and once you have that list, the regex syntax becomes quite straightforward. So put me down as another vote for "whitelist".

BlairHippo 2009-04-16 15:14:02

Answer 6

+1 A:

To exclude certain characters ( <, >, %, and $), you can make a regular expression like this:

[<>%\$]

This regular expression will match all inputs that have a blacklisted character in them. The brackets define a character class, and the \ is necessary before the dollar sign because dollar sign has a special meaning in regular expressions.

To add more characters to the black list, just insert them between the brackets; order does not matter.

According to some Java documentation for regular expressions, you could use the expression like this:

Pattern p = Pattern.compile("[<>%\$]");
Matcher m = p.matcher(unsafeInputString);
if (m.matches())
{
    // Invalid input: reject it, or remove/change the offending characters.
}
else
{
    // Valid input.
}

David Grayson 2009-04-16 15:15:29

matches() returns true iff the regex matches the whole string, as if it were anchored at both ends with '^' and '$'; you would need to use find() for this approach to work. But see the other answers for why a blacklist is bad idea.

Alan Moore 2009-04-16 22:18:30

Also, most metacharacters lose their special meanings when they're in a character class, so there's no need to escape the '$'. But if you did need to escape it you would have to use two backslashes ("\\$") because it's in a Java String literal.

Alan Moore 2009-04-16 22:22:18

Answer 7

A:

Why do you consider regex the best tool for this? If your purpose is to detect whether an illegal character is present in a string, testing each character in a loop will be both simpler and more efficient than constructing a regex.

DJClayworth 2009-04-16 18:58:06

Answer 8

A:

Here's all the french accented characters: àÀâÂäÄáÁéÉèÈêÊëËìÌîÎïÏòÒôÔöÖùÙûÛüÜçÇ’ñ

I would google a list of German accented characters. There aren't THAT many. You should be able to get them all.

For URLS I Replace accented URLs with regular letters like so:

string beforeConversion = "àÀâÂäÄáÁéÉèÈêÊëËìÌîÎïÏòÒôÔöÖùÙûÛüÜçÇ’ñ";
string afterConversion = "aAaAaAaAeEeEeEeEiIiIiIoOoOoOuUuUuUcC'n";
for (int i = 0; i < beforeConversion.Length; i++) {

     cleaned = Regex.Replace(cleaned, beforeConversion[i].ToString(), afterConversion[i].ToString());
}

There's probably a more efficient way, mind you.

Atømix 2009-09-30 16:56:02

ansaurus

tags:

views:

answers:

Regular expression for excluding special characters

related questions