tags:

views:

8293

answers:

8

Hi,

I am having trouble coming up with a regular expression which would essentially black list certain special characters.

I need to use this to validate data in input fields (in a Java Web app). We want to allow users to enter any digit, letter (we need to include accented characters, ex. french or german) and some special characters such as '-. etc.

How do I blacklist characters such as <>%$ etc?

Any help would be greatly appreciated.

+3  A: 

Its usually better to whitelist characters you allow, rather than to blacklist characters you don't allow. both from a security standpoint, and from an ease of implementation standpoint.

If you do go down the blacklist route, here is an example, but be warned, the syntax is not simple.

http://groups.google.com/group/regex/browse_thread/thread/0795c1b958561a07

If you want to whitelist all the accent characters, perhaps using unicode ranges would help? Check out this link.

http://www.regular-expressions.info/unicode.html

Jason Coyne
Thanks for your reply.We tried whitelisting first but it is not practical since we want to allow any accented characters. We started with this:^[a-zA-Z0-9. '-]+$then we had to add all the French characters manually. Now we need all the German ones and so on.
Have a look on my pattern, it whitelists all characters including all accented ones.
Lucero
According to Gaijin's link, Lucero's pattern is too simplistic; check out the section labeled "Unicode Character Properties". (You need something like "\p{L}\p{M}*" to really catch all accented characters.) But I'm quite certain a whitelist is the way to go; a fully-populated blacklist will hurt.
BlairHippo
+2  A: 

Do you really want to blacklist specific characters or rather whitelist the allowed charachters?

I assume that you actually want the latter. This is pretty simple (add any additional symbols to whitelist into the [\-] group):

^(?:\p{L}\p{M}*|[\-])*$

Edit: Optimized the pattern with the input from the comments

Lucero
This is the right idea, but I don't think the capture group is needed, or in the right place. Wouldn't "[-\p{L}]*", used with the `matches()` method, do just fine?
erickson
Yes it should. However, I wasn't sure how the Java Regex engine handles [-\p{L}] exactly; I'd at least escape the - character. Or you can make a non-capturing group (which makes the reges a little less easy to read): ^(?:\p{L}|[\-])*$
Lucero
See the second of Gaijin's two links, under "Unicode Character Properties" -- this might not catch everything it needs to, depending on how the character is encoded. (That page suggests "\p{L}\p{M}*".) But it definitely feels like it's close to being the solution.
BlairHippo
This depends mainly whether the string is normalized or not, but yes, this is a valid point.
Lucero
+2  A: 

I would just white list the characters.

^[a-zA-Z0-9äöüÄÖÜ]*$

Building a black list is equally simple with regex but you might need to add much more characters - there are a lot of Chinese symbols in unicode ... ;)

^[^<>%$]*$

The expression [^(many characters here)] just matches any character that is not listed.

Daniel Brückner
Your whitelist pattern does only include the German umlaut, but no French or other characters - and there are many common ones... like: ñëÿêâôîíì etc. therefore, basically only using a Unicode character group makes whitelisting possible with the requirement given.
Lucero
Of course ... only an example and the Umlaute were easiest to type on a German keyboard.
Daniel Brückner
You didn't get the point I was trying to make. It's not about your choice of characters as sample, but about not really being able to whitelist all possible combinations.
Lucero
Why not? There aren't that many accented letters. If you have to manage a separate list for each language, so be it.
Atømix
@Atomiton, Vietnamese (for example) has 11 vowel nuclei, each of which can have one of 5 accents (ex: ệ) as well as the letter đ. Polish has Ł Ź Ś Ę... Turkish has the dotted I, İ. There are hundreds of different accented letters.
Jacob
There are a few hundred he wants to include but there are several thousands he wants to exclude.
Daniel Brückner
A: 

I guess it depends what language you are targeting. In general, something like this should work:

[^<>%$]

The "[]" construct defines a character class, which will match any of the listed characters. Putting "^" as the first character negates the match, ie: any character OTHER than one of those listed.

You may need to escape some of the characters within the "[]", depending on what language/regex engine you are using.

KarstenF
A: 

I strongly suspect it's going to be easier to come up with a list of the characters that ARE allowed vs. the ones that aren't -- and once you have that list, the regex syntax becomes quite straightforward. So put me down as another vote for "whitelist".

BlairHippo
+1  A: 

To exclude certain characters ( <, >, %, and $), you can make a regular expression like this:

[<>%\$]

This regular expression will match all inputs that have a blacklisted character in them. The brackets define a character class, and the \ is necessary before the dollar sign because dollar sign has a special meaning in regular expressions.

To add more characters to the black list, just insert them between the brackets; order does not matter.

According to some Java documentation for regular expressions, you could use the expression like this:

Pattern p = Pattern.compile("[<>%\$]");
Matcher m = p.matcher(unsafeInputString);
if (m.matches())
{
    // Invalid input: reject it, or remove/change the offending characters.
}
else
{
    // Valid input.
}
David Grayson
matches() returns true iff the regex matches the whole string, as if it were anchored at both ends with '^' and '$'; you would need to use find() for this approach to work. But see the other answers for why a blacklist is bad idea.
Alan Moore
Also, most metacharacters lose their special meanings when they're in a character class, so there's no need to escape the '$'. But if you did need to escape it you would have to use two backslashes ("\\$") because it's in a Java String literal.
Alan Moore
A: 

Why do you consider regex the best tool for this? If your purpose is to detect whether an illegal character is present in a string, testing each character in a loop will be both simpler and more efficient than constructing a regex.

DJClayworth
A: 

Here's all the french accented characters: àÀâÂäÄáÁéÉèÈêÊëËìÌîÎïÏòÒôÔöÖùÙûÛüÜçÇ’ñ

I would google a list of German accented characters. There aren't THAT many. You should be able to get them all.

For URLS I Replace accented URLs with regular letters like so:

string beforeConversion = "àÀâÂäÄáÁéÉèÈêÊëËìÌîÎïÏòÒôÔöÖùÙûÛüÜçÇ’ñ";
string afterConversion = "aAaAaAaAeEeEeEeEiIiIiIoOoOoOuUuUuUcC'n";
for (int i = 0; i < beforeConversion.Length; i++) {

     cleaned = Regex.Replace(cleaned, beforeConversion[i].ToString(), afterConversion[i].ToString());
}

There's probably a more efficient way, mind you.

Atømix