ansaurus

Question

Regular expression to allow a set of characters and disallow others

Answer 1

+1 A:

You do not mention what "flavor" of regex you are using. Does the following work?

\A[^œçşÇŞ ğĞščřŠŘŇĚŽĎŤČňěž ůŮ İ ť ı — ¿ „ ” * @]+\z

Lieven 2009-04-01 14:09:46

Answer 2

+1 A:

A regular expression can be built to match the incorrect characters, e.g.:

[œçşÇŞ ğĞščřŠŘŇĚŽĎŤČňěž ůŮ İ ť ı]

(I didn't include all the characters; you get the idea!).

If any character matches, it's a fail.

Or, if you need a regular expression that matches valid input, simply add a caret to the front of the brackets like so:

[^œçşÇŞ ğĞščřŠŘŇĚŽĎŤČňěž ůŮ İ ť ı]*

Jason Cohen 2009-04-01 14:10:42

Answer 3

+1 A:

You COULD use a regular expression for this, but why not just check if any of the disallowed characters are in your string with a builtin method? For example, in the .NET world you could use .Contains().

Personally, I would create a list of allowed characters, then just check that your string doesn't have any characters that aren't in your list. Using a whitelist will ensure that you haven't forgotten any "bad" characters as well.

Alex Fort 2009-04-01 14:11:37

I won't down-vote you, but here a regex really is smarter because (a) after compilation it's faster than checking against a list of characters and (b) it's more flexible if requirements change in future.

Jason Cohen 2009-04-01 14:18:00

Answer 4

A:

A few more will be added to this list but I will have the complete restricted list eventually.

And I do not have the complete list of allowed characters (It would be too long even if I try to get it and would include all chars like ~`!#$%^&()[]{};':",.<> alongwith certain foreign chars)

You will eventually have the list of disallowed characters and probably not the list of allowed characters? You must have either the list of all allowed characters or the list of all disallowed characters. Else you cannot tell if the input is legal. Further more, if you have one of the lists, you have the second implicitly if the character set is known. Then just implement the shorter one.

Just guessing, but if you use Unicode, there will probably be much more characters you want to disallow than to allow - think of all the fancy Chinees and Japanes symbols. So I think you should really build a list of allowed characters and use ranges like a-z where posiible.

If you really want to build the list of disallowed characters, you will have to build a regular expression like [^œçşÇŞ ğĞščřŠŘŇĚŽĎŤČňěž ůŮ İ ť ı — ¿ „ ” * @]*. Do not forget to escape the characters if required and use ranges if possible.

Adding so many chars in the not allowed list like [^œçşÇŞ ğĞščřŠŘŇĚŽĎŤČňěž ůŮ İ ť ı — ¿ „ ” *@]+ does not seem to work.

There are spaces in your list. Are they in your code, too? I am not sure, but may be this might be a problem.

Daniel Brückner 2009-04-01 14:19:10

Answer 5

A:

It would be best to try and match any character that is not allowed by negating the allowed set. For example, if you only wanted to allow 'a' through 'z', you might do the following.

[^a-z]

You cannot possibly know all of the characters that are not allowed, but you presumably know the ones that are allowed. So, build a regular expression like the one above that matches only one character that is not in the allowed set. If you get a match, you'll know that the string contains an invalid character.

If you can, try to use built-in character class escape codes if they're available.

Find them for Perl RE here, look for "Character Classes and other Special Escapes". It may allow you to have a shorter expression like this one.

[^\w\d  ..other individual chars..  ]

Harvey 2009-04-01 14:41:29

Answer 6

A:

Thanks all,

What i meant by

Adding so many chars in the not allowed list like [^œçşÇŞ ğĞščřŠŘŇĚŽĎŤČňěž ůŮ İ ť ı — ¿ „ ” *@]+ does not seem to work.

is that I get the below error when I have the expression as [^@*–’”“\r\nœçşÇŞğĞščřŠŘŇĚŽĎŤČňěžůŮİťı—¿„”]+

java.lang.ArrayIndexOutOfBoundsException
at org.apache.regexp.RECompiler$RERange.delete(RECompiler.java:1326)
at org.apache.regexp.RECompiler$RERange.remove(RECompiler.java:1417)
at org.apache.regexp.RECompiler$RERange.include(RECompiler.java:1459)
at org.apache.regexp.RECompiler$RERange.include(RECompiler.java:1470)
at org.apache.regexp.RECompiler.characterClass(RECompiler.java:699)
at org.apache.regexp.RECompiler.terminal(RECompiler.java:863)
at org.apache.regexp.RECompiler.closure(RECompiler.java:942)
at org.apache.regexp.RECompiler.branch(RECompiler.java:1151)
at org.apache.regexp.RECompiler.expr(RECompiler.java:1203)
at org.apache.regexp.RECompiler.compile(RECompiler.java:1281)
at org.apache.regexp.RE.(RE.java:495)
at org.apache.regexp.RE.(RE.java:480)

but this expression works perfectly fine
[^@*–’”“\r\nœçşÇŞğĞščřŠŘŇ]+

Also,
[^@*–’”“\r\nœçşÇŞğĞščřŠŘ„”]+ works
but
[^@*–’”“\r\nœçşÇŞğĞščřŠŘŇĚ]+ does not and gives the above error.

Is there a limit to the number of characters that can be disallowed like the way above?

2009-04-01 15:10:11

ansaurus

tags:

views:

answers:

Regular expression to allow a set of characters and disallow others

related questions