tags:

views:

686

answers:

4

Is there a special regex statement like \w that denotes all printable characters? I'd like to validate that a string only contains a character that can be printed--i.e. does not contain ASCII control characters like \b (bell), or null, etc. Anything on the keyboard is fine, and so are UTF chars.

If there isn't a special statement, how can I specify this in a regex?

+3  A: 

Well, if you Google it you'll find some answers right away. There is a POSIX character class designation [:print:] that should match printable characters, and [:cntrl:] for control characters. Note that these match codes throughout the ASCII table, so they might not be suitable for matching other encodings.

Failing that, the expression [\x00-\x1f] will match through the ASCII control characters, although again, these could be printable in other encodings.

zombat
+1  A: 

It depends wildly on what regex package you are using. This is one of these situations about which some wag said that the great thing about standards is there are so many to choose from.

If you happen to be using C, the isprint(3) function/macro is your friend.

Norman Ramsey
A: 

In Java, the \p{Print} option specifies the printable character class.

hashable
A: 

If your regex flavor supports Unicode properties, this is probably the best the best way:

\P{Cc}

That matches any character that's not a control character, whether it be ASCII -- [\x00-\x1F\x7F] -- or Latin1 -- [\x80-\x9F] (also known as the C1 control characters).

The problem with POSIX classes like [:print:] or \p{Print} is that they can match different things depending on the regex flavor and, possibly, the locale settings of the underlying platform. In Java, they're strictly ASCII-oriented. That means \p{Print} matches only the ASCII printing characters -- [\x20-\x7E] -- while \P{Cntrl} (note the capital 'P') matches everything that's not an ASCII control character -- [^\x00-\x1F\x7F]. That is, it matches any ASCII character that isn't a control character, or any non-ASCII character--including C1 control characters.

Alan Moore