views:

49

answers:

1
^[^\x00-\x1F\x7F-\xFF]+$

This regex will properly fail to match a string that contains non-printing (hex 00-1f) or ASCII extended characters (hex 80-FF), but, unlike PHP, lets non-ASCII utf-8 characters pass. (eg. 日本واستقرارهहिन्दीދިވެހިބަސްગુજરાતી한)

Looking at the wikipedia page on UTF-8 all of those should fall in the 80-ff range. Does anyone know what I'm missing?

Also, if you could explain how to ignore quoted text, you would be my hero forever.

+1  A: 

Hmm... instead of rejecting byte ranges, try matching actual Unicode characters, e.g.:

^[\u0020-\u007e]+$
Delan Azabani
Thank you kindly!
Greg