views:

81

answers:

1

I want to disallow certain UTF-8 input (server-side), e.g. eastern languages, where example input might be " 伊 ".

However, I do want to continue supporting other latin or "latin-like" characters, such as the welsh ŵ and ŷ, so checking against latin-1 is not possible.

What are my options? (if language specific, PHP preferred)

Thanks very much.

+6  A: 

Just do

preg_match('/[^\\p{Common}\\p{Latin}]/u', $string)

where $string is an UTF-8 string. This will return "1" if there are non-latin characters and will return "0" otherwise.

Example:

var_dump(preg_match('/[^\\p{Common}\\p{Latin}]/u', 'sf..ŷaás??'));  //int(0)
var_dump(preg_match('/[^\\p{Common}\\p{Latin}]/u', 'sf..ŷݤaás??')); //int(1)
Artefacto
Looks useful! +1
alex
Works great, thanks v. much!
HoboBen
Is there a list of named subpatterns anywhere?
HoboBen
@Hobo See this page: http://www.regular-expressions.info/unicode.html
Artefacto