views:

38

answers:

1

Hello,

I am creating a service that could "go international" to non-English speaking markets. I do not want to restrict a username to the ASCII range of characters but would like to allow a user to specify their "natural" username. OK, use UNICODE (and say UTF-8 as my username text encoding).

But! I don't want users to create "non-name" usernames that contain "symbol" code points. For instance, I don't want to allow a username like √√√√√√øøøøø.

Is there a list of "symbol" code points for UNICODE that I can check (perhaps with a regex) to accept/reject a given username?

Thanks!

+2  A: 

Unicode has several categories, so you can easily exclude symbols. How exactly to do that depends on the language you are using. Some regex frameworks have that feature built-in, some don't.

Lukáš Lalinský
Ah, I had no idea about this! That's perfect. Thanks.
z8000
I suppose for my purposes I'll allow codepoints in any of these categories:[Ll] Letter, Lowercase[Lm] Letter, Modifier[Lo] Letter, Other[Lt] Letter, Titlecase[Lu] Letter, Uppercase
z8000
Well, for example Perl supports a pseudo-category for regular expression called *IsWord*, which is defined as: Ll+Lu+Lt+Lo+Nd
Lukáš Lalinský