Why do most (all?) websites only support usernames in ASCII? Are there any security considerations if an admin decides to start accepting Unicode usernames?
I would say a big reason is the lack of support for unicode in most PHP installations. It isn't easy to work with, so why allow it when the possibilities in ASCII are sufficient to cover your entire user base?
Plain ASCII is rare, I'd say. Often it's just that no one thinks of it since in Western Europe Latin 1 suffices and for the US as well. Some databases make distinctions between text in legacy character sets and Unicode (varchar
vs. nvarchar
) or for other databases a special character set has to be set.
Especially in the US many people never even notice that ASCII won't be enough. Some try to find excuses with »Users have to enter it« or similar which are mostly bogus, though.
To answer your question, I doubt there are security considerations, except maybe for spoofing other people's names using different scripts (a and а look identical, but one is Latin, one is Cyrillic – this has been done with URLs before). Generally I see it as an oversight by developers who probably should know better.
Homoglyph attacks. User 'cat' and 'сat' are different unicode strings although they look the same. The first letter in the second 'сat' is Russian 'с' - "CYRILLIC SMALL LETTER ES" to be exact. The system can't easily tell that you're spoofing another user's name - to the computer the nicks are different.
Edit: Preventing mixed scripts does not solve the problem. For example 'сосо' is pure Cyryllic and can be used to spoof ascii 'coco'.
Also, left-to-right override (and friends.) Leave them unsanitized and they'll mess up your whole page.
HTTP authentication? There could be some problems with sending the unicode username (and/or password) over existing protocols. One case that I have run into before is with Basic authentication. There is no well defined way to handle sending these unicode usernames/passwords in the basic auth headers.
While you can go ahead and allow unicode, understand that some usernames will not work as expected thanks to different cultures applying different rules to the same characters.
Consider the basic case for breaking case sensivitity: In Turkish, the usernames "Id1" and "id1" are different (in Turkish there are two different Is, one with a dot and one without, resulting in 2 captial and 2 small letters that do not match the same captialization rules as English). So while any Turkish person can enter their name in their own language, the program will not treat their name as they expect - instead it will undergo a strange transformation into mutant English.
Special latin characters in European languages have similar overlaps, making it seemingly random as to which language they are being entered in. Other regions of the world have similar shared characters where the rules of use differ - in some cases national and cultural hatreds could result in some very angry people when the characters making up their username are treated as if it was written in the language of their hated enemy (due to that being the operating systems default setting for those foreign characters).
Your observation is not always true. And, the choice of ASCII is largely human factors rather than technical or security issues.
For most of the case, it is just for the ease of programming. A programmer never know that all software, libraries, utilities in website will break or not with some characters. Why risks the website development while ASCII works well? Also, some packaged web software would hinder the use of Unicode in user name. This contributes the issue that many websites only support usernames in ASCII.
Theoretically, all current software can handle 8-bit data well. There is no problem in storage or transmission nowadays. Even if some protocols not, they can translate in UTF-7 or with other transformation schemes.
There are some issues with Unicode. It is more on the side of data processing. It might be display, fonts, readiness of software and software libraries for non-BMP characters, collation, comparison, input methods, writing directions. Administrators might not knowledgeable enough to handle them. Depending on the nature of website, it could be a problem, but mostly not.
For admin purpose, it is not easy to type some exotic characters. It makes admin hard to search for users. It is also hard for an admin to keep offensive usernames in foreign languages off the website.
However, it is not uncommon that Chinese usernames are used Chinese website. It might not always in ASCII. So do other cultures and languages. Some global projects accept nearlly all kinds of Unicode characters. Wikipedia is an example.
Or, we could just stop giving a crap about what a username looks like, and whether WE can pronounce/ remember it. That should be the USERS concern. If no one remembers you, that's your loss. And, as for name spoofing, that is almost unavoidable in any case. And yet, rarely do you ever hear of username spoofs.
Imagine a forum, imagine someone posts with an account that LOOKS identical to yours. You get in trouble, say you didn't do it, post a link to your history, see the post isn't there. Click the profile of the guy who ACTUALLY posted it, and bam, you have his profile. He's now bannable.
Having the same name doesn't mean you have the same user data. Any application that doesn't make it easy for you to differentiate two similar users is piss poor anyway and needs to be rewritten.
While it is at all questionable why there should ever be username and not just a 'password' to identify a user, I think there's no reason to disallow unicode usernames.
What's more important, is that password to be validated as lanuguage-agnostic: it should treat keystokes regardless of user's keyboard setting. This means, "שלום" and "akuo" would be the same password. This is important, because the user often doesn't see the password characters he's typing, and they are getting severely pissed if the CAPSLOCK is on.