views:

1691

answers:

8

Why do most (all?) websites only support usernames in ASCII? Are there any security considerations if an admin decides to start accepting Unicode usernames?

A: 

I would say a big reason is the lack of support for unicode in most PHP installations. It isn't easy to work with, so why allow it when the possibilities in ASCII are sufficient to cover your entire user base?

Scott M.
The question is not about PHP so infirmity of that language shouldn't be an argument.
Crozin
@Crozin: Many web applications are written in PHP, so it may be an argument for those. That particular language has a long, sad history of the crappiest support for Unicode next to only LaTeX.
Joey
@Scott_M. @Johannes_Rössel: Following this argument, the web should only be populated with latin characters? To follow-up on your answers, even though you say PHP lack support of unicode, you find many websites with unicode contents, **except** when they force their users to choose ascii usernames and passwords.
banx
banx: It's just that PHP can't natively or easily work with Unicode (and MySQL which it's often paired with defaults to Latin 1) which is far from ideal. If you don't pay attention as developer you'll end up with a site that just doesn't support it whereas I'd consider the default to be Unicode a much more sensible choice.
Joey
I didn't say it's impossible, I said it's difficult. Many developers are lazy and say "if they want to use my site, they have to use ASCII user names." Lucky for us, when PHP 6 comes out it will have native support for unicode and this difficulty will go away (slowly)
Scott M.
A: 

Plain ASCII is rare, I'd say. Often it's just that no one thinks of it since in Western Europe Latin 1 suffices and for the US as well. Some databases make distinctions between text in legacy character sets and Unicode (varchar vs. nvarchar) or for other databases a special character set has to be set.

Especially in the US many people never even notice that ASCII won't be enough. Some try to find excuses with »Users have to enter it« or similar which are mostly bogus, though.

To answer your question, I doubt there are security considerations, except maybe for spoofing other people's names using different scripts (a and а look identical, but one is Latin, one is Cyrillic – this has been done with URLs before). Generally I see it as an oversight by developers who probably should know better.

Joey
+28  A: 

Homoglyph attacks. User 'cat' and 'сat' are different unicode strings although they look the same. The first letter in the second 'сat' is Russian 'с' - "CYRILLIC SMALL LETTER ES" to be exact. The system can't easily tell that you're spoofing another user's name - to the computer the nicks are different.

Edit: Preventing mixed scripts does not solve the problem. For example 'сосо' is pure Cyryllic and can be used to spoof ascii 'coco'.

Also, left-to-right override (and friends.) Leave them unsanitized and they'll mess up your whole page.

Rafał Dowgird
Well, it *can* easily tell if you're mixing scripts and disallow those. Web browsers follow a similar rule to revert IDNs to Punycode display.
Joey
You don't always need to *mix* scripts. Some all-ascii words can be recreated using cyrillic-only, for example 'coco'. So you need to deal with that too.
Rafał Dowgird
Homoglyph attacks are possible in ASCII as well; "0" and "O" are indistinguishable in many fonts, as are "|", "I", "l", and "1"; ".com", ".corn" among others.
Dour High Arch
If homoglyph attacks need to be minimized, you may then restrict a username to use only one type of script per username (along with numeric characters). For instance, a user cannot mix Chinese and Arabic characters in one single username, or English and Cyrillic and Ethiopic, etc.
banx
Doesn't OpenID (as used here on SO) make spoofing another user's name quite easy? I got the impression that I could make my name appear as "Jon Skeet" if I wanted. Yet, I don't see that sort of thing being a problem on SO.
Craig McQueen
How about allowing single-script names, but putting a little logo next to each username identifying the script? That way Latin "coco" and Cyrillic "coco" are distinguishable.
Paul Johnson
+4  A: 

HTTP authentication? There could be some problems with sending the unicode username (and/or password) over existing protocols. One case that I have run into before is with Basic authentication. There is no well defined way to handle sending these unicode usernames/passwords in the basic auth headers.

Mike
[UTF-7](http://en.wikipedia.org/wiki/UTF-7) allows you to transmit Unicode code-points as ASCII.
dreamlax
But with utf-7, or any other encoding, you need to own the client and the server code to make sure that they will properly decode the data.
Mike
+1  A: 

While you can go ahead and allow unicode, understand that some usernames will not work as expected thanks to different cultures applying different rules to the same characters.

Consider the basic case for breaking case sensivitity: In Turkish, the usernames "Id1" and "id1" are different (in Turkish there are two different Is, one with a dot and one without, resulting in 2 captial and 2 small letters that do not match the same captialization rules as English). So while any Turkish person can enter their name in their own language, the program will not treat their name as they expect - instead it will undergo a strange transformation into mutant English.

Special latin characters in European languages have similar overlaps, making it seemingly random as to which language they are being entered in. Other regions of the world have similar shared characters where the rules of use differ - in some cases national and cultural hatreds could result in some very angry people when the characters making up their username are treated as if it was written in the language of their hated enemy (due to that being the operating systems default setting for those foreign characters).

David
So, we need PSP (politics sensitive programming). Shame on the Unicode consortium for not sorting all that out for us. ☺
Craig McQueen
+2  A: 

Your observation is not always true. And, the choice of ASCII is largely human factors rather than technical or security issues.

For most of the case, it is just for the ease of programming. A programmer never know that all software, libraries, utilities in website will break or not with some characters. Why risks the website development while ASCII works well? Also, some packaged web software would hinder the use of Unicode in user name. This contributes the issue that many websites only support usernames in ASCII.

Theoretically, all current software can handle 8-bit data well. There is no problem in storage or transmission nowadays. Even if some protocols not, they can translate in UTF-7 or with other transformation schemes.

There are some issues with Unicode. It is more on the side of data processing. It might be display, fonts, readiness of software and software libraries for non-BMP characters, collation, comparison, input methods, writing directions. Administrators might not knowledgeable enough to handle them. Depending on the nature of website, it could be a problem, but mostly not.

For admin purpose, it is not easy to type some exotic characters. It makes admin hard to search for users. It is also hard for an admin to keep offensive usernames in foreign languages off the website.

However, it is not uncommon that Chinese usernames are used Chinese website. It might not always in ASCII. So do other cultures and languages. Some global projects accept nearlly all kinds of Unicode characters. Wikipedia is an example.

OmniBus
A: 

Or, we could just stop giving a crap about what a username looks like, and whether WE can pronounce/ remember it. That should be the USERS concern. If no one remembers you, that's your loss. And, as for name spoofing, that is almost unavoidable in any case. And yet, rarely do you ever hear of username spoofs.

Imagine a forum, imagine someone posts with an account that LOOKS identical to yours. You get in trouble, say you didn't do it, post a link to your history, see the post isn't there. Click the profile of the guy who ACTUALLY posted it, and bam, you have his profile. He's now bannable.

Having the same name doesn't mean you have the same user data. Any application that doesn't make it easy for you to differentiate two similar users is piss poor anyway and needs to be rewritten.

Clint
This does not answer the question. It would be better as a comment under one of the other answers.
Skip Head
A crap, I do not give it.
Clint
+2  A: 

While it is at all questionable why there should ever be username and not just a 'password' to identify a user, I think there's no reason to disallow unicode usernames.

What's more important, is that password to be validated as lanuguage-agnostic: it should treat keystokes regardless of user's keyboard setting. This means, "שלום" and "akuo" would be the same password. This is important, because the user often doesn't see the password characters he's typing, and they are getting severely pissed if the CAPSLOCK is on.

Pavel Radzivilovsky