Should Unicode be allowed in usernames?

views:

1691

answers:

+19 Q:

Should Unicode be allowed in usernames?

Why do most (all?) websites only support usernames in ASCII? Are there any security considerations if an admin decides to start accepting Unicode usernames?

I would say a big reason is the lack of support for unicode in most PHP installations. It isn't easy to work with, so why allow it when the possibilities in ASCII are sufficient to cover your entire user base?

Scott M. 2010-08-12 18:16:43

The question is not about PHP so infirmity of that language shouldn't be an argument.

Crozin 2010-08-12 18:18:29

@Crozin: Many web applications are written in PHP, so it may be an argument for those. That particular language has a long, sad history of the crappiest support for Unicode next to only LaTeX.

Joey 2010-08-12 18:20:54

@Scott_M. @Johannes_Rössel: Following this argument, the web should only be populated with latin characters? To follow-up on your answers, even though you say PHP lack support of unicode, you find many websites with unicode contents, **except** when they force their users to choose ascii usernames and passwords.

banx 2010-08-12 18:24:29

banx: It's just that PHP can't natively or easily work with Unicode (and MySQL which it's often paired with defaults to Latin 1) which is far from ideal. If you don't pay attention as developer you'll end up with a site that just doesn't support it whereas I'd consider the default to be Unicode a much more sensible choice.

Joey 2010-08-12 18:29:29

I didn't say it's impossible, I said it's difficult. Many developers are lazy and say "if they want to use my site, they have to use ASCII user names." Lucky for us, when PHP 6 comes out it will have native support for unicode and this difficulty will go away (slowly)

Scott M. 2010-08-12 18:37:25

Plain ASCII is rare, I'd say. Often it's just that no one thinks of it since in Western Europe Latin 1 suffices and for the US as well. Some databases make distinctions between text in legacy character sets and Unicode (varchar vs. nvarchar) or for other databases a special character set has to be set.

Especially in the US many people never even notice that ASCII won't be enough. Some try to find excuses with »Users have to enter it« or similar which are mostly bogus, though.

To answer your question, I doubt there are security considerations, except maybe for spoofing other people's names using different scripts (a and а look identical, but one is Latin, one is Cyrillic – this has been done with URLs before). Generally I see it as an oversight by developers who probably should know better.

Joey 2010-08-12 18:18:14

+28 A:

Homoglyph attacks. User 'cat' and 'сat' are different unicode strings although they look the same. The first letter in the second 'сat' is Russian 'с' - "CYRILLIC SMALL LETTER ES" to be exact. The system can't easily tell that you're spoofing another user's name - to the computer the nicks are different.

Edit: Preventing mixed scripts does not solve the problem. For example 'сосо' is pure Cyryllic and can be used to spoof ascii 'coco'.

Also, left-to-right override (and friends.) Leave them unsanitized and they'll mess up your whole page.

Rafał Dowgird 2010-08-12 18:24:24

Well, it *can* easily tell if you're mixing scripts and disallow those. Web browsers follow a similar rule to revert IDNs to Punycode display.

Joey 2010-08-12 18:26:58

You don't always need to *mix* scripts. Some all-ascii words can be recreated using cyrillic-only, for example 'coco'. So you need to deal with that too.

Rafał Dowgird 2010-08-12 18:42:06

Homoglyph attacks are possible in ASCII as well; "0" and "O" are indistinguishable in many fonts, as are "|", "I", "l", and "1"; ".com", ".corn" among others.

Dour High Arch 2010-08-12 18:42:06

If homoglyph attacks need to be minimized, you may then restrict a username to use only one type of script per username (along with numeric characters). For instance, a user cannot mix Chinese and Arabic characters in one single username, or English and Cyrillic and Ethiopic, etc.

banx 2010-08-12 18:58:30

Doesn't OpenID (as used here on SO) make spoofing another user's name quite easy? I got the impression that I could make my name appear as "Jon Skeet" if I wanted. Yet, I don't see that sort of thing being a problem on SO.

Craig McQueen 2010-08-13 02:38:02

How about allowing single-script names, but putting a little logo next to each username identifying the script? That way Latin "coco" and Cyrillic "coco" are distinguishable.

Paul Johnson 2010-08-13 11:33:36

+4 A:

HTTP authentication? There could be some problems with sending the unicode username (and/or password) over existing protocols. One case that I have run into before is with Basic authentication. There is no well defined way to handle sending these unicode usernames/passwords in the basic auth headers.

Mike 2010-08-12 18:46:13

[UTF-7](http://en.wikipedia.org/wiki/UTF-7) allows you to transmit Unicode code-points as ASCII.

dreamlax 2010-08-15 22:01:59

But with utf-7, or any other encoding, you need to own the client and the server code to make sure that they will properly decode the data.

Mike 2010-08-16 22:04:44

+1 A:

While you can go ahead and allow unicode, understand that some usernames will not work as expected thanks to different cultures applying different rules to the same characters.

Consider the basic case for breaking case sensivitity: In Turkish, the usernames "Id1" and "id1" are different (in Turkish there are two different Is, one with a dot and one without, resulting in 2 captial and 2 small letters that do not match the same captialization rules as English). So while any Turkish person can enter their name in their own language, the program will not treat their name as they expect - instead it will undergo a strange transformation into mutant English.

Special latin characters in European languages have similar overlaps, making it seemingly random as to which language they are being entered in. Other regions of the world have similar shared characters where the rules of use differ - in some cases national and cultural hatreds could result in some very angry people when the characters making up their username are treated as if it was written in the language of their hated enemy (due to that being the operating systems default setting for those foreign characters).

David 2010-08-12 18:52:01

So, we need PSP (politics sensitive programming). Shame on the Unicode consortium for not sorting all that out for us. ☺

Craig McQueen 2010-08-13 02:41:10

+2 A:

Your observation is not always true. And, the choice of ASCII is largely human factors rather than technical or security issues.

For most of the case, it is just for the ease of programming. A programmer never know that all software, libraries, utilities in website will break or not with some characters. Why risks the website development while ASCII works well? Also, some packaged web software would hinder the use of Unicode in user name. This contributes the issue that many websites only support usernames in ASCII.

Theoretically, all current software can handle 8-bit data well. There is no problem in storage or transmission nowadays. Even if some protocols not, they can translate in UTF-7 or with other transformation schemes.

There are some issues with Unicode. It is more on the side of data processing. It might be display, fonts, readiness of software and software libraries for non-BMP characters, collation, comparison, input methods, writing directions. Administrators might not knowledgeable enough to handle them. Depending on the nature of website, it could be a problem, but mostly not.

For admin purpose, it is not easy to type some exotic characters. It makes admin hard to search for users. It is also hard for an admin to keep offensive usernames in foreign languages off the website.

However, it is not uncommon that Chinese usernames are used Chinese website. It might not always in ASCII. So do other cultures and languages. Some global projects accept nearlly all kinds of Unicode characters. Wikipedia is an example.

OmniBus 2010-08-13 10:28:23

Or, we could just stop giving a crap about what a username looks like, and whether WE can pronounce/ remember it. That should be the USERS concern. If no one remembers you, that's your loss. And, as for name spoofing, that is almost unavoidable in any case. And yet, rarely do you ever hear of username spoofs.

Imagine a forum, imagine someone posts with an account that LOOKS identical to yours. You get in trouble, say you didn't do it, post a link to your history, see the post isn't there. Click the profile of the guy who ACTUALLY posted it, and bam, you have his profile. He's now bannable.

Having the same name doesn't mean you have the same user data. Any application that doesn't make it easy for you to differentiate two similar users is piss poor anyway and needs to be rewritten.

Clint 2010-08-13 12:06:01

This does not answer the question. It would be better as a comment under one of the other answers.

Skip Head 2010-08-15 14:37:30

A crap, I do not give it.

Clint 2010-08-16 12:02:33

+2 A:

While it is at all questionable why there should ever be username and not just a 'password' to identify a user, I think there's no reason to disallow unicode usernames.

What's more important, is that password to be validated as lanuguage-agnostic: it should treat keystokes regardless of user's keyboard setting. This means, "שלום" and "akuo" would be the same password. This is important, because the user often doesn't see the password characters he's typing, and they are getting severely pissed if the CAPSLOCK is on.

Pavel Radzivilovsky 2010-08-23 09:04:31

ansaurus

tags:

views:

answers:

Should Unicode be allowed in usernames?

related questions