ansaurus

Question

Answer 1

+2 A:

Depending on your URL scheme, you can make the UTF-8 encoded part "not important". For example, if you look at Stack Overflow URLs, they're of the following form:

http://stackoverflow.com/questions/2742852/unicode-characters-in-urls

However, the server doesn't actually care if you get the part after the identifier wrong, so this also works:

http://stackoverflow.com/questions/2742852/これは、これを日本人のテキストです

So if you had a layout like this, then you could potentially use UTF-8 in the part after the identifier and it wouldn't really matter if it got garbled. Of course this probably only works in somewhat specialised circumstances...

Dean Harding 2010-04-30 07:29:06

Hmmm, *very* clever thinking! It could still be that some clients choke on the characters no matter where they are located in the string, but it *would* eliminate all the problems with ordinary garbling when copy+pasting a URL, which I think is the most important part. Hadn't looked at SO's URL that way yet. Thanks!

Pekka 2010-04-30 07:34:08

well, this still leaves word "questions" untranslated, plus there is stuff after hash #, which follows entire url, very nice trick though!!

Evgeny 2010-04-30 16:39:05

Answer 2

+7 A:

Use percent encoding. Modern browsers will take care of display & paste issues and make it human-readable. E. g. http://ko.wikipedia.org/wiki/위키백과:대문

Edit: when you copy such an url in Firefox, the clipboard will hold the percent-encoded form (which is usually a good thing), but if you copy only a part of it, it will remain unencoded.

Tgr 2010-04-30 07:39:22

Wow, actually you're right! If you cut'n'paste a %-encoded URL Firefox will turn it into the correct thing for display.

Dean Harding 2010-04-30 07:44:02

Wow, I wasn't aware of this. Chances are this is the best solution!

Pekka 2010-04-30 07:45:28

Answer 3

+6 A:

What Tgr said. Background:

http://www.example.com/düsseldorf?neighbourhood=Lörick

That's not a URI. But it is an IRI.

You can't include an IRI in an HTML4 document; the type of attributes like href is defined as URI and not IRI. Some browsers will handle an IRI here anyway, but it's not really a good idea.

To encode an IRI into a URI, take the path and query parts, UTF-8-encode them then percent-encode the non-ASCII bytes:

http://www.example.com/d%C3%BCsseldorf?neighbourhood=L%C3%B6rick

If there are non-ASCII characters in the hostname part of the IRI, eg. http://例え.テスト/, they have be encoded using Punycode instead.

Now you have a URI. It's an ugly URI. But most browsers will hide that for you: copy and paste it into the address bar or follow it in a link and you'll see it displayed with the original Unicode characters. Wikipedia have been using this for years, eg.:

http://en.wikipedia.org/wiki/ɸ

The one browser whose behaviour is unpredictable and doesn't always display the pretty IRI version is...

...well, you know.

bobince 2010-04-30 11:37:01

I know. One day, somebody has to take a big club and smack those Lynx developers on the head. Thanks for the excellent background info.

Pekka 2010-04-30 11:48:34

Answer 4

A:

As all of these comments are true, You should note that as far as ICAAN approved Arabic (Persian) and chinese characters to be registered as Domain Name, All of the Browsers making Companies ( Microsoft, Mozilla, Apple, etc) have tosupport Unicode in URL (without any encryption) and those should be searchable by google and etc.

So this issue will resolve ASAP

Nasser Hadjloo 2010-05-03 07:30:44

@Nasser: True - we have special characters in german domains now, too - but those are encoded into ASCII characters using [Punycode](http://en.wikipedia.org/wiki/Punycode). While they are sure to work in major browsers, it will be a long time before every HTTP client library and exotic application will be able to deal with unencoded Unicode characters.

Pekka 2010-05-03 07:33:42

@Pekka, I'm not sure but as I heard, all of browsers have to support Unicode URL at 4th quarter of 2010. (I'm Not Sure)

Nasser Hadjloo 2010-05-03 07:44:05

ansaurus

tags:

views:

answers:

Unicode characters in URLs

related questions