views:

556

answers:

4

How should I format URLs with special/international characters?

Currently I try to make URLs "look good", so that:

www.myhost.com/this is a test, do you know how?

is converted to:

www.myhost.com/this_is_a_test_do_you_know_how

I know some international letters could be converted (ü = ue, æ = ae, å = aa), some characters could be removed. I general I try to make the URL look "good", but is that stupid?

But what do I do with chinese, japanese, arabian letters that has nothing to do with our western ASCII format?

I really don't like the idea of rewriting the URL with hex codes, so right now I just use my internal unique ID if the url contains too many "non convertable" characters.

+1  A: 

What language are you using? PHP includes a function filter_var() that seems to do most of what you want. See http://us.php.net/manual/en/function.filter-var.php.

In general, the cost of making human-readable ASCII strings from arbitrary string input is probably too great to be worth it. If the user gives you a Chinese hanzi, what are you going to do? Look it up in a dictionary and output the result in pinyin?

The best, most general solution is simply to take the input, format it as UTF-8, then url-encode the result. This will make non-Latin text unreadable, but there is no good, general solution for those languages anyway. The language you're using almost certainly has library functions that can make this easy.

JSBangs
well using asp.net.
A: 

But doesn't Google take advantage of the URL? If some of the text from a given article is in the URL Google search engine will use that? But if there really is no cool way of handling the non-ascii letters, then those languages is lower prioritized on the "google-internet?"

A: 

Have a look at say, http://ja.wikipedia.org/ . If you mouseover the links, they show up in the status bar as Japanese characters. Doesn't look so Japanese in the location bar when you follow the link, but that possibly can't be helped. Haven't checked, but I assume it's all utf8 hex-encoded.

d__
Yes, this is possible, but in my opinion a very *BAD* idea. I encountered it sometimes and you know what? I had problems because I didn't have the "right" keyboard on the computer I was using at that time!
Davide
Not sure if I understand. What problems did you have, and what effect did the keyboard have? My understanding is that the html text is written entirely in ASCII characters, and the browser interprets and render the encoded non-ascii characters if it can, so the keyboard shouldn't enter into it.
d__
yeah, the ja. site just works with the japaneese letters, without worrying about the ascii letters.
A: 

if you're using .NET with not

Server.URLEncode( myURL );

but if you want to use the scandinavian chars or whatever char you want, you just need to set up the rule in your URL ReWriting component because DynamicWeb CMS software uses the all chars available, only replace spaces by underscores ('_')

like this url:

http://www.gynækologen.dk/Undersøgelser_og_behandlinger.aspx

you can see the æ in the domain as well the ø in the page name

balexandre
yeah, I thought about it, but again, you cant just paste the URL into some site that dosent support it. Like: http://validator.w3.org/check?uri=http%3A%2F%2Fwww.gyn%C3%A6kologen.dk%2FUnders%25C3%25B8gelser_og_behandlinger.aspxAnd hello to you balexandre, I am danish as well :)
ahh seems like its just the host name
normal name: http://www.gynaekologen.dk ;-)
balexandre