views:

49

answers:

3

For a web application, we need to link to some user generated content. A users types in a title for e.g. a product and we generate an SEO friendly url for that product:

like this

title: a nice product

www.user.com/product/a-nice-product

title: أبجد هوز

www.user.com/product/أبجد هوز

The problem is that those foreign language url's aren't supported and a browser refuses to open those links. I've seen wordpress setups support that kind of url's so I guess it's possible to do this.

Does anyone know how we should support this in php?

wikipedia handles this just fine: http://ar.wikipedia.org

A: 

You might need to use IDNA encoding on the non-ASCII portion of the URL.

http://en.wikipedia.org/wiki/Internationalized_domain_name

Mark Ransom
IDNA is only for the actual domain name. The remainder of the URL is encoded using percent-encoding (`urlencode()` in PHP).
TRiG
can we use urlencode() for encoding all non ascii characters?
Jorre
A: 

You're in trouble I'm afraid. The encoding of the URL is at the discretion of the browser. I've encountered the same problem when trying to support URLs with Norwegian special characters and its simply not consistently possible.

You may be able to redirect a browser to the UTF-8 URL, but it might reply to you in ISO. It gets even worse in some cases where browsers (firefox for instance) will mix ISO and UTF-8 formatting in the same url (this happens particularly with get parameters).

My suggestion is simply; Don't do it, use either English (better SEO too!) or spell it phonetically.

BjornS
for me that's not a problem, but the clients of my web app are international and will use titles in their own language.
Jorre
check out http://ar.wikipedia.org, they are handling this just fine so I must be possible?
Jorre
@Jorre they have it in the same ugly urlencoded way. It's browser makes it looks nice. Better transliterate it
Col. Shrapnel
+1  A: 

Although the URL itself only allows US-ASCII characters, you can use Unicode characters in the URI path if you encode them with UTF-8 and then convert them in US-ASCII characters by using the percent-encoding:

A system that internally provides identifiers in the form of a different character encoding, such as EBCDIC, will generally perform character translation of textual identifiers to UTF-8 [STD63] (or some other superset of the US-ASCII character encoding) at an internal interface, thereby providing more meaningful identifiers than those resulting from simply percent-encoding the original octets.

So you can do something like this (assuming UTF-8):

$title = 'أبجد هوز';
$path = '/product/'.rawurlencode($title);
echo $path;  // "/product/%D8%A3%D8%A8%D8%AC%D8%AF%20%D9%87%D9%88%D8%B2"

Although the URI path is actually encoded with the percent-encoding, most modern browsers will display the characters this sequence represents in Unicode when UTF-8 is used.

Gumbo
that sounds interesting, but isn't it better for e.g. /product/españa to make this /product/espana instead of /product/espa%F1a ?
Jorre
great stuff mate, thanks for your help already!
Jorre
I'm make sure you check these URLs for copy-paste-ability and bookmark-ability. In my experience they will often work fine when you serve them to the browser as an anchor tag or as a redirect url, but once the browser has to store and retrieve the URL themselves, all bets are off. Hope you find a solution and good luck!
BjornS