views:

271

answers:

5

I'm developing an international site which uses UTF8 to display non english characters. I'm also using friendly URLS which contain the item name. Obviously I can't use the non english characters in the URL.

Is there some sort of common practice for this conversion? I'm not sure which english characters i should be replacing them with. Some are quite obvious (like è to e) but other characters I am not familiar with (such as ß).

+2  A: 

Obviously I can't use the non english characters in the URL.

In fact, you can. The Wikipedia software (built in PHP) supports this, e.g. en.wikipedia.org/wiki/☃.

Notice that you need to encode the URL appropriately, as shown in the other answers.

Konrad Rudolph
Cool. Is this documented somewhere? I always thought allowed URL chars were limited to this weird us-ascii subset.
Martin Wickman
@wic RFC 1738 specifies URLs and how they are encoded/decoded.
Håvard S
You *must* encode it with Percent encode. It’s only due to the tolerance of the browser that they do it for you if you didn’t. But don’t rely on that.
Gumbo
RFC 3986 (http://www.ietf.org/rfc/rfc3986.txt) states that the only allowed characters in an URI are English letters and certain symbols. Everything else should be percent-encoded. As Gumbo said, the only think that makes the URL Konrad specified is the fact that most major browsers do the proper encoding before sending the HTTP request. However, any link you generate should conform to RFC 3986, especially if you have URLs that will be consumed by a non-major-browser HTTP agents.
Franci Penov
@Gumbo: thanks, I’ve looked it up in the RFC and corrected the text.
Konrad Rudolph
… and yet, despite the correction, this answer continues to garner downvotes. Please, people, leave a reason when doing this, like @Franci has.
Konrad Rudolph
+1  A: 

Use rawurlencode to encode your name for the URL, and rawurldecode to convert the name in the URL back to the original string. These two functions convert strings to and from URLs in compliance with RFC 1738.

Håvard S
+2  A: 

I normally use iconv() with the 'ASCII//TRANSLIT' option. This takes input like:

último año

and produces output like:

'ultimo a~no

Then I use preg_replace() to replace white spaces with dashes:

'ultimo-a~no

... and remove unwanted chars, e.g.

[^a-z0-9-]

It's probably useless with Arabic or Chinese but it works fine with Spanish, French or German.

Álvaro G. Vicario
+3  A: 

You can use UTF-8 encoded data in URL paths. You just need to encoded it additionally with the Percent encoding (see rawurlencode):

// ß (U+00DF) = 0xC39F (UTF-8)
$str = "\xC3\x9F";
echo '<a href="http://en.wikipedia.org/wiki/'.rawurlencode($str).'"&gt;'.$str.'&lt;/a&gt;';

This will echo a link to http://en.wikipedia.org/wiki/ß. Modern browsers will display the character ß itself in the location bar instead of the percentage encoded representation of that character in UTF-8 (%C3%9F).

If you don’t want to use UTF-8 but only ASCII characters, I suggest to use transliteration like Álvaro G. Vicario suggested.

Gumbo
I was always planning on using ASCII characters but i didn't realise some browsers could interpret UTF-8 in the URL. Which "modern browsers" can handle this?
Alex
A: 

Last time I tried (about a week ago), UTF-8 (specifically japanese) characters worked fine in URLs without any additional encoding. Even looked right in address bars across all browsers I tested with (Safari, Chrome and Firefox, all on Mac) and I have no idea what browser my girlfriend was using on windows. Aside from most windows installations i've run across just showing squares for japanese characters because they lack the required fonts to display them, it seems to work fine there as well.

The URL I tried is: http://www.webghoul.de.private-void.net/cache/black-f-with-あい-50.png (WMD does not seem to like it)

Proof by screenshot

So it might not actually be allowed by the spec, for what i've seen it works well across the board, except maybe in editors that like the spec a lot ;-)

I wouldn't actually recommend using these types of characters in URLs, but I also wouldn't make it a first priority to "fix".

Kris
See the comments on Konrad's response. The fact that this happens to work is just a side effect of particular HTTP agents implementations.
Franci Penov
@Franci Penov, You did read that i expressly state that I am not recommending it right? I'm just saying it doesn't have to be a show stopping bug if there are more important issues to fix.
Kris
Yes, I read your answer carefully. The extremely wrong part is " for what i've seen it works well across the board, except maybe in editors that like the spec a lot" - what you see in the browsers address bar has nothing to do with the actual URLs exchanged over the wire. However, the OP is not on the browser side, he's on the server side and should be concerned with what goes over the wire, not what particular user might see in particular HTTP agent.
Franci Penov
Does the term "sense of humour" mean anything to you? I think you should consider growing one.
Kris