views:

1752

answers:

7

I rewrite URLs to include the title of user generated travelblogs.

I do this for both readability of URLs and SEO purposes.

 http://www.example.com/gallery/280-Gorges_du_Todra/

The first integer is the id, the rest is for us humans (but is irrelevant for requesting the resource).

Now people can write titles containing any UTF-8 character, but most are not allowed in the URL. My audience is generally English speaking, but since they travel, they like to include names like

 Aït Ben Haddou

What is the proper way to translate this for displaying in an URL using PHP on linux.

So far I've seen several solutions:

  1. just strip all non allowed characters, replace spaces this has strange results:
    'Aït Ben Haddou' → /gallery/280-At_Ben_Haddou/
    Not really helpfull.

  2. just strip all non allowed characters, replace spaces, leave charcode (stackoverflow.com) most likely because of the 'regex-hammer' used
    this gives strange results: 'tést tést' → /questions/0000/t233st-t233st

  3. translate to 'nearest equivalent'
    'Aït Ben Haddou' → /gallery/280-Ait_Ben_Haddou/
    But this goes wrong for german; for example 'ü' should be transliterated 'ue'.

For me, as a Dutch person, the 3rd result 'looks' the best.
I'm quite sure however that (1) many people will have a different opinion and (2) it is just plain wrong in the german example.

Another problem with the 3rd option is: how to find all possible characters that can be converted to a 7bit equivalent?

So the question is:

  1. what, in your opinion, is the most desirable result. (within tech-limits)

  2. How to technically solve it. (reach the desired result) with PHP.

A: 

IDNA

Doug Currie
I'm aware of the internationalized domain names, but they do not solve the problem for that resource path, nor are they very readable
Jacco
+3  A: 

To me the third is most readable.

You could use a little dictionary e.g. ï -> i and ü -> ue to specify how you'd like various charcaters to be translated.

ChrisW
the problem is, it translates different for French and german, let alone for Turkish or Vietnamese.
Jacco
Maybe use different dictionaries (if you know which language it is), or always use the simplest dictionary e.g. `ü -> u`.
ChrisW
+8  A: 

Ultimately, you're going to have to give up on the idea of "correct", for this problem. Translating the string, no matter how you do it, destroys accuracy in the name of compatibility and readability. All three options are equally compatible, but #1 and #2 suffer in terms of readability. So just run with it and go for whatever looks best — option #3.

Yes, the translations are wrong for German, but unless you start requiring your users to specify what language their titles are in (and restricting them to only one), you're not going to solve that problem without far more effort than it's worth. (For example, running each word in the title through dictionaries for each known language and translating that word's diacritics according to the rules of its language would work, but it's excessive.)

Alternatively, if German is a higher concern than other languages, make your translation always use the German version when one exists: äae, ëe, ïi, öoe, üue.

Edit:

Oh, and as for the actual method, I'd translate the special cases, if any, via str_replace, then use iconv for the rest:

$text = str_replace(array("ä", "ö", "ü", "ß"), array("ae", "oe", "ue", "ss"), $text);
$text = iconv('UTF-8', 'US-ASCII//TRANSLIT', $text);
Ben Blank
iconv //TRANSLIT looks usefull, thanks
Jacco
+1  A: 

As an interesting side note, on SO nothing seems to really matter after the ID -- this is a link to this page:

http://stackoverflow.com/questions/465990/why-does-this-link-go-to-the-same-place

Obviously the motivation is to allow title changes without breaking links, and you may want to consider that feature as well.

wowest
I'm using a similar technique in a project, except if the "slug" provided doesn't match, I automatically 301 to the correct "slug" (e.g. /questions/465990/why-does-this-link-go-to-the-same-place would redirect to /questions/465990/how-to-handle-diacritics-accents-when-rewriting-pretty-urls ). I think this is generally desirable from a SEO and general web standards standpoint: ideally, a resource should be accessible from one, or at most, a few URLs -- not the millions that SO would currently allow.
Frank Farmer
A: 

Now people can write titles containing any UTF-8 character, but most are not allowed in the URL.

On the contrary, most are allowed. See for example Wikipedia's URLs - things like http://en.wikipedia.org/wiki/Café (aka http://en.wikipedia.org/wiki/Caf%C3%A9) display nicely - even if StackOverflow's highlighter doesn't pick them out correctly :-)

The trick is reading them reliably across any hosting environment; there are problems with CGI and Windows servers, particularly IIS, for example.

bobince
Of course, support is spotty enough that your answer even provides an example of why you generally don't want to do this — SO's URL-recognizer cuts off the "é" in your Café example.
Ben Blank
My browser (Safari) does this for me automatically. Every URL I enter in the address field is being send in UTF-8.
Gumbo
I've to keep it compatible with the ancient computers found in internet cafés around the world. But maybe I should have said: old school URLs :)
Jacco
A: 

check this wonderful UTF8 to ASCII library that should correctly handle (and covert) most of the characters: php utf8 class

dusoft
A: 

Nice topic, I had the same problem a while ago.
Here's how I fixed it:

function title2url($string=null){
 // return if empty
 if(empty($string)) return false;

 // replace spaces by "-"
 // convert accents to html entities
 $string=htmlentities(utf8_decode(str_replace(' ', '-', $string)));

 // remove the accent from the letter
 $string=preg_replace(array('@&([a-zA-Z]){1,2}(acute|grave|circ|tilde|uml|ring|elig|zlig|slash|cedil|strok|lig){1};@', '@&[euro]{1};@'), array('${1}', 'E'), $string);

 // now, everything but alphanumeric and -_ can be removed
 // aso remove double dashes
 $string=preg_replace(array('@[^a-zA-Z0-9\-_]@', '@[\-]{2,}@'), array('', '-'), html_entity_decode($string));
}

Here's how my function works:

  1. Convert it to html entities
  2. Strip the accents
  3. Remove all remaining weird chars
Daan