ansaurus

Question

Route-problem regarding Url-encoded Umlauts (using the Zend-framework)

Answer 1

+1 A:

The u modifier makes the regexp expect utf-8 input. This would suggest that ZF expects utf-8 encoded input, and not ISO-8859-1 (I'm not too familiar with ZF, so I'm just guessing here).

If that's the case, you'll have to utf-8 encode the ü before using it in a URL. It would then become: mysite.local/citytest/M%C3%BCnchen

Note that since the rest of your application probably speaks ISO-8859-1 (Which is default for PHP <= 5), you will have to explicitly decode the variable with utf8_decode, before you can use it.

troelskn 2008-09-24 15:19:03

Answer 2

+1 A:

The problem is the following:

Using the /u pattern modifier prevents words from being mangled but instead PCRE skips strings of characters with code values greater than 127. Therefore, \w will not match a multibyte (non-lower ascii) word at all (but also won’t return portions of it). From the pcrepattern man page;

In UTF-8 mode, characters with values greater than 128 never match \d, \s, or \w, and always match \D, \S, and \W. This is true even when Unicode character property support is available.

From Handling UTF-8 with PHP. Therefore it's actually irrelevant if your URL is ISO-8859-1 encoded (mysite.local/citytest/M%FCnchen) or UTF-8 encoded (mysite.local/citytest/M%C3%BCnchen), the default regex won't match.

I also made experiments with umlauts in URLs in Zend Framework and came to the conclusion that you wouldn't really want umlauts in your URLs. The problem is, that you cannot rely on the encoding used by the browser for the URL. Firefox (prior to 3.0) for example does not UTF-8 encode URLs entered into the address textbox (if not specified in about:config) and IE does have a checkbox within its options to choose between regular and UTF-8 encoding for its URLs. But if you click on links within a page both browsers use the URL in the given encoding (UTF-8 on an UTF-8 page). Therefore you cannot be sure in which encoding the URLs are sent to your application - and detecting the encoding used is not that trivial to do.

Perhaps it's better to use transliterated parameters in your URLs (e.g. change Ä to Ae and so on). There is a really simple way to this (I don't know if this works with every language but I'm using it with German strings and it works quite well):

function createUrlFriendlyName($name) // $name must be an UTF-8 encoded string
{
    $name=mb_convert_encoding(trim($name), 'HTML-ENTITIES', 'UTF-8');
    $name=preg_replace(
     array('/&szlig;/', '/&(..)lig;/', '/&([aouAOU])uml;/', '/&(.)[^;]*;/', '/\W/'),
     array('ss', '$1', '$1e', '$1', '-'),
     $name);
    $name=preg_replace('/-{2,}/', '-', $name);
    return trim($name, '-');
}

Stefan Gehrig 2008-09-27 18:25:27

ansaurus

tags:

views:

answers:

Route-problem regarding Url-encoded Umlauts (using the Zend-framework)

related questions