views:

741

answers:

2

Hi @ll.

Today I stumbled about a Problem which seems to be a bug in the Zend-Framework. Given the following route:

<test>
    <route>citytest/:city</route>
    <defaults>
     <controller>result</controller>
     <action>test</action>
    </defaults>
    <reqs>
     <city>.+</city>
    </reqs>
</test>

and three Urls:

  • mysite.local/citytest/Berlin
  • mysite.local/citytest/Hamburg
  • mysite.local/citytest/M%FCnchen

the last Url does not match and thus the correct controller is not called. Anybody got a clue why?

Fyi, where are using Zend-Framework 1.0 ( Yeah, I know that's ancient but I am not in charge to change that :-/ )

Edit: From what I hear, we are going to upgrade to Zend 1.5.6 soon, but I don't know when, so a Patch would be great.

Edit: I've tracked it down to the following line (Zend/Controller/Router/Route.php:170):

$regex = $this->_regexDelimiter . '^' . 
  $part['regex'] . '$' . 
  $this->_regexDelimiter . 'iu';

If I change that to

  $this->_regexDelimiter . 'i';

it works. From what I understand, the u-modifier is for working with asian characters. As I don't use them, I'm fine with that patch for know. Thanks for reading.

+1  A: 

The u modifier makes the regexp expect utf-8 input. This would suggest that ZF expects utf-8 encoded input, and not ISO-8859-1 (I'm not too familiar with ZF, so I'm just guessing here).

If that's the case, you'll have to utf-8 encode the ü before using it in a URL. It would then become: mysite.local/citytest/M%C3%BCnchen

Note that since the rest of your application probably speaks ISO-8859-1 (Which is default for PHP <= 5), you will have to explicitly decode the variable with utf8_decode, before you can use it.

troelskn
+1  A: 

The problem is the following:

Using the /u pattern modifier prevents words from being mangled but instead PCRE skips strings of characters with code values greater than 127. Therefore, \w will not match a multibyte (non-lower ascii) word at all (but also won’t return portions of it). From the pcrepattern man page;

In UTF-8 mode, characters with values greater than 128 never match \d, \s, or \w, and always match \D, \S, and \W. This is true even when Unicode character property support is available.

From Handling UTF-8 with PHP. Therefore it's actually irrelevant if your URL is ISO-8859-1 encoded (mysite.local/citytest/M%FCnchen) or UTF-8 encoded (mysite.local/citytest/M%C3%BCnchen), the default regex won't match.

I also made experiments with umlauts in URLs in Zend Framework and came to the conclusion that you wouldn't really want umlauts in your URLs. The problem is, that you cannot rely on the encoding used by the browser for the URL. Firefox (prior to 3.0) for example does not UTF-8 encode URLs entered into the address textbox (if not specified in about:config) and IE does have a checkbox within its options to choose between regular and UTF-8 encoding for its URLs. But if you click on links within a page both browsers use the URL in the given encoding (UTF-8 on an UTF-8 page). Therefore you cannot be sure in which encoding the URLs are sent to your application - and detecting the encoding used is not that trivial to do.

Perhaps it's better to use transliterated parameters in your URLs (e.g. change Ä to Ae and so on). There is a really simple way to this (I don't know if this works with every language but I'm using it with German strings and it works quite well):

function createUrlFriendlyName($name) // $name must be an UTF-8 encoded string
{
    $name=mb_convert_encoding(trim($name), 'HTML-ENTITIES', 'UTF-8');
    $name=preg_replace(
     array('/&szlig;/', '/&(..)lig;/', '/&([aouAOU])uml;/', '/&(.)[^;]*;/', '/\W/'),
     array('ss', '$1', '$1e', '$1', '-'),
     $name);
    $name=preg_replace('/-{2,}/', '-', $name);
    return trim($name, '-');
}
Stefan Gehrig