tags:

views:

60

answers:

5

Possible Duplicate:
How to handle diacritics (accents) when rewriting 'pretty URLs'

I want to replace special characters, such as Å Ä Ö Ü é, with "normal" characters (those between a-z and 0-9). And spaces should certainly be replaced with dashes, but that's not really a problem.

In other words, I want to turn this:

en räksmörgås

into this:

en-raksmorgas

What's the best way to do this?

Thank you in advance.

+1  A: 

Check out http://php.net/manual/en/function.strtr.php

<?php
$addr = strtr($addr, "äåö", "aao");
?>
fcingolani
But in this case I have to list all those special characters - I want to avoid that.
Ivarska
Yep. In such case, i think @ircmaxell's answer fits better. :)
fcingolani
A: 

I'd say use a regular expression.

dave
Regular expressions are not the best option in this case, as this can be done much faster and easier with `str_replace()` and other similar functions
Frxstrem
+7  A: 

You can use iconv for the string replacement...

$string = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);

Basically, it'll transliterate the characters it can, and drop those it can't (that are not in the ASCII character set)...

Then, just replace the spaces with str_replace:

$string = str_replace(' ', '-', $string);

Or, if you want to get fancy, you can replace all consecutive white-space characters with a single dash using a simple regex:

$string = preg_replace('/\\s+/', '-', $string);

Edit As @Robert Ros points out, you need to set the locale prior to using iconv (Depending on the defaults of your system). Just execute this line prior to the iconv line:

setlocale(LC_CTYPE, 'en_US.UTF8');
ircmaxell
+1 also see: http://stackoverflow.com/questions/1284535/php-transliteration Btw, it's important your locale is set correctly for iconv transliteraion to work properly.
Robert Ros
@Robert Ros: Thanks, I've added that to the answer...
ircmaxell
Wonderful! But instead of 'ä' I get 'a"'. Not a big problem, I just have to run a preg_replace to remove everything but the characters. But is it ment to be so? I'm just curious.
Ivarska
@Robert Thanks, I always wondered why sometimes iconv transliteration work and sometimes it didn't.
Artefacto
Huh? You get a quote character? I tested it on my machine, and it worked fine (I got your expected output). Did you run `setlocale` first?
ircmaxell
Yeah, it's kinda weird. I'm running the setlocale function above the iconv line, but my output is: r"aksm"orgas
Ivarska
+2  A: 
function diacritics() {
    return array(
        'À'=>'A','Á'=>'A','Â'=>'A','Ã'=>'A','Å'=>'A','Ä'=>'AE','Æ'=>'AE',
        'à'=>'a','á'=>'a','â'=>'a','ã'=>'a','å'=>'a','ä'=>'ae','æ'=>'ae',
        'Þ'=>'B','þ'=>'b','Č'=>'C','Ć'=>'C','Ç'=>'C','č'=>'c','ć'=>'c',
        'ç'=>'c','ð'=>'d','Đ'=>'Dj','đ'=>'dj','È'=>'E','É'=>'E','Ê'=>'E',
        'Ë'=>'E','è'=>'e','é'=>'e','ê'=>'e','ë'=>'e','Ì'=>'I','Í'=>'I',
        'Î'=>'I','Ï'=>'I','ì'=>'i','í'=>'i','î'=>'i','ï'=>'i','Ñ'=>'N',
        'ñ'=>'n','Ò'=>'O','Ó'=>'O','Ô'=>'O','Õ'=>'O','Ø'=>'O','Ö'=>'OE',
        'Œ'=>'OE','ð'=>'o','ò'=>'o','ó'=>'o','ô'=>'o','õ'=>'o','ö'=>'oe',
        'œ'=>'oe','ø'=>'o','Ŕ'=>'R','ŕ'=>'r','Š'=>'S','š'=>'s','ß'=>'ss',
        'Ù'=>'U','Ú'=>'U','Û'=>'U','Ü'=>'UE','ù'=>'u','ú'=>'u','û'=>'u',
        'ü'=>'ue','Ý'=>'Y','ý'=>'y','ý'=>'y','ÿ'=>'yu','Ž'=>'Z','ž'=>'z'
    );
function slug($text) {
    return preg_replace(
        '/[^\w\.!~*\'"(),]/','-',
        trim(strtr($text,diacritics()))
    );
stillstanding
@ircmaxell: nope. those are the only valid characters specified in RFC 1738. using \W will allow other non-alphanum characters other than those in the regex
stillstanding
Ahh, I see. The negation of `\w` matches those characters, but then because they are explicitly negated already, they don't get matched... My mistake...
ircmaxell
A: 

A clever hack often used for this is calling htmlentitites, then running

preg_replace('/&(\w)(acute|uml|circ|tilde|ring|grave);/', '\1', $str);

to get rid of the diacritics. A more complete (but often unnecessarily complicated) solution is using a Unicode decomposition algorithm to split diacritics, then dropping everything that is not an ASCII letter or digit.

Tgr