views:

808

answers:

8

I am making a swedish website, and swedish letters are å, ä, and ö.

I need to make a string entered by a user to become url-safe with PHP.

Basically, need to convert all characters to underscore, all EXCEPT these:

 A-Z, a-z, 1-9

and all swedish should be converted like this:

'å' to 'a' and 'ä' to 'a' and 'ö' to 'o' (just remove the dots above).

The rest should become underscores as I said.

Im not good at regular expressions so I would appreciate the help guys!

Thanks

NOTE: NOT URLENCODE...I need to store it in a database... etc etc, urlencode wont work for me.

+2  A: 

If you're just interested in making things URL safe, then you want urlencode.

Returns a string in which all non-alphanumeric characters except -_. have been replaced with a percent (%) sign followed by two hex digits and spaces encoded as plus (+) signs. It is encoded the same way that the posted data from a WWW form is encoded, that is the same way as in application/x-www-form-urlencoded media type. This differs from the » RFC 1738 encoding (see rawurlencode()) in that for historical reasons, spaces are encoded as plus (+) signs.

If you really want to strip all non A-Z, a-z, 1-9 (what's wrong with 0, by the way?), then you want:

$mynewstring = preg_replace('/[^A-Za-z1-9]/', '', $str);
Dominic Rodger
sorry, forgot to mention I dont need urlencode
Camran
If you want to make it safe, then you do want urlencode. The fact you want to store it in a database is beside the point (other than that you will want to escape it for your SQL insertation query in addition to making it url safe).
David Dorward
You just don't understand. He wants it to be safe to use as a URL, but not THAT safe. He would prefer it fails on a space or ampersand.
JohnFx
+4  A: 

and all swedish should be converted like this:

'å' to 'a' and 'ä' to 'a' and 'ö' to 'o' (just remove the dots above).

Use normalizer_normalize() to get rid of diacritical marks.

The rest should become underscores as I said.

Use preg_replace() with a pattern of [\W] (i.o.w: any character which doesn't match letters, digits or underscore) to replace them by underscores.

Final result should look like:

$data = preg_replace('[\W]', '_', normalizer_normalize($data));
BalusC
+1  A: 

One simple solution is to use str_replace function with search and replace letter arrays.

Mihail Dimitrov
+6  A: 

// normalize data (remove accent marks)

$data = normalizer_normalize($data);

// replace everything NOT in the sets you specified with an underscore

$data = preg_replace("#[^A-Za-z1-9]#","_", $data);

Nerdling
+1  A: 

You don't need fancy regexps to filter the swedish chars, just use the strtr function to "translate" them, like:

$your_URL = "www.mäåö.com";
$good_URL = strtr($your_URL, "äåöë etc...", "aaoe etc...");
echo $good_URL;

->output: www.maao.com :)

danii
It is only a maintenance nightmare to cover thousands of those characters known at the human world.
BalusC
strtr won't work if extended chars are multibyte-encoded (e.g. utf8)
stereofrog
+1  A: 

as simple as

 $str = str_replace(array('å', 'ä', 'ö'), array('a', 'a', 'o'), $str); 
 $str = preg_replace('/[^a-z0-9]+/', '_', strtolower($str));

assuming you use the same encoding for your data and your code.

stereofrog
'/[^a-z0-9]+/i' or '/[^A-Za-z0-9]+/' to ignore case
Salman A
strtr is more convenient to "translate" sets of characters, like: $str = strtr($str,"aëïöü","aeiou"); it doesn't use arrays
danii
Arrays are cumbercome to maintain a little thousand characters with diacritical marks known at the human world. Just use `normalizer`.
BalusC
+1  A: 

Use iconv to convert strings from a given encoding to ASCII, then replace non-alphanumeric characters using preg_replace:

$input = 'räksmörgås och köttbullar'; // UTF8 encoded
$input = iconv('UTF8', 'ASCII//TRANSLIT', $input);
$input = preg_replace('/[^a-zA-Z0-9]/', '_', $input);
echo $input;

Result:

raksmorgas_och_kottbullar
Pär Wieslander
A: 
function Unaccent($string)
{
    return preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', htmlentities($string, ENT_QUOTES, 'UTF-8'));
}

Somehow I can't use the normalizer_normalize() function so I use this one instead.

Alix Axel