ansaurus

Question

PHP code to generate safe URL?

Answer 1

A:

Replace special chars for white spaces and then replace white spaces for "-". str_replace?

bswietochowski 2010-10-21 06:57:00

Please explain how do you define special characters?

fabrik 2010-10-21 07:03:15

Answer 2

+2 A:

You can use a simple regular expression for this purpose:

<?php
    function safeurl( $v )
    {
        $v = strtolower( $v );
        $v = preg_replace( "/[^a-z0-9]+/", "-", $v );
        $v = trim( $v, "-" );
        return $v;
    }
    echo "<br>www.mysite.com/book/12345/" . safeurl( "The Great Book of PHP" );
    echo "<br>www.mysite.com/book/12345/" . safeurl( "The Greatest !@#$ Book of PHP" );
    echo "<br>www.mysite.com/book/12345/" . safeurl( "  Funny title  " );
    echo "<br>www.mysite.com/book/12345/" . safeurl( "!!Even Funnier title!!" );
?>

Salman A 2010-10-21 06:57:39

Sorry, Salman. I've tried your script with a hungarian sentence which contains all of our vowels and it's fails: http://ideone.com/WDcV8

fabrik 2010-10-21 06:59:52

@fabrik: no one said anything about hungarian. i'd -1 your comment if i could.

Mark 2010-10-21 07:01:58

Does the question mention hungarian?

Salman A 2010-10-21 07:03:26

From the question: "where the title can contain any character".

fabrik 2010-10-21 07:04:55

This fails for leading or trailing invalid characters except whitespace.

Gumbo 2010-10-21 07:30:38

Okie, I've fixed it now.

Salman A 2010-10-21 08:12:03

@Salman A: Now it’s exactly what I’ve suggested. Congratulations.

Gumbo 2010-10-21 08:14:02

**** Grins ****

Salman A 2010-10-21 08:16:39

Answer 3

A:

Basically I would create an array with the characters I don't want to use. Then a loop or regex, using str_replace to empty string (in the case of regex, preg_replace).

Finally you substitute (str_replace) the white space for hyphens.

netadictos 2010-10-21 06:58:07

Answer 4

+1 A:

If you want to allow only letters, digits and underscore (usual word characters) you can do:

$str = strtolower(preg_replace(array('/\W/','/-+/','/^-|-$/'),array('-','-',''),$str));

It first replaces any non-word character(\W) with a -.
Next it replaces any consecutive - with a single -
Next it deletes any leading or trailing -.

Working link

codaddict 2010-10-21 06:58:11

Your script fails too with accented vowels. http://www.ideone.com/QdAEm

fabrik 2010-10-21 07:04:02

Go ahead and downvote Gumbo too. I bet you're having a bad day.

Salman A 2010-10-21 07:06:08

@Salman: Please understand it's not an easy preg_replace: http://core.trac.wordpress.org/browser/tags/3.0.1/wp-includes/formatting.php

fabrik 2010-10-21 07:11:24

Answer 5

+5 A:

If “invalid” means non-alphanumeric, you can do this:

function foo($str) {
    return trim(preg_replace('/[^a-z0-9]+/', '-', strtolower($str)), '-');
}

This will turn $str into lowercase, replace any sequence of one or more non-alphanumeric characters by one hyphen, and then remove leading and trailing hyphens.

var_dump(foo("The Great Book of PHP") === 'the-great-book-of-php');
var_dump(foo("The Greatest !@#$ Book of PHP") === 'the-greatest-book-of-php');
var_dump(foo("Funny title     ") === 'funny-title');

Gumbo 2010-10-21 06:58:45

+1. i like to strip single quotes first though.

Mark 2010-10-21 07:05:43

Fails too. Sorry. Please read the question: "the title can contain any character"

fabrik 2010-10-21 07:07:10

@fabrik: So what’s wrong? Didn’t you test the examples? They all yield true.

Gumbo 2010-10-21 07:08:27

http://www.ideone.com/ZtzDl

fabrik 2010-10-21 07:12:14

@fabrik: “If ‘invalid’ means non-alphanumeric […]” – matt_tm didn’t say anything about what invalid means. I just assumed that he means non-alphanumeric.

Gumbo 2010-10-21 07:29:25

@Gumbo: Thank you for at least trying to understand what i'm talking about. Not only hungarian characters but given a book about Citroën and there you go. Accented characters in an international brand's name. Yes, OP didn't specified what is invalid and what is not but as he stated "the title can contain **any** character". (And, because we talking about books, there's a chance for accented characters.)

fabrik 2010-10-21 07:34:04

Hi - sorry to barge in your conversation and yes, non-English characters should be accounted for as well... Its not a terrible requirement that the 'visible' title be absolutely the same as the actual title, but it MUST be a valid url...

matt_tm 2010-10-22 13:47:47

Answer 6

A:

Use a regex replace to remove all non word characters. For example:

str_replace('[^a-zA-Z]+', '-', $input)

WardB 2010-10-21 06:58:48

Answer 7

A:

<?php
$input = "  The Great Book's of PHP  ";
$output = trim(preg_replace(array("`'`", "`[^a-z]+`"),  array("", "-"), strtolower($input)), "-");
echo $output; // the-great-books-of-php

This trims trailing dashes and doesn't do things like "it's raining" -> "it-s-raining" as most solutions tend to do.

Mark 2010-10-21 06:58:50

And turning *it’s* into *its* is right?

Gumbo 2010-10-21 07:05:47

And turning it's into it-s is right?

Mez 2010-10-21 16:51:00

@Gumbo: I find it preferable. Easier to read, no? Otherwise you read it like "it ess raining" and that's just weird.

Mark 2010-10-21 17:16:32

“It’s” and “its” have a different meaning. The preferable variant would be to use its expanded (unambiguous) variant, so “it is” or “it has”.

Gumbo 2010-10-21 17:44:55

@Gumbo: It's a URL. It's supposed to be short and concise.. if anything I'd strip out words like "is" and "has" too. No one is going to be looking for grammatical errors in a URL. And if they can't figure out "its-raining" actually means "it is raining" because there's no apostrophe....then... they need to go back to school.

Mark 2010-10-21 19:09:41

@Mark: What about constructs with words that are ambiguous like `its-meaning`?

Gumbo 2010-10-21 19:22:59

@Gumbo: When do you ever say "it is meaning"? And who cares? They can visit the website and read the actual title on the actual page in all its unicode glory.

Mark 2010-10-21 19:55:07

Answer 8

A:

Sanitizing special characters not an easy task imho. Take a look at WordPress awesome sanitize_title function, also look it's source.

Update: Sorry guys, i should downvote every answer which isn't dealing with accented characters. Do you understand what "the title can contain any character" means?

Update 2: Go, guys, go! Please downvote me as many as you can!

^{Note: and please don't get surprised when you meet a special character. Just eliminate it with str_replace!}

fabrik 2010-10-21 07:01:46

Is it legal to use those functions directly in our code?

matt_tm 2010-10-21 07:29:32

WordPress released under GPL: http://wordpress.org/about/gpl/ If you're not sure about how you can use WP's sources, please take a look at http://stackoverflow.com/questions/2668854/php-sanitizing-strings-to-make-them-url-and-filename-safe there are some interesting approaches that's dealing with **special** characters too. I suggested WordPress as a best option because it's carefully dealing with _any_ special character even Arabic IIRC.

fabrik 2010-10-21 07:38:03

-1 For the same reason: Since you don’t know what the valid/invalid characters are, even pointing at WordPress’ `sanitize_title` is wrong as it also may behave falsely.

Gumbo 2010-10-21 07:55:54

Yeah... I dont think that's work for us... 2. You may modify your copy or copies of the Program or any portion of it, thus forming a work based on the Program, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions:b. You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License.

matt_tm 2010-10-21 07:56:07

@matt_tm: Why don’t you just tell us what characters you are dealing with and what characters you assume to be invalid? That would make the whole thing more clear and you may even get satisfying answers.

Gumbo 2010-10-21 08:13:04

@Gumbo: i'm assuming the app will be a bookstore. Can you tip what characters will occur? There's a chance you cannot tell this at this point. That's why i trying to suggest a solution that simply **works**.

fabrik 2010-10-21 08:17:36

@fabrik: You can always name at least the characters you *expect*. Even if it’s just “characters of the Unicode character set”. And from that character set you can also specify the characters you assume to be valid. That is the minimum information we need to have the chance to give an appropriate solution.

Gumbo 2010-10-21 08:28:38

@Gumbo: refining my previous comment: You simply don't know what characters should you expect. Don't. Know. When your software limiting your possibilities there's something simply wrong.

fabrik 2010-10-21 08:40:21

@fabrik: If you have some data and want to process it, you certainly need know what the data means. And in case of a string, it’s the character encoding and underlying character set that tells you what the byte sequence means.

Gumbo 2010-10-21 08:45:39

@Gumbo: Assuming the site in UTF-8 (there's a big chance) and given the fact that Unicode can contain 109,000 characters (http://en.wikipedia.org/wiki/Unicode) basically you should take care of every single character.

fabrik 2010-10-21 08:57:15

@fabrik: And all those suggestions you voted down do not do that? What if all those characters of the Unicode character set that matt_tm assumes to be valid are only the alphabetic or alphanumeric characters? Then all these suggestions were correct. And if it’s not just the alphabetic/alphanumeric characters that are valid characters, then only the assumption of the set of valid characters was wrong. But that could be fixed easily.

Gumbo 2010-10-21 09:15:42

@Gumbo: No offense but it seems you don't get the point yet. It's cannot fixed easily if it fails. All of the downvoted answers failed with my simple test: every solution converted my words (Árvíztűrő tükörfúrógép) to rv-zt-r-t-k-rf-r-g-p instead of arvizturo-tukorfurogep. I think that's definitely not the result what anybody want.

fabrik 2010-10-21 09:27:41

@fabrik: No one said anything about transliteration. matt_tm just said that invalid characters should be replaced. And since he used hyphens to replace sequences of invalid characters (assuming `!`, `@`, `#`, `$`, and the space character are invalid), I conclude that if only alphabetic/alphanumeric characters are valid, `rv-zt-r-t-k-rf-r-g-p` could be a solution for the input `Árvíztűrő tükörfúrógép`. But again, as long as we don’t know what characters he assumes to be valid and how invalid characters should be replaced, this debate is rather meaningless.

Gumbo 2010-10-21 09:40:00

@Gumbo: +1 everything is a solution when we don't know what is the problem. Btw i like to solve problems not to hold over them.

fabrik 2010-10-21 09:56:33

@fabrik: Yes, there is no justified judging whether some suggestion is right or wrong until we know every detail of the problem.

Gumbo 2010-10-21 09:59:28

@fabrik - check out my answer :D

Mez 2010-10-21 13:41:41

Answer 9

+1 A:

This code comes from CodeIgniter's url helper. It should do the trick.

function url_title($str, $separator = 'dash', $lowercase = FALSE)
    {
        if ($separator == 'dash')
        {
            $search     = '_';
            $replace    = '-';
        }
        else
        {
            $search     = '-';
            $replace    = '_';
        }

        $trans = array(
                        '&\#\d+?;'              => '',
                        '&\S+?;'                => '',
                        '\s+'                   => $replace,
                        '[^a-z0-9\-\._]'        => '',
                        $replace.'+'            => $replace,
                        $replace.'$'            => $replace,
                        '^'.$replace            => $replace,
                        '\.+$'                  => ''
                      );

        $str = strip_tags($str);

        foreach ($trans as $key => $val)
        {
            $str = preg_replace("#".$key."#i", $val, $str);
        }

        if ($lowercase === TRUE)
        {
            $str = strtolower($str);
        }

        return trim(stripslashes($str));
    }

Anzeo 2010-10-21 07:04:27

Answer 10

+3 A:

Ah, slugification

function slugify($text)
{
    // Swap out Non "Letters" with a -
    $text = preg_replace('/[^\\pL\d]+/u', '-', $text); 

    // Trim out extra -'s
    $text = trim($text, '-');

    // Convert letters that we have left to the closest ASCII representation
    $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);

    // Make text lowercase
    $text = strtolower($text);

    // Strip out anything we haven't been able to convert
    $text = preg_replace('/[^-\w]+/', '', $text);

    return $text;
}

This works fairly well, as it first uses the unicode properties of each character to determine if it's a letter (or \d against a number) - then it converts those that aren't to -'s - then it transliterates to ascii, does another replacement for anything else, and then cleans up after itself. (Fabrik's test returns "arvizturo-tukorfurogep")

I also tend to add in a list of stop words - so that those are removed from the slug. "the" "of" "or" "a", etc (but don't do it on length, or you strip out stuff like "php")

Mez 2010-10-21 13:33:00

Simple yet brilliant! +++ ;) (Now wondering what's that hocus-pocus inside WP source :o)

fabrik 2010-10-21 14:07:47

the Unicode matching only works on 5.1+ and iconv might not be installed on some servers - they have to cater for everyong.

Mez 2010-10-21 17:02:50

ansaurus

tags:

views:

answers:

PHP code to generate safe URL?

related questions