views:

133

answers:

10

We need to generate a unique URL from the title of a book - where the title can contain any character. How can we search-replace all the 'invalid' characters so that a valid and neat lookoing URL is generated?

For instance:

"The Great Book of PHP"

www.mysite.com/book/12345/the-great-book-of-php

"The Greatest !@#$ Book of PHP"

www.mysite.com/book/12345/the-greatest-book-of-php

"Funny title     "

www.mysite.com/book/12345/funny-title
A: 

Replace special chars for white spaces and then replace white spaces for "-". str_replace?

bswietochowski
Please explain how do you define special characters?
fabrik
+2  A: 

You can use a simple regular expression for this purpose:

<?php
    function safeurl( $v )
    {
        $v = strtolower( $v );
        $v = preg_replace( "/[^a-z0-9]+/", "-", $v );
        $v = trim( $v, "-" );
        return $v;
    }
    echo "<br>www.mysite.com/book/12345/" . safeurl( "The Great Book of PHP" );
    echo "<br>www.mysite.com/book/12345/" . safeurl( "The Greatest !@#$ Book of PHP" );
    echo "<br>www.mysite.com/book/12345/" . safeurl( "  Funny title  " );
    echo "<br>www.mysite.com/book/12345/" . safeurl( "!!Even Funnier title!!" );
?>
Salman A
Sorry, Salman. I've tried your script with a hungarian sentence which contains all of our vowels and it's fails: http://ideone.com/WDcV8
fabrik
@fabrik: no one said anything about hungarian. i'd -1 your comment if i could.
Mark
Does the question mention hungarian?
Salman A
From the question: "where the title can contain any character".
fabrik
This fails for leading or trailing invalid characters except whitespace.
Gumbo
Okie, I've fixed it now.
Salman A
@Salman A: Now it’s exactly what I’ve suggested. Congratulations.
Gumbo
**** Grins ****
Salman A
A: 

Basically I would create an array with the characters I don't want to use. Then a loop or regex, using str_replace to empty string (in the case of regex, preg_replace).

Finally you substitute (str_replace) the white space for hyphens.

netadictos
+1  A: 

If you want to allow only letters, digits and underscore (usual word characters) you can do:

$str = strtolower(preg_replace(array('/\W/','/-+/','/^-|-$/'),array('-','-',''),$str));

It first replaces any non-word character(\W) with a -.
Next it replaces any consecutive - with a single -
Next it deletes any leading or trailing -.

Working link

codaddict
Your script fails too with accented vowels. http://www.ideone.com/QdAEm
fabrik
Go ahead and downvote Gumbo too. I bet you're having a bad day.
Salman A
@Salman: Please understand it's not an easy preg_replace: http://core.trac.wordpress.org/browser/tags/3.0.1/wp-includes/formatting.php
fabrik
+5  A: 

If “invalid” means non-alphanumeric, you can do this:

function foo($str) {
    return trim(preg_replace('/[^a-z0-9]+/', '-', strtolower($str)), '-');
}

This will turn $str into lowercase, replace any sequence of one or more non-alphanumeric characters by one hyphen, and then remove leading and trailing hyphens.

var_dump(foo("The Great Book of PHP") === 'the-great-book-of-php');
var_dump(foo("The Greatest !@#$ Book of PHP") === 'the-greatest-book-of-php');
var_dump(foo("Funny title     ") === 'funny-title');
Gumbo
+1. i like to strip single quotes first though.
Mark
Fails too. Sorry. Please read the question: "the title can contain any character"
fabrik
@fabrik: So what’s wrong? Didn’t you test the examples? They all yield true.
Gumbo
http://www.ideone.com/ZtzDl
fabrik
@fabrik: “If ‘invalid’ means non-alphanumeric […]” – matt_tm didn’t say anything about what invalid means. I just assumed that he means non-alphanumeric.
Gumbo
@Gumbo: Thank you for at least trying to understand what i'm talking about. Not only hungarian characters but given a book about Citroën and there you go. Accented characters in an international brand's name. Yes, OP didn't specified what is invalid and what is not but as he stated "the title can contain **any** character". (And, because we talking about books, there's a chance for accented characters.)
fabrik
Hi - sorry to barge in your conversation and yes, non-English characters should be accounted for as well... Its not a terrible requirement that the 'visible' title be absolutely the same as the actual title, but it MUST be a valid url...
matt_tm
A: 

Use a regex replace to remove all non word characters. For example:

str_replace('[^a-zA-Z]+', '-', $input)

WardB
A: 
<?php
$input = "  The Great Book's of PHP  ";
$output = trim(preg_replace(array("`'`", "`[^a-z]+`"),  array("", "-"), strtolower($input)), "-");
echo $output; // the-great-books-of-php

This trims trailing dashes and doesn't do things like "it's raining" -> "it-s-raining" as most solutions tend to do.

Mark
And turning *it’s* into *its* is right?
Gumbo
And turning it's into it-s is right?
Mez
@Gumbo: I find it preferable. Easier to read, no? Otherwise you read it like "it ess raining" and that's just weird.
Mark
“It’s” and “its” have a different meaning. The preferable variant would be to use its expanded (unambiguous) variant, so “it is” or “it has”.
Gumbo
@Gumbo: It's a URL. It's supposed to be short and concise.. if anything I'd strip out words like "is" and "has" too. No one is going to be looking for grammatical errors in a URL. And if they can't figure out "its-raining" actually means "it is raining" because there's no apostrophe....then... they need to go back to school.
Mark
@Mark: What about constructs with words that are ambiguous like `its-meaning`?
Gumbo
@Gumbo: When do you ever say "it is meaning"? And who cares? They can visit the website and read the actual title on the actual page in all its unicode glory.
Mark
A: 

Sanitizing special characters not an easy task imho. Take a look at WordPress awesome sanitize_title function, also look it's source.

Update: Sorry guys, i should downvote every answer which isn't dealing with accented characters. Do you understand what "the title can contain any character" means?

Update 2: Go, guys, go! Please downvote me as many as you can!

Note: and please don't get surprised when you meet a special character. Just eliminate it with str_replace!

fabrik
Is it legal to use those functions directly in our code?
matt_tm
WordPress released under GPL: http://wordpress.org/about/gpl/ If you're not sure about how you can use WP's sources, please take a look at http://stackoverflow.com/questions/2668854/php-sanitizing-strings-to-make-them-url-and-filename-safe there are some interesting approaches that's dealing with **special** characters too. I suggested WordPress as a best option because it's carefully dealing with _any_ special character even Arabic IIRC.
fabrik
-1 For the same reason: Since you don’t know what the valid/invalid characters are, even pointing at WordPress’ `sanitize_title` is wrong as it also may behave falsely.
Gumbo
Yeah... I dont think that's work for us... 2. You may modify your copy or copies of the Program or any portion of it, thus forming a work based on the Program, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions:b. You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License.
matt_tm
@matt_tm: Why don’t you just tell us what characters you are dealing with and what characters you assume to be invalid? That would make the whole thing more clear and you may even get satisfying answers.
Gumbo
@Gumbo: i'm assuming the app will be a bookstore. Can you tip what characters will occur? There's a chance you cannot tell this at this point. That's why i trying to suggest a solution that simply **works**.
fabrik
@fabrik: You can always name at least the characters you *expect*. Even if it’s just “characters of the Unicode character set”. And from that character set you can also specify the characters you assume to be valid. That is the minimum information we need to have the chance to give an appropriate solution.
Gumbo
@Gumbo: refining my previous comment: You simply don't know what characters should you expect. Don't. Know. When your software limiting your possibilities there's something simply wrong.
fabrik
@fabrik: If you have some data and want to process it, you certainly need know what the data means. And in case of a string, it’s the character encoding and underlying character set that tells you what the byte sequence means.
Gumbo
@Gumbo: Assuming the site in UTF-8 (there's a big chance) and given the fact that Unicode can contain 109,000 characters (http://en.wikipedia.org/wiki/Unicode) basically you should take care of every single character.
fabrik
@fabrik: And all those suggestions you voted down do not do that? What if all those characters of the Unicode character set that matt_tm assumes to be valid are only the alphabetic or alphanumeric characters? Then all these suggestions were correct. And if it’s not just the alphabetic/alphanumeric characters that are valid characters, then only the assumption of the set of valid characters was wrong. But that could be fixed easily.
Gumbo
@Gumbo: No offense but it seems you don't get the point yet. It's cannot fixed easily if it fails. All of the downvoted answers failed with my simple test: every solution converted my words (Árvíztűrő tükörfúrógép) to rv-zt-r-t-k-rf-r-g-p instead of arvizturo-tukorfurogep. I think that's definitely not the result what anybody want.
fabrik
@fabrik: No one said anything about transliteration. matt_tm just said that invalid characters should be replaced. And since he used hyphens to replace sequences of invalid characters (assuming `!`, `@`, `#`, `$`, and the space character are invalid), I conclude that if only alphabetic/alphanumeric characters are valid, `rv-zt-r-t-k-rf-r-g-p` could be a solution for the input `Árvíztűrő tükörfúrógép`. But again, as long as we don’t know what characters he assumes to be valid and how invalid characters should be replaced, this debate is rather meaningless.
Gumbo
@Gumbo: +1 everything is a solution when we don't know what is the problem. Btw i like to solve problems not to hold over them.
fabrik
@fabrik: Yes, there is no justified judging whether some suggestion is right or wrong until we know every detail of the problem.
Gumbo
@fabrik - check out my answer :D
Mez
+1  A: 

This code comes from CodeIgniter's url helper. It should do the trick.

function url_title($str, $separator = 'dash', $lowercase = FALSE)
    {
        if ($separator == 'dash')
        {
            $search     = '_';
            $replace    = '-';
        }
        else
        {
            $search     = '-';
            $replace    = '_';
        }

        $trans = array(
                        '&\#\d+?;'              => '',
                        '&\S+?;'                => '',
                        '\s+'                   => $replace,
                        '[^a-z0-9\-\._]'        => '',
                        $replace.'+'            => $replace,
                        $replace.'$'            => $replace,
                        '^'.$replace            => $replace,
                        '\.+$'                  => ''
                      );

        $str = strip_tags($str);

        foreach ($trans as $key => $val)
        {
            $str = preg_replace("#".$key."#i", $val, $str);
        }

        if ($lowercase === TRUE)
        {
            $str = strtolower($str);
        }

        return trim(stripslashes($str));
    }
Anzeo
+3  A: 

Ah, slugification

function slugify($text)
{
    // Swap out Non "Letters" with a -
    $text = preg_replace('/[^\\pL\d]+/u', '-', $text); 

    // Trim out extra -'s
    $text = trim($text, '-');

    // Convert letters that we have left to the closest ASCII representation
    $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);

    // Make text lowercase
    $text = strtolower($text);

    // Strip out anything we haven't been able to convert
    $text = preg_replace('/[^-\w]+/', '', $text);

    return $text;
}

This works fairly well, as it first uses the unicode properties of each character to determine if it's a letter (or \d against a number) - then it converts those that aren't to -'s - then it transliterates to ascii, does another replacement for anything else, and then cleans up after itself. (Fabrik's test returns "arvizturo-tukorfurogep")

I also tend to add in a list of stop words - so that those are removed from the slug. "the" "of" "or" "a", etc (but don't do it on length, or you strip out stuff like "php")

Mez
Simple yet brilliant! +++ ;) (Now wondering what's that hocus-pocus inside WP source :o)
fabrik
the Unicode matching only works on 5.1+ and iconv might not be installed on some servers - they have to cater for everyong.
Mez