views:

65

answers:

4

I'm trying to write a simple PHP function that can take a string like

Topic: Some stuff, Maybe some more, it's my stuff?

and return

topic-some-stuff-maybe-some-more-its-my-stuff

As such:

  • lowercase
  • remove all non-alphanumeric non-space characters
  • replace all spaces (or groups of spaces) with hyphens

Can I do this with a single regex?

+1  A: 

Why are regular expressions considered the universal panacea to all life's problems (just because a lowly backtrace in a preg_match has discovered the cure for cancer). here's a solution without recourse to regexp:

$str = "Topic: Some stuff, Maybe some more, it's my stuff?";
$str = implode('-',str_word_count(strtolower($str),2));
echo $str;

Without going the whole UTF-8 route:

$str = "Topic: Some stuff, Maybe some more, it's my Iñtërnâtiônàlizætiøn stuff?";
$str = implode('-',str_word_count(strtolower(str_replace("'","",$str)),2,'Þßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ'));
echo $str;

gives

topic-some-stuff-maybe-some-more-its-my-iñtërnâtiônàlizætiøn-stuff

Mark Baker
could run iconv on that in case it contains foreign language characters
Gordon
@Mark: People love regular expressions because one feels smart to be able to write stuff that makes even perl seem readable. =)
Jens
@Jens Except many of the people who think like that ask somebody here to actually write the regexp for them
Mark Baker
@Jens I don't think that the case. People write regular expressions because they require less memory than all of PHP's string functions and their argument order.
Artefacto
@Mark - Thank you. I just figured regex was the way to go because all I hear is regex this and regex that for non-trivial string manipulation. I am much happier with your answer
Mala
`$str = 'Iñtërnâtiônàlizætiøn';` Output = `i�-t�-rn�-ti�-n�-liz�-ti�-n`.
Alix Axel
By the way, you're not replacing the single quote: `topic-some-stuff-maybe-some-more-it's-my-stuff`.
Alix Axel
@Mark: Your last update is not suitable for URLs, `Iñtërnâtiônàlizætiøn` should become `internationalization` or `internationalizaetion`.
Alix Axel
@Mark Worse than that, PHP's string functions (almost all) are not suitable for UTF-8 data, e.g. `strtolower` could corrupt the string.
Artefacto
Yes, it gets rather kludgy when trying to internationalise it. That's when Alix Axel's slug method comes into its own.
Mark Baker
+2  A: 

You can do it with one preg_replace:

preg_replace(array("/[A-Z]/e", "/\\p{P}/", "/\\s+/"),
    array('strtolower("$0")', '', '-'), $str);

Technically, you could do it with one regex, but this is simpler.

Preemptive response: yes, it unnecessarily uses regular expressions (though very simple ones), an unecessarily big number of calls to strtolower, and it doesn't consider non-english characters (he doesn't even give an encoding); I'm just satisfying the OP's requirements.

Artefacto
+1 for actually answering my question, but I ended up going with the non-regex solution
Mala
@Mala Good call, though both solutions are readable (well, once you know `str_word_count` doesn't actually give a word count in this case), Mark's is probably much more efficient.
Artefacto
Quite nice indeed.
Alix Axel
+2  A: 

Many frameworks provide functions for this

CodeIgniter: http://bitbucket.org/ellislab/codeigniter/src/c39315f13a76/system/helpers/url_helper.php#cl-472

wordpress (has many more in the code): http://core.trac.wordpress.org/browser/trunk/wp-includes/formatting.php#L814

DrColossos
Thanks! I was using CodeIgniter, so this helps a lot. For reference to everyone else: http://codeigniter.com/user_guide/helpers/url_helper.html
Mala
+3  A: 
function Slug($string)
{
    return strtolower(trim(preg_replace('~[^0-9a-z]+~i', '-', html_entity_decode(preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', htmlentities($string, ENT_QUOTES, 'UTF-8')), ENT_QUOTES, 'UTF-8')), '-'));
}

$topic = 'Iñtërnâtiônàlizætiøn';
echo Slug($topic); // internationalizaetion

$topic = 'Topic: Some stuff, Maybe some more, it\'s my stuff?';
echo Slug($topic); // topic-some-stuff-maybe-some-more-it-s-my-stuff

$topic = 'here عربي‎ Arabi';
echo Slug($topic); // here-arabi

$topic = 'here 日本語 Japanese';
echo Slug($topic); // here-japanese
Alix Axel
If you go with internationalization, why not go all the way and also consider e.g. arabic characters? +1 for cleverness, though.
Artefacto
@Artefacto: I want to, but I've no knowledge of the language. I don't even know if those chars can be *romanized*.
Alix Axel
@Alix Well, good point but I was thinking more on the case when only a few characters are arab (so they could be discarded)
Artefacto
@Artefacto: Already did that (I had posted a outdated version). =)
Alix Axel