



I'm trying to write a simple PHP function that can take a string like

Topic: Some stuff, Maybe some more, it's my stuff?

and return


As such:

  • lowercase
  • remove all non-alphanumeric non-space characters
  • replace all spaces (or groups of spaces) with hyphens

Can I do this with a single regex?

Why are regular expressions considered the universal panacea to all life's problems (just because a lowly backtrace in a preg_match has discovered the cure for cancer). here's a solution without recourse to regexp:

$str = "Topic: Some stuff, Maybe some more, it's my stuff?";
$str = implode('-',str_word_count(strtolower($str),2));
echo $str;

Without going the whole UTF-8 route:

$str = "Topic: Some stuff, Maybe some more, it's my Iñtërnâtiônàlizætiøn stuff?";
$str = implode('-',str_word_count(strtolower(str_replace("'","",$str)),2,'Þßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ'));
echo $str;



Mark Baker
could run iconv on that in case it contains foreign language characters
@Mark: People love regular expressions because one feels smart to be able to write stuff that makes even perl seem readable. =)
@Jens Except many of the people who think like that ask somebody here to actually write the regexp for them
Mark Baker
@Jens I don't think that the case. People write regular expressions because they require less memory than all of PHP's string functions and their argument order.
@Mark - Thank you. I just figured regex was the way to go because all I hear is regex this and regex that for non-trivial string manipulation. I am much happier with your answer
`$str = 'Iñtërnâtiônàlizætiøn';` Output = `i�-t�-rn�-ti�-n�-liz�-ti�-n`.
Alix Axel
By the way, you're not replacing the single quote: `topic-some-stuff-maybe-some-more-it's-my-stuff`.
Alix Axel
@Mark: Your last update is not suitable for URLs, `Iñtërnâtiônàlizætiøn` should become `internationalization` or `internationalizaetion`.
Alix Axel
@Mark Worse than that, PHP's string functions (almost all) are not suitable for UTF-8 data, e.g. `strtolower` could corrupt the string.
Yes, it gets rather kludgy when trying to internationalise it. That's when Alix Axel's slug method comes into its own.
Mark Baker
You can do it with one preg_replace:

preg_replace(array("/[A-Z]/e", "/\\p{P}/", "/\\s+/"),
    array('strtolower("$0")', '', '-'), $str);

Technically, you could do it with one regex, but this is simpler.

Preemptive response: yes, it unnecessarily uses regular expressions (though very simple ones), an unecessarily big number of calls to strtolower, and it doesn't consider non-english characters (he doesn't even give an encoding); I'm just satisfying the OP's requirements.

+1 for actually answering my question, but I ended up going with the non-regex solution
@Mala Good call, though both solutions are readable (well, once you know `str_word_count` doesn't actually give a word count in this case), Mark's is probably much more efficient.
Quite nice indeed.
Alix Axel
Many frameworks provide functions for this


wordpress (has many more in the code):

Thanks! I was using CodeIgniter, so this helps a lot. For reference to everyone else:
function Slug($string)
    return strtolower(trim(preg_replace('~[^0-9a-z]+~i', '-', html_entity_decode(preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', htmlentities($string, ENT_QUOTES, 'UTF-8')), ENT_QUOTES, 'UTF-8')), '-'));

$topic = 'Iñtërnâtiônàlizætiøn';
echo Slug($topic); // internationalizaetion

$topic = 'Topic: Some stuff, Maybe some more, it\'s my stuff?';
echo Slug($topic); // topic-some-stuff-maybe-some-more-it-s-my-stuff

$topic = 'here عربي‎ Arabi';
echo Slug($topic); // here-arabi

$topic = 'here 日本語 Japanese';
echo Slug($topic); // here-japanese
Alix Axel
If you go with internationalization, why not go all the way and also consider e.g. arabic characters? +1 for cleverness, though.
@Artefacto: I want to, but I've no knowledge of the language. I don't even know if those chars can be *romanized*.
Alix Axel
@Alix Well, good point but I was thinking more on the case when only a few characters are arab (so they could be discarded)
@Artefacto: Already did that (I had posted a outdated version). =)
Alix Axel