views:

44

answers:

3

I've delved into Regular Expressions for one of the first times in order to a parse a url. Without going into too much depth, I basically want friendly urls and I'm saving each permalink in the database, but because of differences in languages and pages I only want to save one permalink and parse the url for the page and language. So if I'm getting something like this:

http://domain.com/lang/fr/category/9/category_title/page/3.html

All I really want is this bit "category/9/category_title" to know what page i'm on. I've come up with this function:

$return = array();

$string = 'http://domain.com/lang/fr/category/9/category_title/page/3.html';

//Remove domain and http
$string = preg_replace('@^(?:http://)?([^/]+)@i','',$string);

if(preg_match('/^\/lang\/([a-z]{2})/',$string,$langMatches)) {
 $return['lang'] = $langMatches[1];
 //Remove lang
 $string = preg_replace('/^\/lang\/[a-z{2}]+/','',$string);
} else {
 $return['lang'] = 'en';
}

//Get extension
$bits = explode(".", strtolower($string));
$return['extension'] = end($bits);

//Remove extension
$string = preg_replace('/\.[^.]+$/','',$string);

if(preg_match('/page\/([1-9+])$/',$string,$pageMatches)) {
 $return['page'] = $pageMatches[1];
 //Remove lang
 $string = preg_replace('/page\/[1-9+]$/','',$string);
} else {
 $return['page'] = 1;
}

//Remove additional slashes from beginning and end
$string = preg_replace('#^(/?)|(/?)$#', '', $string);

$return['permalink'] = $string;

print_r($return);

Which returns this from the above example:

Array
(
    [lang] => fr
    [extension] => html
    [page] => 3
    [permalink] => category/9/category_title
)

This is perfect and exactly what I want. However my question is, have I gone about using regular expressions correctly? Is there a better way I could do this, for instance could I strip the domain, the extension and the additional slashes at the beginning and end with just one kick ass expression?

+1  A: 

You should use parse_url to split the URL into its components. And when having the URL path, you can use explode to split the path into its segments, array_slice to get specific segments and pathinfo to get the extension.

Gumbo
Indeed, with the possible addition of an `explode('/',$pathstring)` to easily get to the right path-segments.
Wrikken
Would this be less resource intensive than regular expressions?
Rob
@Rob: I don’t have any information about that. But it is probably more comprehensive, faultless and flexible than using regular expression.
Gumbo
A: 

PHP has the parse_url function.

This method highly recommended, especially as opposed to using Regular Expressions.

injekt
A: 

The expression below is, hopefully programming language agnostic.

^.*?\\.[^/]+/[^/]+/([^/]+)/([^/]+/[^/]+/[^/]+)/.*(\\d+)\\.(\\w+).*$

Let me explain what this does.

I consume the whole line (anchored by ^ and $) and work initially toward the last '.' character in the domain. From there I consume the last element of the domain and the first path element together with the '/' separator characters that follow each element, then I use capturing groups to grab the language field and the next three element segment of the path then discard up to the start of the filename and use two more groups to capture the file name and the extension discarding whitespace, if any to the end of the line.

A word of caution, I have done minimal testing of the exprssion above but believe that it can handle most URLs composed of characters in the ASCII range. It is also very specific to the structure of the URL and won't handle URLs on more than one line.

Don Mackenzie