tags:

views:

133

answers:

3

I have been trying to find an effective url parser, php's own does not include subdomain or extension. On php.net a number of users had contributed and made this:

function parseUrl($url) {
    $r  = "^(?:(?P<scheme>\w+)://)?";
    $r .= "(?:(?P<login>\w+):(?P<pass>\w+)@)?";
    $r .= "(?P<host>(?:(?P<subdomain>[-\w\.]+)\.)?" . "(?P<domain>[-\w]+\.(?P<extension>\w+)))";
    $r .= "(?::(?P<port>\d+))?";
    $r .= "(?P<path>[\w/]*/(?P<file>\w+(?:\.\w+)?)?)?";
    $r .= "(?:\?(?P<arg>[\w=&]+))?";
    $r .= "(?:#(?P<anchor>\w+))?";
    $r = "!$r!";                                                // Delimiters

    preg_match ( $r, $url, $out );

    return $out;
}

Unfortunately it fails on paths with a '-' and I can't for the life of me workout how to amend it to accept '-' in the path name.

Thanks

A: 

It's much easier to use a existing parse_url function and then parse the subdomain from the 'host' index.

Example:

$url = 'http://username:[email protected]/path?arg=value#anchor';
$urlInfo = parse_url($url);
$host = $urlInfo['host'];
$subdomain = substr($host, 0, strpos($host, '.'));
$tld = substr($host, strrpos($host, '.') + 1);
Marko
can you suggest how I might go about that?
Mark
@Mark - to do what? What are you trying to achieve?
Dominic Rodger
I am looking to get the subdomain, tld, domain, path and arguments of a url. parse_url does not allow for subdomain or tld.
Mark
I've added a example that shows how to get the sub-domain and the TLD from the host. It's a simple string manipulation.
Marko
that will fail for tlds such as .co.uk
Galen
Thanks Marko, but I have already considered something like that problem is it won't work for domains with tld's containing multiple '.'s like .co.uk
Mark
Well technically, .co.uk is not a TLD, only .uk is. If I understand it right, you will have to keep a manual list for those cases anyway - there are lots other "quasi-tld"s like .co.at, .co.in and so on.
Pekka
+1  A: 

try this...

function parseUrl($url) {
    $r  = "^(?:(?P<scheme>\w+)://)?";
    $r .= "(?:(?P<login>\w+):(?P<pass>\w+)@)?";
    $r .= "(?P<host>(?:(?P<subdomain>[-\w\.]+)\.)?" . "(?P<domain>[-\w]+\.(?P<extension>\w+)))";
    $r .= "(?::(?P<port>\d+))?";
    $r .= "(?P<path>[\w/-]*/(?P<file>[\w-]+(?:\.\w+)?)?)?";
    $r .= "(?:\?(?P<arg>[\w=&]+))?";
    $r .= "(?:#(?P<anchor>\w+))?";
    $r = "!$r!";

    preg_match ( $r, $url, $out );

    return $out;
}

i added dashes to the path and file

Galen
It works exactly as I wanted... thank you for actually answering the question :)
Mark
A: 

The internal PHP function "parse_url" is not always sufficient to parse URLs or URIs correctly into their components.

A standard-compliant, robust and high-performance PHP class for handling and parsing URLs, URIs, URNs and IRIs according to RFC 3986 and RFC 3987 is available to download for free:

http://andreas-hahn.com/en/parse-url

Andreas M. Hahn