views:

183

answers:

3

This function is great, but its main flaw is that it doesn't handle domains ending with .co.uk or .com.au. How can it be modified to handle this?

function parseUrl($url) {
    $r  = "^(?:(?P<scheme>\w+)://)?";
    $r .= "(?:(?P<login>\w+):(?P<pass>\w+)@)?";
    $r .= "(?P<host>(?:(?P<subdomain>[-\w\.]+)\.)?" . "(?P<domain>[-\w]+\.(?P<extension>\w+)))";
    $r .= "(?::(?P<port>\d+))?";
    $r .= "(?P<path>[\w/-]*/(?P<file>[\w-]+(?:\.\w+)?)?)?";
    $r .= "(?:\?(?P<arg>[\w=&]+))?";
    $r .= "(?:#(?P<anchor>\w+))?";
    $r = "!$r!";

    preg_match ( $r, $url, $out );

    return $out;
}

To clarify my reason for looking for something other than parse_url() is that I want to strip out (possibly multiple) subdomains as well.

Judging by the leading answer so far, there seems to be some confusion about what parse_url does.

print_r(parse_url('sub1.sub2.test.co.uk'));

Results in:

Array(
[scheme] => http
[host] => sub1.sub2.test.co.uk
)

What I want to extract is "test.co.uk" (sans subdomains), so first using parse_url is a pointless extra step where the output is the same as the input.

+1  A: 

Replace this bit:

(?P<extension>\w+)

With:

(?P<extension>\w+(?:\.\w+)?)

Where there (?:...) part is a non-capturing group, with the ? making it optional.


I'd probably go a step further and change that bit to this:

(?P<extension>[a-z]{2,10}(?:\.[a-z]{2,10})?)

Since the extension don't contain number or underscore, and are usually just 2/3 letters (I think .museum is longest, at 6... so 10 is probably a safe maximum).

If you do that, you might want a case-insensitive flag added, (or put A-Z in also).


Based on your comment, you want to make the subdomain part of the match 'lazy' (only match if it has to), and thus allow the extension to capture both parts.

To do that, simply add a ? to the end of the quanitifer, changing:

(?P<subdomain>[-\w\.]+)

to

(?P<subdomain>[-\w\.]+?)

And (in theory - haven't got PHP handy to test) that will only make the subdomain longer if it has to, so should allow the extension group to match appropriately.


Update:
Ok, assuming you've extracted the full hostname already (using parse_url as suggested in other Q/comments), try this for matching subdomain, domain, and extension parts:

^(?P<subdomains>(?:[\w-]+\.)*?)(?P<domain>[\w-]+(?P<extension>(?:\.[a-z]{2,10}){1,2}))$

This will leave a . on the end of the subdomain (and on the start of the extensio)n, but you can use a substr($string,0,-1) or similar to remove that.

Expanded form for readability:

^
(?P<subdomains>
  (?:[\w-]+\.)*?
)
(?P<domain>
  [\w-]+
  (?P<extension>
     (?:\.[a-z]{2,10}){1,2}
   )
)$

(can add comments to explain any of that, if necessary?)

Peter Boughton
I'm afraid this is the result I'm getting still:Array( [0] => http://test.co.uk [scheme] => http [1] => http [login] => [2] => [pass] => [3] => [host] => test.co.uk [4] => test.co.uk [subdomain] => test [5] => test [domain] => co.uk [6] => co.uk [extension] => uk [7] => uk)It should also work with something like subdomain.subdomain2.test.co.uk
Fo
I think you can fix that by making the subdomain part lazy.
Peter Boughton
Hmm.. I'm still not quite getting a clean domain from this.
Fo
+7  A: 

What's wrong with the built-in parse_url?

timdev
`parse_url` is too lenient with malformed URLs, which the OP might not want.
Will Vousden
Actually, the reason is I want to strip out all subdomains as well.
Fo
Hmmm, since parse_url gives you the hostname, why not then write a (simpler) expression to split the subdomains and extension from that?
Peter Boughton
@Fo Why don't you use parse_url for the initial parsing and perform further parsing on the hostname it returns?
George Marian
+3  A: 

This may or may not be of interest, but here's a (somewhat monstrous) regex I wrote that mostly conforms to RFC3986 (it's actually slightly stricter, as it disallows some of the more unusual URI syntaxes):

~^(?:(?:(?P<scheme>[a-z][0-9a-z.+-]*?)://)?(?P<authority>(?:(?P<userinfo>(?P<username>(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=])*)?:(?P<password>(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=])*)?|(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:)*?)@)?(?P<host>(?P<domain>(?:[a-z](?:[0-9a-z-]*(?:[0-9a-z]))?\.)+(?:[a-z](?:[0-9a-z-]*(?:[0-9a-z]))?))|(?P<ip>(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d).(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d).(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d).(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d)))(?::(?P<port>\d+))?(?=/|$)))?(?P<path>/?(?:(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:|@)+/)*(?:(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:|@)+/?)?)(?:\?(?P<query>(?:(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:|@)|/|\?)*?))?(?:#(?P<fragment>(?:(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:|@)|/|\?)*))?$~i

The named components are:

scheme
authority
  userinfo
    username
    password
  domain
  ip
path
query
fragment

And here's the code that generates it (along with variants defined by some options):

public static function validateUri($uri, &$components = false, $flags = 0)
{
    if (func_num_args() > 3)
    {
        $flags = array_slice(func_get_args(), 2);
    }

    if (is_array($flags))
    {
        $flagsArray = $flags;
        $flags = array();
        foreach ($flagsArray as $flag)
        {
            if (is_int($flag))
            {
                $flags |= $flag;
            }
        }
    }

    // Set options.
    $requireScheme = !($flags & self::URI_ALLOW_NO_SCHEME);
    $requireAuthority = !($flags & self::URI_ALLOW_NO_AUTHORITY);
    $isRelative = (bool)($flags & self::URI_IS_RELATIVE);
    $requireMultiPartDomain = (bool)($flags & self::URI_REQUIRE_MULTI_PART_DOMAIN);

    // And we're away…

    // Some character types (taken from RFC 3986: http://tools.ietf.org/html/rfc3986).
    $hex = '[\da-f]'; // Hexadecimal digit.
    $pct = "(?:%$hex{2})"; // "Percent-encoded" value.
    $gen = '[\[\]:/?#@]'; // Generic delimiters.
    $sub = '[!$&\'()*+,;=]'; // Sub-delimiters.
    $reserved = "(?:$gen|$sub)"; // Reserved characters.
    $unreserved = '[\w.\~-]'; // Unreserved characters.
    $pChar = "(?:$unreserved|$pct|$sub|:|@)"; // Path characters.
    $qfChar = "(?:$pChar|/|\?)"; // Query/fragment characters.

    // Other entities.
    $octet = '(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d)';
    $label = '[a-z](?:[0-9a-z-]*(?:[0-9a-z]))?';

    $scheme = '(?:(?P<scheme>[a-z][0-9a-z.+-]*?)://)';

    // Authority components.
    $userInfo = "(?:(?P<userinfo>(?P<username>(?:$unreserved|$pct|$sub)*)?:(?P<password>(?:$unreserved|$pct|$sub)*)?|(?:$unreserved|$pct|$sub|:)*?)@)?";
    $ip = "(?P<ip>$octet.$octet.$octet.$octet)";
    if ($requireMultiPartDomain)
    {
        $domain = "(?P<domain>(?:$label\.)+(?:$label))";
    }
    else
    {
        $domain = "(?P<domain>(?:$label\.)*(?:$label))";
    }
    $host = "(?P<host>$domain|$ip)";
    $port = '(?::(?P<port>\d+))?';

    // Primary hierarchical URI components.
    $authority = "(?P<authority>$userInfo$host$port(?=/|$))";
    $path = "(?P<path>/?(?:$pChar+/)*(?:$pChar+/?)?)";

    // Final bits.
    $query = "(?:\?(?P<query>$qfChar*?))?";
    $fragment = "(?:#(?P<fragment>$qfChar*))?";

    // Construct the final pattern.
    $pattern = '~^';

    // Only include scheme and authority if the path is not relative.
    if (!$isRelative)
    {
        if ($requireScheme)
        {
            // If the scheme is required, then the authority must also be there.
            $pattern .= $scheme . $authority;
        }
        else if ($requireAuthority)
        {
            $pattern .= "$scheme?$authority";
        }
        else
        {
            $pattern .= "(?:$scheme?$authority)?";
        }
    }
    else
    {
        // Disallow that optional slash we put in $path.
        $pattern .= '(?!/)';
    }

    // Now add standard elements and terminate the pattern.
    $pattern .= $path . $query . $fragment . '$~i';

    // Finally, validate that sucker!
    $components = array();
    $result = (bool)preg_match($pattern, $uri, $matches);
    if ($result)
    {
        // Filter out all of the useless numerical matches.
        foreach ($matches as $key => $value)
        {
            if (!is_int($key))
            {
                $components[$key] = $value;
            }
        }

        return true;
    }
    else
    {
        return false;
    }
}
Will Vousden