ansaurus

Question

How to make this PHP URL parsing function nearly perfect?

Answer 1

+1 A:

Replace this bit:

(?P<extension>\w+)

With:

(?P<extension>\w+(?:\.\w+)?)

Where there (?:...) part is a non-capturing group, with the ? making it optional.

I'd probably go a step further and change that bit to this:

(?P<extension>[a-z]{2,10}(?:\.[a-z]{2,10})?)

Since the extension don't contain number or underscore, and are usually just 2/3 letters (I think .museum is longest, at 6... so 10 is probably a safe maximum).

If you do that, you might want a case-insensitive flag added, (or put A-Z in also).

Based on your comment, you want to make the subdomain part of the match 'lazy' (only match if it has to), and thus allow the extension to capture both parts.

To do that, simply add a ? to the end of the quanitifer, changing:

(?P<subdomain>[-\w\.]+)

to

(?P<subdomain>[-\w\.]+?)

And (in theory - haven't got PHP handy to test) that will only make the subdomain longer if it has to, so should allow the extension group to match appropriately.

Update:
Ok, assuming you've extracted the full hostname already (using parse_url as suggested in other Q/comments), try this for matching subdomain, domain, and extension parts:

^(?P<subdomains>(?:[\w-]+\.)*?)(?P<domain>[\w-]+(?P<extension>(?:\.[a-z]{2,10}){1,2}))$

This will leave a . on the end of the subdomain (and on the start of the extensio)n, but you can use a substr($string,0,-1) or similar to remove that.

Expanded form for readability:

^
(?P<subdomains>
  (?:[\w-]+\.)*?
)
(?P<domain>
  [\w-]+
  (?P<extension>
     (?:\.[a-z]{2,10}){1,2}
   )
)$

(can add comments to explain any of that, if necessary?)

Peter Boughton 2010-07-18 22:12:17

I'm afraid this is the result I'm getting still:Array( [0] => http://test.co.uk [scheme] => http [1] => http [login] => [2] => [pass] => [3] => [host] => test.co.uk [4] => test.co.uk [subdomain] => test [5] => test [domain] => co.uk [6] => co.uk [extension] => uk [7] => uk)It should also work with something like subdomain.subdomain2.test.co.uk

Fo 2010-07-18 22:17:21

I think you can fix that by making the subdomain part lazy.

Peter Boughton 2010-07-18 22:19:10

Hmm.. I'm still not quite getting a clean domain from this.

Fo 2010-07-18 23:27:10

Answer 2

+7 A:

What's wrong with the built-in parse_url?

timdev 2010-07-18 22:15:39

`parse_url` is too lenient with malformed URLs, which the OP might not want.

Will Vousden 2010-07-18 22:16:52

Actually, the reason is I want to strip out all subdomains as well.

Fo 2010-07-18 22:22:16

Hmmm, since parse_url gives you the hostname, why not then write a (simpler) expression to split the subdomains and extension from that?

Peter Boughton 2010-07-18 22:29:04

@Fo Why don't you use parse_url for the initial parsing and perform further parsing on the hostname it returns?

George Marian 2010-07-18 23:01:36

Answer 3

+3 A:

This may or may not be of interest, but here's a (somewhat monstrous) regex I wrote that mostly conforms to RFC3986 (it's actually slightly stricter, as it disallows some of the more unusual URI syntaxes):

~^(?:(?:(?P<scheme>[a-z][0-9a-z.+-]*?)://)?(?P<authority>(?:(?P<userinfo>(?P<username>(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=])*)?:(?P<password>(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=])*)?|(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:)*?)@)?(?P<host>(?P<domain>(?:[a-z](?:[0-9a-z-]*(?:[0-9a-z]))?\.)+(?:[a-z](?:[0-9a-z-]*(?:[0-9a-z]))?))|(?P<ip>(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d).(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d).(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d).(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d)))(?::(?P<port>\d+))?(?=/|$)))?(?P<path>/?(?:(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:|@)+/)*(?:(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:|@)+/?)?)(?:\?(?P<query>(?:(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:|@)|/|\?)*?))?(?:#(?P<fragment>(?:(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:|@)|/|\?)*))?$~i

The named components are:

scheme
authority
  userinfo
    username
    password
  domain
  ip
path
query
fragment

And here's the code that generates it (along with variants defined by some options):

public static function validateUri($uri, &$components = false, $flags = 0)
{
    if (func_num_args() > 3)
    {
        $flags = array_slice(func_get_args(), 2);
    }

    if (is_array($flags))
    {
        $flagsArray = $flags;
        $flags = array();
        foreach ($flagsArray as $flag)
        {
            if (is_int($flag))
            {
                $flags |= $flag;
            }
        }
    }

    // Set options.
    $requireScheme = !($flags & self::URI_ALLOW_NO_SCHEME);
    $requireAuthority = !($flags & self::URI_ALLOW_NO_AUTHORITY);
    $isRelative = (bool)($flags & self::URI_IS_RELATIVE);
    $requireMultiPartDomain = (bool)($flags & self::URI_REQUIRE_MULTI_PART_DOMAIN);

    // And we're away…

    // Some character types (taken from RFC 3986: http://tools.ietf.org/html/rfc3986).
    $hex = '[\da-f]'; // Hexadecimal digit.
    $pct = "(?:%$hex{2})"; // "Percent-encoded" value.
    $gen = '[\[\]:/?#@]'; // Generic delimiters.
    $sub = '[!$&\'()*+,;=]'; // Sub-delimiters.
    $reserved = "(?:$gen|$sub)"; // Reserved characters.
    $unreserved = '[\w.\~-]'; // Unreserved characters.
    $pChar = "(?:$unreserved|$pct|$sub|:|@)"; // Path characters.
    $qfChar = "(?:$pChar|/|\?)"; // Query/fragment characters.

    // Other entities.
    $octet = '(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d)';
    $label = '[a-z](?:[0-9a-z-]*(?:[0-9a-z]))?';

    $scheme = '(?:(?P<scheme>[a-z][0-9a-z.+-]*?)://)';

    // Authority components.
    $userInfo = "(?:(?P<userinfo>(?P<username>(?:$unreserved|$pct|$sub)*)?:(?P<password>(?:$unreserved|$pct|$sub)*)?|(?:$unreserved|$pct|$sub|:)*?)@)?";
    $ip = "(?P<ip>$octet.$octet.$octet.$octet)";
    if ($requireMultiPartDomain)
    {
        $domain = "(?P<domain>(?:$label\.)+(?:$label))";
    }
    else
    {
        $domain = "(?P<domain>(?:$label\.)*(?:$label))";
    }
    $host = "(?P<host>$domain|$ip)";
    $port = '(?::(?P<port>\d+))?';

    // Primary hierarchical URI components.
    $authority = "(?P<authority>$userInfo$host$port(?=/|$))";
    $path = "(?P<path>/?(?:$pChar+/)*(?:$pChar+/?)?)";

    // Final bits.
    $query = "(?:\?(?P<query>$qfChar*?))?";
    $fragment = "(?:#(?P<fragment>$qfChar*))?";

    // Construct the final pattern.
    $pattern = '~^';

    // Only include scheme and authority if the path is not relative.
    if (!$isRelative)
    {
        if ($requireScheme)
        {
            // If the scheme is required, then the authority must also be there.
            $pattern .= $scheme . $authority;
        }
        else if ($requireAuthority)
        {
            $pattern .= "$scheme?$authority";
        }
        else
        {
            $pattern .= "(?:$scheme?$authority)?";
        }
    }
    else
    {
        // Disallow that optional slash we put in $path.
        $pattern .= '(?!/)';
    }

    // Now add standard elements and terminate the pattern.
    $pattern .= $path . $query . $fragment . '$~i';

    // Finally, validate that sucker!
    $components = array();
    $result = (bool)preg_match($pattern, $uri, $matches);
    if ($result)
    {
        // Filter out all of the useless numerical matches.
        foreach ($matches as $key => $value)
        {
            if (!is_int($key))
            {
                $components[$key] = $value;
            }
        }

        return true;
    }
    else
    {
        return false;
    }
}

Will Vousden 2010-07-18 22:31:24

ansaurus

tags:

views:

answers:

How to make this PHP URL parsing function nearly perfect?

related questions