tags:

views:

1661

answers:

4

Im looking for a method (or function) to strip out the domain.ext part of any URL thats fed into the function. The domain extension can be anything (.com, .co.uk, .nl, .whatever), and the URL thats fed into it can be anything from http://www.domain.com to www.domain.com/path/script.php?=whatever

Whats the best way to go about doing this?

+5  A: 

You can use parse_url() to do this:

$url = 'http://www.example.com';
$domain = parse_url($url, PHP_URL_HOST);

In this example $domain should contain example.com.

DavidM
Shouldn't that be parse_url() instead of url_parse()
Darryl Hein
Note: the second argument for parse_url is a PHP5 invention. Anyone on PHP4 (upgrade, please, for the love of God...) will need to use Robert Elwell's way.
ceejayoz
Anyone on PHP4 ... will have to upgrade.
Kent Fredric
+13  A: 

parse_url turns a URL into an associative array:

php > $foo = "http://www.example.com/foo/bar?hat=bowler&accessory=cane";
php > $blah = parse_url($foo);
php > print_r($blah);
Array
(
    [scheme] => http
    [host] => www.example.com
    [path] => /foo/bar
    [query] => hat=bowler&accessory=cane
)
Robert Elwell
What would be the best way to strip out the www. portion if its present in the domain. IM not good with regex. The messy way I can think of is $www_check = substr($domain,0,4); if ($www_check == "www.") { echo substr($domain, 4); } else { echo $domain; }
Yegor
@Yegor: $domain = preg_replace('/^www./','',$domain);
Kent Fredric
er. make that \. not .
Kent Fredric
I like explode on "www." and then use the first instance in the array myself. It generally works just fine.
Robert Elwell
Careful Robert as a lot of URls don't have www in front of them. ie images.google.com
gradbot
Yeah, generally for my purposes, that's the goal, as a non-www subdomain is pretty informative about the content being displayed in that part of the site.
Robert Elwell
Slight problem with your suggestion, Robert. It wont find the host if there is no http:// in the url.
Yegor
+2  A: 

You can also write a regular expression to get exactly what you want.

Here is my attempt at it:

$pattern = '/\w+\..{2,3}(?:\..{2,3})?(?:$|(?=\/))/i';
$url = 'http://www.example.com/foo/bar?hat=bowler&accessory=cane';
if (preg_match($pattern, $url, $matches) === 1) {
    echo $matches[0];
}

The output is:

example.com

This pattern also takes into consideration domains such as 'example.com.au'.

Note: I have not consulted the relevant RFC.

firstresponder
A: 

I spent some time thinking about whether it makes sense to use a regular expression for this, but in the end I think not.

firstresponder's regexp came close to convincing me it was the best way, but it didn't work on anything missing a trailing slash (so http://example.com, for instance). I fixed that with the following: '/\w+\..{2,3}(?:\..{2,3})?(?=[\/\W])/i', but then I realized that matches twice for urls like 'http://example.com/index.htm'. Oops. That wouldn't be so bad (just use the first one), but it also matches twice on something like this: 'http://abc.ed.fg.hij.kl.mn/', and the first match isn't the right one. :(

A co-worker suggested just getting the host (via parse_url()), and then just taking the last two or three array bits (split() on '.') The two or three would be based on a list of domains, like 'co.uk', etc. Making up that list becomes the hard part.

livingtech