views:

1370

answers:

9

I need to write a function to parse variables which contain domain names. It's best I explain this with an example, the variable could contain any of these things:

here.example.com
example.com
example.org
here.example.org

But when passed through my function all of these must return either example.com or example.co.uk, the root domain name basically. I'm sure I've done this before but I've been searching Google for about 20 minutes and can't find anything. Any help would be appreciated.

EDIT: Ignore the .co.uk, presume that all domains going through this function have a 3 letter TLD.

+5  A: 

Stackoverflow Question Archive:


print get_domain("http://somedomain.co.uk"); // outputs 'somedomain.co.uk'

function get_domain($url)
{
  $pieces = parse_url($url);
  $domain = isset($pieces['host']) ? $pieces['host'] : '';
  if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i', $domain, $regs)) {
    return $regs['domain'];
  }
  return false;
}
Jonathan Sampson
Doesn't seem to be working.
zuk1
Should work. I use it as-is all over the place. How are you using it?
Jonathan Sampson
+1 for the related question link.
TSomKes
Sorry it didn't work as is with the 'example.domain.com' format but I took you're regex and made it works, thanks a bunch!
zuk1
This doesn't work on domains that are 3 chars long when the tld is 2 chars long. www.exg.ie returns www.exg.ie as the domain. Any ideas?
Ryaner
A: 
    $full_domain = $_SERVER['SERVER_NAME'];
$just_domain = preg_replace("/^(.*\.)?([^.]*\..*)$/", "$2", $_SERVER['HTTP_HOST']);
Wbdvlpr
Please read the question before responding.
zuk1
I'm trying to get the domain from a variable which could contain any domain on the internet, not the one the script is residing on.
zuk1
ok I'll update the anser then
Wbdvlpr
Sorry I just realised it was bleedingly obvious I could change your code to do what I need. It certainly works with example.domain.com but example.domain.co.uk. I'll hold off unless I get a better answer but so far this does accomplish what I need, I'd just prefer to have more TLD compatibility.
zuk1
A: 

Regex could help you out there. Try something like this:

([^.]+(.com|.co.uk))$

Zachery Delafosse
A: 

I think your problem is that you haven't clearly defined what exactly you want the function to do. From your examples, you certainly don't want it to just blindly return the last two, or last three, components of the name, but just knowing what it shouldn't do isn't enough.

Here's my guess at what you really want: there are certain second-level domain names, like co.uk., that you'd like to be treated as a single TLD (top-level domain) for purposes of this function. In that case I'd suggest enumerating all such cases and putting them as keys into an associative array with dummy values, along with all the normal top-level domains like com., net., info., etc. Then whenever you get a new domain name, extract the last two components and see if the resulting string is in your array as a key. If not, extract just the last component and make sure that's in your array. (If even that isn't, it's not a valid domain name) Either way, whatever key you do find in the array, take that plus one more component off the end of the domain name, and you'll have your base domain.

You could, perhaps, make things a bit simpler by writing a function, instead of using an associative array, to tell whether the last two components should be treated as a single "effective TLD." The function would probably look at the next-to-last component and, if it's shorter than 3 characters, decide that it should be treated as part of the TLD.

David Zaslavsky
You're right. Presume for this example that all TLD's going through this function have got 3 letters (org,net,com). I basically want to strip the subdomain if there is one and be left with 'domain.com/org/net'.
zuk1
+3  A: 

I would do something like the following:

// hierarchical array of top level domains
$tlds = array(
    'com' => true,
    'uk' => array(
        'co' => true,
        // …
    ),
    // …
);
$domain = 'here.example.co.uk';
// split domain
$parts = explode('.', $domain);
$tmp = $tlds;
// travers the tree in reverse order, from right to left
foreach (array_reverse($parts) as $key => $part) {
    if (isset($tmp[$part])) {
        $tmp = $tmp[$part];
    } else {
        break;
    }
}
// build the result
var_dump(implode('.', array_slice($parts, - $key - 1)));
Gumbo
To me this seems the most comprehensive way of doing this test - you can make it as elaborate and checking as many TLDs/SLDs as you have time to add.
Richy C.
True but I'd like it to be as lightweight as possible because it's going to be doing potnetially thousands of these in one loop, so if there's a more efficient option I'll go with that.
zuk1
Accessing an array does only cost O(1). And with a maximum depth of two for any top level domain I know (https://wiki.mozilla.org/TLD_List), you will always get your result within at most two steps. I don’t know any other way that is more efficient.
Gumbo
The key is how many records are there going to be in `$tlds`?
@user198729 - http://publicsuffix.org/list/ has superseded the list @Gumbo linked to. By my count (`cat effective_tld_names.dat | grep -v "^//" | grep -v "^$" | wc -l`) it's currently 3692 entries, so not too bad.
therefromhere
A: 

To do it well, you'll need a list of the second level domains and top level domains and build an appropriate regular expression list. A good list of second level domains is available at https://wiki.mozilla.org/TLD_List. Another test case apart from the aforementioned CentralNic .uk.com variants is The Vatican: their website is technically at http://va : and that's a difficult one to match on!

Richy C.
A: 

Ah - if you just want to handle three character top level domains - then this code works:

<?php 
// let's test the code works: these should all return
// example.com , example.net or example.org
$domains=Array('here.example.com',
            'example.com',
            'example.org',
     'here.example.org',
     'example.com/ignorethis',
     'example.net/',
     'http://here.example.org/longtest?string=here');
foreach ($domains as $domain) {
 testdomain($domain);
}

function testdomain($url) {
 if (preg_match('/^((.+)\.)?([A-Za-z][0-9A-Za-z\-]{1,63})\.([A-Za-z]{3})(\/.*)?$/',$url,$matches)) {
    print 'Domain is: '.$matches[3].'.'.$matches[4].'<br>'."\n";
 } else {
    print 'Domain not found in '.$url.'<br>'."\n";
 }
}
?>

$matches[1]/$matches[2] will contain any subdomain and/or protocol, $matches[3] contains the domain name, $matches[4] the top level domain and $matches[5] contains any other URL path information.

To match most common top level domains you could try changing it to:

if (preg_match('/^((.+)\.)?([A-Za-z][0-9A-Za-z\-]{1,63})\.([A-Za-z]{2,6})(\/.*)?$/',$url,$matches)) {

Or to get it coping with everything:

if (preg_match('/^((.+)\.)?([A-Za-z][0-9A-Za-z\-]{1,63})\.(co\.uk|me\.uk|org\.uk|com|org|net|int|eu)(\/.*)?$/',$url,$matches)) {

etc etc

Richy C.
A: 

This is built in to php.

$url = 'http://username:password@hostname/path?arg=value#anchor';

print_r(parse_url($url));

The above example will output:

Array
(
    [scheme] => http
    [host] => hostname
    [user] => username
    [pass] => password
    [path] => /path
    [query] => arg=value
    [fragment] => anchor
)
Lance Kidwell
That returns the hostname - not necessarily the domain name (i.e. it'll return www.example.com or here.example.com - not example.com as the original poster wanted)
Richy C.
This might be more useful then:http://www.dkim-reputation.org/regdom-libs/
Lance Kidwell
A: 

Building on Jonathan's answer:

function main_domain($domain) {
  if (preg_match('/([a-z0-9][a-z0-9\-]{1,63})\.([a-z]{3}|[a-z]{2}\.[a-z]{2})$/i', $domain, $regs)) {
    return $regs;
  }

  return false;
}

His expression might be a bit better, but this interface seems more like what you're describing.

eswald