views:

281

answers:

3

I'm trying to write (or just find an existing) PHP method that can take a link and extract the url. The trick is, it needs to hold under the weight of strange looking domains like:

www.champa.kku.ac.th 

Looking at this one myself with human eyes, I still guessed it incorrectly: thought the domain would be kku.ac.th but that gives a dns error when visiting.

So anyone knows of a good way to reliably extract the domain from url:

http://site.com/hello.php
http://site.com.uk/hello.php
http://subdomain.site.com/hello.php
http://subdomain.site.com.uk/hello.php
http://www.champa.kku.ac.th/hello.php // and even the one I couldn't tell
+2  A: 

Maybe the parse_url function could help, here ?


In your case, with those URLs, the following portion of code :

echo parse_url('http://site.com/hello.php', PHP_URL_HOST) . '<br />';
echo parse_url('http://site.com.uk/hello.php', PHP_URL_HOST) . '<br />';
echo parse_url('http://subdomain.site.com/hello.php', PHP_URL_HOST) . '<br />';
echo parse_url('http://subdomain.site.com.uk/hello.php', PHP_URL_HOST) . '<br />';
echo parse_url('http://www.champa.kku.ac.th/hello.php', PHP_URL_HOST) . '<br />';

Gives this output :

site.com
site.com.uk
subdomain.site.com
subdomain.site.com.uk
www.champa.kku.ac.th
Pascal MARTIN
Thanks Pascal, that solves some of my problem, thanks a lot, but it's not quite what I was concerned about. Pekka nails it more down, so I'll likely choose his answer for future readers.
karl
+1  A: 

PHP has the parse_url() function that will help you do the basic splitting into protocol, host, port, and so on.

As to extracting the "right" domain in uncertain cases, this is extremely hard to tell because sometimes, "two-part TLDs" are a measure by the TLD authority (e.g. in the UK) and sometimes are private enterprises (e.g. .uk.com). I think you won't get around maintaining lists of top level domains that have two parts like

  • .co.uk
  • .ac.uk
  • .ac.th

those endings would be treated like TLDs (Top level domains), swallowing the second part.

This is the only way of reliably telling apart "two-part TLDs" like .co.uk - where server1.ibm.co.uk (where the two-part .co.uk needs to be removed to determine the domain itself) from regular sub-domains like server1.ibm.com (where .com needs to be removed).

A good starting point to get a list of many important "two-part TLDs" is the domain search at speednames.com (select "all" in countries).

Pekka
I was thinking the same thing regarding "I think you won't get around maintaining lists of top level domains that have two parts" Is there a list? I tried wikipedia and could find only the normal list http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains
karl
@karl I don't think there's an official list, because many of those are private enterprises. Check out speednames, they have a lot of "two-part TLDs" in their portfolio. It's a good start I think.
Pekka
A: 

With Ruby you can use the Domainatrix library / gem

http://www.pauldix.net/2009/12/parse-domains-from-urls-easily-with-domainatrix.html

require 'rubygems'
require 'domainatrix'
s = 'http://www.champa.kku.ac.th/dir1/dir2/file?option1&option2'
url = Domainatrix.parse(s)
url.domain
=> "kku"

great tool! :-)

Tilo