tags:

views:

2137

answers:

11

How can I fetch a domain name from a URL String?

Examples:

+----------------------+------------+
| input                | output     |
+----------------------+------------+
| www.google.com       | google     |
| www.mail.yahoo.com   | mail.yahoo |
| www.mail.yahoo.co.in | mail.yahoo |
| www.abc.au.uk        | abc        |
+----------------------+------------+

Related:

+2  A: 

I don't know of any libraries, but the string manipulation of domain names is easy enough.

The hard part is knowing if the name is at the second or third level. For this you will need a data file you maintain (e.g. for .uk is is not always the third level, some organisations (e.g. bl.uk, jet.uk) exist at the second level).

The source of Firefox from Mozilla has such a data file, check the Mozilla licensing to see if you could reuse that.

Richard
A: 

in jscript using window.location
window.location.hostname => the host name (without the port number or square brackets). ex: www.google.com

oops.... just reread your question and noticed that it probably doesn't apply to your case...

vitorsilva
A: 

Remove www

Find first /

Find first . before it.

String between them is 1st level domain.

Everything to the left is domain name.

Remove all additional 1st/2nd domain names like co.uk using dictionary. Because you need to remove .co.uk but not .msk.ru or .spb.ru because it only seems to be common suffix but they are private domains.

abatishchev
Won't work with (for example) www.guardian.co.uk
pi
Nor will it work for other prefixes, e.g. blog.stackoverflow.com!
Richard
+4  A: 

the list @ Mozilla website can give you the effective top domain levels list.

chburd
A: 

You need a list of what domain prefixes and suffixes can be removed. For example:

Prefixes:

  • www.

Suffixes:

  • .com
  • .co.in
  • .au.uk
Gumbo
works only for the samples and maintaining such lists does not scale
George Jempty
So what else. Guessing?
Gumbo
+5  A: 
/^(?:www\.)?(.*?)\.(?:com|au\.uk|co\.in)$/
J.F. Sebastian
+1 - talk about succinct - no English in the response. Love it.
Daniel Paull
no english in the question....
ck
A: 

Look in System.Uri

Where in the System.Uri space do you see anything that would answer the question?
Renaud Bompuis
+5  A: 

I once had to write such a regex for a company I worked for. The solution was this:

  • Get a list of every ccTLD and gTLD available. Your first stop should be IANA. The list from Mozilla looks great at first sight, but lacks ac.uk for example so for this it is not really usable.
  • Join the list like the example below. A warning: Ordering is important! If org.uk would appear after uk then example.org.uk would match org instead of example.

Example regex:

.*([^\.]+)(com|net|org|info|coop|int|co\.uk|org\.uk|ac\.uk|uk|__and so on__)$

This worked really well and also matched weird, unofficial top-levels like de.com and friends.

The upside:

  • Very fast if regex is optimally ordered

The downside of this solution is of course:

  • Handwritten regex which has to be updated manually if ccTLDs change or get added. Tedious job!
  • Very large regex so not very readable.
pi
RE: tedious to update - Write a little code generator program to generate the regex based on the input data files.
Evan
True. With a good test harness this should be possible. We of course did no testing then...
pi
The list from Mozilla seems pretty good actually -- it has *.uk to match .ac.uk . You just have to figure out the format and interpret the rules correctly.
Josh Gibson
Worth noting that if you parse the mozilla list for ALL the possible tlds, the regex compilation fails. (on PHP at least)
buggedcom
A: 
#!/usr/bin/perl -w
use strict;

my $url = $ARGV[0];
if($url =~ /([^:]*:\/\/)?([^\/]*\.)*([^\/\.]+)\.[^\/]+/g) {
  print $3;
}
if you used other characters than a forward slash for the match operator, then you wouldn't need to have to have so many escape characters and can make the regex more readable, e.g. `$url =~ m{([^:]*://)?([^/]*\.)*([^/\.]+)\.[^/]+}`not sure you want the looping operator (/g) either?
plusplus
True, although the big problem with my response is that it won't work for foreign domains since they don't follow the standard US format "xxx.(com|edu|org|etc)". Sot telegraph.co.uk won't match. Makes me think that you really do need to explicitly list out all of the various country codes in order to match something like that.
or since other people have already figured this stuff out, just use a module to do it, such as URI::Find - http://search.cpan.org/perldoc?URI::Find or if you just want a regex then http://search.cpan.org/perldoc?Regexp::Common::URI
plusplus
Of course, but when someone asks for a regex, it's always fun to work it out :)
A: 
import urlparse

GENERIC_TLDS = [
    'aero', 'asia', 'biz', 'com', 'coop', 'edu', 'gov', 'info', 'int', 'jobs', 
    'mil', 'mobi', 'museum', 'name', 'net', 'org', 'pro', 'tel', 'travel', 'cat'
    ]

def get_domain(url):
    hostname = urlparse.urlparse(url.lower()).netloc
    if hostname == '':
        # Force the recognition as a full URL
        hostname = urlparse.urlparse('http://' + uri).netloc

    # Remove the 'user:passw', 'www.' and ':port' parts
    hostname = hostname.split('@')[-1].split(':')[0].lstrip('www.').split('.')

    num_parts = len(hostname)
    if (num_parts < 3) or (len(hostname[-1]) > 2):
        return '.'.join(hostname[:-1])
    if len(hostname[-2]) > 2 and hostname[-2] not in GENERIC_TLDS:
        return '.'.join(hostname[:-1])
    if num_parts >= 3:
        return '.'.join(hostname[:-2])

This code isn't guaranteed to work with all URLs and doesn't filter those that are grammatically correct but invalid like 'example.uk'.

However it'll do the job in most cases.

Juan-Pablo Scaletti