ansaurus

Question

Answer 1

+2 A:

I don't know of any libraries, but the string manipulation of domain names is easy enough.

The hard part is knowing if the name is at the second or third level. For this you will need a data file you maintain (e.g. for .uk is is not always the third level, some organisations (e.g. bl.uk, jet.uk) exist at the second level).

The source of Firefox from Mozilla has such a data file, check the Mozilla licensing to see if you could reuse that.

Richard 2009-02-20 11:07:14

Answer 2

A:

in jscript using window.location
window.location.hostname => the host name (without the port number or square brackets). ex: www.google.com

oops.... just reread your question and noticed that it probably doesn't apply to your case...

vitorsilva 2009-02-20 11:09:42

Answer 3

A:

Remove www

Find first /

Find first . before it.

String between them is 1st level domain.

Everything to the left is domain name.

Remove all additional 1st/2nd domain names like co.uk using dictionary. Because you need to remove .co.uk but not .msk.ru or .spb.ru because it only seems to be common suffix but they are private domains.

abatishchev 2009-02-20 11:11:37

Won't work with (for example) www.guardian.co.uk

pi 2009-02-20 11:14:13

Nor will it work for other prefixes, e.g. blog.stackoverflow.com!

Richard 2009-02-20 17:13:04

Answer 4

+4 A:

the list @ Mozilla website can give you the effective top domain levels list.

chburd 2009-02-20 11:15:39

Answer 5

A:

You need a list of what domain prefixes and suffixes can be removed. For example:

Prefixes:

www.

Suffixes:

.com
.co.in
.au.uk

Gumbo 2009-02-20 11:15:54

works only for the samples and maintaining such lists does not scale

George Jempty 2009-02-20 11:29:50

So what else. Guessing?

Gumbo 2009-02-20 14:51:36

Answer 6

+5 A:

/^(?:www\.)?(.*?)\.(?:com|au\.uk|co\.in)$/

J.F. Sebastian 2009-02-20 11:19:12

+1 - talk about succinct - no English in the response. Love it.

Daniel Paull 2009-02-20 11:49:42

no english in the question....

ck 2009-02-20 12:25:48

Answer 7

A:

Look in System.Uri

2009-02-20 11:27:48

Where in the System.Uri space do you see anything that would answer the question?

Renaud Bompuis 2009-02-21 04:35:26

Answer 8

+5 A:

I once had to write such a regex for a company I worked for. The solution was this:

Get a list of every ccTLD and gTLD available. Your first stop should be IANA. The list from Mozilla looks great at first sight, but lacks ac.uk for example so for this it is not really usable.
Join the list like the example below. A warning: Ordering is important! If org.uk would appear after uk then example.org.uk would match org instead of example.

Example regex:

.*([^\.]+)(com|net|org|info|coop|int|co\.uk|org\.uk|ac\.uk|uk|__and so on__)$

This worked really well and also matched weird, unofficial top-levels like de.com and friends.

The upside:

Very fast if regex is optimally ordered

The downside of this solution is of course:

Handwritten regex which has to be updated manually if ccTLDs change or get added. Tedious job!
Very large regex so not very readable.

pi 2009-02-20 11:30:41

RE: tedious to update - Write a little code generator program to generate the regex based on the input data files.

Evan 2009-02-20 15:58:58

True. With a good test harness this should be possible. We of course did no testing then...

pi 2009-02-22 14:23:44

The list from Mozilla seems pretty good actually -- it has *.uk to match .ac.uk . You just have to figure out the format and interpret the rules correctly.

Josh Gibson 2009-05-19 05:55:09

Worth noting that if you parse the mozilla list for ALL the possible tlds, the regex compilation fails. (on PHP at least)

buggedcom 2010-07-16 19:26:00

Answer 9

A:

#!/usr/bin/perl -w
use strict;

my $url = $ARGV[0];
if($url =~ /([^:]*:\/\/)?([^\/]*\.)*([^\/\.]+)\.[^\/]+/g) {
  print $3;
}

2010-03-23 04:02:21

if you used other characters than a forward slash for the match operator, then you wouldn't need to have to have so many escape characters and can make the regex more readable, e.g. `$url =~ m{([^:]*://)?([^/]*\.)*([^/\.]+)\.[^/]+}`not sure you want the looping operator (/g) either?

plusplus 2010-03-25 09:40:06

True, although the big problem with my response is that it won't work for foreign domains since they don't follow the standard US format "xxx.(com|edu|org|etc)". Sot telegraph.co.uk won't match. Makes me think that you really do need to explicitly list out all of the various country codes in order to match something like that.

2010-03-25 14:25:15

or since other people have already figured this stuff out, just use a module to do it, such as URI::Find - http://search.cpan.org/perldoc?URI::Find or if you just want a regex then http://search.cpan.org/perldoc?Regexp::Common::URI

plusplus 2010-03-25 16:07:44

Of course, but when someone asks for a regex, it's always fun to work it out :)

2010-03-25 17:15:53

Answer 10

A:

import urlparse

GENERIC_TLDS = [
    'aero', 'asia', 'biz', 'com', 'coop', 'edu', 'gov', 'info', 'int', 'jobs', 
    'mil', 'mobi', 'museum', 'name', 'net', 'org', 'pro', 'tel', 'travel', 'cat'
    ]

def get_domain(url):
    hostname = urlparse.urlparse(url.lower()).netloc
    if hostname == '':
        # Force the recognition as a full URL
        hostname = urlparse.urlparse('http://' + uri).netloc

    # Remove the 'user:passw', 'www.' and ':port' parts
    hostname = hostname.split('@')[-1].split(':')[0].lstrip('www.').split('.')

    num_parts = len(hostname)
    if (num_parts < 3) or (len(hostname[-1]) > 2):
        return '.'.join(hostname[:-1])
    if len(hostname[-2]) > 2 and hostname[-2] not in GENERIC_TLDS:
        return '.'.join(hostname[:-1])
    if num_parts >= 3:
        return '.'.join(hostname[:-2])

This code isn't guaranteed to work with all URLs and doesn't filter those that are grammatically correct but invalid like 'example.uk'.

However it'll do the job in most cases.

Juan-Pablo Scaletti 2010-07-19 03:21:11

ansaurus

tags:

views:

answers:

how to get domain name from URL

Examples:

related questions