ansaurus

Question

Answer 1

+6 A:

No, there is no "intrinsic" way of knowing that (e.g.) zap.co.it is a subdomain (because Italy's registrar DOES sell domains such as co.it) while zap.co.uk isn't (because the UK's registrar DOESN'T sell domains such as co.uk, but only like zap.co.uk).

You'll just have to use an auxiliary table (or online source) to tell you which TLD's behave peculiarly like UK's and Australia's -- there's no way of divining that from just staring at the string without such extra semantic knowledge (of course it can change eventually, but if you can find a good online source that source will also change accordingly, one hopes!-).

Alex Martelli 2009-07-01 01:48:50

Answer 2

+2 A:

There are many, many TLD's. Here's the list:

http://data.iana.org/TLD/tlds-alpha-by-domain.txt

Here's another list

http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains

Here's another list

http://www.iana.org/domains/root/db/

S.Lott 2009-07-01 01:51:45

That doesn't help, because it doesn't tell you which ones have an "extra level", like co.uk.

Lennart Regebro 2009-07-01 07:38:45

Lennart: It helps, U can wrap them to be optional, within a regex.

Lakshman Prasad 2009-07-01 09:08:15

Answer 3

+5 A:

using this file of effective tlds which someone else found on mozzila's website:

from __future__ import with_statement
from urlparse import urlparse

# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tldFile:
    tlds = [line.strip() for line in tldFile if line[0] not in "/\n"]

def getDomain(url, tlds):
    urlElements = urlparse(url)[1].split('.')
    # urlElements = ["abcde","co","uk"]

    for i in range(-len(urlElements),0):
        lastIElements = urlElements[i:]
        #    i=-3: ["abcde","co","uk"]
        #    i=-2: ["co","uk"]
        #    i=-1: ["uk"] etc

        candidate = ".".join(lastIElements) # abcde.co.uk, co.uk, uk
        wildcardCandidate = ".".join(["*"]+lastIElements[1:]) # *.co.uk, *.uk, *
        exceptionCandidate = "!"+candidate

        # match tlds: 
        if (exceptionCandidate in tlds):
            return ".".join(urlElements[i:]) 
        if (candidate in tlds or wildcardCandidate in tlds):
            return ".".join(urlElements[i-1:])
            # returns "abcde.co.uk"

    raise ValueError("Domain not in global list of TLDs")

print getDomain("http://abcde.co.uk",tlds)

results in:

abcde.co.uk

I'd appreciate it if someone let me know which bits of the above could be rewritten in a more pythonic way. For example, there must be a better way of iterating over the lastIElements list, but I couldn't think of one. I also don't know if ValueError is the best thing to raise. Comments?

Markus 2009-07-01 15:23:52

this doesn't treat exceptions to wildcard tlds correctly yet.

Markus 2009-07-01 15:50:34

edit: now it does

Markus 2009-07-01 16:06:11

If you need to call getDomain() often in practice, such as extracting domains from a large log file, I would recommend that you make tlds a set, e.g. tlds = set([line.strip() for line in tldFile if line[0] not in "/\n"]). This gives you constant time lookup for each of those checks for whether some item is in tlds. I saw a speedup of about 1500 times for the lookups (set vs. list) and for my entire operation extracting domains from a ~20 million line log file, about a 60 times speedup (6 minutes down from 6 hours).

Bryce Thomas 2010-08-07 04:14:45

ansaurus

tags:

views:

answers:

Python extract domain name from URL

related questions