using this file of effective tlds which someone else found on mozzila's website:
from __future__ import with_statement
from urlparse import urlparse
# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tldFile:
tlds = [line.strip() for line in tldFile if line[0] not in "/\n"]
def getDomain(url, tlds):
urlElements = urlparse(url)[1].split('.')
# urlElements = ["abcde","co","uk"]
for i in range(-len(urlElements),0):
lastIElements = urlElements[i:]
# i=-3: ["abcde","co","uk"]
# i=-2: ["co","uk"]
# i=-1: ["uk"] etc
candidate = ".".join(lastIElements) # abcde.co.uk, co.uk, uk
wildcardCandidate = ".".join(["*"]+lastIElements[1:]) # *.co.uk, *.uk, *
exceptionCandidate = "!"+candidate
# match tlds:
if (exceptionCandidate in tlds):
return ".".join(urlElements[i:])
if (candidate in tlds or wildcardCandidate in tlds):
return ".".join(urlElements[i-1:])
# returns "abcde.co.uk"
raise ValueError("Domain not in global list of TLDs")
print getDomain("http://abcde.co.uk",tlds)
results in:
abcde.co.uk
I'd appreciate it if someone let me know which bits of the above could be rewritten in a more pythonic way. For example, there must be a better way of iterating over the lastIElements
list, but I couldn't think of one. I also don't know if ValueError is the best thing to raise. Comments?