I have a link such as http://www.techcrunch.com/ and I would like to get just the techcrunch.com part of the link. How do I go about this in python?
Thanks. I knew about this library but for some reason it didn't cross my mind when thinking about this problem.
Gavin Schulz
2009-10-05 18:34:48
A:
Getting the hostname is easy enough using urlparse:
hostname = urlparse.urlparse("http://www.techcrunch.com/").hostname
Getting the "root domain", however, is going to be more problematic, because it isn't defined in a syntactic sense. What's the root domain of "www.theregister.co.uk"? How about networks using default domains? "devbox12" could be a valid hostname.
For the most common cases, however, you can probably handle the former specially and ignore the latter, but aware that it won't 100% accurate.
hostname = urlparse.urlparse(url).hostname.split(".")
hostname = ".".join(len(hostname[-2]) < 4 and hostname[-3:] or hostname[-2:])
This uses the last three parts if the next-to-last part is less than four characters (e.g. ".com.au", ".co.uk") and the last two parts otherwise.
Ben Blank
2009-10-05 18:35:45
A:
This worked for my purposes. I figured I'd share it.
".".join("www.sun.google.com".split(".")[-2:])
Joe J
2010-07-30 06:53:24