views:

201

answers:

3

I have a link such as http://www.techcrunch.com/ and I would like to get just the techcrunch.com part of the link. How do I go about this in python?

+5  A: 

This might do the trick.

http://docs.python.org/library/urlparse.html

Eli
Thanks. I knew about this library but for some reason it didn't cross my mind when thinking about this problem.
Gavin Schulz
A: 

Getting the hostname is easy enough using urlparse:

hostname = urlparse.urlparse("http://www.techcrunch.com/").hostname

Getting the "root domain", however, is going to be more problematic, because it isn't defined in a syntactic sense. What's the root domain of "www.theregister.co.uk"? How about networks using default domains? "devbox12" could be a valid hostname.

For the most common cases, however, you can probably handle the former specially and ignore the latter, but aware that it won't 100% accurate.

hostname = urlparse.urlparse(url).hostname.split(".")
hostname = ".".join(len(hostname[-2]) < 4 and hostname[-3:] or hostname[-2:])

This uses the last three parts if the next-to-last part is less than four characters (e.g. ".com.au", ".co.uk") and the last two parts otherwise.

Ben Blank
A: 

This worked for my purposes. I figured I'd share it.

".".join("www.sun.google.com".split(".")[-2:])
Joe J