views:

2461

answers:

8

Getting the subdomain from a URL sounds easy at first.

http://www.domain.example

Scan for the first period then return whatever came after the "http://" ...

Then you remember

http://super.duper.domain.example

Oh. So then you think, okay, find the last period, go back a word and get everything before!

Then you remember

http://super.duper.domain.co.uk

And you're back to square one. Anyone have any great ideas besides storing a list of all TLDs?

John

A: 

This question has already been asked here: Getting Parts of a URL

Edit: A similar question has been asked here : )

jb
I don't think it's the same question - this seems to be more about the administrative cuts in the domain name which can't be worked out just by looking at the string
Alnitak
+13  A: 

Anyone have any great ideas besides storing a list of all TLDs?

No, because each TLD differs on what counts as a subdomain, second level domain, etc.

Keep in mind that there are top level domains, second level domains, and subdomains. Technically speaking, everything except the TLD is a subdomain.

In the domain.com.uk example, domain is a subdomain, com is a second level domain, and uk is the tld.

So the question remains more complex than at first blush, and it depends on how each TLD is managed. You'll need a database of all the TLDs that include their particular partitioning, and what counts as a second level domain and a subdomain. There aren't too many TLDs, though, so the list is reasonably manageable, but collecting all that information isn't trivial. There may already be such a list available.

Looks like http://publicsuffix.org/ is one such list - all the common suffixes (.com, .co.uk, etc) in a list suitable for searching. It still won't be easy to parse it, but at least you don't have to maintain the list.

A "public suffix" is one under which Internet users can directly register names. Some examples of public suffixes are ".com", ".co.uk" and "pvt.k12.wy.us". The Public Suffix List is a list of all known public suffixes.

The Public Suffix List is an initiative of the Mozilla Foundation. It is available for use in any software, but was originally created to meet the needs of browser manufacturers. It allows browsers to, for example:

  • Avoid privacy-damaging "supercookies" being set for high-level domain name suffixes
  • Highlight the most important part of a domain name in the user interface
  • Accurately sort history entries by site

Looking through the list, you can see it's not a trivial problem. I think a list is the only correct way to accomplish this...

Adam Davis
great - I mention publicsuffix.org first and everyone else gets the rep points! I recommend anyone reading this read the IETF draft that I referred to - it's written by a senior Opera programmer who's trying to solve this problem for real.
Alnitak
Mozilla has code that uses this service. The project was spun off because the original cookie spec had linked TLD's to trust in cookies, but never worked. The "Cookie Monster" bug was the first problem, and the architecture was never fixed or replaced.
benc
The preferred language to solve this in isn't listed, but there is an opensource project that uses this list in C# code here: http://code.google.com/p/domainname-parser/
Dan Esparza
+4  A: 

As Adam says, it's not easy, and currently the only practical way is to use a list.

Even then there are exceptions - for example in .uk there are a handful of domains that are valid immediately at that level that aren't in .co.uk, so those have to be added as exceptions.

This is currently how mainstream browsers do this - it's necessary to ensure that example.co.uk can't set a Cookie for .co.uk which would then be sent to any other website under .co.uk.

The good news is that there's already a list available at http://publicsuffix.org/.

There's also some work in the IETF to create some sort of standard to allow TLDs to declare what their domain structure looks like. This is slightly complicated though by the likes of .uk.com, which is operated as if it were a public suffix, but isn't sold by the .com registry.

Alnitak
A: 

It's not working it out exactly, but you could maybe get a useful answer by trying to fetch the domain piece by piece and checking the response, ie, fetch 'http://uk', then 'http://co.uk', then 'http://domain.co.uk'. When you get a non-error response you've got the domain and the rest is subdomain.

Sometimes you just gotta try it :)

Edit:

Tom Leys points out in the comments, that some domains are set up only on the www subdomain, which would give us an incorrect answer in the above test. Good point! Maybe the best approach would be to check each part with 'http://www' as well as 'http://', and count a hit to either as a hit for that section of the domain name? We'd still be missing some 'alternative' arrangements such as 'web.domain.com', but I haven't run into one of those for a while :)

jTresidder
There is no guarantee that x.com points to a webserver at port 80 even if www.x.com does. www is a valid subdomain in this case. Perhaps an automated whois would help here.
Tom Leys
Good point! A whois would clear it up, although maintaining a list of which whois servers to use for which for which tld/2nd level would mean solving the same problem for edge cases.
jTresidder
no - whois is not the answer here
Alnitak
-1 that is a lame answer
Harry
A: 

List of common suffixes (.co.uk, .com, et cetera) to strip out along with the http:// and then you'll only have "sub.domain" to work with instead of "http://sub.domain.suffix", or at least that's what I'd probably do.

The biggest problem is the list of possible suffixes. There's a lot, after all.

Peter C.
A: 

Having taken a quick look at the publicsuffix.org list, it appears that you could make a reasonable approximation by removing the final three segments ("segment" here meaning a section between two dots) from domains where the final segment is two characters long, on the assumption that it's a country code and will be further subdivided. If the final segment is "us" and the second-to-last segment is also two characters, remove the last four segments. In all other cases, remove the final two segments. e.g.:

"example" is not two characters, so remove "domain.example", leaving "www"

"example" is not two characters, so remove "domain.example", leaving "super.duper"

"uk" is two characters (but not "us"), so remove "domain.co.uk", leaving "super.duper"

"us" is two characters and is "us", plus "wy" is also two characters, so remove "pvt.k12.wy.us", leaving "foo".

Note that, although this works for all examples that I've seen in the responses so far, it remains only a reasonable approximation. It is not completely correct, although I suspect it's about as close as you're likely to get without making/obtaining an actual list to use for reference.

Dave Sherohman
One trivial fail case: compare `http://www.bit.ly` to `http://tla.com.au`
kibibu
+3  A: 

Publicsuffix.org seems the way to do. There are plenty of implementations out there to parse the contents of the publicsuffix data file file easily:

jtes0111
But remember it is not just a matter of parsing! This list at Publicsuffix.org is an unofficial project, which is incomplete (eu.org is missing, for instance), does NOT reflect automatically the policies of TLD and may become unmaintained at any time.
bortzmeyer
Ruby: http://github.com/pauldix/domainatrix
ssspiochld