views:

56

answers:

0

Extracting an accurate representation of the top-level domain of a hostname is complicated by the fact that each top-level domain registry is free to make up its own policies regarding how domains are issued and what subdomains are defined. As there doesn't appear to be any standards body coordinating these or establishing standards, this has made determining the actual TLD a somewhat complicated affair.

Since web browsers assign cookies only to registered domains, and for security reasons must be vigilant about ensuring cookies cannot be assigned on a broader level, these browsers typically contain a database of all known TLDs in some form. I've found that Firefox has a fairly complete database:

http://hg.mozilla.org/mozilla-central/raw-file/3f91606bd115/netwerk/dns/effective_tld_names.dat

I have two specific questions:

  • Although it is fairly trivial to convert this listing into a regular expression, is there a gem or reference regexp that's a better solution than rolling your own? The tld gem only provides country-level info for the root-level domain.

  • Is there a better reference than the Firefox TLD listing? All of the local Google sites are correctly parsed by this specification, but that's hardly an exhaustive test.

If there's nothing out there, is anyone interested in a gem that performs this kind of operation? This sort of thing should be present in the URI module but is apparently missing.

Here's my take on converting this file into a usable Regexp in Ruby:

TLD_SPEC = Regexp.new(
  '[^\.]+\.(' + %q[
// ***** BEGIN LICENSE BLOCK *****
// ... (Rest of file)
  ].split(/\n/).collect do |line|
    line.sub(%r[//.*], '').sub(/\s+$/, '')
  end.reject(&:blank?).collect do |s|
    Regexp.escape(s).sub(/^\\\*\\\./, '[^\.]+\.')
  end.join('|') + ')$'
)