views:

4453

answers:

7

So, I have been working on this domain name regular expression. So far, it seems to pick up domain names with SLDs and TLDs (with the optional ccTLD), but there is duplication of the TLD listing. Can this be refactored any further?

params[:domain_name].downcase.strip.match(/^[a-z0-9\-]{2,63}
\.((a[cdefgilmnoqrstuwxz]|aero|arpa)|(b[abdefghijmnorstvwyz]|biz)|
(c[acdfghiklmnorsuvxyz]|cat|com|coop)|d[ejkmoz]|(e[ceghrstu]|edu)|f[ijkmor]|
(g[abdefghilmnpqrstuwy]|gov)|h[kmnrtu]|(i[delmnoqrst]|info|int)|
(j[emop]|jobs)|k[eghimnprwyz]|l[abcikrstuvy]|
(m[acdghklmnopqrstuvwxyz]|me|mil|mobi|museum)|(n[acefgilopruz]|name|net)|(om|org)|
(p[aefghklmnrstwy]|pro)|qa|r[eouw]|s[abcdeghijklmnortvyz]|
(t[cdfghjklmnoprtvwz]|travel)|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw])
(\.((a[cdefgilmnoqrstuwxz]|aero|arpa)|(b[abdefghijmnorstvwyz]|biz)|
(c[acdfghiklmnorsuvxyz]|cat|com|coop)|d[ejkmoz]|(e[ceghrstu]|edu)|f[ijkmor]|
(g[abdefghilmnpqrstuwy]|gov)|h[kmnrtu]|(i[delmnoqrst]|info|int)|
(j[emop]|jobs)|k[eghimnprwyz]|l[abcikrstuvy]|
m[acdghklmnopqrstuvwxyz]|mil|mobi|museum)|
(n[acefgilopruz]|name|net)|(om|org)|
(p[aefghklmnrstwy]|pro)|qa|r[eouw]|s[abcdeghijklmnortvyz]|
(t[cdfghjklmnoprtvwz]|travel)|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]))?$/)
A: 

I don't know enough about domain names probably. But why is domains like "foo.info.com" matched? It seems that the domain name is "info.com" in that particular case.

And you might want to make sure the name starts with [a-z\d]. I don't think you can register a domain that starts with a dash?

PEZ
Not all domain names are two part. A single part example: "ck" is the domain for the Cook islands (try http://ck or http://www.ck); my own domain is three part (nichesoftware.co.nz) due to a structure within the .nz TLD.
Bevan
A: 

Well as you have it written, the TLD part is equivalent but longer than (\.<tldpart>){1,2} but I'm sure it could be fixed for duplication...

edit: yech, no, it would be possible but essentially a very slow brute force list to handle the duplications I think. Simpler and faster to put the possible TLD and SLD+country pairs in a big hashmap and check the substring against that.

annakata
A: 

You can build up the regex as a string and then do Regexp.new(string).

Jules
+10  A: 

Please, please, please don't use a fixed and horribly complicated regex like this to match for known domain names.

The list of TLDs is not static, particularly with ICANN looking at a streamlined process for new gTLDs. Even the list of ccTLDs changes sometimes!

Have a look at the list available from http://publicsuffix.org/ and write some code that's able to download and parse that list instead.

Alnitak
+1 That regex made my eyes bleed... and I feel sorry for whoever has to maintain it.
Jon Tackabury
Concerning regular expressions and eye bleeding: http://www.codinghorror.com/blog/archives/001016.html
Gavin Miller
+1, and added code
TheSoftwareJedi
removed the code again - any noob can read a file from the net, and without the ! etc handling it's not useful.
Alnitak
I guess I agree. There are better ways to do it, but I need something that is incredibly to do registrations/transfers. Any other recommendations?
Josh
There is an opensource C# library that uses publicsuffix.org to parse domains, here: http://code.google.com/p/domainname-parser/
Dan Esparza
A: 

I'd recommend starting with the rules laid out in RFC 1035, and then working backwards -- but only if you really really really need to do this from scratch. A domain regex pattern has got to be (arguable second only to email address regex patterns) the most common thing out there. I would check out the site regexlib.com and browse through what other folks have done.

sammich
The RFC technically does not allow all-numeric domain parts, but in practice registrars and nameservers have been allowing them for years now.
Andrew Medico
+2  A: 

Download this: http://data.iana.org/TLD/tlds-alpha-by-domain.txt

Example usage (in Python):

import re
def validate(domain):
    valid_domains = [ line.upper().replace('.', '\.').strip() 
          for line in open('domains.txt') 
          if line[0] != '#' ]
    r = re.compile(r'^[A-Z0-9\-]{2,63}\.(%s)$' % ('|'.join(valid_domains),))
    return True if r.match(domain.upper()) else False


print validate('stackoverflow.com')
print validate('omnom.nom')

You can factor the domain-list-building out of the validate function to help performance.

Steve Losh
Results aren't as expected for domains like awesomedomain.co.uk -- the TLD isn't considered .uk it's .co.uk It's better to use something like http://publicsuffix.org/
Dan Esparza
A: 

Removed for the sake of anyone who stumbles upon this answer.

Dana
There are TLD with more than four letter (such as .TRAVEL)
bortzmeyer
Not even to mention IDN and the fact that DOMAIN names allow much more characters than HOST names (for instance the underscore).
bortzmeyer
Clearly, I did not consider IDNs whatsoever or read up on valid TLDs. In retrospect, I also agree that complicated regular expressions should not be used to solve this problem. I'll add another vote in favor of using an accurately maintained list of TLDs for verification.
Dana