ansaurus

Question

Can I improve this regex check for valid domain names?

Answer 1

A:

I don't know enough about domain names probably. But why is domains like "foo.info.com" matched? It seems that the domain name is "info.com" in that particular case.

And you might want to make sure the name starts with [a-z\d]. I don't think you can register a domain that starts with a dash?

PEZ 2008-12-30 10:34:45

Not all domain names are two part. A single part example: "ck" is the domain for the Cook islands (try http://ck or http://www.ck); my own domain is three part (nichesoftware.co.nz) due to a structure within the .nz TLD.

Bevan 2008-12-30 21:10:31

Answer 2

A:

Well as you have it written, the TLD part is equivalent but longer than (\.<tldpart>){1,2} but I'm sure it could be fixed for duplication...

edit: yech, no, it would be possible but essentially a very slow brute force list to handle the duplications I think. Simpler and faster to put the possible TLD and SLD+country pairs in a big hashmap and check the substring against that.

annakata 2008-12-30 10:36:42

Answer 3

A:

You can build up the regex as a string and then do Regexp.new(string).

Jules 2008-12-30 10:38:54

Answer 4

+10 A:

Please, please, please don't use a fixed and horribly complicated regex like this to match for known domain names.

The list of TLDs is not static, particularly with ICANN looking at a streamlined process for new gTLDs. Even the list of ccTLDs changes sometimes!

Have a look at the list available from http://publicsuffix.org/ and write some code that's able to download and parse that list instead.

Alnitak 2008-12-30 19:08:53

+1 That regex made my eyes bleed... and I feel sorry for whoever has to maintain it.

Jon Tackabury 2008-12-30 19:20:34

Concerning regular expressions and eye bleeding: http://www.codinghorror.com/blog/archives/001016.html

Gavin Miller 2008-12-30 20:04:18

+1, and added code

TheSoftwareJedi 2008-12-30 21:18:25

removed the code again - any noob can read a file from the net, and without the ! etc handling it's not useful.

Alnitak 2008-12-30 21:21:33

I guess I agree. There are better ways to do it, but I need something that is incredibly to do registrations/transfers. Any other recommendations?

Josh 2009-01-06 21:44:45

There is an opensource C# library that uses publicsuffix.org to parse domains, here: http://code.google.com/p/domainname-parser/

Dan Esparza 2009-05-18 05:28:53

Answer 5

A:

I'd recommend starting with the rules laid out in RFC 1035, and then working backwards -- but only if you really really really need to do this from scratch. A domain regex pattern has got to be (arguable second only to email address regex patterns) the most common thing out there. I would check out the site regexlib.com and browse through what other folks have done.

sammich 2008-12-30 19:55:15

The RFC technically does not allow all-numeric domain parts, but in practice registrars and nameservers have been allowing them for years now.

Andrew Medico 2008-12-30 21:52:59

Answer 6

+2 A:

Download this: http://data.iana.org/TLD/tlds-alpha-by-domain.txt

Example usage (in Python):

import re
def validate(domain):
    valid_domains = [ line.upper().replace('.', '\.').strip() 
          for line in open('domains.txt') 
          if line[0] != '#' ]
    r = re.compile(r'^[A-Z0-9\-]{2,63}\.(%s)$' % ('|'.join(valid_domains),))
    return True if r.match(domain.upper()) else False


print validate('stackoverflow.com')
print validate('omnom.nom')

You can factor the domain-list-building out of the validate function to help performance.

Steve Losh 2008-12-30 21:02:27

Results aren't as expected for domains like awesomedomain.co.uk -- the TLD isn't considered .uk it's .co.uk It's better to use something like http://publicsuffix.org/

Dan Esparza 2009-05-11 22:36:46

Answer 7

A:

Removed for the sake of anyone who stumbles upon this answer.

Dana 2009-09-18 15:57:34

There are TLD with more than four letter (such as .TRAVEL)

bortzmeyer 2009-09-21 07:15:02

Not even to mention IDN and the fact that DOMAIN names allow much more characters than HOST names (for instance the underscore).

bortzmeyer 2009-09-21 07:15:58

Clearly, I did not consider IDNs whatsoever or read up on valid TLDs. In retrospect, I also agree that complicated regular expressions should not be used to solve this problem. I'll add another vote in favor of using an accurately maintained list of TLDs for verification.

Dana 2010-05-17 20:02:18

ansaurus

tags:

views:

answers:

Can I improve this regex check for valid domain names?

related questions