views:

801

answers:

4

Following up to Regular expression to match hostname or IP Address? and using Restrictions on valid host names as a reference, what is the most readable, concise way to match/validate a hostname/fqdn (fully qualified domain name) in Python? I've answered with my attempt below, improvements welcome.

A: 

Process each DNS label individually by excluding invalid characters and ensuring nonzero length.


def isValidHostname(hostname):
    disallowed = re.compile("[^a-zA-Z\d\-]")
    return all(map(lambda x: len(x) and not disallowed.search(x), hostname.split(".")))
kostmo
`return all(x and not disallowed.search(x) for x in hostname.split("."))`
Roger Pate
A trailing `.` on the end of a hostname is valid. Oh, and much more work to do if you want to support IDN, of course...
bobince
+5  A: 
def isValidHostname(hostname):
    if len(hostname) > 255:
        return False
    if hostname[-1:] == ".":
        hostname = hostname[:-1] # strip exactly one dot from the right, if present
    allowed = re.compile("(?!-)[A-Z\d-]{1,63}(?<!-)$", re.IGNORECASE)
    return all(allowed.match(x) for x in hostname.split("."))

ensures that each segment

  • contains at least one character and a maximum of 63 characters
  • consists only of allowed characters
  • doesn't begin or end with a hyphen.

It also avoids double negatives (not disallowed), and if hostname ends in a ., that's OK, too. It will (and should) fail if hostname ends in more than one dot.

Tim Pietzcker
Hostname labels should also not end with a hyphen.
bobince
Right, thanks. Edited my answer.
Tim Pietzcker
You're using `re.match` incorrectly - mind that `re.match("a+", "ab")` is a match whereas `re.match("a+$", "ab")` isn't. Your function also does not allow for a single dot at the end of the hostname.
AndiDog
I had been under the impression that `re.match` needs to match the entire string, therefore making the end-of-string anchor unnecessary. But as I now found out (thanks!) it only binds the match to the start of the string. I corrected my regex accordingly. I don't get your second point, however. Is it legal to end a hostname in a dot? The Wikipedia article linked in the question appears to say no.
Tim Pietzcker
@Tim Pietzcker Yes, a single dot at the end is legal. It marks the name as a fully-qualified domain name, which lets the DNS system know that it shouldn't try appending the local domain to it.
Daniel Stutzbach
Note that there's also a 63 character limit for each segment. And a global 255 character for the whole hostname.
Romuald Brunet
Aw shucks. Another edit :)
Tim Pietzcker
A: 

If you're looking to validate the name of an existing host, the best way is to try to resolve it. You'll never write a regular expression to provide that level of validation.

Donal Fellows
And what if he wants to find out if a hostname that does not yet exist will be a legal one? The RFC appears to be quite straightforward, so I don't see why a regex wouldn't work.
Tim Pietzcker
Depends on what you're trying to show. If the name doesn't resolve then who knows what it “means”; the true means of validation require information that a regular expression cannot have (i.e., access to DNS). It's easier to just try it and handle the failure. And when thinking about names that are potentially legal but not yet, the only people who actually need to care about that are the registrars. Everyone else should leave these things to the code that is designed to have genuine expertise in the area. As JWZ notes, applying an RE turns a problem into two problems. (Well, mostly…)
Donal Fellows
i do not agree. there are two separate concerns, and both are valid concerns: (1)°argue whether a given string can serve, technically and plausibly, as a, say, valid email address, hostname, such things; (2)°demonstrate that a given name is taken, or likely free. (1) is purely a syntactical consideration. since (2) happens over the network, there is a modicum of doubt: a host that is up now can be down in a second, a domain i order now can be taken when my mail arrives.
flow
This approach has been proposed in a similar question (http://stackoverflow.com/questions/399932/can-i-improve-this-regex-check-for-valid-domain-names/401132#401132), and there is even a Python project to facilitate this (http://code.google.com/p/python-public-suffix-list/). I've modified the question title slightly, since I'm not interested in a solution that requires network lookups.
kostmo
A: 

I like the thoroughness of Tim Pietzcker's answer, but I prefer to offload some of the logic from regular expressions for readability. Honestly, I had to look up the meaning of those (? "extension notation" parts. Additionally, I feel the "double-negative" approach is more obvious in that it limits the responsibility of the regular expression to just finding any invalid character. I do like that re.IGNORECASE allows the regex to be shortened.

So here's another shot; it's longer but it reads kind of like prose. I suppose "readable" is somewhat at odds with "concise". I believe all of the validation constraints mentioned in the thread so far are covered:


def isValidHostname(hostname):
    if len(hostname) > 255:
        return False
    if hostname.endswith("."): # A single trailing dot is legal
        hostname = hostname[:-1] # strip exactly one dot from the right, if present
    disallowed = re.compile("[^A-Z\d-]", re.IGNORECASE)
    return all( # Split by labels and verify individually
        (label and len(label) <= 63 # length is within proper range
         and not label.startswith("-") and not label.endswith("-") # no bordering hyphens
         and not disallowed.search(label)) # contains only legal characters
        for label in hostname.split("."))
kostmo
You don't need the backslashes as line continuators - they are implicit in the enclosing parentheses.
Tim Pietzcker
good to know. i've removed them.
kostmo