Firstly, you should clarify whether you mean:
- individual domain name labels
- entire domain names (i.e. multiple dot-separate labels)
- host names
The reason the distinction is necessary is that a label can technically include any characters, including the NUL, @
and '.
' characters. DNS is 8-bit capable and it's perfectly possible to have a zone file containing an entry reading "an\0odd\.l@bel
". It's not recommended of course, not least because people would have difficulty telling a dot inside a label from those separating labels, but it is legal.
However, URLs require a host name in them, and those are governed by RFCs 952 and 1123. Valid host names are a subset of domain names. Specifically only letters, digits and hyphen are allowed. Furthermore the first and last characters cannot be a hyphen. RFC 952 didn't permit a number for the first character, but RFC 1123 subsequently relaxed that.
Hence:
a
- valid
0
- valid
a-
- invalid
a-b
- valid
xn--dasdkhfsd
- valid (punycode encoding of an IDN)
Off the top of my head I don't think it's possible to invalidate the a-
example with a single simple regexp. The best I can come up with to check a single _host_ label is:
if (preg_match('/^[a-z\d][a-z\d-]{0,62}$/i', $label) &&
!preg_match('/-$/', $label))
{
# label is legal within a hostname
}
To further complicate matters, some domain name entries (typically SRV
records) use labels prefixed with an underscore, e.g. _sip._udp.example.com
. These are not host names, but are legal domain names.