Regexes are really not a great way to validate IP addresses, I want to make that clear right up front. It is far, far easier to parse the addresses and do some simple arithmetic to compare them. A couple of less than's and greater than's and you're there.
That said, it seemed like it would be a really fun exercise to write this regex generator. And you know what? It was! I came up with a big mess of Python code to generate these regexes. Before I show the code, here's a sample of the regexes it produces for a couple of IP ranges:
1.2.3.4 to 1.2.3.4 1\.2\.3\.4
147.63.23.156 to 147.63.23.159 147\.63\.23\.15[6-9]
10.7.7.10 to 10.7.7.88 10\.7\.7\.([1-7]\d|8[0-8])
127.0.0.0 to 127.0.1.255 127\.0\.[0-1]\.(\d|[1-9]\d|1\d\d|2([0-4]\d|5[0-5]))
I'll show the code in two parts. First, the part that generates regexes for simple integer ranges. Second, the part that handles full IP addresses.
Matching number ranges
The first step is to figure out how to generate a regex that matches an arbitrary integer range, say 12-28 or 0-255.
@Robert Harvey already posted a link to some code that does this, but feh to that! I was well on my way writing code before I saw his link. Besides, it's about the journey not the destination, right?
So I wrote my own implementation of the number range regex generator. Here's an example of the regexes it comes up with:
156 to 159 15[6-9]
1 to 100 [1-9]|[1-9]\d|100
0 to 255 \d|[1-9]\d|1\d\d|2([0-4]\d|5[0-5])
Here's the code. There are numerous comments inline explaining the logic behind it. Overall it relies on a lot of recursion and special casing to try to keep the regexes lean and mean.
import sys, re
def range_regex(lower, upper):
lower, upper = str(lower), str(upper)
# Different lengths, for instance 1-100. Combine regex(1-9) and
# regex(10-100).
if len(lower) != len(upper):
return '%s|%s' % (
range_regex(lower, '9' * len(lower)),
range_regex(10 ** (len(lower)), upper)
)
ll, lr = lower[0], lower[1:]
ul, ur = upper[0], upper[1:]
# One digit numbers.
if lr == '':
if ll == '0' and ul == '9':
return '\\d'
else:
return '[%s-%s]' % (ll, ul)
# Same first digit, for instance 12-14. Concatenate "1" and regex(2-4).
elif ll == ul:
return ll + sub_range_regex(lr, ur)
# All zeros to all nines, for instance 100-399. Concatenate regex(1-3)
# and the appropriate number of \d's.
elif lr == '0' * len(lr) and ur == '9' * len(ur):
return range_regex(ll, ul) + '\\d' * len(lr)
# All zeros on left, for instance 200-649. Combine regex(200-599) and
# regex(600-649).
elif lr == '0' * len(lr):
return '%s|%s' % (
range_regex(lower, str(int(ul[0]) - 1) + '9' * len(ur)),
range_regex(ul + '0' * len(ur), upper)
)
# All nines on right, for instance 167-499. Combine regex(167-199) and
# regex(200-499).
elif ur == '9' * len(ur):
return '%s|%s' % (
range_regex(lower, ll + '9' * len(lr)),
range_regex(str(int(ll[0]) + 1) + '0' * len(lr), upper)
)
# First digits are one apart, for instance 12-24. Combine regex(12-19)
# and regex(20-24).
elif ord(ul[0]) - ord(ll[0]) == 1:
return '%s%s|%s%s' % (
ll, sub_range_regex(lr, '9' * len(lr)),
ul, sub_range_regex('0' * len(ur), ur)
)
# Far apart, uneven numbers, for instance 15-73. Combine regex(15-19),
# regex(20-69), and regex(70-73).
else:
return '%s|%s|%s' % (
range_regex(lower, ll + '9' * len(lr)),
range_regex(str(int(ll[0]) + 1) + '0' * len(lr),
str(int(ul[0]) - 1) + '9' * len(ur)),
range_regex(ul + '0' * len(ur), upper)
)
# Helper function which adds parentheses when needed to sub-regexes.
# Sub-regexes need parentheses if they have pipes that aren't already
# contained within parentheses. For example, "6|8" needs parentheses
# but "1(6|8)" doesn't.
def sub_range_regex(lower, upper):
orig_regex = range_regex(lower, upper)
old_regex = orig_regex
while True:
new_regex = re.sub(r'\([^()]*\)', '', old_regex)
if new_regex == old_regex:
break
else:
old_regex = new_regex
continue
if '|' in new_regex:
return '(' + orig_regex + ')'
else:
return orig_regex
Matching IP address ranges
With that capability in place, I then wrpte a very similar-looking IP range function to work with full IP addresses. The code is very similar to the code above except that we're working in base 256 instead of base 10, and the code throws around lists instead of strings.
import sys, re, socket
def ip_range_regex(lower, upper):
lower = [ord(c) for c in socket.inet_aton(lower)]
upper = [ord(c) for c in socket.inet_aton(upper)]
return ip_array_regex(lower, upper)
def ip_array_regex(lower, upper):
# One octet left.
if len(lower) == 1:
return range_regex(lower[0], upper[0])
# Same first octet.
if lower[0] == upper[0]:
return '%s\.%s' % (lower[0], sub_regex(ip_array_regex(lower[1:], upper[1:])))
# Full subnet.
elif lower[1:] == [0] * len(lower[1:]) and upper[1:] == [255] * len(upper[1:]):
return '%s\.%s' % (
range_regex(lower[0], upper[0]),
sub_regex(ip_array_regex(lower[1:], upper[1:]))
)
# Partial lower subnet.
elif lower[1:] == [0] * len(lower[1:]):
return '%s|%s' % (
ip_array_regex(lower, [upper[0] - 1] + [255] * len(upper[1:])),
ip_array_regex([upper[0]] + [0] * len(upper[1:]), upper)
)
# Partial upper subnet.
elif upper[1:] == [255] * len(upper[1:]):
return '%s|%s' % (
ip_array_regex(lower, [lower[0]] + [255] * len(lower[1:])),
ip_array_regex([lower[0] + 1] + [0] * len(lower[1:]), upper)
)
# First octets just 1 apart.
elif upper[0] - lower[0] == 1:
return '%s|%s' % (
ip_array_regex(lower, [lower[0]] + [255] * len(lower[1:])),
ip_array_regex([upper[0]] + [0] * len(upper[1:]), upper)
)
# First octets more than 1 apart.
else:
return '%s|%s|%s' % (
ip_array_regex(lower, [lower[0]] + [255] * len(lower[1:])),
ip_array_regex([lower[0] + 1] + [0] * len(lower[1:]),
[upper[0] - 1] + [255] * len(upper[1:])),
ip_array_regex([upper[0]] + [0] * len(upper[1:]), upper)
)