Anna's second solution is the inspiration for this one.
First, load all the words into memory and divide the dictionary into sections based on word length.
For each length, make n copies of an array of pointers to the words. Sort each array so that the strings appear in order when rotated by a certain number of letters. For example, suppose the original list of 5-letter words is [plane, apple, space, train, happy, stack, hacks]. Then your five arrays of pointers will be:
rotated by 0 letters: [apple, hacks, happy, plane, space, stack, train]
rotated by 1 letter: [hacks, happy, plane, space, apple, train, stack]
rotated by 2 letters: [space, stack, train, plane, hacks, apple, happy]
rotated by 3 letters: [space, stack, train, hacks, apple, plane, happy]
rotated by 4 letters: [apple, plane, space, stack, train, hacks, happy]
(Instead of pointers, you can use integers identifying the words, if that saves space on your platform.)
To search, just ask how much you would have to rotate the pattern so that the question marks appear at the end. Then you can binary search in the appropriate list.
If you need to find matches for ??ppy, you would have to rotate that by 2 to make ppy??. So look in the array that is in order when rotated by 2 letters. A quick binary search finds that "happy" is the only match.
If you need to find matches for th??g, you would have to rotate that by 4 to make gth??. So look in array 4, where a binary search finds that there are no matches.
This works no matter how many question marks there are, as long as they all appear together.
Space required in addition to the dictionary itself: For words of length N, this requires space for (N times the number of words of length N) pointers or integers.
Time per lookup: O(log n) where n is the number of words of the appropriate length.
Implementation in Python:
import bisect
class Matcher:
def __init__(self, words):
# Sort the words into bins by length.
bins = []
for w in words:
while len(bins) <= len(w):
bins.append([])
bins[len(w)].append(w)
# Make n copies of each list, sorted by rotations.
for n in range(len(bins)):
bins[n] = [sorted(bins[n], key=lambda w: w[i:]+w[:i]) for i in range(n)]
self.bins = bins
def find(self, pattern):
bins = self.bins
if len(pattern) >= len(bins):
return []
# Figure out which array to search.
r = (pattern.rindex('?') + 1) % len(pattern)
rpat = (pattern[r:] + pattern[:r]).rstrip('?')
if '?' in rpat:
raise ValueError("non-adjacent wildcards in pattern: " + repr(pattern))
a = bins[len(pattern)][r]
# Binary-search the array.
class RotatedArray:
def __len__(self):
return len(a)
def __getitem__(self, i):
word = a[i]
return word[r:] + word[:r]
ra = RotatedArray()
start = bisect.bisect(ra, rpat)
stop = bisect.bisect(ra, rpat[:-1] + chr(ord(rpat[-1]) + 1))
# Return the matches.
return a[start:stop]
words = open('/usr/share/dict/words', 'r').read().split()
print "Building matcher..."
m = Matcher(words) # takes 1-2 seconds, for me
print "Done."
print m.find("st??k")
print m.find("ov???low")
On my computer, the system dictionary is 909KB big and this program uses about 3.2MB of memory in addition to what it takes just to store the words (pointers are 4 bytes). For this dictionary, you could cut that in half by using 2-byte integers instead of pointers, because there are fewer than 216 words of each length.
Measurements: On my machine, m.find("st??k")
runs in 0.000032 seconds, m.find("ov???low")
in 0.000034 seconds, and m.find("????????????????e")
in 0.000023 seconds.
By writing out the binary search instead of using class RotatedArray
and the bisect
library, I got those first two numbers down to 0.000016 seconds: twice as fast. Implementing this in C++ would make it faster still.