ansaurus

Question

Good algorithm and data structure for looking up words with missing letters?

Answer 1

+2 A:

You can take a look at how its done in aspell. It prompts suggestions of correct word for misspelled words.

Andreas Bonini 2009-12-23 14:30:57

How do I see the source code for that?

SuperString 2009-12-23 14:33:29

download the latest version

Matt Ellen 2009-12-23 14:55:33

Wow it is a huge file with like >100 files in it, where do I even start?

SuperString 2009-12-23 14:59:37

Start with the manual.http://aspell.net/0.61/man-html/Through-the-C-API.html#Through-the-C-API

John Bellone 2009-12-23 15:02:43

Answer 2

+3 A:

If you generate all the possible words that match the pattern (arate, arbte, arcte ... zryte, zrzte) and then look them up in a binary tree representation of the dictionary, that will have the average performance characteristics of O(e^N1 * log(N2)) where N1 is the number of question marks and N2 is the size of the dictionary. Seems good enough for me but I'm sure it's possible to figure out a better algorithm.

EDIT: If you will have more than say, three question marks, have a look at Phil H's answer and his letter indexing approach.

DrJokepu 2009-12-23 14:33:12

Have you done this before how do you know that its O((N1^2)*log(N2))?

SuperString 2009-12-23 14:34:56

You should not generate them all at once but only when you need them, starting with the first ? from the left.

Debilski 2009-12-23 14:37:18

Can you elaborate on what your algorithm is?

SuperString 2009-12-23 14:38:49

Danny: A binary tree loup is typically O(log N), and if you're using the English alphabet and you have two question marks you have 26^2 possibilities to look up, hence e^N, you look up each word hence O(e^N1 * log(N2))

DrJokepu 2009-12-23 14:44:17

Debilski: Yes this seems like a good idea to effectively halve the execution time on average (not exactly halving of course)

DrJokepu 2009-12-23 14:45:33

so you mean...If I have ?r?te... you are testing if arate, arbte, arcte....brate, brbte, brcte...zrzte are words?Testing words would take some time too, no?

SuperString 2009-12-23 14:50:47

What do you mean by not generating them all it once?

SuperString 2009-12-23 14:53:55

Yeah, but testing itself is not a problem as a binary tree lookup is O(log N) so it's quick. The problem is if you have more than say, three question marks as in that case the e^N part will get very big very quickly. If you expect more than three question marks, you will need some another approach. Let me know if that's the case and I will give you another data structure that's better suited for that case.

DrJokepu 2009-12-23 14:56:38

I can have any number of question marks and I am just trying to figure out which approach is better.

SuperString 2009-12-23 15:03:25

If you can have any number of question marks, consider Phil H's approach.

DrJokepu 2009-12-23 15:24:47

This is still not fast enough. can we make it < log(n)? We can probably do it by using more memory size, but obviously we do not want to use a hashmap of all the possible words, so how?

SuperString 2009-12-23 22:24:46

Good idea, but instead of a tree I'd make it a sorted array with binary search optimized for the case of successive search keys close to each other. That is, searching for 'hello' right after you searched for 'helko' can be done faster than the general case. The array takes less space than the tree and makes this successive-search optimization easier.

Darius Bacon 2009-12-30 20:38:11

Answer 3

+4 A:

First we need a way to compare the query string with a given entry. Let's assume a function using regexes: matches(query,trialstr).

An O(n) algorithm would be to simply run through every list item (your dictionary would be represented as a list in the program), comparing each to your query string.

With a bit of pre-calculation, you could improve on this for large numbers of queries by building an additional list of words for each letter, so your dictionary might look like:

wordsbyletter = { 'a' : ['aardvark', 'abacus', ... ],
                  'b' : ['bat', 'bar', ...],
                  .... }

However, this would be of limited use, particularly if your query string starts with an unknown character. So we can do even better by noting where in a given word a particular letter lies, generating:

wordsmap = { 'a':{ 0:['aardvark', 'abacus'],
                   1:['bat','bar'] 
                   2:['abacus']},
             'b':{ 0:['bat','bar'],
                   1:['abacus']},
             ....
           }

As you can see, without using indices, you will end up hugely increasing the amount of required storage space - specifically a dictionary of n words and average length m will require nm² of storage. However, you could very quickly now do your look up to get all the words from each set that can match.

The final optimisation (which you could use off the bat on the naive approach) is to also separate all the words of the same length into separate stores, since you always know how long the word is.

This version would be O(kx) where k is the number of known letters in the query word, and x=x(n) is the time to look up a single item in a dictionary of length n in your implementation (usually log(n).

So with a final dictionary like:

allmap = { 
           3 : { 
                  'a' : {
                          1 : ['ant','all'],
                          2 : ['bar','pat']
                         }
                  'b' : {
                          1 : ['bar','boy'],
                      ...
                }
           4 : {
                  'a' : {
                          1 : ['ante'],
                      ....

Then our algorithm is just:

possiblewords = set()
firsttime = True
wordlen = len(query)
for idx,letter in enumerate(query):
    if(letter is not '?'):
        matchesthisletter = set(allmap[wordlen][letter][idx])
        if firsttime:
             possiblewords = matchesthisletter
        else:
             possiblewords &= matchesthisletter

At the end, the set possiblewords will contain all the matching letters.

Phil H 2009-12-23 14:42:33

Do you really think this is practical? If you need to find the word ‘aardvark’, you’ll now need to find the intersection of the sets {word | word[0] == 'a'}, {word | word[1] == 'a'}, {word | word[2] == 'r'}, … and so on. You can optimise a little by starting the calculation with the smallest subsets but if your subsets turn out to be quite large…?

Debilski 2009-12-23 14:51:44

The algorithm as it is by the end is very efficient in computational effort, but not in storage requirements. It really depends how big the original 'dictionary' is.

Phil H 2009-12-23 14:59:13

This is essentially the same thought I had - it'll certainly be more efficient than the O(N) of a regular expression matcher and will do quite well if the dictionary is from a natural language like English.

James Kolpack 2009-12-23 15:06:54

This sounds like a good approach but what if the words starts with ???xxx..I also don't understand the relationship of runtime to memory space. Would the huge memory space that it requires slow down the program?

SuperString 2009-12-23 15:10:36

Time vs memory is a classic trade-off for algorithm design. Some good discussion is available here: http://stackoverflow.com/questions/1898161/memory-vs-performance

James Kolpack 2009-12-23 15:27:25

@Danny: If the query was ???ing, then all 6 letter words with 'i' at pos 3, n at pos 4 and g at pos 5 would be selected, and the intersection of those three sets would yield the possible words - say it had 'pumice, asking, baking' from i, 'asking, baking, canine' for the n and 'asking, baking, piglog' for the g. The intersection gives you 'asking, baking', et voila.

Phil H 2009-12-23 16:15:23

have you benchmarked this? To be honest I believe this will be very slow.

martinus 2009-12-23 20:32:45

Yea this is not going to work for me, I need < 20 lookups.

SuperString 2009-12-23 22:21:27

Can I suggest you put that in the question, then? For 1 spelling correction this is not optimal. For several million, it will be faster than Tries or any other approach I've seen here. Note that having more unknown characters <i>speeds this algorithm up</i> whereas it slows others down.

Phil H 2009-12-24 08:38:41

so you are saying that this will be faster than trie on average? interesting.

SuperString 2009-12-24 14:07:09

Space could be reduced to an extent by using pointers/indices to single list of words.

Joe Soul-bringer 2009-12-24 20:43:13

Size could also be reduced by limiting the size of "branches" of the "tree". Essentially, if you can do, say 3 branched lookups, you've reduced by 1000 and then you can a list look-up.

Joe Soul-bringer 2009-12-24 20:51:25

Answer 4

+1 A:

A regex-based solution will consider every possible value in your dictionary. If performance is your largest constraint, an index could be built to speed it up considerably.

You could start with an index on each word length containing an index of each index=character matching word sets. For length 5 words, for example, 2=r : {write, wrote, drate, arete, arite}, 3=o : {wrote, float, group}, etc. To get the possible matches for the original query, say '?ro??', the word sets would be intersected resulting in {wrote, group} in this case.

This is assuming that the only wildcard will be a single character and that the word length is known up front. If these are not valid assumptions, I can recommend n-gram based text matching, such as discussed in this paper.

James Kolpack 2009-12-23 14:56:28

Answer 5

+20 A:

To me this problem sounds like a good fit for a Trie data structure. Enter the entire dictionary into your trie, and then look up the word. For a missing letter you would have to try all sub-tries, which should be relatively easy to do with a recursive approach.

EDIT: I wrote a simple implementation of this in Ruby just now: http://gist.github.com/262667.

DataWraith 2009-12-23 15:07:30

If you have a bunch of question marks in a row, that algorithm will degrade quickly.

Winston Smith 2009-12-23 15:24:51

+1 for the Trie

fa. 2009-12-23 15:32:08

hmm so what would be the best solution

SuperString 2009-12-23 15:36:17

if you have lots of ? then you will have lots of answers anyway (unless your dictionary is very sparse, which will mean you won't have many sub tries anyway), its not clear to me that this perfoms badly with multiple ? am i missing something?

jk 2009-12-23 15:41:46

From what I am seeing here, Trie is good for lookup, but would this really be faster than a HashMap?

SuperString 2009-12-23 16:04:02

A HashMap would probably be faster, but use (much?) more memory. Depending on the size of your dictionary and RAM that could lead to problems. Unless you need this for production code (as opposed to homework, etc.) you can probably choose either.

DataWraith 2009-12-23 18:56:47

Hmm how would you write this in C++? I found some code online but it is not well documented so I cant understand it.

SuperString 2009-12-23 19:41:35

Thanks for the trie implementation! It is really very fast, when you do not count the tree building stage. I wonder how to best deal with worst case query patterns like `????????????????e`

martinus 2009-12-23 20:45:37

Nice - even in the absolute worst case scenario, it still degrades into a linear search - you never have to search the same string more than once.

Eclipse 2009-12-23 21:06:30

I have created a fork of your code that creates a separate Trie for each word length, which probably uses a bit more memory but is probably much faster in the worst case. http://gist.github.com/262815

martinus 2009-12-23 21:27:28

Can we make it so that it is < log(n) where n is the length of the dictionary?

SuperString 2009-12-23 22:22:40

@Danny: The code makes heavy use of Ruby idioms... it basically uses a ruby hash (associative array), indexed by the next character (the ones written on the edges in Wikipedia's diagram), in every TrieNode to keep track of children. The words are assembled in .map { ... } as we ascend the tree.In C++ maybe you could use `std::hash_map` -- I'm not sure how idiomatic that would be though. http://www.cs.bu.edu/teaching/c/tree/trie/ has an outline of how to write a trie in C, which may or may not be a little closer to C++ than Ruby is.@martinus: Cool, my first fork! :-)

DataWraith 2009-12-23 23:33:32

If you use a threshold (about 30ish) after which instead of building out the trie you just maintain a list of words to scan over, you can keep the memory usage down without degrading speed very much.

Rob Van Dam 2009-12-24 08:02:52

@SuperString It wouldn't be faster than a hash map, but the hash map consumes more space: For every group of words you need 2*word.length entries.

Dave 2009-12-31 13:59:10

@SuperString: Dear SuperString, I highly support this answer and hope you decide to go with the trie. It offers the best compromise between speed and space. It is a general, well known data structure, which makes implementation very easy (either there are already good implemenataions or there is sufficient doc.). Plus: Maintainability and readability increase, as one doesn't have to read through custom data structure designs! Good Luck Data Wraith (+1)

Dave 2010-01-01 12:10:01

Answer 6

+3 A:

Assume you have enough memory, you could build a giant hashmap to provide the answer in constant time. Here is a quick example in python:

from array import array
all_words = open("english-words").read().split()
big_map = {}

def populate_map(word):
  for i in range(pow(2, len(word))):
    bin = _bin(i, len(word))
    candidate = array('c', word)
    for j in range(len(word)):
      if bin[j] == "1":
        candidate[j] = "?"
    if candidate.tostring() in big_map:
      big_map[candidate.tostring()].add(word)
    else:
      big_map[candidate.tostring()] = set([word])

def _bin(x, width):
    return ''.join(str((x>>i)&1) for i in xrange(width-1,-1,-1))

def run():
  for word in all_words:
    populate_map(word)

run()

>>> big_map["y??r"]
set(['your', 'year'])
>>> big_map["yo?r"]
set(['your'])
>>> big_map["?o?r"]
set(['four', 'poor', 'door', 'your', 'hour'])

Jiayao Yu 2009-12-23 15:29:04

would this really be in constant time and faster than the other approaches?

SuperString 2009-12-23 15:37:11

and this map is REALLY big, haha

SuperString 2009-12-23 15:46:48

so I calculate this to be about 1 billion keys, I can take away the impossible words, but that will still leave me with a lot. How much space would this take up?

SuperString 2009-12-23 15:51:02

It's definitely a lot faster than regex search, try it out in an experiment. It's constant if no collision occurs, and that depends on your hash function. You can read up on "hash table" and "perfect hash".

Jiayao Yu 2009-12-23 15:51:12

In terms of memory space, would this be practical?

SuperString 2009-12-23 16:05:32

depends on your original dictionary size, if its the OED no

jk 2009-12-23 16:36:10

Also depends on your infrastructure. A couple of ways to make this more practical: 1. You can store the map on disk. 2. you can spread it on multiple machines.

Jiayao Yu 2009-12-23 18:33:25

Would we be able to use a little less memory but still run fast like this? Prehaps in O(log(log(n)))?

SuperString 2009-12-23 22:26:07

How big is your dictionary? I just tried it with a list of 5000 words, and heapy reports the memory usage to be around 1GB: Partition of a set of 6404350 objects. Total size = 1120854216 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 3188422 50 739713904 66 739713904 66 __builtin__.set 1 341 0 201543416 18 941257320 84 dict (no owner) 2 3203590 50 177720584 16 1118977904 100 str 3 5172 0 419736 0 1119397640 100 tuple

Jiayao Yu 2009-12-24 00:44:15

A dictionary is a normal English dictionary, so probably like 100k - 1m

SuperString 2009-12-24 14:07:57

One possible improvement to this approach is to save a BST in the hash entries instead of a linked list. This way, when you search inside one entry, the search is made in O(log(es)), where es is the number of objects in that hash entry. This way, you could use a smaller hash table size, with more values per each entry on average.Also, the entries could point to the words themselves instead of copying them - this would also save up a lot of space.

Anna 2009-12-30 17:48:19

Answer 7

A:

If you only have ? wildcards, no * wildcards that match a variable number of characters, you could try this: For each character index, build a dictionary from characters to sets of words. i.e. if the words are write, wrote, drate, arete, arite, your dictionary structure would look like this:

Character Index 0:
  'a' -> {"arete", "arite"}
  'd' -> {"drate"}
  'w' -> {"write", "wrote"}
Character Index 1:
  'r' -> {"write", "wrote", "drate", "arete", "arite"}
Character Index 2:
  'a' -> {"drate"}
  'e' -> {"arete"}
  'i' -> {"write", "arite"}
  'o' -> {"wrote"}
...

If you want to look up a?i?? you would take the set that corresponds to character index 0 => 'a' {"arete", "arite"} and the set that corresponds to character index 2 = 'i' => {"write", "arite"} and take the set intersection.

nikie 2009-12-23 15:50:24

Answer 8

+1 A:

The data structure you want is called a trie - see the wikipedia article for a short summary.

A trie is a tree structure where the paths through the tree form the set of all the words you wish to encode - each node can have up to 26 children, on for each possible letter at the next character position. See the diagram in the wikipedia article to see what I mean.

Dave Kirby 2009-12-23 18:33:53

Answer 9

+45 A:

I believe in this case it is best to just use a flat file where each word stands in one line. With this you can conveniently use the power of a regular expression search, which is highly optimized and will probably beat any data structure you can devise yourself for this problem.

Solution #1: Using Regex

This is working Ruby code for this problem:

def query(str, data)    
  r = Regexp.new("^#{str.gsub("?", ".")}$")
  idx = 0
  begin
    idx = data.index(r, idx)
    if idx
      yield data[idx, str.size]
      idx += str.size + 1
    end
  end while idx
end

start_time = Time.now
query("?r?te", File.read("wordlist.txt")) do |w|
  puts w
end
puts Time.now - start_time

The file wordlist.txt contains 45425 words (downloadable here). The program's output for query ?r?te is:

brute
crate
Crete
grate
irate
prate
write
wrote
0.013689

So it takes just 37 milliseconds to both read the whole file and to find all matches in it. And it scales very well for all kinds of query patterns, even where a Trie is very slow:

query ????????????????e

counterproductive
indistinguishable
microarchitecture
microprogrammable
0.018681

query ?h?a?r?c?l?

theatricals
0.013608

This looks fast enough for me.

Solution #2: Regex with Prepared Data

If you want to go even faster, you can split the wordlist into strings that contain words of equal lengths and just search the correct one based on your query length. Replace the last 5 lines with this code:

def query_split(str, data)
  query(str, data[str.length]) do |w|
    yield w
  end
end

# prepare data    
data = Hash.new("")
File.read("wordlist.txt").each_line do |w|
  data[w.length-1] += w
end

# use prepared data for query
start_time = Time.now
query_split("?r?te", data) do |w|
  puts w
end
puts Time.now - start_time

Building the data structure takes now about 0.4 second, but all queries are about 10 times faster (depending on the number of words with that length):

?r?te 0.001112 sec
?h?a?r?c?l? 0.000852 sec
????????????????e 0.000169 sec

Solution #3: One Big Hashtable (Updated Requirements)

Since you have changed your requirements, you can easily expand on your idea to use just one big hashtable that contains all precalculated results. But instead of working around collisions yourself you could rely on the performance of a properly implemented hashtable.

Here I create one big hashtable, where each possible query maps to a list of its results:

def create_big_hash(data)
  h = Hash.new do |h,k|
    h[k] = Array.new
  end    
  data.each_line do |l|
    w = l.strip
    # add all words with one ?
    w.length.times do |i|
      q = String.new(w)
      q[i] = "?"
      h[q].push w
    end
    # add all words with two ??
    (w.length-1).times do |i|
      q = String.new(w)      
      q[i, 2] = "??"
      h[q].push w
    end
  end
  h
end

# prepare data    
t = Time.new
h = create_big_hash(File.read("wordlist.txt"))
puts "#{Time.new - t} sec preparing data\n#{h.size} entries in big hash"

# use prepared data for query
t = Time.new
h["?ood"].each do |w|
  puts w
end
puts (Time.new - t)

Output is

4.960255 sec preparing data
616745 entries in big hash
food
good
hood
mood
wood
2.0e-05

The query performance is O(1), it is just a lookup in the hashtable. The time 2.0e-05 is probably below the timer's precision. When running it 1000 times, I get an average of 1.958e-6 seconds per query. To get it faster, I would switch to C++ and use the Google Sparse Hash which is extremely memory efficient, and fast.

Solution #4: Get Really Serious

All above solutions work and should be good enough for many use cases. If you really want to get serious and have lots of spare time on your hands, read some good papers:

Tries for Approximate String Matching - If well implemented, tries can have very compact memory requirements (50% less space than the dictionary itself), and are very fast.
Agrep - A Fast Approximate Pattern-Matching Tool - Agrep is based on a new efficient and flexible algorithm for approximate string matching.
Google Scholar search for approximate string matching - More than enough to read on this topic.

martinus 2009-12-23 20:29:00

can we make it so it runs at 1/1000 that speed?

SuperString 2009-12-23 22:23:33

maybe, but only when you can specialize on some use cases. Now is the time where you should benchmark and record how this is really used, then tune it for your specific workload.

martinus 2009-12-24 07:29:35

for example you can cache recent results, so if the same query is used twice just look it up in O(1).

martinus 2009-12-24 07:40:56

For speed, use C and an extremely fast regular expression engine (http://re2c.org/ ?).

Dave Jarvis 2009-12-30 22:11:05

you can probably stash the whole file in memory so you don't have to re-read it all the time

Claudiu 2009-12-30 22:23:20

If you want it 1000 times faster I suggest you parallelise the solution. Split your dataset into 1000 chunks spread across 1000 cores. Replace or modify first step of algorithm to sent task to right chunk/core. Sit back and wait (not very long). Whaddya mean you don't have 1000 cores ?

High Performance Mark 2009-12-31 11:52:14

@SuperString making it run at 1/1000 the speed means, it takes 1000 times longer..

Dave 2009-12-31 13:55:17

@dave, at least this can be done easily ;-)

martinus 2009-12-31 14:48:06

I was looking at the Google Sparsh Hash and I am kind of new to C++, how do I actually use the code there? Do I have to put the src into some folder?

SuperString 2009-12-31 19:02:34

Have a look at the README or README.windows

martinus 2010-01-01 19:15:02

+1 for approximate grep

Chinmay Kanchi 2010-01-03 14:07:46

oops would give you the bounty, but wasn't online, sorry

SuperString 2010-01-13 14:52:12

no problem. thanks anyway :)

martinus 2010-01-13 17:39:00

Answer 10

+15 A:

Directed Acyclic Word Graph would be perfect data structure for this problem. It combines efficiency of a trie (trie can be seen as a special case of DAWG), but is much more space efficient. Typical DAWG will take fraction of size that plain text file with words would take.

Enumerating words that meet specific conditions is simple and the same as in trie - you have to traverse graph in depth-first fashion.

el.pescado 2009-12-24 11:45:30

Would this be faster than a Trie?

SuperString 2009-12-24 14:08:52

+1. Was surprised to see no one else had mentioned this optimization...

int3 2009-12-24 18:51:20

@Danny: A trie would be pretty fast already, but a DAWG will use less memory (and consequently be more localized in memory), and so searches on it might have better cache performance. A DAWG is built from a trie though, so you'll have to build that first.

int3 2009-12-24 18:52:45

Note that you can also combine this with the approach of interval trees. You can store the length of the longest possible string with each vertex, since you know the length of your resulting string up front.For instance the words: "abc" and abfg" could be stored in a graph like this:a: 4b: 4c: 3f: 4g: 4With the edges:a -> bb -> cb -> ff -> gWhen searching for a??g you know that you do not have to search for anything beyond "abc" and only in the direction of "abfg". This example does not illustrate it very well but I hope you get the idea.

Joa Ebert 2009-12-30 16:48:18

Answer 11

+1 A:

Here's how I'd do it:

Concatenate the words of the dictionary into one long String separated by a non-word character.
Put all words into a TreeMap, where the key is the word and the value is the offset of the start of the word in the big String.
Find the base of the search string; i.e. the largest leading substring that doesn't include a '?'.
Use TreeMap.higherKey(base) and TreeMap.lowerKey(next(base)) to find the range within the String between which matches will be found. (The next method needs to calculate the next larger word to the base string with the same number or fewer characters; e.g. next("aa") is "ab", next("az") is "b".)
Create a regex for the search string and use Matcher.find() to search the substring corresponding to the range.

Steps 1 and 2 are done beforehand giving a data structure using O(NlogN) space where N is the number of words.

This approach degenerates to a brute-force regex search of the entire dictionary when the '?' appears in the first position, but the further to the right it is, the less matching needs to be done.

EDIT:

To improve the performance in the case where '?' is the first character, create a secondary lookup table that records the start/end offsets of runs of words whose second character is 'a', 'b', and so on. This can be used in the case where the first non-'?' is second character. You can us a similar approach for cases where the first non-'?' is the third character, fourth character and so on, but you end up with larger and larger numbers of smaller and smaller runs, and eventually this "optimization" becomes ineffective.

An alternative approach which requires significantly more space, but which is faster in most cases, is to prepare the dictionary data structure as above for all rotations of the words in the dictionary. For instance, the first rotation would consist of all words 2 characters or more with the first character of the word moved to the end of the word. The second rotation would be words of 3 characters or more with the first two characters moved to the end, and so on. Then to do the search, look for the longest sequence of non-'?' characters in the search string. If the index of the first character of this substring is N, use the Nth rotation to find the ranges, and search in the Nth rotation word list.

Stephen C 2009-12-30 07:02:21

Answer 12

+1 A:

My first post had an error that Jason found, it did not work well when ?? was in the beginning. I have now borrowed the cyclic shifts from Anna..

My solution: Introduce an end-of-word character (@) and store all cyclic shifted words in sorted arrays!! Use one sorted array for each word length. When looking for "th??e@", shift the string to move the ?-marks to the end (obtaining e@th??) and pick the array containing words of length 5 and make a binary search for the first word occurring after string "e@th". All remaining words in the array match, i.e., we will find "e@thoo (thoose), e@thes (these), etc.

The solution has time complexity Log( N ), where N is the size of the dictionary, and it expands the size of the data by a factor of 6 or so ( the average word length)

ragnarius 2009-12-30 15:15:08

Not all the words in between will match. For example, "thing" does not match "th??e", but it is between "these" and "those".

Jason Orendorff 2009-12-30 18:50:14

Yes, you are correct, one would need an additional filter.. I will up-vote your comment.

ragnarius 2009-12-30 20:05:33

Answer 13

+1 A:

Have you considered using a Ternary Search Tree? The lookup speed is comparable to a trie, but it is more space-efficient.

I have implemented this data structure several times, and it is a quite straightforward task in most languages.

fishlips 2009-12-30 18:29:08

Answer 14

+13 A:

Anna 2009-12-30 18:50:44

this is a very nice answer

Dave 2009-12-31 14:07:58

Great answer. Just to document - the trie method is known as the permuterm index method, and is very well described in the Intro to IR book at : http://nlp.stanford.edu/IR-book/html/htmledition/wildcard-queries-1.html

viksit 2010-01-14 19:58:35

@viksit, thanks for the comment!

Anna 2010-01-15 10:27:33

Answer 15

+7 A:

Anna's second solution is the inspiration for this one.

First, load all the words into memory and divide the dictionary into sections based on word length.

For each length, make n copies of an array of pointers to the words. Sort each array so that the strings appear in order when rotated by a certain number of letters. For example, suppose the original list of 5-letter words is [plane, apple, space, train, happy, stack, hacks]. Then your five arrays of pointers will be:

rotated by 0 letters: [apple, hacks, happy, plane, space, stack, train]
rotated by 1 letter:  [hacks, happy, plane, space, apple, train, stack]
rotated by 2 letters: [space, stack, train, plane, hacks, apple, happy]
rotated by 3 letters: [space, stack, train, hacks, apple, plane, happy]
rotated by 4 letters: [apple, plane, space, stack, train, hacks, happy]

(Instead of pointers, you can use integers identifying the words, if that saves space on your platform.)

To search, just ask how much you would have to rotate the pattern so that the question marks appear at the end. Then you can binary search in the appropriate list.

If you need to find matches for ??ppy, you would have to rotate that by 2 to make ppy??. So look in the array that is in order when rotated by 2 letters. A quick binary search finds that "happy" is the only match.

If you need to find matches for th??g, you would have to rotate that by 4 to make gth??. So look in array 4, where a binary search finds that there are no matches.

This works no matter how many question marks there are, as long as they all appear together.

Space required in addition to the dictionary itself: For words of length N, this requires space for (N times the number of words of length N) pointers or integers.

Time per lookup: O(log n) where n is the number of words of the appropriate length.

Implementation in Python:

import bisect

class Matcher:
    def __init__(self, words):
        # Sort the words into bins by length.
        bins = []
        for w in words:
            while len(bins) <= len(w):
                bins.append([])
            bins[len(w)].append(w)

        # Make n copies of each list, sorted by rotations.
        for n in range(len(bins)):
            bins[n] = [sorted(bins[n], key=lambda w: w[i:]+w[:i]) for i in range(n)]
        self.bins = bins

    def find(self, pattern):
        bins = self.bins
        if len(pattern) >= len(bins):
            return []

        # Figure out which array to search.
        r = (pattern.rindex('?') + 1) % len(pattern)
        rpat = (pattern[r:] + pattern[:r]).rstrip('?')
        if '?' in rpat:
            raise ValueError("non-adjacent wildcards in pattern: " + repr(pattern))
        a = bins[len(pattern)][r]

        # Binary-search the array.
        class RotatedArray:
            def __len__(self):
                return len(a)
            def __getitem__(self, i):
                word = a[i]
                return word[r:] + word[:r]
        ra = RotatedArray()
        start = bisect.bisect(ra, rpat)
        stop = bisect.bisect(ra, rpat[:-1] + chr(ord(rpat[-1]) + 1))

        # Return the matches.
        return a[start:stop]

words = open('/usr/share/dict/words', 'r').read().split()
print "Building matcher..."
m = Matcher(words)  # takes 1-2 seconds, for me
print "Done."

print m.find("st??k")
print m.find("ov???low")

On my computer, the system dictionary is 909KB big and this program uses about 3.2MB of memory in addition to what it takes just to store the words (pointers are 4 bytes). For this dictionary, you could cut that in half by using 2-byte integers instead of pointers, because there are fewer than 2¹⁶ words of each length.

Measurements: On my machine, m.find("st??k") runs in 0.000032 seconds, m.find("ov???low") in 0.000034 seconds, and m.find("????????????????e") in 0.000023 seconds.

By writing out the binary search instead of using class RotatedArray and the bisect library, I got those first two numbers down to 0.000016 seconds: twice as fast. Implementing this in C++ would make it faster still.

Jason Orendorff 2009-12-30 19:08:33

Wouldn't log(n) be too slow?Cool you saw that we can use indexing instead of the words to save space.

SuperString 2009-12-30 21:15:53

No, O(log *n*) is super fast. The current top-voted answer is O(*n*). All the answers I see that claim to be any faster than O(log *n*) involve calculating the answers to all possible queries ahead of time.

Jason Orendorff 2009-12-30 21:58:40

Note that for this dictionary, *log2(n)* is 14 or less.

Jason Orendorff 2009-12-30 21:59:53

nice idea! very fast and memory efficient. the only disadvantage I can see are queries like `?h?a?r?c?l?`.

martinus 2010-01-04 09:54:19

Answer 16

+2 A:

Build a hash set of all the words. To find matches, replace the question marks in the pattern with each possible combination of letters. If there are two question marks, a query consists of 26² = 676 quick, constant-expected-time hash table lookups.

import itertools

words = set(open("/usr/share/dict/words").read().split())

def query(pattern):
    i = pattern.index('?')
    j = pattern.rindex('?') + 1
    for combo in itertools.product('abcdefghijklmnopqrstuvwxyz', repeat=j-i):
        attempt = pattern[:i] + ''.join(combo) + pattern[j:]
        if attempt in words:
            print attempt

This uses less memory than my other answer, but it gets exponentially slower as you add more question marks.

Jason Orendorff 2009-12-30 21:26:32

Answer 17

+1 A:

Stephen C. posted a good idea: search an ordered dictionary to find the range where the pattern can appear. This doesn't help, though, when the pattern starts with a wildcard. You might also index by word-length, but here's another idea: add an ordered index on the reversed dictionary words; then a pattern always yields a small range in either the forward index or the reversed-word index (since we're told there are no patterns like ?ABCD?). The words themselves need be stored only once, with the entries of both structures pointing to the same words, and the lookup procedure viewing them either forwards or in reverse; but to use Python's built-in binary-search function I've made two separate strings arrays instead, wasting some space. (I'm using a sorted array instead of a tree as others have suggested, as it saves space and goes at least as fast.)

import bisect, re

def forward(string): return string
def reverse(string): return string[::-1]

index_forward = sorted(line.rstrip('\n')
                       for line in open('/usr/share/dict/words'))
index_reverse = sorted(map(reverse, index_forward))

def lookup(pattern):
    "Return a list of the dictionary words that match pattern."
    if reverse(pattern).find('?') <= pattern.find('?'):
        key, index, fixup = pattern, index_forward, forward
    else:
        key, index, fixup = reverse(pattern), index_reverse, reverse
    assert all(c.isalpha() or c == '?' for c in pattern)
    lo = bisect.bisect_left(index, key.replace('?', 'A'))
    hi = bisect.bisect_right(index, key.replace('?', 'z'))
    r = re.compile(pattern.replace('?', '.') + '$')
    return filter(r.match, (fixup(index[i]) for i in range(lo, hi)))

Tests: (The code also works for patterns like ?AB?D?, though without the speed guarantee.)

>>> lookup('hello')
['hello']
>>> lookup('??llo')
['callo', 'cello', 'hello', 'uhllo', 'Rollo', 'hollo', 'nullo']
>>> lookup('hel??')
['helio', 'helix', 'hello', 'helly', 'heloe', 'helve']
>>> lookup('he?l')
['heal', 'heel', 'hell', 'heml', 'herl']
>>> lookup('hx?l')
[]

This needs 2N pointers plus the space needed to store the dictionary-word text (in the tuned version). The worst-case time comes on the pattern '??e' which looks at 44062 candidates in my 235k-word /usr/share/dict/words; but almost all queries are much faster, like 'h??lo' looking at 190, and indexing first on word-length would reduce '??e' almost to nothing if we need to. Each candidate-check goes faster than the hashtable lookups others have suggested.

This resembles the rotations-index solution, which avoids all false match candidates at the cost of needing about 10N pointers instead of 2N (supposing an average word-length of about 10, as in my /usr/share/dict/words).

You could do a single binary search per lookup, instead of two, using a custom search function that searches for both low-bound and high-bound together (so the shared part of the search isn't repeated).

Darius Bacon 2009-12-30 21:39:40

Answer 18

+2 A:

If 80-90% accuracy is acceptable, you could manage with Peter Norvig's spell checker. The implementation is small and elegant.

duffymo 2010-01-03 12:48:52

This came to mind (multiple times i think) when i saw this question

RCIX 2010-01-03 14:16:07

Answer 19

A:

If you seriously want something on the order of a billion searches per second (though i can't dream why anyone outside of someone making the next grand-master scrabble AI or something for a huge web service would want that fast), i recommend utilizing threading to spawn [number of cores on your machine] threads + a master thread that delegates work to all of those threads. Then apply the best solution you have found so far and hope you don't run out of memory.

An idea i had is that you can speed up some cases by preparing sliced down dictionaries by letter then if you know the first letter of the selection you can resort to looking in a much smaller haystack.

Another thought I had was that you were trying to brute-force something -- perhaps build a DB or list or something for scrabble?

RCIX 2010-01-03 14:15:21

Answer 20

+1 A:

A lazy solution is to let SQLite or another DBMS do the job for you.

Just create an in-memory database, load your words and run a select using the LIKE operator.

Benoit Vidis 2010-01-04 10:51:33

ansaurus

tags:

views:

answers:

Good algorithm and data structure for looking up words with missing letters?

Solution #1: Using Regex

Solution #2: Regex with Prepared Data

Solution #3: One Big Hashtable (Updated Requirements)

Solution #4: Get Really Serious

related questions