views:

888

answers:

5

I am trying to code a word descrambler like this one here and was wondering what algorithms I should use to implement this. Also, if anyone can find existing code for this that would be great as well. Basically the functionality is going to be like a boggle solver but without being a matrix, just searching for all word possibilities from a string of characters. I do already have adequate dictionaries.

I was planning to do this in either python or ruby. Thanks in advance for your help guys!

+3  A: 

I'd use a Trie. Here's an implementation in Python: http://jtauber.com/2005/02/trie.py (credit to James Tauber)

perimosocordiae
+1 for Trie, the right tool for the job.
Michael Foukarakis
+2  A: 

I may be missing an understanding of the game but barring some complications in the rules, such as with the introduction of "joker" (wildcard) letters, missing or additional letters, multiple words etc... I think the following ideas would help turn the problem in a somewhat relatively uninteresting thing. :-(

Main idea index words by the ordered sequence of their letters.
For example "computer" gets keyed as "cemoprtu". Whatever the random drawings provide is sorting in kind, and used as key to find possible matches. Using trie structures as suggested by perimosocordiae, as the underlying storage for these sorted keys and associated words(s)/wordIds in the "leaf" nodes, Word lookup can be done in O(n) time, where n is the number of letters (or better, on average due to non-existing words).

To further help with indexing we can have several tables/dictionaries, one per number of letters. Also depending on statistics the vowels and consonants could be handled separately. Another trick would be to have a custom sort order, placing the most selective letters first.

Additional twists to the game (such as finding words made from a subset of the letters) is mostly a matter of iterating the power set of these letters and checking the dictionary for each combination.

A few heuristics can be introduced to help prune some of the combinations (for example combinations without vowels [and of a given length] are not possible solutions etc. One should manage these heuristics carefully for the lookup cost is relatively small.

mjv
+2  A: 

For your dictionary index, build a map (Map[Bag[Char], List[String]]). It should be a hash map so you can get O(1) word lookup. A Bag[Char] is an identifier for a word that is unique up to character order. It's is basically a hash map from Char to Int. The Char is a given character in the word and the Int is the number of times that character appears in the word.

Example:

{'a'=>3, 'n'=>1, 'g'=>1, 'r'=>1, 'm'=>1} => ["anagram"]
{'s'=>3, 't'=>1, 'r'=>1, 'e'=>2, 'd'=>1} => ["stressed", "desserts"]

To find words, take every combination of characters from the input string and look it up in this map. The complexity of this algorithm is O(2^n) in the length of the input string. Notably, the complexity does not depend on the length of the dictionary.

Apocalisp
+1  A: 

This sounds like Rabin-Karp string search would be a good choice. If you use a rolling hash-function then at each position you need one hash value update and one dictionary lookup. You also need to create a good way to cope with different word lengths, like truncating all words to the shortest word in the set and rechecking possible matches. Splitting the word set into separate length ranges will reduce the amount of false positives at the expense of increasing the hashing work.

Ants Aasma
+1  A: 

There are two ways to do this. One is to check every candidate permutation of letters in the word to see if the candidate is in your dictionary of words. That's an O(N!) operation, depending on the length of the word.

The other way is to check every candidate word in your dictionary to see if it's contained within the word. This can be sped up by aggregating the dictionary; instead of every candidate word, you check all words that are anagrams of each other at once, since if any one of them is contained in your word, all of them are.

So start by building a dictionary whose key is a sorted string of letters and whose value is a list of the words that are anagrams of the key:

>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> with open(r"c:\temp\words.txt", "r") as f:
        for line in f.readlines():
            if line[0].isupper(): continue
            word = line.strip()
            key = "".join(sorted(word.lower()))
            d[key].append(word)

Now we need a function to see if a word contains a candidate. This function assumes that the word and candidate are both sorted, so that it can go through them both letter by letter and give up quickly when it finds that they don't match.

>>> def contains(sorted_word, sorted_candidate):
        wchars = (c for c in sorted_word)
        for cc in sorted_candidate:
            while(True):
                try:
                    wc = wchars.next()
                except StopIteration:
                    return False
                if wc < cc: continue
                if wc == cc: break
                return False
        return True

Now find all the candidate keys in the dictionary that are contained by the word, and aggregate all of their values into a single list:

>>> w = sorted("mythopoetic")
>>> result = []
>>> for k in d.keys():
        if contains(w, k): result.extend(d[k])
>>> len(result)
429
>>> sorted(result)[:20]
['c', 'ce', 'cep', 'ceti', 'che', 'chetty', 'chi', 'chime', 'chip', 'chit', 'chitty', 'cho', 'chomp', 'choop', 'chop', 'chott', 'chyme', 'cipo', 'cit', 'cite']

That last step takes about a quarter second on my laptop; there are 195K keys in my dictionary (I'm using the BSD Unix words file).

Robert Rossney