tags:

views:

834

answers:

4

I'm looking for an efficient algorithm for scrambling a set of letters into a permutation containing the maximum number of words.

For example, say I am given the list of letters: {e, e, h, r, s, t}. I need to order them in such a way as to contain the maximum number of words. If I order those letters into "theres", it contain the words "the", "there", "her", "here", and "ere". So that example could have a score of 5, since it contains 5 words. I want to order the letters in such a way as to have the highest score (contain the most words).

A naive algorithm would be to try and score every permutation. I believe this is O(n!), so 720 different permutations would be tried for just the 6 letters above (including some duplicates, since the example has e twice). For more letters, the naive solution quickly becomes impossible, of course.

The algorithm doesn't have to actually produce the very best solution, but it should find a good solution in a reasonable amount of time. For my application, simply guessing (Monte Carlo) at a few million permutations works quite poorly, so that's currently the mark to beat.

I am currently using the Aho-Corasick algorithm to score permutations. It searches for each word in the dictionary in just one pass through the text, so I believe it's quite efficient. This also means I have all the words stored in a trie, but if another algorithm requires different storage that's fine too. I am not worried about setting up the dictionary, just the run time of the actual ordering and searching. Even a fuzzy dictionary could be used if needed, like a Bloom Filter.

For my application, the list of letters given is about 100, and the dictionary contains over 100,000 entries. The dictionary never changes, but several different lists of letters need to be ordered.

I am considering trying a path finding algorithm. I believe I could start with a random letter from the list as a starting point. Then each remaining letter would be used to create a "path." I think this would work well with the Aho-Corasick scoring algorithm, since scores could be built up one letter at a time. I haven't tried path finding yet though; maybe it's not a even a good idea? I don't know which path finding algorithm might be best.

Another algorithm I thought of also starts with a random letter. Then the dictionary trie would be searched for "rich" branches containing the remain letters. Dictionary branches containing unavailable letters would be pruned. I'm a bit foggy on the details of how this would work exactly, but it could completely eliminate scoring permutations.

Any suggestions, advice, or algorithms will be greatly appreciated. Thanks!

+2  A: 

Have you thought about using a genetic algorithm? You have the beginnings of your fitness function already. You could experiment with the mutation and crossover (thanks Nathan) algorithms to see which do the best job.

Another option would be for your algorithm to build the smallest possible word from the input set, and then add one letter at a time so that the new word is also is or contains a new word. Start with a few different starting words for each input set and see where it leads.

Just a few idle thoughts.

Rodyland
I think the word you were looking for is "crossover".
Nathan Kitchen
Indeed. Many thanks.
Rodyland
A: 

It might be useful to check how others solved this: http://sourceforge.net/search/?type_of_search=soft&words=anagram

On this page you can generate anagrams online. I've played around with it for a while and it's great fun. It doesn't explain in detail how it does its job, but the parameters give some insight. http://wordsmith.org/anagram/advanced.html

Wouter van Nifterick
huh? why the downvote?
Wouter van Nifterick
This problem is a _lot_ more difficult than anagram solving.
v3
Yes, it involves more than solving anagrams, but doing it is a major part of the algorithm.
Wouter van Nifterick
+1. At any point in the main algorithm when the first n characters have been decided on and m characters remain, finding anagrams with those m characters is a useful way to find a lower bound on the score that could be added. This would be useful as the heuristic for A* search.
j_random_hacker
+1  A: 

You might try simulated annealing, which has been used successfully for complex optimization problems in a number of domains. Basically you do randomized hill-climbing while gradually reducing the randomness. Since you already have the Aho-Corasick scoring you've done most of the work already. All you need is a way to generate neighbor permutations; for that something simple like swapping a pair of letters should work fine.

Nathan Kitchen
I had heard of simulated annealing before, but never actually knew what it was for. It seems like a good idea, I'm going to give it a try.
Imbue
+3  A: 

Here's an idea, inspired by Markov Chains:

  1. Precompute the letter transition probabilities in your dictionary. Create a table with the probability that some letter X is followed by another letter Y, for all letter pairs, based on the words in the dictionary.
  2. Generate permutations by randomly choosing each next letter from the remaining pool of letters, based on the previous letter and the probability table, until all letters are used up. Run this many times.
  3. You can experiment by increasing the "memory" of your transition table - don't look only one letter back, but say 2 or 3. This increases the probability table, but gives you more chance of creating a valid word.
Yevgeny Doctor