ansaurus

Question

Ordering a dictionary to maximize common letters between adjacent words

Answer 1

A:

This can be done with a recursive approach. Pseudo-code:

Start with one of the words, call it w
FindNext(w, l) // l = list of words without w
  Get a list l of the words near to w
  If only one word in list
    Return that word
  Else
    For every word w' in l do FindNext(w', l') //l' = l without w'

You can add some score to count common pairs and to prefer "better" lists.

schnaader 2008-11-26 08:33:51

Shouldn't that be "l = list of words without *w*"?

Aaron Digulla 2008-11-26 08:38:41

And the last line looks like a cut'n'paste error :/

Aaron Digulla 2008-11-26 08:39:30

Are the single quotes in the last line correct? They look weird.

Aaron Digulla 2008-11-27 09:13:26

These are meant to be part of the (new) variable names, so perhaps it's easier to read like this: "For every word w2 in l do FindNext(w2, l2) //l2 = l without w2"

schnaader 2008-11-27 09:19:30

Answer 2

+1 A:

My pseudo code:

Create a graph of nodes where each node represents a word
Create connections between all the nodes (every node connects to every other node). Each connection has a "value" which is the number of common characters.
Drop connections where the "value" is 0.
Walk the graph by preferring connections with the highest values. If you have two connections with the same value, try both recursively.
Store the output of a walk in a list along with the sum of the distance between the words in this particular result. I'm not 100% sure ATM if you can simply sum the connections you used. See for yourself.
From all outputs, chose the one with the highest value.

This problem is probably NP complete which means that the runtime of the algorithm will become unbearable as the dictionaries grow. Right now, I see only one way to optimize it: Cut the graph into several smaller graphs, run the code on each and then join the lists. The result won't be as perfect as when you try every permutation but the runtime will be much better and the final result might be "good enough".

[EDIT] Since this algorithm doesn't try every possible combination, it's quite possible to miss the perfect result. It's even possible to get caught in a local maximum. Say, you have a pair with a value of 7 but if you chose this pair, all other values drop to 1; if you didn't take this pair, most other values would be 2, giving a much better overall final result.

This algorithm trades perfection for speed. When trying every possible combination would take years, even with the fastest computer in the world, you must find some way to bound the runtime.

If the dictionaries are small, you can simply create every permutation and then select the best result. If they grow beyond a certain bound, you're doomed.

Another solution is to mix the two. Use the greedy algorithm to find "islands" which are probably pretty good and then use the "complete search" to sort the small islands.

Aaron Digulla 2008-11-26 08:50:25

I don't have a good example, but are there cases where this greedy algorithm will miss a better overall sequence?

2008-11-26 15:21:44

Answer 3

A:

You may want to take a look at BK-Trees, which make finding words with a given distance to each other efficient. Not a total solution, but possibly a component of one.

Nick Johnson 2008-11-26 10:15:52

Answer 4

A:

This problem has a name: n-ary Gray code. Since you're using English letters, n = 26. The Wikipedia article on Gray code describes the problem and includes some sample code.

John D. Cook 2008-11-26 11:45:48

This seems promising, although wiki doesn't seem to have anything on converting from n-ary gray code into ordinals -- Wouldn't I need that to do a sort?

2008-11-26 14:08:47

Not quite. The Gray code is a *particular sequence* of words with a special property; it is not a way to sort a *given sequence* of words into a good order.

ShreevatsaR 2008-11-27 00:33:52

I just saw this question, and was about to post that very link!

warren 2008-11-27 22:35:22

Answer 5

+3 A:

What you're trying to do, is calculate the shortest hamiltonian path in a complete weighted graph, where each word is a vertex, and the weight of each edge is the number of letters that are differenct between those two words.

For your example, the graph would have edges weighted as so:

     DEVI KALI SHRI VACH
DEVI   X    3    3    4
KALI   3    X    3    3
SHRI   3    3    X    4
VACH   4    3    4    X

Then it's just a simple matter of picking your favorite TSP solving algorithm, and you're good to go.

Eclipse 2008-11-28 17:17:01

ansaurus

tags:

views:

answers:

Ordering a dictionary to maximize common letters between adjacent words

related questions