views:

351

answers:

5

This is intended to be a more concrete, easily expressable form of my earlier question.

Take a list of words from a dictionary with common letter length.
How to reorder this list tto keep as many letters as possible common between adjacent words?

Example 1:

AGNI, CIVA, DEVA, DEWA, KAMA, RAMA, SIVA, VAYU
reorders to:  
AGNI, CIVA, SIVA, DEVA, DEWA, KAMA, RAMA, VAYU

Example 2:

DEVI, KALI, SHRI, VACH
reorders to:
DEVI, SHRI, KALI, VACH

The simplest algorithm seems to be: Pick anything, then search for the shortest distance?
However, DEVI->KALI (1 common) is equivalent to DEVI->SHRI (1 common)
Choosing the first match would result in fewer common pairs in the entire list (4 versus 5).

This seems that it should be simpler than full TSP?

A: 

This can be done with a recursive approach. Pseudo-code:

Start with one of the words, call it w
FindNext(w, l) // l = list of words without w
  Get a list l of the words near to w
  If only one word in list
    Return that word
  Else
    For every word w' in l do FindNext(w', l') //l' = l without w'

You can add some score to count common pairs and to prefer "better" lists.

schnaader
Shouldn't that be "l = list of words without *w*"?
Aaron Digulla
And the last line looks like a cut'n'paste error :/
Aaron Digulla
Are the single quotes in the last line correct? They look weird.
Aaron Digulla
These are meant to be part of the (new) variable names, so perhaps it's easier to read like this: "For every word w2 in l do FindNext(w2, l2) //l2 = l without w2"
schnaader
+1  A: 

My pseudo code:

  • Create a graph of nodes where each node represents a word
  • Create connections between all the nodes (every node connects to every other node). Each connection has a "value" which is the number of common characters.
  • Drop connections where the "value" is 0.
  • Walk the graph by preferring connections with the highest values. If you have two connections with the same value, try both recursively.
  • Store the output of a walk in a list along with the sum of the distance between the words in this particular result. I'm not 100% sure ATM if you can simply sum the connections you used. See for yourself.
  • From all outputs, chose the one with the highest value.

This problem is probably NP complete which means that the runtime of the algorithm will become unbearable as the dictionaries grow. Right now, I see only one way to optimize it: Cut the graph into several smaller graphs, run the code on each and then join the lists. The result won't be as perfect as when you try every permutation but the runtime will be much better and the final result might be "good enough".

[EDIT] Since this algorithm doesn't try every possible combination, it's quite possible to miss the perfect result. It's even possible to get caught in a local maximum. Say, you have a pair with a value of 7 but if you chose this pair, all other values drop to 1; if you didn't take this pair, most other values would be 2, giving a much better overall final result.

This algorithm trades perfection for speed. When trying every possible combination would take years, even with the fastest computer in the world, you must find some way to bound the runtime.

If the dictionaries are small, you can simply create every permutation and then select the best result. If they grow beyond a certain bound, you're doomed.

Another solution is to mix the two. Use the greedy algorithm to find "islands" which are probably pretty good and then use the "complete search" to sort the small islands.

Aaron Digulla
I don't have a good example, but are there cases where this greedy algorithm will miss a better overall sequence?
A: 

You may want to take a look at BK-Trees, which make finding words with a given distance to each other efficient. Not a total solution, but possibly a component of one.

Nick Johnson
A: 

This problem has a name: n-ary Gray code. Since you're using English letters, n = 26. The Wikipedia article on Gray code describes the problem and includes some sample code.

John D. Cook
This seems promising, although wiki doesn't seem to have anything on converting from n-ary gray code into ordinals -- Wouldn't I need that to do a sort?
Not quite. The Gray code is a *particular sequence* of words with a special property; it is not a way to sort a *given sequence* of words into a good order.
ShreevatsaR
I just saw this question, and was about to post that very link!
warren
+3  A: 

What you're trying to do, is calculate the shortest hamiltonian path in a complete weighted graph, where each word is a vertex, and the weight of each edge is the number of letters that are differenct between those two words.

For your example, the graph would have edges weighted as so:

     DEVI KALI SHRI VACH
DEVI   X    3    3    4
KALI   3    X    3    3
SHRI   3    3    X    4
VACH   4    3    4    X

Then it's just a simple matter of picking your favorite TSP solving algorithm, and you're good to go.

Eclipse