views:

1930

answers:

4

Lets say my alphabet contains X letters and my language supports only Y letter words (Y < X ofcourse). I need to generate all the words possible in random order.

E.g. Alphabet=a,b,c,d,e,f,g Y=3

So the words would be: aaa aab aac aba .. bbb ccc .. (the above should be generated in random order)

The trivial way to do it would be to generate the words and then randomize the list. I DONT want to do that. I want to generate the words in random order.

rondom(n)=letter[x].random(n-1) will not work because then you'll have a list of words starting with letter[x].. which will make the list not so random.

Any code/pseudocode appreciated.

A: 

I think you can do something pretty straightforward by generating a random array of characters based on the alphabet you have (in c#):

        char[] alphabet = {'a', 'b', 'c', 'd'};
        int wordLength = 3;

        Random rand = new Random();

        for (int i = 0; i < 5; i++)
        {
            char[] word = new char[wordLength];
            for (int j = 0; j < wordLength; j++)
            {
                word[j] = alphabet[rand.Next(alphabet.Length)];
            }
            Console.WriteLine(new string(word));
        }

Obviously this might generate duplicates but you could maybe store results in a hashmap or something to check for duplicates if you need to.

Jennifer
I want to generate all X<sup>Y</sup> entries. If I use Zach's answer, then there is no guarantee that it'll terminate. If I wanted to use hashmap.. I could as well dump everything to hashmap and print it out as the hashmap will randomize the entries for me anyway.
Sridhar Iyer
OK misunderstood sorry - it looks like you have to do the brute force approach and generate all possible combinations and randomise I suppose. Maybe a middle solution would be to throw it all into a file and read lines out from there!
Jennifer
A: 

If you use a set that maintains its order, you can keep adding a random permutation to the set until the set has a size of XY.

Java code based off of Jennifer's solution:

public static Set<String> permutations(char[] alphabet, int length) {
 Set<String> set = new TreeSet<String>();
 Random rand = new Random();

 while (set.size() < Math.pow(alphabet.length, length)) {
  char[] word = new char[length];
  for (int j = 0; j < length; j++)
   word[j] = alphabet[rand.nextInt(alphabet.length)];

  set.add(new String(word));
 }
 return set;
}
Zach Langley
I did think of this, but this algorithm has no guarantee that it'll terminate.
Sridhar Iyer
I'm aware that this theoretically might never terminate, but I don't know of any other solution without shuffling the set.
Zach Langley
A: 

So I take it what you want is to produce a permutation of the set using as little memory as possible.

First off, it can't be done using no memory. For your first string, you want a function that could produce any of the strings with equal likelihood. Say that function is called nextString(). If you call nextString() again without changing anything in the state, of course it will once again be able to produce any of the strings.

So you need to store something. The question is, what do you need to store, and how much space will it take?

The strings can be seen as numbers 0 - X^Y. (aaa=0, aab=1,aac=2...aba=X...) So to store a single string as efficiently as possible, you'd need lg(X^Y) bits. Let's say X = 16 and Y=2. Then you'd need 1 byte of storage to uniquely specify a string.

Of course the most naive algorithm is to mark each string as it is produced, which takes X^Y bits, which in my example is 256 bits (32 bytes). This is what you've said you don't want to do. You can use a shuffle algorithm as discussed in this question: http://stackoverflow.com/questions/168037/creating-a-random-ordered-list-from-an-ordered-list (you won't need to store the strings as you produce them through the shuffle algorithm, but you still need to mark them).

Ok, now the question is, can we do better than that? How much do we need to store, total?

Well, on the first call, we don't need any storage. On the second call, we need to know which one was produced before. On the last call, we only need to know which one is the last one left. So the worst case is when we're halfway through. When we're halfway through, there have been 128 strings produced, and there are 128 to go. We need to know which are left to produce. Assuming the process is truly random, any split is possible. There are (256 choose 128) possibilities. In order to potentially be able to store any of these, we need lg(256 choose 128) bits, which according to google calculator is 251.67. So if you were really clever you could squeeze the information into 4 fewer bits than the naive algorithm. Probably not worth it.

If you just want it to look randomish with very little storage, see this question: http://stackoverflow.com/questions/732700/looking-for-an-algorithm-to-spit-out-a-sequence-of-numbers-in-a-pseudo-random-o

Erika
+1  A: 

As other answers have implied, there's two main approaches: 1) track what you've already generated (the proposed solutions in this category suffer from possibly never terminating), or 2) track what permutations have yet to be produced (which implies that the permutations must be pre-generated which was specifically disallowed in the requirements). Here's another solution that is guaranteed to terminate and does not require pre-generation, but may not meet your randomization requirements (which are vague at this point).

General overview: generate a tree to track what's been generated or what's remaining. "select" new permutations by traversing random links in the tree, pruning the tree at the leafs after generation of that permutation to prevent it from being generated again.

Without a whiteboard to diagram this, I hope this description is good enough to describe what I mean: Create a "node" that has links to other nodes for every letter in the alphabet. This could be implemented using a generic map of alphabet letters to nodes or if your alphabet is fixed, you could create specific references. The node represents the available letters in the alphabet that can be "produced" next for generating a permutation. Start generating permutations by visiting the root node, selecting a random letter from the available letters in that node, then traversing that reference to the next node. With each traversal, a letter is produced for the permutation. When a leaf is reached (i.e. a permutation is fully constructed), you'd backtrack up the tree to see if the parent nodes have any available permutations remaining; if not, the parent node can be pruned.

As an implementation detail, the node could store the set of letters that are not available to be produced at that point or the set of letters that are still available to be produced at that point. In order to possibly reduce storage requirements, you could also allow the node to store either with a flag indicating which it's doing so that when the node allows more than half the alphabet it stores the letters produced so far and switch to using the letters remaining when there's less than half the alphabet available.

Using such a tree structure limits what can be produced without having to pre-generate all combinations since you don't need to pre-construct the entire tree (it can be constructed as the permutations are generated) and you're guaranteed to complete because of the purging of the nodes (i.e. you're only traversing links to nodes when that's an allowed combination for an unproduced permutation).

I believe the randomization of the technique is a little odd, however, and I don't think each combination is equally likely to be generated at any given time, though I haven't really thought through this. It's also probably worth noting that even though the full tree isn't necessarily generated up front, the overhead involved will likely be enough such that you may be better off pre-generating all permutations.

iammichael
I like this answer, but I think you're right that it won't give you a uniformly random distribution. To get uniform probabilities, I think you need to keep track of how many remaining leaves are available on each branch of a node. Then use those counts to weight which branch you randomly choose from that node.
Nate Kohl