views:

193

answers:

11

After reading this question I started to wonder: is it possible to have a shuffling algorithm which does not modify or copy the original list?

To make it clear:

Imagine you are given a list of objects. The list size can be arbitrary, but assume it's pretty large (say, 10,000,000 items). You need to print out the items of the list in random order, and you need to do it as fast as possible. However, you should not:

  • Copy the original list, because it's very large and copying would waste a LOT of memory (probably hitting the limits of available RAM);
  • Modify the original list, because it's sorted in some way and some other part later on depends on it being sorted.
  • Create an index list, because, again, the list is very large and copying takes all too much time and memory. (Clarification: this is meant any other list, which has the same number of elements as the original list).

Is this possible?

Added: More clarifications.

  1. I want the list to be shuffled in true random way with all permutations equally likely (of course, assuming we have a proper Rand() function to start with).
  2. Suggestions that I make a list of pointers, or a list of indices, or any other list that would have the same number of elements as the original list, is explicitly deemed as inefficient by the original question. You can create additional lists if you want, but they should be serious orders of magnitude smaller than the original list.
  3. The original list is like an array, and you can retrieve any item from it by its index in O(1). (So no doubly-linked list stuff, where you have to iterate through the list to get to your desired item.)

Added 2: OK, let's put it this way: You have a 1TB HDD filled with data items, each 512 bytes large (a single sector). You want to copy all this data to another 1TB HDD while shuffling all the items. You want to do this as fast as possible (single pass over data, etc). You have 512MB of RAM available, and don't count on swap. (This is a theoretical scenario, I don't have anything like this in practice. I just want to find the perfect algorithm.item.)

+6  A: 

Well it depends a bit on what kind of randomness you except for the shuffling, i.e. should all shufflings be as probable, or can the distribution be skewed.

There are mathematical ways to produce "random-looking" permutations of N integers, so if P is such a permutation from 0..N-1 to 0..N-1, you can just iterate x from 0 to N-1 and output list item L(P(x)) instead of L(x) and you have obtained a shuffling. Such permutations can be obtained e.g. using modular arithmetics. For example, if N is prime, P(x)=(x * k) mod N is a permutation for any 0 < k < N (but maps 0 to 0). Similary for a prime N, for example P(x)=(x^3) mod N should be a permutation (but maps 0 to 0 and 1 to 1). This solution can be easily expanded to non-prime N by selecting the least prime above N (call it M), permute upto M, and discard the permuted indices above N (similary below).

It should be noted that modular exponentiation is the basis for many cryptographic algorithms (e.g. RSA, Diffie-Hellman) and is considered a strongly pseudorandom operation by the experts in the field.

Another easy way (not requiring prime numbers) is first to expand the domain so that instead of N you consider M where M is the least power of two above N. So e.g. if N=12 you set M=16. Then you use bijective bit operations, e.g.

P(x) = ((x ^ 0xf) ^ (x << 2) + 3) & 0xf

Then when you output your list, you iterate x from 0 to M-1 and output L(P(x)) only if P(x) is actually < N.

A "true, unbiased random" solution can be constructed by fixing a cryptographically strong block cipher (e.g. AES) and a random key (k) and then iterating the sequence

AES(k, 0), AES(k, 1), ...

and outputting the corresponding item from the sequence iff AES(k,i) < N. This can be done in constant space (the internal memory required by the cipher) and is indistinguishable from a random permutation (due to the cryptographic properties of the cipher) but is obviously very slow. In the case of AES, you would need to iterate until i = 2^128.

antti.huima
Well, I'd like true unbiased random. I'm aware of the prime solution, but that's not really random.
Vilx-
The problem with a strong PRNG is that you will get repeats. That requires a bitmap to prevent "picking" some element of the original list more than once.
Stephen C
Well, as I point out above, there are non-repeating sequences which as permutations are considered strongly random, e.g. modular exponentiation or modern block ciphers.
antti.huima
I think this answer is the closest you can get. Any truly random algorithm is going to need to keep track of which item has already been selected since it will, by definition, not be predictable. This means either modifying the original list or creating a separate data structure.
Dave Kirby
What do you mean by "excepting randomness"?
Svante
+3  A: 

You're not allowed to make a copy, modify it, or keep track of which elements you've visited? I'm gonna say it's not possible. Unless I'm misunderstanding your third criteria.

I take it to mean you're not allowed to say, make an array of 10,000,000 corresponding booleans, set to true when you've printed the corresponding element. And you're not allowed to make a list of the 10,000,000 indices, shuffle the list, and print out the elements in that order.

Ross
Yes, you're understanding me correctly. Well, you can keep track of which items you have visited, if you figure out a way to do it without making another list the same size as he input. If you make a list of size something like log(N), then I'd be satisfied.
Vilx-
+2  A: 

Those 10,000,000 items are only references (or pointers) to actual items, so your list will not be that large. Only ~40MB on 32-bit architecture for all references + size of internal variables of that list. In case when your items are smaller than reference size, you just copy whole list.

MBO
+1  A: 

It sounds impossible.

But 10,000,000 64-bit pointers is only about 76MB.

Jonas Elfström
OK, I upped the stakes a bit. :)
Vilx-
@Vilx and you only stated that items weight 1TB, not that there are 1T of pointers. It all depends on number of your items, not size of all those items.
MBO
In my example above there are 1G of pointers, far too much for 512MB of RAM anyway.
Vilx-
+4  A: 

The basic idea is to create an array of N integers, fill with the numbers 0 to N - 1, shuffle the array, use them as indexes into the original list. That requires space for an array of N ints (or longs), and assumes that you can efficiently index the element list.

To improve on the space usage replace the array of integers with a linear congruential PRNG that uses all values in the range 0 .. 2**X, discarding any values that are > N - 1. The disadvantage of this approach is that a given PRNG will always give the same shuffle, and if N is small compared with 2**X you generate lots of numbers to no good effect.

Finally, note that you cannot replace the cyclic PRNG with a true RNG (or a better PRNG) unless you are prepared to set aside (at least) a bit map for recording the numbers that you've already generated / used. And even if you do that, you have the problem that your rate of generating viable indexes will drop off as an inverse exponential. So you need switch to a different way of generating the last x% of the indexes.

Stephen C
+1 You're too fast
Nescio
...which I explicitly forbid in my post. :)
Vilx-
@Vilx - the wording of your question is ... umm .... unclear.
Stephen C
A: 

If there's enough space, you could store node's pointers in an array, create a bitmap and get random ints that point to the next chosen item. If already chosen (you store that in your bitmap), then get closest one (left or right, you can randomize that), until no items are left.

If there's no enough space, then you could do same without storing node's pointers, but time will suffer (that's the time-space tradeoff ☺).

Ariel
+2  A: 

It's not possible to do this with a truly random number generator since you either have to:

  • remember which numbers have already been chosen and skip them (which requires an O(n) list of booleans and progressively worsening run-times as you skip more and more numbers); or
  • reduce the pool after each selection (which requires either modifications to the original list or a separate O(n) list to modify).

Neither of those are possibilities in your question so I'm going to have to say "no, you can't do it".

What I would tend to go for in this case is a bit mask of used values but not with skipping since, as mentioned, the run-times get worse as the used values accumulate.

A bit mask will be substantially better than the original list of 39Gb (10 million bits is only about 1.2M), many order of magnitude less as you requested even though it's still O(n).

In order to get around the run-time problem, only generate one random number each time and, if the relevant "used" bit is already set, scan forward through the bit mask until you find one that's not set.

That means you won't be hanging around, desperate for the random number generator to give you a number that hasn't been used yet. The run times will only ever get as bad as the time taken to scan 1.2M of data.

Of course this means that the specific number chosen at any time is skewed based on the numbers that have already been chosen but, since those numbers were random anyway, the skewing is random (and if the numbers weren't truly random to begin with, then the skewing won't matter).

And you could even alternate the search direction (scanning up or down) if you want a bit more variety.

Bottom line: I don't believe what you're asking for is doable but keep in mind I've been wrong before as my wife will attest to, quickly and frequently :-) But, as with all things, there's usually ways to get around such issues.

paxdiablo
Well, I do agree about your point about the truly random number generator. I was thinking something about a weird PRNG function which would generate a non-repeating list of integers that can serve as indices into the original list. Well, non-repeating only as large as the size of the array. After that they start to repeat, naturally, perhaps even in the same sequence.
Vilx-
Of course, every PRNG needs a seed, some original entropy to base on. In case I have 1,000,000,000 items there are 1,000,000,000! possible permutations which is a LOT. I don't even know how much, but the PRNG seed would need to have at least that many variations, or the distribution would be seriously skewed. Would a binary number of such size be smaller than a list of 32-bit integer indices?
Vilx-
If yes, then this seed could be generated by the truly-random number generator, and then the PRNG would take care of the rest.
Vilx-
A: 

You can create a pseudorandom, 'secure' permutation using a block cipher - see here. They key insight is that, given a block cipher of n bits length, you can use 'folding' to shorten it to m < n bits, then the trick antti.huima already mentioned to generate a smaller permutation from it without spending huge amounts of time discarding out-of-range values.

Nick Johnson
A: 

A linear-feedback shift register can do pretty much what you want -- generate a list of numbers up to some limit, but in a (reasonably) random order. The patterns it produces are statistically similar to what you'd expect from try randomness, but it's not even close to cryptographically secure. The Berlekamp-Massey algorithm allows you to reverse engineer an equivalent LFSR based on an output sequence.

Given your requirement for a list of ~10,000,000 items, you'd want a 24-bit maximal-length LFSR, and simply discard outputs larger than the size of your list.

For what it's worth, an LFSR is generally quite fast compared to a typical linear congruential PRNG of the same period. In hardware, an LFSR is extremely simple, consisting of an N-bit register, and M 2-input XOR's (where M is the number of taps -- sometimes only a couple, and rarely more than a half dozen or so).

Jerry Coffin
+1  A: 
Jason Orendorff
Wow, cool. Actually, I don't follow you through the second half, but it seems serious enough that I don't doubt you.
Vilx-
It is true that if all permutations need to be as probable you can't use any standard, off-the-self PRNG because when N grows large, N! >> 2**K where K = size in bits of the PRNG's internal state. However, I think it's misguided to interpret "true random" from the original post in this strict fashion because it would make even a solution that could use an arbitrary amount of space very difficult. I thought the original question was more about space usage and "true random" meant "true pseudorandom" as would be standard in computer science.
antti.huima
Well, the asker said several times what he or she meant by "true random". And he or she check-marked this answer. So.
Jason Orendorff
@antti.huima: Also note that the second part of this answer tries to argue that the task as specified is impossible *even if we take for granted some perfect source of randomness* (i.e., "the algorithm receives new random bits from the environment as it goes").
Jason Orendorff
A: 

Essentially what you need is a random number generator that produces the numbers 0..n-1 exactly once each.

Here's a half-baked idea: You could do pretty well by picking a prime p slightly larger than n, then picking a random x between 1 and p-1 whose order in the multiplicative group mod p is p-1 (pick random xs and test which ones satisfy x^i != 1 for i < p-1, you will only need to test a few before you find one). Since x then generates the group, just compute x^i mod p for 1 <= i <= p-2 and that will give you p-2 distinct random(ish) numbers between 2 and p-1. Subtract 2 and throw out the ones >= n and that gives you a sequence of indexes to print.

This isn't terribly random, but you can use the same technique multiple times, taking the indexes above (+1) and using them as the exponents of another generator x2 modulo another prime p2 (you'll need n < p2 < p), and so on. A dozen repetitions should make things pretty random.

Keith Randall