views:

177

answers:

5

How would one implement shuffle for the "Celestial Jukebox"?

More precisely, at each time t, return an uniform random number between 0..n(t), such that there are no repeats in the entire sequence, with n() increasing over time.

For the concrete example, assume a flat-rate music service which allows playing any song in the catalog by a 0 based index number. Every so often, new songs are added which increase range of index numbers. The goal is to play a new song each time (assuming no duplicates in the catalog).

an ideal solution would be feasible on existing hardware - how would I shoehorn a list of six million songs in 8MB of DRAM? Similarly, the high song count exacerbates O(n) selection timings.

-- For an LCG generator, given a partially exhausted LCG on 0..N0, can that be translated to a different LCG on 0..N1 (where N1 > N0), that doen't repeat the exhausted sequence.
-- Checking if a particular song has already been played seems to rapidly grow out of hand, although this might be the only way ? Is there an efficient data structure for this?

+3  A: 

The way that I like to do that kind of non-repeating random selection is to have a list, and each time I select an item at random between [0-N), I remove it from that list. In your case, as new items get added to the catalog, it would also be added to the not-yet-selected list. Once you get to the end, simply reload all the songs back to the list.

EDIT:

If you take v3's suggestion into account, this can be done in basically O(1) time after the O(N) initialization step. It guarantees non-repeating random selection.

Here is the recap:

  1. Add the initial items to a list
  2. Pick index i at random (from set of [0,N))
  3. Remove item at index i
  4. Replace the hole at i with the Nth item (or null if i == Nth) and decrement N
  5. For new items, simply append to the end of the list and increment N as necessary
  6. If you ever get to playing through all the songs (which I doubt if you have 6M songs), then add all the songs back to the list, lather, rinse, and repeat.

Since you are trying to deal with rather large sets, I would recommend the use of a DB. A simple table with basically two fields: id and "pointer" (where "pointer" is what tells you the song to play which could be a GUID, FileName, etc, depending on how you want to do it). Have an index on id and you should get very decent performance with persistence between application runs.

EDIT for 8MB limit:

Umm, this does make it a bit harder... In 8 MB, you can store a maximum of ~2M entries using 32-bit keys.

So what I would recommend is to pre-select the next 2M entries. If the user plays through 2M songs in a lifetime, damn! To pre-select them, do a pre-init step using the above algorithm. The one change I would make is that as you add new songs, roll the dice and see if you want to randomly add that song to the mix. If yes, then pick a random index and replace it with the new song's index.

Erich Mirabal
Assuming a linked list, this involves random selection from linked list ( O(n) ), or for an array, deleting the entry (again, O(n) ), times each item..Today typical providers have catalogs of 6 million songs, so perhaps more elegance is needed?
caffiend
That would be my solution too. Solve the problem by making it easier: don't worry about non-repeating numbers, just have another copy of the data structure and move the already played songs into it
Simonw
I'm leaving now, but let me think about how to optimize for large datasets. My gut feeling is that a dictionary/hashset like Alex mentions is probably in order.
Erich Mirabal
@caffiend: if you use an array, deletion doesn't have to be O(N) ;)To delete an item, swap it with the last element and pretend it isn't there. This won't affect the randomness of the next selection.
v3
@v3: +1. oh yeah, that's a great point. I think if you just added that suggestion, this is as simple as you can get.
Erich Mirabal
A: 

While Erich's solution is probably better for your specific use case, checking if a song has already been played is very fast (amortized O(1)) with a hash-based structure, such as a set in Python or a hashset<int> in C++.

Alex Martelli
yeah, but the problem is that you have to keep repeating the selection process until you don't have a repeat. This will take longer and longer each time.
Erich Mirabal
yep, that's why, as I said, your solution is probably better for the OP's use case (assuming the "non-repeating samples" ever get very close to the maximum allowable size -- where samples stay relatively small, say <= half the max alowable, trade-offs change).
Alex Martelli
A: 

You could simply generate the sequence of numbers from 1 to n and then shuffle it using a Fisher-Yates shuffle. That way you can guarantee that the sequence won't repeat, regardless of n.

Joey
How would you handle new songs? n is not static, it changes over time.
Erich Mirabal
When a new song comes in, pick a random index and add it there. Shift all the other songs down one. O(n) (for the shift).
kenj0418
@kenj0418: why not just replace the item and move the replaced item to the end?
Erich Mirabal
A: 

You could use a linked list inside an array: To build the initial playlist, use an array containing a something like this:

 struct playlistNode{
  songLocator* song;
  playlistNode  *next;
};
struct playlistNode arr[N];

Also keep a 'head' and 'freelist' pointer;

Populate it in 2 passes:
1. fill in arr with all the songs in the catalog in order 0..N.
2. randomly iterate through all the indexes, filling in the next pointer;

Deletion of songs played is O(1):

head=cur->next;
cur->song=NULL;
freelist->next = freelist;
cur->next=freelist;
freelist=cur;

Insertion of new songs is O(1) also: pick an array index at random, and patch a new node.

node = freelist;
freelist=freelist->next;
do {
i=rand(N);   
} while (!arr[i].song);  //make sure you didn't hit a played node
node->next = arr[i].next;
arr[i].next=node;
AShelly
N is not constant. It grows over time. Also, I don't think insert is O(1) since you are looping to skip any previously played ones.
Erich Mirabal
If you play songs faster than you add them you don't need to grow the array. If the opposite is true, either periodically realloc, or maintain a list of arrays, let the linked list thread through them.
AShelly
Insert can be worse than O(1), if you are not adding new songs fast enough. In that case I'd probably do a consolidation once the array reached an "emptiness" threshold.
AShelly
+1  A: 

With a limit of 8MB for 6 million songs, there's plainly not room to store even a single 32 bit integer for each song. Unless you're prepared to store the list on disk (in which case, see below).

If you're prepared to drop the requirement that new items be immediately added to the shuffle, you can generate an LCG over the current set of songs, then when that is exhausted, generate a new LCG over only the songs that were added since you began. Rinse and repeat until you no longer have any new songs. You can also use this rather cool algorithm that generates an unguessable permutation over an arbitrary range without storing it.

If you're prepared to relax the requirement of 8MB ram for 6 million songs, or to go to disk (for example, by memory mapping), you could generate the sequence from 1..n at the beginning, shuffle it with fisher-yates, and whenever a new song is added, pick a random element from the so-far-unplayed section, insert the new ID there, and append the original ID to the end of the list.

If you don't care much about computational efficiency, you could store a bitmap of all songs, and repeatedly pick IDs uniformly at random until you find one you haven't played yet. This would take 6 million tries to find the last song (on average), which is still damn fast on a modern CPU.

Nick Johnson