ansaurus

Question

Implementing shuffle on the celestial jukebox

Answer 1

+3 A:

The way that I like to do that kind of non-repeating random selection is to have a list, and each time I select an item at random between [0-N), I remove it from that list. In your case, as new items get added to the catalog, it would also be added to the not-yet-selected list. Once you get to the end, simply reload all the songs back to the list.

EDIT:

If you take v3's suggestion into account, this can be done in basically O(1) time after the O(N) initialization step. It guarantees non-repeating random selection.

Here is the recap:

Add the initial items to a list
Pick index i at random (from set of [0,N))
Remove item at index i
Replace the hole at i with the Nth item (or null if i == Nth) and decrement N
For new items, simply append to the end of the list and increment N as necessary
If you ever get to playing through all the songs (which I doubt if you have 6M songs), then add all the songs back to the list, lather, rinse, and repeat.

Since you are trying to deal with rather large sets, I would recommend the use of a DB. A simple table with basically two fields: id and "pointer" (where "pointer" is what tells you the song to play which could be a GUID, FileName, etc, depending on how you want to do it). Have an index on id and you should get very decent performance with persistence between application runs.

EDIT for 8MB limit:

Umm, this does make it a bit harder... In 8 MB, you can store a maximum of ~2M entries using 32-bit keys.

So what I would recommend is to pre-select the next 2M entries. If the user plays through 2M songs in a lifetime, damn! To pre-select them, do a pre-init step using the above algorithm. The one change I would make is that as you add new songs, roll the dice and see if you want to randomly add that song to the mix. If yes, then pick a random index and replace it with the new song's index.

Erich Mirabal 2009-05-13 20:05:20

Assuming a linked list, this involves random selection from linked list ( O(n) ), or for an array, deleting the entry (again, O(n) ), times each item..Today typical providers have catalogs of 6 million songs, so perhaps more elegance is needed?

caffiend 2009-05-13 20:10:31

That would be my solution too. Solve the problem by making it easier: don't worry about non-repeating numbers, just have another copy of the data structure and move the already played songs into it

Simonw 2009-05-13 20:11:06

I'm leaving now, but let me think about how to optimize for large datasets. My gut feeling is that a dictionary/hashset like Alex mentions is probably in order.

Erich Mirabal 2009-05-13 20:27:17

@caffiend: if you use an array, deletion doesn't have to be O(N) ;)To delete an item, swap it with the last element and pretend it isn't there. This won't affect the randomness of the next selection.

v3 2009-05-14 00:51:55

@v3: +1. oh yeah, that's a great point. I think if you just added that suggestion, this is as simple as you can get.

Erich Mirabal 2009-05-14 01:35:22

Answer 2

A:

While Erich's solution is probably better for your specific use case, checking if a song has already been played is very fast (amortized O(1)) with a hash-based structure, such as a set in Python or a hashset<int> in C++.

Alex Martelli 2009-05-13 20:12:30

yeah, but the problem is that you have to keep repeating the selection process until you don't have a repeat. This will take longer and longer each time.

Erich Mirabal 2009-05-13 21:06:42

yep, that's why, as I said, your solution is probably better for the OP's use case (assuming the "non-repeating samples" ever get very close to the maximum allowable size -- where samples stay relatively small, say <= half the max alowable, trade-offs change).

Alex Martelli 2009-05-14 09:43:37

Answer 3

A:

You could simply generate the sequence of numbers from 1 to n and then shuffle it using a Fisher-Yates shuffle. That way you can guarantee that the sequence won't repeat, regardless of n.

Joey 2009-05-13 20:23:24

How would you handle new songs? n is not static, it changes over time.

Erich Mirabal 2009-05-13 21:07:16

When a new song comes in, pick a random index and add it there. Shift all the other songs down one. O(n) (for the shift).

kenj0418 2009-05-14 02:33:26

@kenj0418: why not just replace the item and move the replaced item to the end?

Erich Mirabal 2009-05-14 02:58:25

Answer 4

A:

You could use a linked list inside an array: To build the initial playlist, use an array containing a something like this:

 struct playlistNode{
  songLocator* song;
  playlistNode  *next;
};
struct playlistNode arr[N];

Also keep a 'head' and 'freelist' pointer;

Populate it in 2 passes:
1. fill in arr with all the songs in the catalog in order 0..N.
2. randomly iterate through all the indexes, filling in the next pointer;

Deletion of songs played is O(1):

head=cur->next;
cur->song=NULL;
freelist->next = freelist;
cur->next=freelist;
freelist=cur;

Insertion of new songs is O(1) also: pick an array index at random, and patch a new node.

node = freelist;
freelist=freelist->next;
do {
i=rand(N);   
} while (!arr[i].song);  //make sure you didn't hit a played node
node->next = arr[i].next;
arr[i].next=node;

AShelly 2009-05-13 21:10:55

N is not constant. It grows over time. Also, I don't think insert is O(1) since you are looping to skip any previously played ones.

Erich Mirabal 2009-05-13 21:16:57

If you play songs faster than you add them you don't need to grow the array. If the opposite is true, either periodically realloc, or maintain a list of arrays, let the linked list thread through them.

AShelly 2009-05-13 22:55:50

Insert can be worse than O(1), if you are not adding new songs fast enough. In that case I'd probably do a consolidation once the array reached an "emptiness" threshold.

AShelly 2009-05-13 23:03:38

Answer 5

+1 A:

With a limit of 8MB for 6 million songs, there's plainly not room to store even a single 32 bit integer for each song. Unless you're prepared to store the list on disk (in which case, see below).

If you're prepared to drop the requirement that new items be immediately added to the shuffle, you can generate an LCG over the current set of songs, then when that is exhausted, generate a new LCG over only the songs that were added since you began. Rinse and repeat until you no longer have any new songs. You can also use this rather cool algorithm that generates an unguessable permutation over an arbitrary range without storing it.

If you're prepared to relax the requirement of 8MB ram for 6 million songs, or to go to disk (for example, by memory mapping), you could generate the sequence from 1..n at the beginning, shuffle it with fisher-yates, and whenever a new song is added, pick a random element from the so-far-unplayed section, insert the new ID there, and append the original ID to the end of the list.

If you don't care much about computational efficiency, you could store a bitmap of all songs, and repeatedly pick IDs uniformly at random until you find one you haven't played yet. This would take 6 million tries to find the last song (on average), which is still damn fast on a modern CPU.

Nick Johnson 2009-05-13 23:08:52

ansaurus

tags:

views:

answers:

Implementing shuffle on the celestial jukebox

related questions