ansaurus

Question

Shuffle using IComparer

Answer 1

A:

How 'bout sorting based on a hidden field, which is pre-assigned a random value?

James Curran 2009-02-17 18:00:12

I'd like this to work for _any_ T: no constraints, and no projection.

Joel Coehoorn 2009-02-17 18:01:26

Answer 2

+6 A:

I was somewhat surprised in this thread how many wrong answers were posted. Just for the sake of others who come up with a solution similar to the one posted by the OP, the following code looks correct:

int[] nums = new int[1000];
for (int i = 0; i < nums.Length; i++)
{
    nums[i] = i;
}

Random r = new Random();
Array.Sort<int>(nums, (x, y) => r.Next(-1, 2));

foreach(var num in nums)
{
    Console.Write("{0} ", num);
}

However, the code will throw an exception occasionally, but not always. That's what makes it fun to debug :) If you run it enough times, or execute the sort procedure in a loop 50 or so times, you'll get an error stating:

IComparer (or the IComparable methods it relies upon) did not return zero when Array.Sort called x. CompareTo(x). x: '0' x's type: 'Int32' The IComparer: ''.

In other words, the quick sort compared some number x to itself and got a non-zero result. The obvious solution to the code would be write:

Array.Sort<int>(nums, (x, y) =>
    {
        if (x == y) return 0;
        else return r.NextDouble() < 0.5 ? 1 : -1;
    });

But even this doesn't work, because there are occasions where .NET compares 3 numbers against one another which return inconsistent results, such as A > B, B > C, and C > A (oops!). No matter if you use a Guid, GetHashCode, or any other randomly generated input, a solution like the one shown above is still wrong.

With that being said, Fisher-Yates is the standard way of shuffling arrays, so there's no real reason to use IComparer in the first place. Fisher-Yates is O(n) whereas any implementation using IComparer uses a quicksort behinds the scenes which has a time-complexity of O(n log n). There's just no good reason not to use the well-known, efficient, standard algorithm to solve this kind of problem.

However, if you really insist on using an IComparer and a rand, then apply your random data before you sort. This requires a projection of the data onto another object so you don't lose your random data:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace ConsoleApplication1
{
    class Pair<T, U>
    {
        public T Item1 { get; private set; }
        public U Item2 { get; private set; }
        public Pair(T item1, U item2)
        {
            this.Item1 = item1;
            this.Item2 = item2;
        }
    }

    class Program
    {
        static void Main(string[] args)
        {
            Pair<int, double>[] nums = new Pair<int, double>[1000];
            Random r = new Random();
            for (int i = 0; i < nums.Length; i++)
            {
                nums[i] = new Pair<int, double>(i, r.NextDouble());
            }

            Array.Sort<Pair<int, double>>(nums, (x, y) => x.Item2.CompareTo(y.Item2));

            foreach (var item in nums)
            {
                Console.Write("{0} ", item.Item1);
            }

            Console.ReadKey(true);
        }
    }
}

Or get LINQy with your bad self:

Random r = new Random();
var nums = from x in Enumerable.Range(0, 1000)
           orderby r.NextDouble()
           select x;

Juliet 2009-02-17 18:38:35

I think I put together a pretty good case where you might want to embed correct logic into an IComparer. I know it's not going to be possible to get the same performance as Fisher-Yates, but it ought to be at least possible to get correct logic and reasonable performance.

Joel Coehoorn 2009-02-17 19:25:04

Answer 3

+1 A:

IComparer requiring a zero return at some point (for equal instances of T), makes it mathematically impossible to create a generic IComparer that will mimic a Fisher-Yates Shuffle statistically. There will always be a bias. For a real shuffle, you'd never want to force it to return any particular value.

Jason Punyon 2009-02-17 18:44:08

Answer 4

A:

To follow up on James Curran's idea: let the IComparer maintain the "sorted" values as a list; if a new value occurs, insert it into the list at a random position; compare by list index. Optimize by maintaining the list as a balanced tree or something. Every instance of such a IComparer will maintain a consistent and random sort order so you have the choice of letting your Random sorting be consistently the same random ordering or a different one each time. Minor modification will even allow equal elements to be "sorted" into different ordering positions, if you prefer to read "random" that way.

reinierpost 2009-02-18 00:51:13

Answer 5

+2 A:

As others have pointed out, it's impossible to adhere to the IComparer contract and simultaneously achieve true, unbiased random distribution. However, I think we can implement a "good enough" solution.

The IComparer contract requires that Compare(x, y) return the same value every time for the same x and y. If a comparer doesn't do this, and the runtime gets a whiff of it, an exception is thrown. This is one of the problems with the naive randomizer in the question.

Joel's implementation works around this restriction by making each separate instance of the comparer adhere to the IComparer contract. ie, Compare(x, y) will consistently return the same value for the same x and y if, and only if, the same instance of the comparer is used.

As Joel mentions in his question, this pattern of per-instance consistency seems to be the only way that we can achieve something close to randomness without breaking the IComparer contract and hitting exceptions.

Whereas Joel's comparer uses a one-way cryptographic hash to ensure consistency, my version uses plain-old System.Random, and achieves its consistency by storing the generated random numbers in a dictionary, to be re-used over the lifetime of each instance. (In a nutshell, I'm trading some memory use for speed.)

My comparer returns all possible sorts and there isn't any bias that I've noticed. The performance is pretty good too: in my tests it's about 3x slower than a Fisher-Yates loop. It appears to be O(n log n) as far as I can tell. (I'm not certain about this last point though, someone please correct me if I'm wrong!)

public class FasterRandomizer<T> : IComparer<T>
{
    private static Random _random = new Random();
    private Dictionary<T, int> _dictionary = new Dictionary<T, int>();

    public int Compare(T x, T y)
    {
        int xrnd, yrnd;

        if (_dictionary.ContainsKey(x))
            xrnd = _dictionary[x];
        else
        {
            xrnd = _random.Next();
            _dictionary[x] = xrnd;
        }

        if (_dictionary.ContainsKey(y))
            yrnd = _dictionary[y];
        else
        {
            yrnd = _random.Next();
            _dictionary[y] = yrnd;
        }

        return xrnd.CompareTo(yrnd);
    }

    // if you prefer readability and/or brevity over raw speed
    // then you can use this version which is approx 3% slower
    //public int Compare(T x, T y)
    //{
    //    if (!_dictionary.ContainsKey(x))
    //        _dictionary[x] = _random.Next();
    //
    //    if (!_dictionary.ContainsKey(y))
    //        _dictionary[y] = _random.Next();
    //
    //    return _dictionary[x].CompareTo(_dictionary[y]);
    //}
}

LukeH 2009-02-18 12:15:26

I played with using a dictionary to cash my hashes- in my case it didn't help until there were at least 500 items in the list, in which case the memory use by the dictionary could start to be a concern- it just 'feels' wrong. Oh well, there may not be anything better.

Joel Coehoorn 2009-02-18 14:03:29

@Joel, it does give you the benefit of using true(ish) random numbers, independent of the values being compared, rather than hashes, which means that it'll return all possible sorts.

LukeH 2009-02-18 14:08:37

Answer 6

A:

An interesting endeavor. Most likely a misuse/abuse of IComparer.

You're attempting to do a random weighted sort by using a mechanism that wasn't built for that purpose.

Why not implement your own sorting routine and your own comparer? I have a feeling that even that would be insufficient.

2009-02-18 14:20:25

Answer 7

A:

One suggestion I got elsewhere was to create a separate interface that describes any operation to rearrange a list. Something like IArrangeable.

Then I could implement a Shuffle for that interface using a proper Fisher-Yates algorithm, and also have implementations that wrap each additional IEnumerable.Sort()/IComparable variety that I care about.

For a final step, I add support for this to any IEnumerable via an extension method. Then you still get the simple run-time algorithm swapping, you have a better implementation of the shuffle algorithm, and the code to use it feels natural.

Joel Coehoorn 2009-02-18 14:31:07

Answer 8

A:

Don't do it.

All of the algorithms proposed thus far introduce some sort of bias into the output (some bigger than others).

@Princess and @Luke propose storing a random number alongside the data. However because there is a possibility that any two of these random numbers could have the same value as another, the sort order between those two items will be deterministically biased

The worst case for this would be if the sorting routine is "stable" (that is that objects that are considered equal are always output in the same order they were input in). Array.Sort doesn't happen to be stable (it uses QuickSort internally) but there is still a bias that occurs whenever two items have the same value that depends on where they are in the input (and specifically where they are relative to the QuickSort's pivot).

As the keyspace for this random number increases, the probability of a collision goes down (with a good source of randomness), but keep in mind that as the number of values you are sorting goes up, the birthday paradox dictates that the likelyhood of at least one pair amongst them colliding goes up very quickly.

For an integer key, there are 2^32 unique values for the key and even assuming that there is a perfectly even distribution of random values, with 75,000 rows, there is a 50% probability that there will be a collision. Wikipedia.

The cryptographic hash approach that you proposed potentially has a large enough keyspace (160) bits to make the chance of a collision negligible, but your algorithm decomposes all of that randomness back down to a single int before actually doing the compare which negates the benefit of that larger keyspace.

Your best approach is to associate a distinct "sortOrder" value with each of your data items shuffle these values using a proven algorithm, and then order the results by that value.

If you are using Array.Sort, there is an overload that takes an array of "keys" and an array of "values". The keys array is sorted normally, but whenever a value in the keys array is moved, the corresponding entry in the values array is also moved.

Something like:


Something[] data;//populated somewhere
int[] keys = new int[data.Length];//or long if you might have lots of data
for(int i=0;i<keys.Length;++i) {
 keys[i] = i;
}

Shuffle(keys);

Array.Sort(keys, data);

2009-02-18 15:46:43

While accurate, this misses the spirit of the problem: 1) Someone who just wants to swap in an IComparable implementation based on a user selection without a lot of extra code for the 'special' random case.2) An alternative that works and still uses IComparable for all the 'bad' google results.

Joel Coehoorn 2009-02-18 16:15:58

ansaurus

tags:

views:

answers:

Shuffle using IComparer

related questions