views:

370

answers:

3

The .Net framework has an Array.Sort overload that allows one to specify the starting and ending indicies for the sort to act upon. However these parameters are only 32 bit. So I don't see a way to sort a part of a large array when the indicies that describe the sort range can only be specified using a 64-bit number. I suppose I could copy and modify the the framework's sort implementation, but that is not ideal.

Update:

I've created two classes to help me around these and other large-array issues. One other such issue was that long before I got to my memory limit, I start getting OutOfMemoryException's. I'm assuming this is because the requested memory may be available but not contiguous. So for that, I created class BigArray, which is a generic, dynamically sizable list of arrays. It has a smaller memory footprint than the framework's generic list class, and does not require that the entire array be contiguous. I haven't tested the performance hit, but I'm sure its there.

  public class BigArray<T> : IEnumerable<T>
  {
    private long capacity;
    private int itemsPerBlock;
    private int shift;
    private List<T[]> blocks = new List<T[]>();

    public BigArray(int itemsPerBlock)
    {
      shift = (int)Math.Ceiling(Math.Log(itemsPerBlock) / Math.Log(2));
      this.itemsPerBlock = 1 << shift;
    }

    public long Capacity
    {
      get
      {
        return capacity;
      }
      set
      {
        var requiredBlockCount = (value - 1) / itemsPerBlock + 1;
        while (blocks.Count > requiredBlockCount)
        {
          blocks.RemoveAt(blocks.Count - 1);
        }
        while (blocks.Count < requiredBlockCount)
        {
          blocks.Add(new T[itemsPerBlock]);
        }
        capacity = (long)itemsPerBlock * blocks.Count;
      }
    }

    public T this[long index]
    {
      get
      {
        Debug.Assert(index < capacity);
        var blockNumber = (int)(index >> shift);
        var itemNumber = index & (itemsPerBlock - 1);
        return blocks[blockNumber][itemNumber];
      }
      set
      {
        Debug.Assert(index < capacity);
        var blockNumber = (int)(index >> shift);
        var itemNumber = index & (itemsPerBlock - 1);
        blocks[blockNumber][itemNumber] = value;
      }
    }

    public IEnumerator<T> GetEnumerator()
    {
      for (long i = 0; i < capacity; i++)
      {
        yield return this[i];
      }
    }

    System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
    {
      return this.GetEnumerator();
    }

  }

And getting back to the original issue of sorting... What I really needed was a way to act on each element of an array, in order. But with such large arrays, it is prohibitive to copy the data, sort it, act on it and then discard the sorted copy (the original order must be maintained). So I created static class OrderedOperation, which allows you to perform an arbitrary operation on each element of an unsorted array, in a sorted order. And do so with a low memory footprint (trading memory for execution time here).

  public static class OrderedOperation
  {
    public delegate void WorkerDelegate(int index, float progress);

    public static void Process(WorkerDelegate worker, IEnumerable<int> items, int count, int maxItem, int maxChunkSize)
    {
      // create a histogram such that a single bin is never bigger than a chunk
      int binCount = 1000;
      int[] bins;
      double binScale;
      bool ok;
      do
      {
        ok = true;
        bins = new int[binCount];
        binScale = (double)(binCount - 1) / maxItem;
        int i = 0;
        foreach (int item in items)
        {
          bins[(int)(binScale * item)]++;
          if (++i == count)
          {
            break;
          }
        }
        for (int b = 0; b < binCount; b++)
        {
          if (bins[b] > maxChunkSize)
          {
            ok = false;
            binCount *= 2;
            break;
          }
        }
      } while (!ok);

      var chunkData = new int[maxChunkSize];
      var chunkIndex = new int[maxChunkSize];
      var done = new System.Collections.BitArray(count);
      var processed = 0;
      var binsCompleted = 0;
      while (binsCompleted < binCount)
      {
        var chunkMax = 0;
        var sum = 0;
        do
        {
          sum += bins[binsCompleted];
          binsCompleted++;
        } while (binsCompleted < binCount - 1 && sum + bins[binsCompleted] <= maxChunkSize);
        Debug.Assert(sum <= maxChunkSize);
        chunkMax = (int)Math.Ceiling((double)binsCompleted / binScale);
        var chunkCount = 0;
        int i = 0;
        foreach (int item in items)
        {
          if (item < chunkMax && !done[i])
          {
            chunkData[chunkCount] = item;
            chunkIndex[chunkCount] = i;
            chunkCount++;
            done[i] = true;
          }
          if (++i == count)
          {
            break;
          }
        }
        Debug.Assert(sum == chunkCount);
        Array.Sort(chunkData, chunkIndex, 0, chunkCount);
        for (i = 0; i < chunkCount; i++)
        {
          worker(chunkIndex[i], (float)processed / count);
          processed++;
        }
      }
      Debug.Assert(processed == count);
    }
  }

The two classes can work together (that's how I use them), but they don't have to. I hope someone else finds them useful. But I'll admit, they are fringe case classes. Questions welcome. And if my code sucks, I'd like to hear tips, too.

One final thought: As you can see in OrderedOperation, I'm using ints and not longs. Currently that is sufficient for me despite the original question I had (the application is in flux, in case you can't tell). But the class should be able to handle longs as well, should the need arise.

A: 

Seems like if you are sorting more than 2^32 elements then it would be best to write your own, more efficient, sort algorithm anyway.

Annath
He has (potentially) many elements, but wants to just sort a subset, not the entire array. Array.Sort would work for the whole array.
Reed Copsey
@Annath: If something is efficient for 2^32 elements (quicksort), then it is probably also efficient for more than 2^32 elements. I doubt that I can do better than quicksort.@Reed: I suppose I will have to branch on the size of the indecies. If they are small enough, use the Sort overload that lets me specify the bounds. If they are not small enough, sort the whole array and deal with the performance hit of sorting more than I needed to sort.
Fantius
+1  A: 

Since Array.Copy takes Int64 params, you could pull out the section you need to sort, sort it, then put it back. Assuming you're sorting less than 2^32 elements, of course.

Robert
I'm surprised that they have array.Get(long) but not array.Sort(long,long).
Jared Updike
A reasonable solution if you can spare the memory for the copy. In my case, I cannot.
Fantius
Also, your suggestion need not be restricted to sorting less than 2^32 elements.
Fantius
+5  A: 

You'll find that even on the 64-bit framework, the maximum number of elements in an array is int.MaxValue.

The existing methods that take or return Int64 just cast the long values to Int32 internally and, in the case of parameters, will throw an ArgumentOutOfRangeException if a long parameter isn't between int.MinValue and int.MaxValue.

For example the LongLength property, which returns an Int64, just casts and returns the value of the Length property:

public long LongLength
{
    get { return (long)this.Length; }    // Length is an Int32
}

So my suggestion would be to cast your Int64 indicies to Int32 and then call one of the existing Sort overloads.

LukeH
Hmm. I did not expect that. I figured the presence of LongLength implied that arrays larger than Int32 could exist. Can you provide a reference to your claim?
Fantius
@fantius, I haven't been able to find any references confirming this behaviour, so I suspect it's an implementation detail that could change at some point. You can check it yourself by examining the Array class using Reflector: http://www.red-gate.com/products/reflector/
LukeH