tags:

views:

176

answers:

4

I've got a large set of data for which computing the sort key is fairly expensive. What I'd like to do is use the DSU pattern where I take the rows and compute a sort key. An example:

        Qty   Name      Supplier   
Row 1:   50   Widgets   IBM
Row 2:   48   Thingies  Dell
Row 3:   99   Googaws   IBM

To sort by Quantity and Supplier I could have the sort keys: 0050 IBM, 0048 Dell, 0099 IBM. The numbers are right-aligned and the text is left-aligned, everything is padded as needed.

If I need to sort by the Quanty in descending order I can just subtract the value from a constant (say, 10000) to build the sort keys: 9950 IBM, 9952 Dell, 9901 IBM.

How do I quickly/cheaply build a descending key for the alphabetic fields in C#?

[My data is all 8-bit ASCII w/ISO 8859 extension characters.]

Note: In Perl, this could be done by bit-complementing the strings:

 $subkey = $string ^ ( "\xFF" x length $string );

Porting this solution straight into C# doesn't work:

 subkey = encoding.GetString(encoding.GetBytes(stringval).
                      Select(x => (byte)(x ^ 0xff)).ToArray());

I suspect because of the differences in the way that strings are handled in C#/Perl. Maybe Perl is sorting in ASCII order and C# is trying to be smart?

Here's a sample piece of code that tries to accomplish this:

        System.Text.ASCIIEncoding encoding = new System.Text.ASCIIEncoding();
        List<List<string>> sample = new List<List<string>>() {
            new List<string>() { "", "apple", "table" },
            new List<string>() { "", "apple", "chair" },
            new List<string>() { "", "apple", "davenport" },
            new List<string>() { "", "orange", "sofa" },
            new List<string>() { "", "peach", "bed" },
        };
        foreach(List<string> line in sample)
        {
            StringBuilder sb = new StringBuilder();

            string key1 = line[1].PadRight(10, ' ');
            string key2 = line[2].PadRight(10, ' ');

            // Comment the next line to sort desc, desc
            key2 = encoding.GetString(encoding.GetBytes(key2).
                  Select(x => (byte)(x ^ 0xff)).ToArray());

            sb.Append(key2);
            sb.Append(key1);
            line[0] = sb.ToString();
        }

        List<List<string>> output = sample.OrderBy(p => p[0]).ToList();

        return;
A: 

Just write an IComparer that would work as a chain of comparators. In case of equality on each stage, it should pass eveluation to the next key part. If it's less then, or greater then, just return.

You need something like this:

int comparision = 0;

foreach(i = 0; i < n; i++)
{
 comparision = a[i].CompareTo(b[i]) * comparisionSign[i];

 if( comparision != 0 )
  return comparision;
}
return comparision;

Or even simpler, you can go with:

list.OrderBy(i=>i.ID).ThenBy(i=>i.Name).ThenByDescending(i=>i.Supplier);

The first call return IOrderedEnumerable<>, the which can sort by additional fields.

George Polevoy
I think this gets me back in the business of writing loops within the sort -- breaks the DSU pattern pretty badly. I'm beginning to think my real problem is that the OrderBy (in my example) is using C#'s sort order and I need something more like StringComparison.Ordinal.
clintp
You can swap string comparision with IComparer. Not sure what exactly DSU pattern gives you. But i'm sure it takes cpu cycles to decorate and undecorate. Also DSU internally would rely on a comparision cycle inside the characters of the string.
George Polevoy
The "even simpler" misses the point of the problem entirely. :(
clintp
@George: The DSU pattern allows you to compute the basis for the comparisons exactly *once* for every item in the list. In my case, those were expensive to come by and so much, much cheaper. The entire sort is reduced to string comparisons.
clintp
clintp: Doesn't Enumerable.OrderBy(key) only call `key` once per item and cache it? That is, doesn't OrderBy implement "the DSU pattern" internally already? http://stackoverflow.com/questions/254844/random-array-using-linq-and-c/254919#254919
Ken
A: 

Answering my own question (but not satisfactorily). To construct a descending alphabetic key I used this code and then appended this subkey to the search key for the object:

   if ( reverse )
        subkey = encoding.GetString(encoding.GetBytes(subkey)
                 .Select(x => (byte)(0x80 - x)).ToArray());
   rowobj.sortKey.Append(subkey);

Once I had the keys built, I couldn't just do this:

   rowobjList.Sort();

Because the default comparator isn't in ASCII order (which my 0x80 - x trick relies on). So then I had to write an IComparable<RowObject> that used the Ordinal sorting:

    public int CompareTo(RowObject other)
    {
        return String.Compare(this.sortKey, other.sortKey, 
                                 StringComparison.Ordinal);
    }

This seems to work. I'm a little dissatisfied because it feels clunky in C# with the encoding/decoding of the string.

clintp
+2  A: 

You can get to where you want, although I'll admit I don't know whether there's a better overall way.

The problem you have with the straight translation of the Perl method is that .NET simply will not allow you to be so laissez-faire with encoding. However, if as you say your data is all printable ASCII (ie consists of characters with Unicode codepoints in the range 32..127) - note that there is no such thing as '8-bit ASCII' - then you can do this:

            key2 = encoding.GetString(encoding.GetBytes(key2).
                Select(x => (byte)(32+95-(x-32))).ToArray());

In this expression I have been explicit about what I'm doing:

  • Take x (which I assume to be in 32..127)
  • Map the range to 0..95 to make it zero-based
  • Reverse by subtracting from 95
  • Add 32 to map back to the printable range

It's not very nice but it does work.

AakashM
I'll upvote even though you came to nearly the same answer I did, and a little bit later. At least you're looking at the right problem.
clintp
A: 

If a key computation is expensive, why compute a key at all? String comparision by itself is not free, it's actually expensive loop through the characters and is not going to perform any better then a custom comparision loop.

In this test custom comparision sort performs about 3 times better then DSU.

Note that DSU key computation is not measured in this test, it's precomputed.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
using Microsoft.VisualStudio.TestTools.UnitTesting;

namespace DSUPatternTest
{
 [TestClass]
 public class DSUPatternPerformanceTest
 {
  public class Row
  {
   public int Qty;
   public string Name;
   public string Supplier;
   public string PrecomputedKey;

   public void ComputeKey()
   {
    // Do not need StringBuilder here, String.Concat does better job internally.
    PrecomputedKey =
     Qty.ToString().PadLeft(4, '0') + " "
     + Name.PadRight(12, ' ') + " "
     + Supplier.PadRight(12, ' ');
   }

   public bool Equals(Row other)
   {
    if (ReferenceEquals(null, other)) return false;
    if (ReferenceEquals(this, other)) return true;
    return other.Qty == Qty && Equals(other.Name, Name) && Equals(other.Supplier, Supplier);
   }

   public override bool Equals(object obj)
   {
    if (ReferenceEquals(null, obj)) return false;
    if (ReferenceEquals(this, obj)) return true;
    if (obj.GetType() != typeof (Row)) return false;
    return Equals((Row) obj);
   }

   public override int GetHashCode()
   {
    unchecked
    {
     int result = Qty;
     result = (result*397) ^ (Name != null ? Name.GetHashCode() : 0);
     result = (result*397) ^ (Supplier != null ? Supplier.GetHashCode() : 0);
     return result;
    }
   }
  }

  public class RowComparer : IComparer<Row>
  {
   public int Compare(Row x, Row y)
   {
    int comparision;

    comparision = x.Qty.CompareTo(y.Qty);
                if (comparision != 0) return comparision;

    comparision = x.Name.CompareTo(y.Name);
                if (comparision != 0) return comparision;

    comparision = x.Supplier.CompareTo(y.Supplier);

    return comparision;
   }
  }

  [TestMethod]
  public void CustomLoopIsFaster()
  {
   var random = new Random();
   var rows = Enumerable.Range(0, 5000).Select(i =>
             new Row
              {
               Qty = (int) (random.NextDouble()*9999),
               Name = random.Next().ToString(),
     Supplier = random.Next().ToString()

              }).ToList();

   foreach (var row in rows)
   {
    row.ComputeKey();
   }

   var dsuSw = Stopwatch.StartNew();
   var sortedByDSU = rows.OrderBy(i => i.PrecomputedKey).ToList();
   var dsuTime = dsuSw.ElapsedMilliseconds;

   var customSw = Stopwatch.StartNew();
   var sortedByCustom = rows.OrderBy(i => i, new RowComparer()).ToList();
   var customTime = customSw.ElapsedMilliseconds;

   Trace.WriteLine(dsuTime);
   Trace.WriteLine(customTime);

   CollectionAssert.AreEqual(sortedByDSU, sortedByCustom);

   Assert.IsTrue(dsuTime > customTime * 2.5);
  }
 }
}

If you need to build a sorter dynamically you can use something like this:

var comparerChain = new ComparerChain<Row>()
.By(r => r.Qty, false)
.By(r => r.Name, false)
.By(r => r.Supplier, false);

var sortedByCustom = rows.OrderBy(i => i, comparerChain).ToList();

Here is a sample implementation of comparer chain builder:

public class ComparerChain<T> : IComparer<T>
    {
        private List<PropComparer<T>> Comparers = new List<PropComparer<T>>();

        public int Compare(T x, T y)
        {
            foreach (var comparer in Comparers)
            {
                var result = comparer._f(x, y);
                if (result != 0)
                    return result;
            }
            return 0;
        }
        public ComparerChain<T> By<Tp>(Func<T,Tp> property, bool descending) where Tp:IComparable<Tp>
        {
            Comparers.Add(PropComparer<T>.By(property, descending));
            return this;
        }
    }

    public class PropComparer<T>
    {
        public Func<T, T, int> _f;

        public static PropComparer<T> By<Tp>(Func<T,Tp> property, bool descending) where Tp:IComparable<Tp>
        {
            Func<T, T, int> ascendingCompare = (a, b) => property(a).CompareTo(property(b));
            Func<T, T, int> descendingCompare = (a, b) => property(b).CompareTo(property(a));
            return new PropComparer<T>(descending ?  descendingCompare : ascendingCompare);
        }

        public PropComparer(Func<T, T, int> f)
        {
            _f = f;
        }
    }

It works a little bit slower, maybe because of property binging delegate calls.

George Polevoy
Again, you're assuming that I have an object that I can query with .Qty and .Supplier that have methods on them, and that they're known at compile-time. This was only an example, as stated clearly. These could be pointers that reference tertiary offsite storage, data too large for memory that needs to be swapped in and out, data on magnetic tape, etc... And worse, the structure of the data is unknown at compile-time. Which way it's to be sorted is a *runtime* decision, the structure is determined at runtime also.
clintp
Not sure i understand how would you compute string representation of something that resides on a magnetic tape in your example.
George Polevoy