views:

69

answers:

2

I am trying to develop a simple application in C# to count the number of duplicates in a listbox. I need to count all the number of duplicates and display a rank suffix to the top 3 elements most duplicated. For example, suppose a list has 7 elements called 'apple', 6 elements called 'pear', 4 elements called 'peach' and 3 elements called 'orange', after the process, it should display the list as:

apple (7)
pear (6)
peach (4)
orange
+2  A: 

Since we do not know the data source you are using, here is a generic LINQ example that could get you started.

string[] items = { "apple", "pear", "peach", "apple", "orange", "peach", "apple" };

var ranking = (from item in items
               group item by item into r
               orderby r.Count() descending
               select new { name = r.Key, rank = r.Count() }).Take(3);

This will return a collection of objects containing the name and rank of the top 3 items.

Of course you would replace the items array here with what every data source you are using to fill the ListBox, and if the items are not just simple strings but more complex items you would adjust the LINQ query appropiately.

Here is an example of the above which will fill a listbox with the data as in the form you showed.

  string[] items = { "apple", "pear", "peach", "apple", "orange", "peach", "apple" };

  var ranking = (from item in items
                 group item by item into r
                 orderby r.Count() descending
                 select new { name = r.Key, rank = r.Count() }).ToArray();

  for (int i = 0; i < ranking.Length; ++i)
  {
    var item = ranking[i];
    if (i < 3)
    {
      listBox1.Items.Add(string.Format("{0} ({1})", item.name, item.rank));
    }
    else
    {
      listBox1.Items.Add(item.name);
    }
  }

This does the same as the first example, but the transforms the results to an array and populates a listbox with the items with the first 3 items showing there rank.

Chris Taylor
+1 for solving the problem. I want to know what is it with you Linq guys and the compulsive need to call ToArray when it's completely and utterly unnecessary. You do realize that this allocates a new block of memory, performs an enumeration of the list, and copies each value--right? Linq is piss-poor slow enough, why needlessly make it even slower?
Tergiver
@Tergiver, :) well thank you for calling me a Linq guy, since I am specifically targetting questions that allow me to learn more about Linq, I do not naturally tend to use Linq. Even for this simple example I had to reference the 101 samples, but I learned more about Linq. Of course the ToArray is allocating an extra bock of memory, no arguing that and while I did consider that I just decided, right or wrong, that for this example it would not be an issue.
Chris Taylor
@Chris: Sorry to pick on you, I hadn't had my daily Linq-bashing rant yet today ;)
Tergiver
A: 

Here is an alternative method to using Linq, presented as a timed test to see which performs faster. These are the results I obtained with 1000 iterations:

Total words: 1324
Min        Max        Mean       Method
5305       22889      5739.182   LinkMethodToArray
5053       11973      5418.355   LinkMethod
3112       6726       3252.457   HashMethod

The LinkMethod is only about 1.6 times slower in this case. Not as bad as a lot of Linq code that I have performance tested, but it was only 1324 words.

Edit #1

That was before adding the sort. With the sort, you can see that it is comparible with the Linq method. Of course, copying the hash to a list and then sorting the list isn't the most efficient way to do this. We could improve on this. There are a couple of ways that come to mind, but none of them are simple and would require writing a lot of custom code.

Since we want to use what's already available and we want the code to be clear, I have to say that Linq is in fact a very good choice. This has changed my opinion of Linq.. a little. I've seen far too many other comparisons where Linq ends up disastrously slower (on the order of 1,000s of times slower) to give a green light to using Linq anywhere and everywhere, but certainly in this one place it shines very well.

I guess the moral is, as it always has been, test, test, test.

Here are the values with the sort added to HashMethod.

Total words: 1324
Min        Max        Mean       Method
5284       21030      5667.808   LinkMethodToArray
5081       36339      5425.626   LinkMethod
5017       27583      5288.602   HashMethod

Edit #2

A couple of simple optimizations (pre-initializing both the dictionary and the list) make HashMethod a bit noticably faster.

Total words: 1324
Min        Max        Mean       Method
5287       16299      5686.429   LinkMethodToArray
5081       21813      5440.758   LinkMethod
4588       8420       4710.659   HashMethod

Edit #3

With a larger word set, they become much more even. In fact, the Linq method seems to edge out every time. Here is the United States Constitution (All seven articles and signatures). This may be due to the fact that the declaration contains a lot of repeat words ("He has ...").

Total words: 4545
Min        Max        Mean       Method
13363      36133      14086.875  LinkMethodToArray
12917      26532      13668.914  LinkMethod
13601      19435      13836.955  HashMethod

Code:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;
using System.Threading;

class Program
{
    static void Main()
    {
        Thread.CurrentThread.Priority = ThreadPriority.Highest;

        // Declaration.txt is a copy of the Declaration of Independence
        // which can be found here: http://en.wikisource.org/wiki/United_States_Declaration_of_Independence
        string declaration = File.ReadAllText("Declaration.txt");
        string[] items = declaration.ToLower().Split(new char[] { ',', '.', ':', ';', '-', '\r', '\n', '\t', ' ' }, StringSplitOptions.RemoveEmptyEntries);

        // pre-execute outside timing loop
        LinqMethodToArray(items);
        LinqMethod(items);
        HashMethod(items);

        int iterations = 1000;
        long min1 = long.MaxValue, max1 = long.MinValue, sum1 = 0;
        long min2 = long.MaxValue, max2 = long.MinValue, sum2 = 0;
        long min3 = long.MaxValue, max3 = long.MinValue, sum3 = 0;

        Console.WriteLine("Iterations: {0}", iterations);
        Console.WriteLine("Total words: {0}", items.Length);

        Stopwatch sw = new Stopwatch();

        for (int n = 0; n < iterations; n++)
        {
            sw.Reset();
            sw.Start();
            LinqMethodToArray(items);
            sw.Stop();
            sum1 += sw.ElapsedTicks;
            if (sw.ElapsedTicks < min1)
                min1 = sw.ElapsedTicks;
            if (sw.ElapsedTicks > max1)
                max1 = sw.ElapsedTicks;

            sw.Reset();
            sw.Start();
            LinqMethod(items);
            sw.Stop();
            sum2 += sw.ElapsedTicks;
            if (sw.ElapsedTicks < min2)
                min2 = sw.ElapsedTicks;
            if (sw.ElapsedTicks > max2)
                max2 = sw.ElapsedTicks;

            sw.Reset();
            sw.Start();
            HashMethod(items);
            sw.Stop();
            sum3 += sw.ElapsedTicks;
            if (sw.ElapsedTicks < min3)
                min3 = sw.ElapsedTicks;
            if (sw.ElapsedTicks > max3)
                max3 = sw.ElapsedTicks;
        }

        Console.WriteLine("{0,-10} {1,-10} {2,-10} Method", "Min", "Max", "Mean");
        Console.WriteLine("{0,-10} {1,-10} {2,-10} LinkMethodToArray", min1, max1, (double)sum1 / iterations);
        Console.WriteLine("{0,-10} {1,-10} {2,-10} LinkMethod", min2, max2, (double)sum2 / iterations);
        Console.WriteLine("{0,-10} {1,-10} {2,-10} HashMethod", min3, max3, (double)sum3 / iterations);
    }

    static void LinqMethodToArray(string[] items)
    {
        var ranking = (from item in items
                       group item by item into r
                       orderby r.Count() descending
                       select new { Name = r.Key, Rank = r.Count() }).ToArray();
        for (int n = 0; n < ranking.Length; n++)
        {
            var item = ranking[n];
            DoSomethingWithItem(item);
        }
    }

    static void LinqMethod(string[] items)
    {
        var ranking = (from item in items
                       group item by item into r
                       orderby r.Count() descending
                       select new { Name = r.Key, Rank = r.Count() });
        foreach (var item in ranking)
            DoSomethingWithItem(item);
    }

    static void HashMethod(string[] items)
    {
        var ranking = new Dictionary<string, int>(items.Length / 2);
        foreach (string item in items)
        {
            if (!ranking.ContainsKey(item))
                ranking[item] = 1;
            else
                ranking[item]++;
        }
        var list = new List<KeyValuePair<string, int>>(ranking);
        list.Sort((a, b) => b.Value.CompareTo(a.Value));
        foreach (KeyValuePair<string, int> pair in list)
            DoSomethingWithItem(pair);

    }

    static volatile object hold;
    static void DoSomethingWithItem(object item)
    {
        // This method exists solely to prevent the compiler from
        // optimizing use of the item away so that this program
        // can be executed in Release build, outside the debugger.
        hold = item;
    }
}
Tergiver
I don't want to seem to nit pick, especially because I totally agree with you here. But your HashMethod is not delivering the same result as the LINQ methods from the point of view that the Linq methods exlicitly order the results by rank, where the hash method skips that step.
Chris Taylor
Crap, you're right. I totally missed the boat there. I'll fix it to have it sort. Could this turn out to be a case where Linq excels? Tune in later to find out...
Tergiver
There it is. Linq works very well in this situation.
Tergiver