views:

314

answers:

11

I can't figure out a discrepancy between the time it takes for the Contains method to find an element in an ArrayList and the time it takes for a small function that I wrote to do the same thing. The documentation states that Contains performs a linear search, so it's supposed to be in O(n) and not any other faster method. However, while the exact values may not be relevant, the Contains method returns in 00:00:00.1087087 seconds while my function takes 00:00:00.1876165. It might not be much, but this difference becomes more evident when dealing with even larger arrays. What am I missing and how should I write my function to match Contains's performances?

I'm using C# on .NET 3.5.

public partial class Window1 : Window
{
    public bool DoesContain(ArrayList list, object element)
    {
        for (int i = 0; i < list.Count; i++)
            if (list[i].Equals(element)) return true;

        return false;
    }

    public Window1()
    {
        InitializeComponent();

        ArrayList list = new ArrayList();
        for (int i = 0; i < 10000000; i++) list.Add("zzz " + i);

        Stopwatch sw = new Stopwatch();
        sw.Start();

        //Console.Out.WriteLine(list.Contains("zzz 9000000") + " " + sw.Elapsed);
        Console.Out.WriteLine(DoesContain(list, "zzz 9000000") + " " + sw.Elapsed);
    }
}

EDIT:

Okay, now, lads, look:

public partial class Window1 : Window
{
    public bool DoesContain(ArrayList list, object element)
    {
        int count = list.Count;
        for (int i = count - 1; i >= 0; i--)
            if (element.Equals(list[i])) return true;

        return false;
    }


    public bool DoesContain1(ArrayList list, object element)
    {
        int count = list.Count;
        for (int i = 0; i < count; i++)
            if (element.Equals(list[i])) return true;

        return false;
    }

    public Window1()
    {
        InitializeComponent();

        ArrayList list = new ArrayList();
        for (int i = 0; i < 10000000; i++) list.Add("zzz " + i);

        Stopwatch sw = new Stopwatch();
        long total = 0;
        int nr = 100;

        for (int i = 0; i < nr; i++)
        {
            sw.Reset();
            sw.Start();
            DoesContain(list,"zzz");
            total += sw.ElapsedMilliseconds;
        }
        Console.Out.WriteLine(total / nr);


        total = 0;
        for (int i = 0; i < nr; i++)
        {
            sw.Reset();
            sw.Start();
            DoesContain1(list, "zzz");
            total += sw.ElapsedMilliseconds;
        }
        Console.Out.WriteLine(total / nr);


        total = 0;
        for (int i = 0; i < nr; i++)
        {
            sw.Reset();
            sw.Start();
            list.Contains("zzz");
            total += sw.ElapsedMilliseconds;
        }
        Console.Out.WriteLine(total / nr);
    }
  }

I made an average of 100 running times for two versions of my function(forward and backward loop) and for the default Contains function. The times I've got are 136 and 133 milliseconds for my functions and a distant winner of 87 for the Contains version. Well now, if before you could argue that the data was scarce and I based my conclusions on a first, isolated run, what do you say about this test? Not does only on average Contains perform better, but it achieves consistently better results in each run. So, is there some kind of disadvantage in here for 3rd party functions, or what?

+5  A: 

As you're using .NET 3.5, why are you using ArrayList to start with, rather than List<string>?

A few things to try:

  • You could see whether using foreach instead of a for loop helps
  • You could cache the count:

    public bool DoesContain(ArrayList list, object element)
    {
        int count = list.Count;
        for (int i = 0; i < count; i++)
        {
            if (list[i].Equals(element))
            {
                return true;
            }
            return false;
        }
    }
    
  • You could reverse the comparison:

    if (element.Equals(list[i]))
    

While I don't expect any of these to make a significant (positive) difference, they're the next things I'd try.

Do you need to do this containment test more than once? If so, you might want to build a HashSet<T> and use that repeatedly.

Jon Skeet
Or just `Equals(element, list[i])` :)
leppie
Isn’t caching `Count` rather counter-productive vis-à-vis bounds checks? I was under the impression that the JIT would elide these if it recognizes a loop over a range. And if it *also* starts inlining `Count`, hey, we’ve got a deal.
Konrad Rudolph
I suspect that using a `foreach` will have the biggest impact, since you then you never get the count at all, and avoid the cost of calculating `list[i]` every time through the loop.
JSBangs
Using a foreach does have the biggest impact. Negative impact, that is. It's twice as slow. However, caching the Count and reversing the comparison does help a bit, but still it doesn't account but for a small part of the difference.
luvieere
Foreach is fast for raw arrays. For arraylists, the extra layer of indirection is a killer.
Brian
Caching the count will be increasingly helpful as the arrays increase in size. Also, the code as written will return false unless `element` is the first item in the list (`return false` should be outside the second closing brace, so copying this code to try it will fail).
ehdv
@luvieere/Brian: Yes, I'd expect foreach to be slower. However, performance expectations are often incorrect, which is why I'd *try* it. @Konrad: I believe the JIT optimises for Count on arrays, but possibly not for `List<T>`. Again, I have no particular expectation, but I think each of these is at least worth trying.
Jon Skeet
@Jon I did try it, trust me I'm quite curious as to the reasons of my conundrum and I'm testing every idea that I find in these answers.
luvieere
A: 

Revised after reading comments:

It does not use some Hash-alogorithm to enable fast lookup.

Arve
No, it doesn't - as indicated by "The documentation states that Contains performs a linear search"
Jon Skeet
How do you figure that out?
Konrad Rudolph
According to the documentation, it is performing a linear search using the `Equals` and `CompareTo` methods of each element in the `ArrayList`.
William Brendel
+2  A: 

I'm not sure if you're allowed to post Reflector code, but if you open the method using Reflector, you can see that's it's essentially the same (there are some optimizations for null values, but your test harness doesn't include nulls).

The only difference that I can see is that calling list[i] does bounds checking on i whereas the Contains method does not.

Greg
A: 

My guess would be that ArrayList is written in C++ and could be taking advantage of some micro-optimizations (note: this is a guess).

For instance, in C++ you can use pointer arithmetic (specifically incrementing a pointer to iterate an array) to be faster than using an index.

tster
You can use [reflector](http://www.red-gate.com/products/reflector/) to decompile ArrayList and see all it's source code. It's not written in C++, it's written in a regular CLR based language. My guess is an early edition of C#.
Bevan
+1  A: 

With a really good optimizer there should not be difference at all, because the semantics seems to be the same. However the existing optimizer can optimize your function not so good as the hardcoded Contains is optimized. Some of the points for optimization:

  1. comparing to a property each time can be slower than counting downwards and comparing against 0
  2. function call itself has its performance penalty
  3. using iterators instead of explicit indexing can be faster (foreach loop instead of plain for)
Vlad
It looks like counting downwards and comparing against 0 did the trick. The time is down to 00:00:00.1416727 for my function. Thanks!
luvieere
Niiice! Frankly speaking, I didn't expect this old-school optimization to be still working.
Vlad
+11  A: 

First, you're not running it many times and comparing averages.

Second, your method isn't being jitted until it actually runs. So the just in time compile time is added into its execution time.

A true test would run each multiple times and average the results (any number of things could cause one or the other to be slower for run X out of a total of Y), and your assemblies should be pre-jitted using ngen.exe.

Will
Actually, I would say this is the right answer. The actual implementation is hardly different.
leppie
+1 for running it many times. If it were the difference between 187 and 108 seconds, *maybe* fewer runs would be fine; but with a difference of 79 milliseconds, you need more data points to make a solid conclusion.
Austin Salonen
A: 

Use SortedList<TKey,TValue>, Dictionary<TKey, TValue> or System.Collections.ObjectModel.KeyedCollection<TKey, TValue> for fast access based on a key.

var list = new List<myObject>(); // Search is sequential
var dictionary = new Dictionary<myObject, myObject>(); // key based lookup, but no sequential lookup, Contains fast
var sortedList = new SortedList<myObject, myObject>(); // key based and sequential lookup, Contains fast

KeyedCollection<TKey, TValue> is also fast and allows indexed lookup, however, it needs to be inherited as it is abstract. Therefore, you need a specific collection. However, with the following you can create a generic KeyedCollection.

public class GenericKeyedCollection<TKey, TValue> : KeyedCollection<TKey, TValue> {
   public GenericKeyedCollection(Func<TValue, TKey> keyExtractor) {
      this.keyExtractor = keyExtractor;
   }

   private Func<TValue, TKey> keyExtractor;

   protected override TKey GetKeyForItem(TValue value) {
      return this.keyExtractor(value);
   }
}

The advantage of using the KeyedCollection is that the Add method does not require that a key is specified.

Obalix
A: 

Actually Big-O notation is supposed to take the worst case scenario for time, not the best. So therefore I cannot see what the issue is in regards to this information. As far as I can see it works as anticipated since it did not take longer than your function :)

Woot4Moo
+1  A: 

First, if you are using types you know ahead of time, I'd suggest using generics. So List instead of ArrayList. Underneath the hood, ArrayList.Contains actually does a bit more than what you are doing. The following is from reflector:

public virtual bool Contains(object item)
{
    if (item == null)
    {
        for (int j = 0; j < this._size; j++)
        {
            if (this._items[j] == null)
            {
                return true;
            }
        }
        return false;
    }
    for (int i = 0; i < this._size; i++)
    {
        if ((this._items[i] != null) && this._items[i].Equals(item))
        {
            return true;
        }
    }
    return false;
}

Notice that it forks itself on being passed a null value for item. However, since all the values in your example are not null, the additional check on null at the beginning and in the second loop should in theory take longer.

Are you positive you are dealing with fully compiled code? I.e., when your code runs the first time it gets JIT compiled where as the framework is obviously already compiled.

Thomas
He's also calling ArrayList.Item[Int32] though, which does two integer checks `if ((index < 0) || (index >= this._size))`
Greg
A: 

using an array structure, you can't search faster than O(n) whithout any additional information. if you know that the array is sorted, then you can use binary search algorithm and spent only o(log(n)) otherwise you should use a set.

antony
sorry, my post is not applicable to the issue
antony
+1  A: 

Using the code below I was able to get the following timings relatively consitently (within a few ms):
1: 190ms DoesContainRev
2: 198ms DoesContainRev1
3: 188ms DoesContainFwd
4: 203ms DoesContainFwd1
5: 199ms Contains

Several things to notice here.

  1. This is run with release compiled code from the commandline. Many people make the mistake of benchmarking code inside the Visual Studio debugging environment, not to say anyone here did but something to be careful of.

  2. The list[i].Equals(element) appears to be just a bit slower than element.Equals(list[i]).

    using System;
    using System.Diagnostics;
    using System.Collections;
    
    
    namespace ArrayListBenchmark
    {
    class Program
    {
        static void Main(string[] args)
        {
            Stopwatch sw = new Stopwatch();
            const int arrayCount = 10000000;
            ArrayList list = new ArrayList(arrayCount);
            for (int i = 0; i < arrayCount; i++) list.Add("zzz " + i);
        sw.Start();
        DoesContainRev(list, "zzz");
        sw.Stop();
        Console.WriteLine(String.Format("1: {0}", sw.ElapsedMilliseconds));
        sw.Reset();
    
    
    
    sw.Start();
    DoesContainRev1(list, "zzz");
    sw.Stop();
    Console.WriteLine(String.Format("2: {0}", sw.ElapsedMilliseconds));
    sw.Reset();
    
    
    sw.Start();
    DoesContainFwd(list, "zzz");
    sw.Stop();
    Console.WriteLine(String.Format("3: {0}", sw.ElapsedMilliseconds));
    sw.Reset();
    
    
    sw.Start();
    DoesContainFwd1(list, "zzz");
    sw.Stop();
    Console.WriteLine(String.Format("4: {0}", sw.ElapsedMilliseconds));
    sw.Reset();
    
    
    sw.Start();
    list.Contains("zzz");
    sw.Stop();
    Console.WriteLine(String.Format("5: {0}", sw.ElapsedMilliseconds));
    sw.Reset();
    
    
    Console.ReadKey();
    
    } public static bool DoesContainRev(ArrayList list, object element) { int count = list.Count; for (int i = count - 1; i >= 0; i--) if (element.Equals(list[i])) return true;
    return false;
    
    } public static bool DoesContainFwd(ArrayList list, object element) { int count = list.Count; for (int i = 0; i < count; i++) if (element.Equals(list[i])) return true;
    return false;
    
    } public static bool DoesContainRev1(ArrayList list, object element) { int count = list.Count; for (int i = count - 1; i >= 0; i--) if (list[i].Equals(element)) return true;
    return false;
    
    } public static bool DoesContainFwd1(ArrayList list, object element) { int count = list.Count; for (int i = 0; i < count; i++) if (list[i].Equals(element)) return true;
    return false;
    
    } } }
Firestrand
The loop for(int i = 0; i = 0; i--) wouldn't even compile?
Fadrian Sudaman
Yup, it was the debugging environment. I still get better times for `Count` but the difference is negligible.
luvieere
I'm not sure where there is a `for(int i = 0; i = 0; i--)` I believe all the code in my example is correct with `(int i = count - 1; i >= 0; i--)` sorry if I missed something.
Firestrand