views:

69

answers:

5

I have a line like the following in my code:

potentialCollisionsX.Intersect(potentialCollisionsY).Distinct().ToList();

Which, through profiling, i have determined that it is eating approximately 56 percent of my time. I need to figure out how to provide a efficient implementation. I tried

        List<Extent> probableCollisions = new List<Extent>();
        for (int j = 0; j < potentialCollisionsX.Count; j++)
        {
            if (potentialCollisionsY.Contains(potentialCollisionsX[j]) && !probableCollisions.Contains(potentialCollisionsX[j]))
            {
                probableCollisions.Add(potentialCollisionsX[j]);
            }
        }

but that only drops it to 42 percent. Optimizations or alternative ideas would be much appreciated.

Edit: Someone requested information about the Extent class, and i can't think of a better way to give them information than providing the class definition.

    private enum ExtentType { Start, End }
    private sealed class Extent
    {
        private ExtentType _type;
        public ExtentType Type
        {
            get
            {
                return _type;
            }
            set
            {
                _type = value;
                _hashcode = 23;
                _hashcode *= 17 + Nucleus.GetHashCode();
            }
        }
        private Nucleus _nucleus; //Nucleus is the main body class in my engine
        public Nucleus Nucleus
        {
            get
            {
                return _nucleus;
            }
            set
            {
                _nucleus = value;
                _hashcode = 23;
                _hashcode *= 17 + Nucleus.GetHashCode();
            }
        }

        private int _hashcode;

        public Extent(Nucleus nucleus, ExtentType type)
        {
            Nucleus = nucleus;
            Type = type;
            _hashcode = 23;
            _hashcode *= 17 + Nucleus.GetHashCode();
        }

        public override bool Equals(object obj)
        {
            return Equals(obj as Extent);
        }
        public bool Equals(Extent extent)
        {
            if (this.Nucleus == extent.Nucleus) //nucleus.Equals does an int comparison
            {
                return true;
            }
            return false;
        }
        public override int GetHashCode()
        {
            return _hashcode;
        }
    }

Edit2: It would seem that using hashsets makes this part of my code as performant as i need, so thanks for your guy's help!

+1  A: 

If you can't come up with a better solution, consider using unmanaged code as a last resort.

Konamiman
+2  A: 

Intersect returns distinct elements anyway, making the call to Distinct() unnecessary. That will be eating up at least some of your time.

Also, do you actually need to call ToList? What are you then doing with the result?

Does the order matter? If not, you should consider using a HashSet<T> instead of a List<T> for your "manual" code. (And probably create a HashSet<T> for potentialCollisionsY as well.) This will make the Contains call faster, at least if the collections are large enough...

By the way, don't believe the documentation for Intersect - it's wrong about the order of operations (at least in .NET 3.5)

Jon Skeet
After testing a version without Distinct, the time consumption of this code has dropped to 14 percent. More tips would be appreciated, but the optimization of this code is not paramount. As for using a hashset, order is important so i don't think i can use that.
RCIX
As for why i need ToList, i perform a call to RemoveAll directly after which IEnumerable seems to not have.
RCIX
@RCIX: Just do a Where with the opposite condition to filter out what you don't want.
Jon Skeet
e.g. if you were doing `.RemoveAll(x => x.IsFoo)` you'd use `.Where(x => !x.IsFoo)`
Jon Skeet
+1  A: 

Try this:

HashSet<Extent> result = new HashSet<Extent>();
HashSet<Extent> potentialSetY = new HashSet<Extent>(potentialCollisionsY);
foreach (Extent ex in potentialCollisionsX)
    if (potentialSetY.Contains(ex))
        result.Add(ex);

Hash sets are good at doing Contains quickly, but don't preserve order


If you need to preserve order, here's something a little more complicated: An ordered hash set. It uses normal hash set semantics (well, a dictionary, but it's the same thing), but before enumeration it reorders the items according to the insertion order.

// Unchecked code

public class OrderedHashSet<T> : IEnumerable<T> {
    int currentIndex = 0;
    Dictionary<T, index> items = new Dictionary<T, index>();

    public bool Add(T item) {
        if (Contains(item))
            return false;
        items[item] = currentIndex++;
        return true;
    }

    public bool Contains(T item) {
        return items.ContainsKey(item);
    }

    public IEnumerator<T> GetEnumerator() {
        return items.Keys.OrderBy(key => items[key]).GetEnumerator();
    }

    IEnumerator IEnumerable.GetEnumerator() {
        return GetEnumerator();
    }
}

Now simply change HashSet to OrderedHashSet in the above sample and it should work.

configurator
Assuming the Extent class implements a good GetHashCode() and doesn't break the rule that if Extent1.Equals(Extent2)==true, then Extent1.GetHashCode()==Extent2.GetHashCode().
Vilx-
I need order preserved in those lists unfortunately.
RCIX
You could use a `List<T>` for `result` in the above code - but to make sure you don't get duplicates, you may need to *also* have a `HashSet<T>` of results you've already seen (that you can quickly look up).
Jon Skeet
wait a minute, i might just not need the order preserved in this case.... I think i just need an unordered list. I'll try this and see if it helps.
RCIX
Or you can use a new invention - the `OrderedHashSet<T>`.
configurator
+2  A: 

OK, I see the definition of the Extent class. First of all, it violates the rule that if obj1.Equals(obj2)==true then obj1.GetHashCode()==obj2.GetHashCode(). But that's besides the point and can be fixed (if you won't, the algorithms which depend on hashing, like a HashSet will fail).

Now, if the only operation that one can do on the Extent object is to compare for equality, then it will not be possible to get the worst case performance above O(N*M) (where N is the size of first collection, and M is the size of the second collection). That's because you will ultimately have to compare every element with every element.

This can be made better by the use of GetHashCode() and the fact that objects with different hash codes will also be different themselves. Other people have suggested to use the HashSet class, that would be such a solution. The best case performance in this case would be O(N+M), and the worst case - O(N+N*M). On average though you should win, unless the GetHashCode() method is very poorly implemented and returns the same hash codes for many objects.

I myself prefer a more stable solution. If the extent class could be sorted reliably (that is, if you could compare two Extent objects to see which one was bigger and which one was smaller), then you could sort both lists and the performance could be brought down to O(sorting+M+N). The idea is that when the lists are sorted, you can go through them both simultaneously and look for equal elements there.

Now the sorting performance is the tricky thing here. If you only implement the comparison operation (as in, the IComparable interface), you will be able to sort both lists in time O(N*logN+M*logM). The standard List.Sort() method should do that for you. All in all, the total performance would be O(N*logN+M*logM+N+M). You should note however that this uses the QuickSort algorithm which performs poorly on nearly-sorted lists. The worst case is a completely sorted list in which case it is O(N*M). If your lists are close to being sorted already, you should consider another sorting algorithm (and implement it yourself).

The ultimate in reliable speed would be if you could convert each Extent to an integer (or more generally, some string) with the property that if the strings are equal, the Extents are equal as well, and if the strings are not equal, then the Extents are not equal either. The thing with strings is that they can be sorted in linear time with algorithms like radix sort, radix tree, etc. Then the sorting would take only the time of O(N+M). In fact, if you constructed a Radix tree, you would only have to sort the first list and you could search for strings in it directly (with every search taking O(1) time). All in all, the total performance would be O(N+M) which is the best available.

One thing you should always keep in mind though - big algorithms have big constants. The radix approach might look the best on paper, but it will be quite tricky to implement and generally slower than the simpler approaches for small amounts of data. Only if your lists have elements in the ranges of thousands and tens of thousands should you start to think about this. Also, these algorithms require to create a lot of new objects and the cost of each new() operation becomes significant as well. You should think carefully to minimize the number of allocations required.

Vilx-
+1 Hmm I should have read this post carefully before posting my own :)
Will
I already implemented insertion sort for my list
RCIX
A: 

Two approaches:

Put the items in a hashmap if they are not there already, else mark them in the hashmap as duplicated. This is O(n). You then iterate over all items in the hashmap and see if they are marked as duplicate or not - O(n) again.

Another approach:

Sort the two lists. This is an O(n lg n) operation, but crucially it might be that you can happily maintain the two lists sorted at all times, and therefore the cost is not taken when specifically looking for the intersection etc.

Then go through the two lists in order, finding the distinct and duplicate etc entries. This is O(n).

Will
I think this is what i do now....
RCIX
No, you sort only one list (make a HashSet) and then search in it.
Vilx-
Vilx, correct me if I'm wrong, but building the HashSet is O(N) and the search for N items is each O(1) so its O(N) overall.
Will
Yes... did I say otherwise?
Vilx-