views:

95

answers:

6

When an IEnumerable needs both to be sorted and for elements to be removed, are there advantages/drawback of performing the stages in a particular order? My performance tests appear to indicate that it's irrelevant.

A simplified (and somewhat contrived) example of what I mean is shown below:

public IEnumerable<DataItem> GetDataItems(int maximum, IComparer<DataItem> sortOrder)
    {
        IEnumerable<DataItem> result = this.GetDataItems();

        result.Sort(sortOrder);
        result.RemoveAll(item => !item.Display);

        result = result.Take(maximum);  
        return result;
    }
+5  A: 

If your tests indicate it's irrelevant, than why worry about it? Don't optimize before you need to, only when it becomes a problem. If you find a problem with performance, and have used a profiler, and have found that that method is the hotspot, then you can worry more about it.

On second thought, have you considered using LINQ? Those calls could be replaced with a call to Where and OrderBy, both of which are deferred, and then calling Take, like you have in your example. The LINQ libraries should find the best way of doing this for you, and if your data size expands to the point where it takes a noticeable amount of time to process, you can use PLINQ with a simple call to AsParallel.

nasufara
I don't want to suggest that this isn't good *general* advice (it is), but I really don't like it when people respond in this way to specific questions where the OP is demonstrating a curiosity and there is a simple and straightforward answer (see ironic's response, for example). If the OP is curious to know something and you reply by saying "don't worry about it," you're really not helping him/her to learn or understand anything new.
Dan Tao
Thanks, I agree with every word. I just have a nagging thought that someone once said to me that it should be done in a particular order.
Jibberish
@Jibberish See my edits about LINQ.
nasufara
Even with LINQ though, there is a performance impact between whether one does Where or OrderBy first, which may be reordered (particularly on an IQueryable) but may not.
Jon Hanna
+5  A: 

You might as well RemoveAll before sorting so that you'll have fewer elements to sort.

SLaks
+2  A: 

I think that Sort() method would usually have complexity of O(n*log(n)), and RemoveAll() just O(n), so in general it is probably better to remove items first.

ironic
Time complexity is irrelevant, as O(n*log(n)) followed by O(n) will take the same as O(n) followed by O(n*log(n)). The reason RemoveAll first can be more performant is that it means it is O(n) followed by O(m*log(m)) where m <= n, because it has had some items removed.
Jon Hanna
This is exactly what I meant, just assuming n changing between first O and second O ;)
ironic
+1  A: 

It would probably be better to the RemoveAll first, although it would only make much of a difference if your sorting comparison was intensive to calculate.

tKe
+2  A: 

You'd want something like this:

public IEnumerable<DataItem> GetDataItems(int maximum, IComparer<DataItem> sortOrder) 
{ 
    IEnumerable<DataItem> result = this.GetDataItems(); 
    return result
           .Where(item => item.Display)
           .OrderBy(sortOrder)
           .Take(maximum); 
} 
Gabe
+2  A: 

There are two answers that are correct, but won't teach you anything:

  1. It doesn't matter.
  2. You should probably do RemoveAll first.

The first is correct because you said your performance tests showed it's irrelevant. The second is correct because it will have an effect on larger datasets.

There's a third answer that also isn't very useful: Sometimes it's faster to do removals afterwards.

Again, it doesn't actually tell you anything, but "sometimes" always means there is more to learn.

There's also only so much value in saying "profile first". What if profiling shows that 90% of the time is spent doing x.Foo(), which it does in a loop? Is the problem with Foo(), with the loop or with both? Obviously if we can make both more efficient we should, but how do we reason about that without knowledge outside of what a profiler tells us?

When something happens over multiple items (which is true of both RemoveAll and Sort) there are five things (I'm sure there are more I'm not thinking of now) that will affect the performance impact:

  1. The per-set constant costs (both time and memory). How much it costs to do things like calling the function that we pass a collection to, etc. These are almost always negligible, but there could be some nasty high cost hidden there (often because of a mistake).
  2. The per-item constant costs (both time and memory). How much it costs to do something that we do on some or all of the items. Because this happens multiple times, there can be an appreciable win in improving them.
  3. The number of items. As a rule the more items, the more the performance impact. There are exceptions (next item), but unless those exceptions apply (and we need to consider the next item to know when this is the case), then this will be important.
  4. The complexity of the operation. Again, this is a matter of both time-complexity and memory-complexity, but here the chances that we might choose to improve one at the cost of another. I'll talk about this more below.
  5. The number of simultaneous operations. This can be a big difference between "works on my machine" and "works on the live system". If a super time-efficient approach uses .5GB of memory is tested on a machine with 2GB of memory available, it'll work wonderfully, but when you move it to a machine with 8GB of memory available and have multiple concurrent users, it'll hit a bottleneck at 16 simultaneous operations, and suddenly what was beating other approaches in your performance measurements becomes the application's hotspot.

To talk about complexity a bit more. The time complexity is a measure of how the time taken to do something relates the number of items it is done with, while memory complexity is a measure of how the memory used relates to that same number of items. Obtaining an item from a dictionary is O(1) or constant because it takes the same amount of time however large the dictionary is (not strictly true, strictly it "approaches" O(1), but it's close enough for most thinking). Finding something in an already sorted list can be O(log2 n) or logarithmic. Filtering through a list will be linear or O(n). Sorting something using a quicksort (which is what Sort uses) tends to be linearithmic or O(n log2 n) but in its worse case - against a list already sorted - will be quadratic O(n2).

Considering these, with a set of 8 items, an O(1) operation will take 1k seconds to do something, where k is a constant amount of time, O(log2 n) means 3k seconds, O(n) means 8k, O(n log2 n) means 24k and O(n2) means 64k. These are the most commonly found though there are plenty of others like O(nm) which is affected by two different sizes, or O(n!) which would be 40320k.

Obviously, we want as low a complexity as possible, though since k will be different in each case, sometimes the best solution for a small set has a high complexity (but low k constant) though a lower-complexity case will beat it with larger input.

So. Let's go back to the cases you are considering, viz filtering followed by sorting vs. sorting followed by filtering.

  1. Per-set constants. Since we are moving two operations around but still doing both, this will be the same either way.
  2. Per-item constants. Again, we're still doing the same things per item in either case, so no effect.
  3. Number of items. Filtering reduces the number of items. Therefore the sooner we filter items, the more efficient the rest of the operation. Therefore doing RemoveAll first wins in this regard.
  4. Complexity of the operation. It's either a O(n) followed by a average-case-O(log2 n)-worse-case-O(n2), or it's an average-case-O(log2 n)-worse-case-O(n2) followed by an O(n). Same either way.
  5. Number of simultaneous cases. Total memory pressure will be relieved the sooner we remove some items, (slight win for RemoveAll first).

So, we've got two reasons to consider RemoveAll first as likely to be more efficient and none to consider it likely to be less efficient.

We would not assume that we were 100% guaranteed to be correct here. For a start we could simply have made a mistake in our reasoning. For another, there could be other factors we've dismissed as irrelevant that were actually pertinent. It is still true that we should profile before optimising, but reasoning about the sort of things I've mentioned above will both make us more likely to write performant code in the first place (not the same as optimising; but a matter of picking between options when readability, clarity and correctness is equal either way) and makes it easier to find likely ways to improve those things that profiling has found to be troublesome.

For a slightly different but relevant case, consider if the criteria sorted on matched those removed on. E.g. if we were to sort by date and remove all items after a given date.

In this case, if the list deallocates on all removals, it'll still be O(n), but with a much smaller constant. Alternatively, if it just moved the "last-item" pointer*, it becomes O(1). Finding the pointer is O(log2 n), so here there's both reasons to consider that filtering first will be faster (the reasons given above) and that sorting first will be faster (that removal can be made a much faster operation than it was before). With this sort of case it becomes only possible to tell by extending our profiling. It is also true that the performance will be affected by the type of data sent, so we need to profile with realistic data, rather than artificial test data, and we may even find that what was the more performant choice becomes the less performant choice months later when the dataset it is used on changes. Here the ability to reason becomes even more important, because we should note the possibility that changes in real-world use may make this change in this regard, and know that it is something we need to keep an eye on throughout the project's life.

(*Note, List<T> does not just move a last-item pointer for a RemoveRange that covers the last item, but another collection could.)

Jon Hanna
+1 for length...
SLaks
@SLaks, well I would hope for detail rather than length. The length is a bad point.
Jon Hanna
Thanks Jon, that's a lot of useful detail; will read again later in a quiet room so I can understand it better.
Jibberish