views:

1614

answers:

4

I have 2 IEnumerable<int>

IEnumerable<int> x;
IEnumerable<int> y;

What is the best best way to determine if any int in y is present in the x?
Currently I'm using:

return x.Intersect<int>(y).Count() > 0;

Would it be significantly faster to loop through and test each individually?

foreach (int i in x)
{
    foreach (int j in y)
    {
        if (i == j) return true;
    }
}
return false;

The lists are relatively light, with not more than 50 ints in x and 4 in y if that matters in the consideration.

+8  A: 

It would be fastest to use Any instead of Count:

return x.Intersect<int>(y).Any();

This assumes that the IEnumerable<int> doesn't implement ICollection<int>. In that case, Count (in the case where IEnumerable<T> implements ICollection<T>) is an O(N) operation while Any is always an O(1) operation. However, the behavior of Count is an implementation detail, and you shouldn't count on that.

Coincidentally, I've just posted about this on my website:

http://www.caspershouse.com/post/Anything-Counts.aspx

casperOne
+3  A: 

EDIT: The original answer below is really dealing in terms of complexity. If the sequences are short enough, all the calls through GetNext(), building a HashSet etc will actually be more expensive than using Intersect(y).Any(). However, in that case both calls will be really pretty quick anyway.

In other words, Intersect(y).Any() definitely scales better as the two sequences get longer, but for if you're absolutely sure that the sequences are short, the nested loop solution will be faster.

Original answer

No, Intersect() will be faster than the two-loop solution - but as CasperOne wrote, Any() will be faster than Count() because it will exit as soon as it sees an element.

Assuming sequences of length N and M, Intersect will be O(N + M) whereas the two-loop solution is O(N * M).

Intersect builds up a HashSet from the "inner" sequence (this takes O(M) complexity) and then loops through the outer sequence (which takes O(N) complexity), yielding any element which is in the set. These results are streamed - the inner sequence will be evaluated when the first element is requested from Intersect(), but this only goes as far as finding the first match (if any). Using Any() you'll have an "early out" if there are any matches, so we don't need to take the overall number of matches into consideration when working out the complexity.

Streaming results from LINQ rocks - it's well worth getting your head round (as well as deferred execution).

Jon Skeet
Without me starting my profiler, could you explain why? Seems the two loops would cost less than the count?
chris
See my edit for the explanation :)
Jon Skeet
+1  A: 

You're better of doing this:

return x.Intersect(y).Any();

This will abort as soon as it finds a single match, and stop enumerating through the collections.

Reed Copsey
Wouldn't it do the intersect first, then check if there was any element?
JoshBerke
@Josh: No, due to streaming the results.
Jon Skeet
Interesting, thanks.
JoshBerke
Josh: Look in MSDN for "Deferred Execution" in LINQ. It's worth taking the time to understand how and why it works, and how you can use it to your advantage.
Reed Copsey
@Reed: It's not deferred execution as such - it's streaming the results. They're somewhat different. For instance, OrderBy is deferred but as soon as it's asked for the first result it has to process *all* of it. It then yields the results one by one, of course. Intersect is half stream half buffer.
Jon Skeet
@Jon Skeet: Thanks for the clarification.
Reed Copsey
+2  A: 

Intersect is okay, but as others have said I wouldn't call .Count() on the result.

The reason is that Intersect does not create the intersection of the two lists. It creates an IEnumerable that is capable of enumerating that intersection, but it doesn't actually enumerate those results yet. Most of the work is deferred until the time when you finally iterate over this enumeration.

The problem with Count is that it does iterate over the entire enumeration. So not only does it always count all the results, but it causes all of the work involved in computing those results to run as well.

Calling Any instead will be very fast by comparison, because you will compute at most one intersection result before returning. Of course, in the case where there are no matches it will still need to iterate the entire list. However, that's no worse off than you were before. In fact, it's still faster because as Jon Skeet mentioned, the Intersect function uses a HashSet to compute the results rather than iterating over everything. Your best and average cases are improved tremendously.

It's like the difference between these two snippets:

int count = 0;
foreach (int i in x)
{
   foreach (int j in y)
   {
      if (i==j) count++;
   }
}
return (count > 0);

.

// this one should look familiar
foreach (int i in x)
{
    foreach (int j in y)
    {
       if (i==j) return true;
    }
}
return false;

Obviously the 2nd is much faster on average. The performance of Any() would be analogous (not the same as, thanks to the HashSet) the 2nd snippet, while Count() would be analogous to the first.

Joel Coehoorn