tags:

views:

143

answers:

2

Can anyone please explain why the third query below is orders of magnitude slower than the others when it oughtn't to take any longer than doing the first two in sequence?

var data = Enumerable.Range(0, 10000).Select(x => new { Index = x, Value = x + " is the magic number"}).ToList();
var test1 = data.Select(x => new { Original = x, Match = data.Single(y => y.Value == x.Value) }).Take(1).Dump();
var test2 = data.Select(x => new { Original = x, Match = data.Single(z => z.Index == x.Index) }).Take(1).Dump();
var test3 = data.Select(x => new { Original = x, Match = data.Single(z => z.Index == data.Single(y => y.Value == x.Value).Index) }).Take(1).Dump();

EDIT: I've added a .ToList() to the original data generation because I don't want any repeated generation of the data clouding the issue.

I'm just trying to understand why this code is so slow by the way, not looking for faster alternative, unless it sheds some light on the matter. I would have thought that if Linq is lazily evaluated and I'm only looking for the first item (Take(1)) then test3's:

data.Select(x => new { Original = x, Match = data.Single(z => z.Index == data.Single(y => y.Value == x.Value).Index) }).Take(1);

could reduce to:

data.Select(x => new { Original = x, Match = data.Single(z => z.Index == 1) }).Take(1)

in O(N) as the first item in data is successfully matched after one full scan of the data by the inner Single(), leaving one more sweep of the data by the remaining Single(). So still all O(N).

It's evidently being processed in a more long winded way but I don't really understand how or why.

Test3 takes a couple of seconds to run by the way, so I think we can safely assume that if your answer features the number 10^16 you've made a mistake somewhere along the line.

+7  A: 

The first two "tests" are identical, and both slow. The third adds another entire level of slowness.

The first two LINQ statements here are quadratic in nature. Since your "Match" element potentially requires iterating through the entire "data" sequence in order to find the match, as you progress through the range, the length of time for that element will get progressively longer. The 10000th element, for example, will force the engine to iterate through all 10000 elements of the original sequence to find the match, making this an O(N^2) operation.

The "test3" operation takes this to an entirely new level of pain, since it's "squaring" the O(N^2) operation in the second single - forcing it to do another quadratic operation on top of the first one - which is going to be a huge number of operations.

Each time you do data.Single(...) with the match, you're doing an O(N^2) operation - the third test basically becomes O(N^4), which will be orders of magnitude slower.

Reed Copsey
+1 I like your answer better than mine :)
Kelsey
How is iterating the entire sequence quadratic? And why is the third test squaring it when, working from the inside out, it should be the same as doing test1 and plugging the result into test2?
@stovroz: When you call data.Single, you're (potentially) doing an entire enumeration of "data" for every single element of data. This means each time you return one new item from Select(), you potentially have to iterate through ALL of data in order to find the "match". In the third case, you're doing this nested inside of the (already) quadratic element, which basically does an N^2 operation inside the N^2 operation...
Reed Copsey
+1 Excellent explanation of `O` notation. :)
jrista
Hi Reed Copsey. I assume that data.Single is _definitely_ doing an entire enumeration data, as it needs to make sure there's only one. Anyway, so test3's data.Single(y => y.Value == x.Value).Index gets us a value for Index in O(N) reducing the requirement to test2's data.Single(z => z.Index == Index) which is also O(N). And I'm only asking to Take(1) of the N, so what's going wrong that the whole thing isn't O(N) as it should be?
@stovroz: Take test1: When you iterate, it's O(N). However, each element must do an additional data.Single [O(N)] call, which means you're O(N^2). When you do test3, you're doing another O(N) inside of the inner nest, which makes it "squared" again, which causes 10^16 total iterations - hence, it's slow as hell.
Reed Copsey
What do mean each element must do an additional data.Single call? I'm only taking the first one and it's suppose to be lazily evaluated.
@stovroz: But, in test3, you're not doing the take(1), so you end up going quadratic on the first element...
Reed Copsey
I am doing Take(1) in test3.
@stovroz: But not in the inner query. You're doing a query internal to the query (since you've got a nested Single()), and that one happens EVERY time the first (outer) single is processed.
Reed Copsey
Test1 and Test2 are definitely not O(N^2), so there's no way test3 is O(N^4). It must be that test1 and test2 are O(N) and test3 is O(N^2), but I still don't see why Linq can't make a better job of it.
@Reed Copsey: stovroz is absolutely right. Simple test can show that test1 and test2 are O(N) and test3 is O(N^2). The problem is somewhere else.
Yury Tarabanko
@Yury: That's why test 3 is running so much slower. It would be O(N^4) (without the .Take(1), which makes only the first element iterate). However, it's O(N^2) on 10k elements, which makes it MUCH slower.
Reed Copsey
@Reed Copsey: I see. You are right. Have just compared it with "mannual" code. The results are approximately the same. :) +1
Yury Tarabanko
A: 

Fixed.

var data = Enumerable.Range(0, 10000)
  .Select(x => new { Index = x, Value = x + " is the magic number"})
  .ToList();

var forward = data.ToLookup(x => x.Index); 
var backward = data.ToLookup(x => x.Value);

var test1 = data.Select(x => new { Original = x,
  Match = backward[x.Value].Single()
} ).Take(1).Dump();
var test2 = data.Select(x => new { Original = x,
  Match = forward[x.Index].Single()
} ).Take(1).Dump();
var test3 = data.Select(x => new { Original = x,
  Match = forward[backward[x.Value].Single().Index].Single()
} ).Take(1).Dump(); 

In the original code,

  • data.ToList() generates 10,000 instances (10^4).
  • data.Select( data.Single() ).ToList() generates 100,000,000 instances (10^8).
  • data.Select( data.Single( data.Single() ) ).ToList() generates 100,000,000,000,000,000 instances (10^16).

Single and First are different. Single throws if multiple instances are encountered. Single must fully enumerate its source to check for multiple instances.

David B
If the original code is generating 10^16 instances of anything in a couple of seconds, I'm selling my PC to NASA.
well, the take(1) before the dump cuts it to 10^12.
David B
10^12 is still far too high.
*yawn*, you count 'em then.
David B