views:

292

answers:

2

There are similar questions to this, but I don't think anyone has asked this particular question.

Scenario:

Customer - Order (where Order has a CustomerID) - OrderPart - Part

I want a query that returns a customer with all its orders and each order with its parts.

Now I have two main choices:

  1. Use a nested loop (which produces separate queries)
  2. Use data loading options (which produces a single query join)

The question:

Most advice and examples on ORMs suggest using option 2 and I can see why. However, option 2 will potentially be sending back a huge amount of duplicated data, eg:

Option 1 results (3 queries):

ID  Name       Country
1   Customer1  UK

ID  Name
1   Order1
2   Order2

ID  Name
1   Part1
2   Part2
3   Part3

Option 2 results (1 query):

ID  Name       Country  ID  Name    ID Name
1   Customer1  UK       1   Order1  1  Part1
1   Customer1  UK       1   Order1  2  Part2
1   Customer1  UK       1   Order1  3  Part3
1   Customer1  UK       2   Order2  1  Part1
1   Customer1  UK       2   Order2  2  Part2

Option 1 sends back 13 fields with 3 queries. Option 2 sends back 42 fields in 1 query. Now imagine Customer table has 30 fields and Orders have more complex sub joins, the data duplication can quickly become huge.

What impact on overall performance do the following things have:

  • Overhead of making a database connection
  • Time taken to send data (potentially across network if on different server)
  • Bandwidth

Is option 2 always the best choice, option 1 the best choice or does it depend on the situation? If it depends, what criteria should you use to determine? Are any ORMs clever enough to work it out for themselves?

+1  A: 

a great deal of this is going to depend on the amount of data you are going through. The join, while returning more fields, is going to run markedly faster (as a rule) than the Option 1 set of queries. From my personal experience, slow-downs are almost always at that level, the actual running of the query, not the sheer amount of data being passed along whatever pipe you have.

moleboy
+4  A: 

Overhead of making a database connection

Very little if they are on the same subnet, which they usually are. If they're not then this is still not a huge overhead and can be overcome with caching, which most ORMs have (NHibernate has 1st and 2nd level caching).

Time taken to send data (potentially across network if on different server)

For SELECT N+1 this will obviously be longer as it will have to send the select statement each time, which might be up to 1k long. It will also have to grab a new connection from the pool. Chatty versus chunky use to be an argument around 2002-2003 but now it really doesn't make a huge difference unless this is a really big application, in which case you will probably want a more experienced (or better paid) pundit giving his views - i.e. a consultant.

I would favour joins however, as databases will be optimised for this usage over their 10 or more years of development. If performance is really slow a View can sort this out, or Stored Procedure.

By the way, SELECT N+1 is probably the commonest performance problem people experience with NHibernate when they first start using it (including me), and is something that actually takes tweaking to sort out. This is because NHibernate is to ORMs what C++ is to languages.

Bandwidth

An extra SELECT statement for every Customer will eventually build up to however many Customer objects * Orders. So for a large system this might be noticeable - but as I mentioned, ORMs usually have caching mechanisms in place to negate this problem. The amount of SELECT statements also isn't going to be that huge considering:

  • You're on the same network as the SQL server most of the time
  • The increased amount of bytes account for about an extra 0.5-50k of extra bandwidth? Think how fast that is on most servers.
Chris S