views:

30

answers:

3

Often I deal with aggregate or parent entities which have attributes derived from their constituent or children members. For example:

  • The byte_count and packet_count of a TcpConnection object is computed from the same attributes of its two constituent TcpStream objects, which in turn are computed from their constituent TcpPacket objects.

  • An Invoices object might have a total which is basically the SUM() of its constituent InvoiceLineItems' prices, with a little freight, discount and tax logic thrown in.

When dealing with millions of packets or millions of invoiced line items (I wish!), on-demand computation of these derived attributes -- either in a VIEW or more commonly in presentation logic like reports or web interfaces -- is often unacceptably slow.

How do you decide, before performance concerns force your hand, whether to "promote" derived attributes to precomputed fields?

+1  A: 

Basically, you don't. You left performance concerns force your hand.

That's the best answer because 99% of the time, you should not be pre-optimizing like this, it's better to just calc it on the fly.

However, it is quite common for client-application developers to come to the server-side with mistaken preconceptions like "on-demand computation of ...derived attributes... -- is often unacceptably slow", and this just IS NOT true. The correct wording here would be "is rarely unacceptably slow".

As such, unless you are an expert in this (a DB Development Architect, etc.), you should not be engaging in premature optimization. Wait until it's obvious that is has to be fixed, then look at pre-aggregation.

RBarryYoung
A: 

How current the data must be determines how you implement it, really.

I'll assume 2 simple states: current or not current.

  • Current: indexed views, triggers, stored procs to maintain aggregate tables etc
  • Not current: Reporting Service snapshots, log shipping/replication, data warehouse etc

That said, I would develop against the same quantity of data as I have in prod so I have some confidence in response time. You should rarely be surprised by your code performance...

gbn
+3  A: 

I personally wouldn't denormalize until performance trade-offs force my hand (because the downside of denormalizations are too drastic IMHO), but you might also consider:

  1. Convenience: e.g. if two different client apps want to calculate the same derived attributes, they both have to code up the queries to calculate them. Denormalization offers both client apps the derived attribute in a simpler way.
  2. Stability over time: e.g. if the formula for calculating a derived attribute is changeable, denormalization allows you to capture and store the derived value at a point in time so future calculations will never get it wrong
  3. Simpler queries: adding complexity to the DB structure can mean your Select query is simpler at the client end.
  4. Performance: Select queries on denormalized data can be quicker.

Ref: The Database Programmer: The Argument for Denormalization. Be sure to read as well his article on Keeping Denormalized Values Correct - his recommendation is to use triggers. That brings home the kind of trade-off denormalization requires.

vincebowdren
+1 - Very nice answer
Mark Brittingham