views:

561

answers:

7

I was once given this task to do in an RDBMS:

Given tables customer, order, orderlines and product. Everything done with the usual fields and relationships, with a comment memo field on the orderline table.

For one customer retrieve a list of all products that customer has ever ordered with product name, year of first purchase, dates of three last purchases, comment of the latest order, sum of total income for that product-customer combination last 12 months.

After a couple of days I gave up doing it as a Query and opted to just fetch every orderline for a customer, and every product and run through the data procedurally to build the required table clientside.

I regard this a symptom of one or more of the following:

  • I'm a lazy idiot and should have seen how to do it in SQL
  • Set operations are not as expressive as procedural operations
  • SQL is not as expressive as it should be

Did I do the right thing? Did I have other options?

+3  A: 

In most RDBMS you have the option of temporary tables or local table variables that you can use to break up a task like this into manageable chunks.

I don't see any way to easily do this as a single query (without some nasty subqueries), but it still should be doable without dropping out to procedural code, if you use temp tables.

GalacticCowboy
Erm, isn't using temporary tables or local table variables procedural code, then? I'd consider something "set based" to be achievable using a VIEW. If those subqueries ran really fast, would you still consider them 'nasty'?
onedaywhen
You're right. The subqueries would be very nasty indeed. But I think running through one result set by cursor and inserting result rows procedurally into a result temp table is the best way to do it serverside. It works, but it confirms that SQL Query Language is not up to the task.
Guge
No, the WHOLE POINT of a temp table is that you'd still use set-based logic to interact with it. True, cursor-based inserts are procedural, but the whole point of set-based logic is not to use cursors.
GalacticCowboy
@onedaywhen: a "stored procedure" is not "procedural" in the same sense you're thinking. Just because something is broken down into sub-steps, each of the substeps can be performed in set-logic rather than procedural code.
GalacticCowboy
+4  A: 

Set operations are not as expressive as procedural operations

Perhaps more like: "Set operations are not as familiar as procedural operations to a developer used to procedural languages" ;-)

Doing it iteratively as you have done now is fine for small sets of data, but simply doesn't scale the same way. The answer to whether you did the right thing depends on whether you are satisfied with the performance right now and/or don't expect the amount of data to increase much.

If you could provide some sample code, we might be able to help you find a set-based solution, which will be faster to begin with and scale far, far better. As GalacticCowboy mentioned, techniques such as temporary tables can help make the statements far more readable while largely retaining the performance benefits.

Sören Kuklau
Agreed. Can Amazon.com's application can fetch all products and iterate through them?
Bill Karwin
I agree with everything you say. The reason why I didn't provide sample code is just because I've left that project years ago. Don't have the code anymore. The code was ugly as hell. I didn't want to preload everyones idea of how to do it.
Guge
+1  A: 

SQL queries return results in the form of a single "flat" table of rows and columns. Reporting requirements are often more complex than this, demanding a "jagged" set of results like your example. There is nothing wrong with "going procedural" to solve such requirements, or using a reporting tool that sits on top of the database. However, you should use SQL as far as possible to get the best performance from the database.

Tony Andrews
Another way of thinking of this is that the result of a query is itself a relation (i.e. table) even though it is not stored. The definition of a relation therefore applies, e.g. each column has the same name, data type, and meaning on every row.
Bill Karwin
I agree very much about the performance part of it. I happen to think that it can't be solved in SQL alone. SQL was not created for this kind of reporting. The report is very human-oriented.
Guge
SQL was created for exactly this kind of problem - it was initially viewed as an ad-hoc querying interface. And it certainly can be dealt with in a single query.
le dorfier
Doofledorfer, I don't see how this problem can be USEFULLY answered in a single query, without either grouping with concatenated values or getting a cartesian product? The result doesn't fit neatly into 1 table.
Tony Andrews
+2  A: 

This problem may not have been solvable by one query. I see several distinct parts...

For one customer

  1. Get a list of all products ordered (with product name)
  2. Get year of first purchase
  3. Get dates of last three purchases
  4. Get comment on latest order
  5. Get sum of product purchases for the last 12 months

Your procedure is steps 1 - 5 and SQL gets you the data.

Anthony Mastrean
Maybe only 3 queries: can get min purchase year, latest comment and sum of purchases for last 12 months in a single query.
Tony Andrews
I'd like to see you get the dates of the last three purchases for each product in one nice ANSI-92 query. How can something that is so easy to explain be so hard to express in a language as "popular" as SQL?
Guge
+2  A: 

Sounds like a data warehouse project to me. If you need things like "three most recent things" and "sum of something over the last 12 months" then store them i.e. denormalize.

onedaywhen
This is actually a very good answer. Why not move the problem to incoming data? Much easier to delete the oldest purchase when a new is made and there are already 3 in the table, or even have 3 dates and update by setting date3=date2, date2=date1 and date1=order.date. Triggers rule.
Guge
Yup, at least the part apart "data warehouse". This is what analytical databases are for.
le dorfier
(But triggers don't rule. Avoid them at all costs. They're snakes that lurk in the dark and will bite when you least expect it.)
le dorfier
Until SQL Server CHECK constraints support subqueries (and can we have CREATE ASSERTION too, please), I'm stuck with triggers to enforce basic constraints e.g. a sequenced key in a valid-time state ('history') table.
onedaywhen
+7  A: 

You definitely should be able to do this exercise without doing the work equivalent to a JOIN in application code, i.e. by fetching all rows from both orderlines and products and iterating through them. You don't have to be an SQL wizard to do that one. JOIN is to SQL what a loop is to a procedural language -- in that both are fundamental language features that you should know how to use.

One trap people fall into is thinking that the whole report has to be produced in a single SQL query. Not true! Most reports don't fit into a rectangle, as Tony Andrews points out. There are lots of rollups, summaries, special cases, etc. so it's both simpler and more efficient to fetch parts of the report in separate queries. Likewise, in a procedural language you wouldn't try do all your computation in a single line of code, or even in a single function (hopefully).

Some reporting tools insist that a report is generated from a single query, and you have no opportunity to merge in multiple queries. If so, then you need to produce multiple reports (and if the boss wants it on one page, then you need to do some paste-up manually).

To get a list of all products ordered (with product name), dates of last three purchases, and comment on latest order is straightforward:

SELECT o.*, l.*, p.*
FROM Orders o
 JOIN OrderLines l USING (order_id)
 JOIN Products p USING (product_id)
WHERE o.customer_id = ?
ORDER BY o.order_date;

It's fine to iterate over the result row-by-row to extract the dates and comments on the latest orders, since you're fetching those rows anyway. But make it easy on yourself by asking the database to return the results sorted by date.

Year of first purchase is available from the previous query, if you sort by the order_date and fetch the result row-by-row, you'll have access to the first order. Otherwise, you can do it this way:

SELECT YEAR(MIN(o.order_date)) FROM Orders o WHERE o.customer_id = ?;

Sum of product purchases for the last 12 months is best calculated by a separate query:

SELECT SUM(l.quantity * p.price)
FROM Orders o
 JOIN OrderLines l USING (order_id)
 JOIN Products p USING (product_id)
WHERE o.customer_id = ?
 AND o.order_date > CURDATE() - INTERVAL 1 YEAR;

edit: You said in another comment that you'd like to see how to get the dates of the last three purchases in standard SQL:

SELECT o1.order_date
FROM Orders o1
  LEFT OUTER JOIN Orders o2 
  ON (o1.customer_id = o2.customer_id AND (o1.order_date < o2.order_date 
      OR (o1.order_date = o2.order_date AND o1.order_id < o2.order_id)))
WHERE o1.customer_id = ?
GROUP BY o1.order_id
HAVING COUNT(*) <= 3;

If you can use a wee bit of vendor-specific SQL features, you can use Microsoft/Sybase TOP n, or MySQL/PostgreSQL LIMIT:

SELECT TOP 3 order_date
FROM Orders
WHERE customer_id = ?
ORDER BY order_date DESC;

SELECT order_date
FROM Orders
WHERE customer_id = ?
ORDER BY order_date DESC
LIMIT 3;
Bill Karwin
Very thorough and nice answer. Balance between set operations and procedural, like I did. The hardest part is to get the last three purchase dates for each product a customer has bought. You, as I did, solve this one by going procedural.
Guge
I'm missing the 'Procedural' part of this solution that Guge identifies. Doesn't that require cursors and loops? This just looks like several select statements.
Jeff O
@Guiness: I think the point is that not every report has to be generated from a single overly-complex SQL query. The 'proceural' part is some application glue to run multiple simpler queries and combine/present their results.
Bill Karwin
+2  A: 

EDIT: This is a completely new take on the solution, using no temp tables or strange sub-sub-sub queries. However, it will ONLY work on SQL 2005 or newer, as it uses the "pivot" command that is new in that version.

The fundamental problem is the desired pivot from a set of rows (in the data) into columns in the output. While noodling on the issue, I recalled that SQL Server now has a "pivot" operator to deal with this.

This works on SQL 2005 only, using the Northwind sample data.

-- This could be a parameter to a stored procedure
-- I picked this one because he has products that he ordered 4 or more times
declare @customerId nchar(5)
set @customerId = 'ERNSH'

select c.CustomerID, p.ProductName, products_ordered_by_cust.FirstOrderYear,
    latest_order_dates_pivot.LatestOrder1 as LatestOrderDate,
    latest_order_dates_pivot.LatestOrder2 as SecondLatestOrderDate,
    latest_order_dates_pivot.LatestOrder3 as ThirdLatestOrderDate,
    'If I had a comment field it would go here' as LatestOrderComment,
    isnull(last_year_revenue_sum.ItemGrandTotal, 0) as LastYearIncome
from
    -- Find all products ordered by customer, along with first year product was ordered
    (
     select c.CustomerID, od.ProductID,
      datepart(year, min(o.OrderDate)) as FirstOrderYear
     from Customers c
      join Orders o on o.CustomerID = c.CustomerID
      join [Order Details] od on od.OrderID = o.OrderID
     group by c.CustomerID, od.ProductID
    ) products_ordered_by_cust
    -- Find the grand total for product purchased within last year - note fudged date below (Northwind)
    join (
     select o.CustomerID, od.ProductID, 
      sum(cast(round((od.UnitPrice * od.Quantity) - ((od.UnitPrice * od.Quantity) * od.Discount), 2) as money)) as ItemGrandTotal
     from
      Orders o
      join [Order Details] od on od.OrderID = o.OrderID
     -- The Northwind database only contains orders from 1998 and earlier, otherwise I would just use getdate()
     where datediff(yy, o.OrderDate, dateadd(year, -10, getdate())) = 0
     group by o.CustomerID, od.ProductID
    ) last_year_revenue_sum on last_year_revenue_sum.CustomerID = products_ordered_by_cust.CustomerID
     and last_year_revenue_sum.ProductID = products_ordered_by_cust.ProductID
    -- THIS is where the magic happens.  I will walk through the individual pieces for you
    join (
     select CustomerID, ProductID,
      max([1]) as LatestOrder1,
      max([2]) as LatestOrder2,
      max([3]) as LatestOrder3
     from
     (
      -- For all orders matching the customer and product, assign them a row number based on the order date, descending
      -- So, the most recent is row # 1, next is row # 2, etc.
      select o.CustomerID, od.ProductID, o.OrderID, o.OrderDate,
       row_number() over (partition by o.CustomerID, od.ProductID order by o.OrderDate desc) as RowNumber
      from Orders o join [Order Details] od on o.OrderID = od.OrderID
     ) src
     -- Now, produce a pivot table that contains the first three row #s from our result table,
     -- pivoted into columns by customer and product
     pivot
     (
      max(OrderDate)
      for RowNumber in ([1], [2], [3])
     ) as pvt
     group by CustomerID, ProductID
    ) latest_order_dates_pivot on products_ordered_by_cust.CustomerID = latest_order_dates_pivot.CustomerID
     and products_ordered_by_cust.ProductID = latest_order_dates_pivot.ProductID
    -- Finally, join back to our other tables to get more details
    join Customers c on c.CustomerID = products_ordered_by_cust.CustomerID
    join Orders o on o.CustomerID = products_ordered_by_cust.CustomerID and o.OrderDate = latest_order_dates_pivot.LatestOrder1
    join [Order Details] od on od.OrderID = o.OrderID and od.ProductID = products_ordered_by_cust.ProductID
    join Products p on p.ProductID = products_ordered_by_cust.ProductID
where c.CustomerID = @customerId
order by CustomerID, p.ProductID
GalacticCowboy
BTW, if I comment out the customer ID filter (so the results contain all customers in the Northwind database) the script still takes less than a second.
GalacticCowboy
You realize how short a step it is from that, to a single query without temp tables, don't you? And I bet it will run faster still - at least no slower.
le dorfier
@GalacticCowboy - My hat is off to you. Quite an effort. I am drawn to conclude that Sören Kuklau was right in that I am much procedural, that's how I read between his lines. However, I think that a query language should have been able to do something so proceduraly simple in a shorter expression.
Guge
@doofledorfer - Can't wait to see it. Would you first get the latest, then the next latest with a subquery and the third latest with a subquery in a subquery? That's what I didn't want in my original program.
Guge
Of course, now half of these comments don't make sense... :)
GalacticCowboy