tags:

views:

281

answers:

8

Edit 1 (clarification): Thank you for the answers so far! The response is gratifying.
I want to clarify the question a little because based on the answers I think I did not describe one aspect of the problem correctly (and I'm sure that's my fault as I was having a difficult time defining it even for myself).
Here's the rub: The result set should contain ONLY the records with tstamp BETWEEN '2010-01-03' AND '2010-01-09', AND the one record where the tstamp IS NULL for each order_num in the first set (there will always be one with null tstamp for each order_num).
The answers given so far appear to include all records for a certain order_num if there are any with tstamp BETWEEN '2010-01-03' AND '2010-01-09'. For example, if there were another record with order_num = 2 and tstamp = 2010-01-12 00:00:00 it should not be included in the result.

Original question:
Consider an orders table containing id (unique), order_num, tstamp (a timestamp), and item_id (the single item included in an order). tstamp is null, unless the order has been modified, in which case there is another record with identical order_num and tstamp then contains the timestamp of when the change occurred.

Example...

id  order_num  tstamp               item_id
__  _________  ___________________  _______
 0          1                           100
 1          2                           101
 2          2  2010-01-05 12:34:56      102
 3          3                           113
 4          4                           124
 5          5                           135
 6          5  2010-01-07 01:23:45      136
 7          5  2010-01-07 02:46:00      137
 8          6                           100
 9          6  2010-01-13 08:33:55      105

What is the most efficient SQL statement to retrieve all of the orders (based on order_num) which have been modified one or more times during a certain date range? In other words, for each order we need all of the records with the same order_num (including the one with NULL tstamp), for each order_num WHERE at least one of the order_num's has tstamp NOT NULL AND tstamp BETWEEN '2010-01-03' AND '2010-01-09'. It's the "WHERE at least one of the order_num's has tstamp NOT NULL" that I'm having difficulty with.

The result set should look like this:

id  order_num  tstamp               item_id
__  _________  ___________________  _______
 1          2                           101
 2          2  2010-01-05 12:34:56      102
 5          5                           135
 6          5  2010-01-07 01:23:45      136
 7          5  2010-01-07 02:46:00      137

The SQL that I came up with is this, which is essentially "A UNION (B in A)", but it executes slowly and I hope there is a more efficient solution:

SELECT history_orders.order_id, history_orders.tstamp, history_orders.item_id
FROM
   (SELECT orders.order_id, orders.tstamp, orders.item_id
    FROM orders
    WHERE orders.tstamp BETWEEN '2010-01-03' AND '2010-01-09')
    AS history_orders
UNION
SELECT current_orders.order_id, current_orders.tstamp, current_orders.item_id
FROM
   (SELECT orders.order_id, orders.tstamp, orders.item_id
    FROM orders
    WHERE orders.tstamp IS NULL)
    AS current_orders
WHERE current_orders.order_id IN
   (SELECT orders.order_id
    FROM orders
    WHERE orders.tstamp BETWEEN '2010-01-03' AND '2010-01-09');
+3  A: 

perhaps a subquery:

select * from order o where o.order_num in (select distinct
  order_num from order where tstamp between '2010-01-03' and '2010-01-09')
jspcal
+1 for being the most readable of the first three near-identical solutions.
egrunin
So, after including the change necessary based on my clarification ("Edit 1") in my original question, my solution is to use this answer while adding subquery to the first select with a WHERE clause that limits the results totstamp BETWEEN '2010-01-03' AND '2010-01-09'OR tstamp IS NULL.I will add a complete and correct answer shortly.
machinatus
+1  A: 

Unless I've misunderstood, something like this should do the trick:

SELECT o1.id, o1.order_num, o.tstamp, o.item_id
FROM  orders o1
WHERE EXISTS(
    SELECT * FROM orders o2 
    WHERE o1.order_num = o2.order_num 
        AND o2.tstamp BETWEEN '2010-01-03' AND '2010-01-09')

Benefit of using EXISTS is that it stops as soon as it fines the first match.

AdaTheDev
A: 

Hope I got your question right. This should return all orders which have on order that has been changed within the provided timestamp.

SELECT o.order_id, o.tstamp, o.item_id
FROM orders o
JOIN ( SELECT DISTINCT o2.order_num
       FROM orders o2
       WHERE o2.tstamp BETWEEN '2010-01-03' AND '2010-01-09' ) o3
ON ( o3.order_num = o.order_num )
Peter Lang
A: 

You can self-join the table. Simplified, this would look like:

select order_id
from orders all_orders
inner join orders not_null_orders
    on all_orders.order_id = not_null_orders.order_id
where
    not_null_orders.tstamp is not null
    and all_orders.tstamp between '2010-01-03' AND '2010-01-09'
Alison R.
A: 

You didn't mention how many rows in the tables you were dealing with here and it can make a BIG difference to which query strategy you take. The below will probably work fairly well for most medium to large tables.

I essentially do the A query first into a temporary table and index the results. I then do the (B in A) query, which hopefully is sped up by the index, and use UNION ALL - note the ALL there (which doesn't de-dupe the results) - unioned with the (A) query which we already have the results for.

If your dbo.orders table does not have an index on the order_num column, you may want to consider one...

SELECT id, order_num, tstamp, item_id
INTO #tempTstampExists
FROM dbo.orders
WHERE tstamp BETWEEN '01 Mar 2010' and '01 Sep 2010'

CREATE INDEX IX_TmpStamp_OrderNum ON #tempTstampExists (order_num)

SELECT   id, order_num, tstamp, item_id
FROM     dbo.orders
                    INNER JOIN
         #tempTstampExists tmp
                    ON orders.order_num = tmp.order_num
WHERE    
         orders.tstamp BETWEEN '01 Mar 2010' and '01 Sep 2010'
AND      orders.tstamp IS NULL

UNION ALL

SELECT   id, order_num, tstamp, item_id
FROM     #tempTstampExists
ORDER BY
         order_num, id

Hopefully that helps.

Neil Fenwick
Ps the strategy mentioned here is for ad-hoc type queries that would be long-running. If you run this query frequently (e.g. several times per minute or more), you probably need to look at separating your data out into more tables, and keep the NULL tstamp entries away from the non-NULL entries
Neil Fenwick
To comment on the table size, my actual "real world" problem involves 4 tables (connected with inner joins) containing 98, 2189, 43897, 785656 records respectively.
machinatus
A: 

Thank you again for all the suggestions. I found three solutions that work, including my original. At the end I've added some performance results, which are not as great as I had hoped. If anyone can improve on this I would be thrilled!

1) The best solution found so far seems to be:

SELECT history_orders.order_id, history_orders.tstamp, history_orders.item_id
FROM
   (SELECT orders.order_id, orders.tstamp, orders.item_id
    FROM orders
    WHERE orders.tstamp BETWEEN '2010-01-03' AND '2010-01-09'
    OR orders.tstamp IS NULL)
    AS history_orders
WHERE history_orders.order_id IN
   (SELECT orders.order_id
    FROM orders
    WHERE orders.tstamp BETWEEN '2010-01-03' AND '2010-01-09');

2) I also tried using EXISTS in place of IN, which requires an additional WHERE clause in the last SELECT:

SELECT history_orders.order_id, history_orders.tstamp, history_orders.item_id
FROM
   (SELECT orders.order_id, orders.tstamp, orders.item_id
    FROM orders
    WHERE orders.tstamp BETWEEN '2010-01-03' AND '2010-01-09'
    OR orders.tstamp IS NULL)
    AS history_orders
WHERE EXISTS
   (SELECT orders.order_id
    FROM orders
    WHERE history_orders.order_id = orders.order_id
    AND orders.tstamp BETWEEN '2010-01-03' AND '2010-01-09');

3) And finally there is my original solution, using UNION.

Comments:
To comment on the table size, my actual "real world" problem involves 4 tables (connected with inner joins) containing 98, 2189, 43897, 785656 records respectively.

Performance - I ran each solution three times and here are my real world results:
1: 52, 51, 51 seconds
2: 54, 54, 53 s
3: 56, 56, 56 s

machinatus
Do you have indexes on order_id and tstamp?
benjynito
No, and I don't have the power to modify the design as this is not a key report. It will not be run very often so I'm fairly satisfied with what I have now. Not that I'm not interested in ways to improve things further, just for my own knowledge!
machinatus
A: 

I've added another answer after you've clarified your question. I've now realised it is not possible to execute any query to filter results without checking for NULL's...

A warning about NULL's - you can't compare anything to NULL & you can't index it. Doing anything involving something like

...
WHERE colX IS NOT NULL
AND ...

is going to result in a table scan. In your case, I can't see how to avoid scanning the entire table because you need to check for at least 1 NULL for each group of orders.

If this is a core query - I think I'd probably reconsider the table design and avoid relying on NULL's

Neil Fenwick
+1  A: 

I know it's very late for repliying but I just saw this posting and I thought maybe I should try this once, how about this query, it's really very small as compared to all of the above solutions and solves the purpose.

select * from orders_gc where order_num in 
    (select order_num
     from orders_gc 
     group by order_num 
     having count(id) > 1 and 
     MAX(tstamp) between '03-jan-2010' and '09-jan-2010')
Gourav C