views:

108

answers:

4

I see that in SQL, the GROUP BY has to precede ORDER BY expression. Does this imply that ordering is done after grouping discards identical rows/columns?

Because I seem to need to order rows by a timestamp column A first, THEN discarding rows with identical value in column A. Not sure how to accomplish this...

I am using MySQL 5.1.41

create table
(
    A int,
    B timestamp
)

The data could be:

+-----+-----------------------+
|  A  |  B                    |
+-----+-----------------------+
|  1  |  today                |
|  1  |  yesterday            |
|  2  |  yesterday            |
|  2  |  tomorrow             |
+-----+-----------------------+

The results I am aiming for would be:

+-----+-----------------------+
|  A  |  B                    |
+-----+-----------------------+
|  1  |  today                |
|  2  |  tomorrow             |
+-----+-----------------------+

Basically, I want the rows with the latest timestamp in column B (think ORDER BY), and only one row for each value in column A (think DISTINCT or GROUP BY).

My actual project details, if you need these:

In real life, I have two tables - users and payment_receipts.

create table users
(
    phone_nr int(10) unsigned not null,
    primary key (phone_nr)
)

create table payment_receipts
(
    phone_nr int(10) unsigned not null,
    payed_ts timestamp default current_timestamp not null,
    payed_until_ts timestamp not null,
    primary key (phone_nr, payed_ts, payed_until_ts)
)

The tables may include other columns, I omitted all that IMO is irrelevant here. As part of a mobile-payment scheme, I have to send SMS to users across the mobile cell network in periodic intervals, depending of course on whether the payment is due or not. The payment is actualized when the SMS is sent, which is premium-taxed. I keep records of all payments done with the payment_receipts table, for book-keeping, which simulates a real shop where both a buyer and seller get a copy of the receipt of purchase, for reference. This table stores my (sellers) copy of each receipt. The customers receipt is the received SMS itself. Each time an SMS is sent (and thus a payment is accomplished), the table is inserted a receipt record, stating who payed, when and "until when". To explain the latter, imagine a subscription service, but one which spans indefinitely until a user opt-out explicitly, at which point the user record is removed. A payment is made a month in advance, so as a rule, the difference between the payed_ts and payed_until_ts is 30 days worth of time.

Naturally I have a batch job that executes every day and needs to select a list of users that are due monthly payment as part automatic subscription renewal. To link this to the dumy example earlier, the phone number column phone_nr is a and payed_until_ts is b, but in actual code there are two tables, which bring me to the following behavior and its implications: when a user record is removed, the receipt remains, for bookkeeping. So, not only do I need to group payments by date and discard all but the latest payment receipt date, I also need to watch out not to select receipts where there no longer is a matching user record.

I am solving the problem of selecting records that are due payment by finding the receipts with the latest payed_until_ts value (as in most cases there will be several receipts for each phone number) for each phone_nr and out of those rows I further need to leave only those phone_numbers where the payed_until_ts is earlier than the time the batch job executes. I loop over the list of these numbers and send out payments, storing a new receipt for each sent SMS, where payed_ts is now() and payed_until_ts is now() + interval 30 days.

+4  A: 

Yes, grouping is done first, and it affects a single select whereas ordering affects all the results from all select statements in a union, such as:

select a, 'max', max(b) from tbl group by a
union all select a, 'min', min(b) from tbl group by a
order by 1, 2

(using field numbers in order by since I couldn't be bothered to name my columns). Each group by affects only its select, the order by affects the combined result set.

It seems that what you're after can be achieved with:

select A, max(B) from tbl group by A

This uses the max aggregation function to basically do your pre-group ordering (it doesn't actually sort it in any decent DBMS, rather it will simply choose the maximum from an suitable index if available).

paxdiablo
This though wouldn't gaurantee the whole row that contains the MAX(B), which I believe the asker wants. Is there any way to do that without the answer I provided?
Mike Sherov
Not sure what you mean, @Mike. This gets the whole row. If you have _extra_ columns that you want pulled from the row with the highest `B` then, yes, you need another way, since you _have_ to use an aggregation function and it may result in a value from a different row. That way is a nightmare in standard SQL and should probably be left to a higher (i.e., application) layer.
paxdiablo
@pax, yes it requires a subselec if you want other values from the row that has the max(b), which I agree is a nightmare (huge possible temp table). I would think though that doing it in the app. layer would be an even bigger nightmare :)
Mike Sherov
@Mike, I didn't understand what you mean. What's the meaning of to "guarantee the whole row"?
amn
@amn, imagine you're table had a third column, "c", which had usernames. Now, if you did "select a,max(b),c from table group by a", you'd get the max timestamp, but the username from the lowest timestamp.
Mike Sherov
Ah, yes, exactly, I get it now. Thanks.
amn
@paxdiablo You do have a very useful answer, and I have confirmed it works very well for me. The only caveat here is that one needs to be aware that this will NOT work (as mentioned by others) where there are other columns to be retrieved beside `a` and `max(b)`. Since I am planning to eventually select other columns as well, I will not use your solution. But wanted to say a big thank you nevertheless, as for certain scenarios it is perfectly valid!
amn
No probs, @amn. My solution is fine for the question posed but, as you say, problematic if you want other columns from the same row.
paxdiablo
+1  A: 
Select a,b from (select a,b from table order by b) as c group by a;
Mike Sherov
There is so much errors in this query that i dont even want to describe it...
Vash
@Vash, have you never heard of a subquery: http://dev.mysql.com/doc/refman/5.0/en/subqueries.html
Mike Sherov
@Vash please do! I want to get my head around this.
amn
@amn, this is a valid mysql query. Not sure what breed of SQL you're using. Perhaps you should add which version of SQL you're using to this questions tags?
Mike Sherov
Actually, it looks like what I need. I use MySQL 5.1.41, and don't aim for SQL portability, at least not to a degree where I won't use this because of portability concerns. Just out of curiosity, is it possible to rewrite this query using join syntax?
amn
@amn, you can write it using a join, but it might not be quicker because either way it involves temp tables. Updating my answer to include it...
Mike Sherov
@amn, forget the join version, it's too unnecessarily complex.
Mike Sherov
Vash
@vash, Amn is using mysql and doesn't care about SQL server. Second, you don't need to only grab aggregate columns when doing a group by in mysql. Third, the ordering is useful to gaurantee that the group by will choose the lowest timestamp's row, as we all know group by uses the row it first encounters. Ordering first makes sure it encounter the lowest timestamps first for every case. Fourth, you don't need to read about set theory to see that my answer is valid and accomplishes the goals presented by @amn.
Mike Sherov
I can confirm that this works as described. I wish I could comprehend what you mean @Vash, I really do. But it works. Are you saying this will break in some scenarios? To me it actually looks really simple - the subquery results are returned ordered only on timestamps, and then grouping selects a row for each distinct `a`, with `b` always the earliest/latest (depending on `asc`/`desc`) timestamp from all timestamps with the same value in `a`. What's not to like? :-)
amn
@amn, it will not break. Although I appreciated the discussion, all of vash's concerns seem to be based on SQL server, which you're not using.
Mike Sherov
@Mike Sherov - For my is something new, that You don't have to use any function or use for grouping. This because i've never work with MySql. In this case i assume the B i taken first form group A.. so if this possible in MySQL every thing as You sad has right place in that query. @amn No, if this is fine with one will be correct in every case for MySQL database engine. As Mike S. already sad my concerns came from that You do not specified at the beginning which db You are using, and my lack of knowledge in MySQL.
Vash
So what I ended up doing, largely derived from @Mike's answer:`select users.phone_nr, p.payed_until_ts from users inner join (select * from (select phone_nr, payed_until_ts from payment_receipts order by payed_until_ts desc) as o group by phone_nr) as p using (phone_nr) where payed_until_ts < now()`. I have indexes in proper places, so the query is fast, and it pleases me that it is correct.
amn
@amn, you can optimize this further: `select * from (select users.phone_nr as phone_nr, p.payed_until_ts as payed_until_ts from users inner join payment_receipts using (phone_nr) where payed_until_ts < now() order by payed_until_ts desc) as t group by phone_nr;` This will do the joining, ordering, and where in one pass, and then do the group by afterwards. I haven't tested it, and assuming there aren't syntax errors, it might be faster.
Mike Sherov
+2  A: 

According to your new rules (tested with PostgreSQL)


Query You'd Want:

SELECT    pr.phone_nr, pr.payed_ts, pr.payed_until_ts 
FROM      payment_receipts pr
JOIN      users
          ON (pr.phone_nr = users.phone_nr)
   JOIN      (select phone_nr, max(payed_until_ts) as payed_until_ts 
              from payment_receipts 
              group by phone_nr
             ) sub
             ON (    pr.phone_nr       = sub.phone_nr 
                 AND pr.payed_until_ts = sub.payed_until_ts)
ORDER BY  pr.phone_nr, pr.payed_ts, pr.payed_until_ts;


Original Answer (with updates):

CREATE TABLE foo (a NUMERIC, b TEXT, DATE);

INSERT INTO foo VALUES 
   (1,'a','2010-07-30'),
   (1,'b','2010-07-30'),
   (1,'c','2010-07-31'),
   (1,'d','2010-07-31'),
   (1,'a','2010-07-29'),
   (1,'c','2010-07-29'),
   (2,'a','2010-07-29'),
   (2,'a','2010-08-01');

-- table contents
SELECT * FROM foo ORDER BY c,a,b;
 a | b |     c      
---+---+------------
 1 | a | 2010-07-29
 1 | c | 2010-07-29
 2 | a | 2010-07-29
 1 | a | 2010-07-30
 1 | b | 2010-07-30
 1 | c | 2010-07-31
 1 | d | 2010-07-31
 2 | a | 2010-08-01

-- The following solutions both retrieve records based on the latest date
--    they both return the same result set, solution 1 is faster, solution 2
--    is easier to read

-- Solution 1: 
SELECT    foo.a, foo.b, foo.c 
FROM      foo
JOIN      (select a, max(c) as c from foo group by a) bar
  ON      (foo.a=bar.a and foo.c=bar.c)
ORDER BY  foo.a, foo.b, foo.c;

-- Solution 2: 
SELECT    a, b, MAX(c) AS c 
FROM      foo main
GROUP BY  a, b
HAVING    MAX(c) = (select max(c) from foo sub where main.a=sub.a group by a)
ORDER BY  a, b;

 a | b |     c      
---+---+------------
 1 | c | 2010-07-31
 1 | d | 2010-07-31
 2 | a | 2010-08-01
(3 rows)  


Comment:
1 is returned twice because their are multiple b values. This is acceptable (and advised). Your data should never have this problem, because c is based on b's value.

vol7ron
Yes, this works, although I don't need the final ordering of rows. Also, it exhibits the same problem as other solutions using `max` - if there are other columns to retrieve besides the `max` and `a`, I am not sure which row will return these.
amn
@amn: could you be a little more descriptive in what you're trying to accomplish? don't be afraid to make your example a little more explicit - SO is filled with programmers that perform all levels of programming. `group by`, groups on a set of keys to return a unique set of values. if you want to return more columns, include it in the `select` and `group by`
vol7ron
Hello, I have added a whole section with the actual problem, for your "pleasure" :-) Thank you for your time. Your update would not apply, because no two records with duplicate `a` may appear in the results.
amn
I have to go to a wedding, so I can't review this any further, but review my update before saying it won't apply. I gave the scenario that column `b` (any of your added columns) could have the same column `a` and same date. You may have data/system constraints to prevent this from happening, but the query would still be what you need.
vol7ron
I'm not sure why you selected the answer that you did, but the above should do what you need, based on your previous question. I'm trying not to answer your other question because I don't want to contribute to deleting "useless" records, since I believe no information is useless. If you're having to extract data from a table, then you need to either 1) change your application or 2) change your database (adding triggers / logging), so that the information is placed where it needs to be. I suggest using an `update` query and proper triggers/logs
vol7ron
vol7ron, well I did flag your answer as useful. Unfortunately I have to select a single 'correct' answer, in the spirit of SO. The answer given by Mike was very clear and the query was very easily understandable. Your query is more tedious, while both are correct. As simple as that. I avoid using features that I can do without using simply querying, and in my case simple querying without triggers and update worked. I also hence avoid extra queries. That is not to say your query is incorrect, I simply chose Mike' version after testing it fist.
amn
So long as you got the answer. `Select a,b from (select a,b from table order by b) as c group by a;` does not solve the problem, though. BTW, you won't need this query if triggers are in place. I'm just advocating good database design vs patching.
vol7ron
+2  A: 
SELECT DISTINCT a,b
FROM tbl t
WHERE b = (SELECT MAX(b) FROM tbl WHERE tbl.a = t.a);
dportas
Hmm, very nice, thank you.
amn