views:

44

answers:

2

I have two tables players and scores.

I want to generate a report that looks something like this:

player    first score             points
foo       2010-05-20              19
bar       2010-04-15              29
baz       2010-02-04              13

Right now, my query looks something like this:

select p.name        player,
       min(s.date)   first_score,
       s.points      points    
from  players p    
join  scores  s on  s.player_id = p.id    
group by p.name, s.points

I need the s.points that is associated with the row that min(s.date) returns. Is that happening with this query? That is, how can I be certain I'm getting the correct s.points value for the joined row?

Side note: I imagine this is somehow related to MySQL's lack of dense ranking. What's the best workaround here?

+3  A: 

This is the greatest-n-per-group problem that comes up frequently on Stack Overflow.

Here's my usual answer:

select
  p.name        player,
  s.date        first_score,
  s.points      points

from  players p

join  scores  s
  on  s.player_id = p.id

left outer join scores  s2
  on  s2.player_id = p.id
      and s2.date < s.date

where
  s2.player_id is null

;

In other words, given score s, try to find a score s2 for the same player, but with an earlier date. If no earlier score is found, then s is the earliest one.


Re your comment about ties: You have to have a policy for which one to use in case of a tie. One possibility is if you use auto-incrementing primary keys, the one with the least value is the earlier one. See the additional term in the outer join below:

select
  p.name        player,
  s.date        first_score,
  s.points      points

from  players p

join  scores  s
  on  s.player_id = p.id

left outer join scores  s2
  on  s2.player_id = p.id
      and (s2.date < s.date or s2.date = s.date and s2.id < s.id)

where
  s2.player_id is null

;

Basically you need to add tiebreaker terms until you get down to a column that's guaranteed to be unique, at least for the given player. The primary key of the table is often the best solution, but I've seen cases where another column was suitable.

Regarding the comments I shared with @OMG Ponies, remember that this type of query benefits hugely from the right index.

Bill Karwin
OMG Ponies
@Bill Karwin, if my `join scores s...` has more join conditions than `s.player_id = p.id`, would I copy all of those conditions for the `left outer join scores s2...` as well?
macek
@OMG Ponies: I have found that using GROUP BY in MySQL is a performance killer, because MySQL almost always creates a temp table. Whereas using the outer join solution (or equivalent NOT EXISTS with a correlated subquery), it's possible to use covering indexes and so the join may be done in memory.
Bill Karwin
@macek: Yes, the join to s2 must use the same conditions as the join to s, plus the one about comparing dates. And if you have the possibility of ties (more than one score on the same date), you may need an extra join term to resolve the tie.
Bill Karwin
@Bill Karwin, you're exactly right! I'm getting multiple rows returned for some users because they have about 2-5 scores that fall on the first day they play. How to resolve that?
macek
A: 

Most RDMBs won't even let you include non aggregate columns in your SELECT clause when using GROUP BY. In MySQL, you'll end up with values from random rows for your non-aggregate columns. This is useful if you actually have the same value in a particular column for all the rows. Therefore, it's nice that MySQL doesn't restrict us, though it's an important thing to understand.

A whole chapter is devoted to this in SQL Antipatterns.

Marcus Adams
Thanks Marcus! :) Also you can make MySQL behave more standardly with `SET SQL_MODE = ONLY_FULL_GROUP_BY`
Bill Karwin
Coincidentally, @Bill Karwin (the writer the accepted answer for this very question) happens to be the author of that book! Small world :)
macek