views:

106

answers:

1

I'm playing around with the Lahman Baseball Database in a MySQL instance. I want to find the players who topped home runs (HR) for each year. The Batting table has the following (relevant parts) of its schema:

+-----------+----------------------+------+-----+---------+-------+
| Field     | Type                 | Null | Key | Default | Extra |
+-----------+----------------------+------+-----+---------+-------+
| playerID  | varchar(9)           | NO   | PRI |         |       |
| yearID    | smallint(4) unsigned | NO   | PRI | 0       |       |
| HR        | smallint(3) unsigned | YES  |     | NULL    |       |
+-----------+----------------------+------+-----+---------+-------+

For each year, every player has an entry (between hundreds and 12k per year, going back to 1871). Getting the top N hitters for a single year is easy:

SELECT playerID,yearID,HR
FROM Batting
WHERE yearID=2009
ORDER BY HR DESC LIMIT 3;
+-----------+--------+------+
| playerID  | yearID | HR   |
+-----------+--------+------+
| pujolal01 |   2009 |   47 |
| fieldpr01 |   2009 |   46 |
| howarry01 |   2009 |   45 |
+-----------+--------+------+

But I'm interested in finding the top 3 from every year. I've found solutions like this, describing how to select the top from a category and I've tried to apply it to my problem, only to end up with a query that never returns:

SELECT
    b.yearID, b.playerID, b.HR
FROM
    Batting AS b
LEFT JOIN
    Batting b2
    ON
    (b.yearID=b2.yearID AND b.HR <= b2.HR)
GROUP BY b.yearID HAVING COUNT(*) <= 3;

Where have I gone wrong?

+3  A: 

Something like this should work:

SELECT b.playerID, b.yearID, b.HR
FROM Batting b 
WHERE HR >= (
    SELECT b2.HR 
    FROM Batting b2 
    WHERE b2.yearID=b1.yearID
    ORDER BY b2.HR DESC
    LIMIT 2, 1
)
ORDER BY b.yearID DESC, b.HR DESC;

Explanation: select all rows which have >= number of home runs as the third highest for that year. This won't break ties. So if there's more than one batter with the same number of home runs, they'll all show up.

The results are the ordered from the most recent year, sub-ordered by rank for each year.

Note: LIMIT is a 0-based offset, so 2, 1 means starting after the second row grab one row, i.e.: the third row.

Ben S
Plus 1 for the LIMIT explanation.
Ryan Liang
The LIMIT arguments actually go the other way: offset, then number of rows. Also, you have an error in the subquery—`b1` should be just `b`. Other than that, this is correct. It took 4:18 to find *just* the results since 2005 (Macbook Pro, OS X 10.6.3, Core 2 2.5GHz, enough RAM to have all the data in memory), so perhaps there are some optimizations to be made.
Drew Stephens
Thanks for pointing out the limit details. I edited my answer.Is you query cache actually set to large enough to hold everything in memory though? The reason it's so slow is that row every row in Batting, it does the subquery which is quite expensive.Possible optimizations would be to add an index on yearID and adding another index on HR. A further optimization would be to make a temporary table that holds each year along with the third-highest HR count to compare against.
Ben S
Drew Stephens