views:

95

answers:

5

I need to optimize a query for a ranking that is taking forever (the query itself works, but I know it's awful and I've just tried it with a good number of records and it gives a timeout).

I'll briefly explain the model. I have 3 tables: player, team and player_team. I have players, that can belong to a team. Obvious as it sounds, players are stored in the player table and teams in team. In my app, each player can switch teams at any time, and a log has to be mantained. However, a player is considered to belong to only one team at a given time. The current team of a player is the last one he's joined.

The structure of player and team is not relevant, I think. I have an id column PK in each. In player_team I have:

id          (PK)
player_id   (FK -> player.id)
team_id     (FK -> team.id)

Now, each team is assigned a point for each player that has joined. So, now, I want to get a ranking of the first N teams with the biggest number of players.

My first idea was to get first the current players from player_team (that is one record top for each player; this record must be the player's current team). I failed to find a simple way to do it (tried GROUP BY player_team.player_id HAVING player_team.id = MAX(player_team.id), but that didn't cut it.

I tried a number of querys that didn't work, but managed to get this working.

SELECT 
    COUNT(*) AS total,
    pt.team_id,
    p.facebook_uid AS owner_uid, 
    t.color 
FROM 
    player_team pt 
JOIN player p ON (p.id = pt.player_id)  
JOIN team t ON (t.id = pt.team_id) 
WHERE 
    pt.id IN (
        SELECT max(J.id) 
        FROM player_team J 
        GROUP BY J.player_id
    )  

GROUP BY 
    pt.team_id 
ORDER BY 
    total DESC 
LIMIT 50            

As I said, it works but looks very bad and performs worse, so I'm sure there must be a better way to go. Anyone has any ideas for optimizing this?

I'm using mysql, by the way.

Thanks in advance

Adding the explain. (Sorry, not sure how to format it properly)

id  select_type     table   type    possible_keys   key     key_len     ref     rows    Extra
1   PRIMARY     t   ALL     PRIMARY     NULL    NULL    NULL    5000    Using temporary; Using filesort
1   PRIMARY     pt  ref     FKplayer_pt77082,FKplayer_pt265938,new_index    FKplayer_pt77082    4   t.id    30  Using where
1   PRIMARY     p   eq_ref  PRIMARY     PRIMARY     4   pt.player_id    1
2   DEPENDENT SUBQUERY  J   index   NULL    new_index   8   NULL    150000  Using index
+1  A: 

I sometimes find that more complex queries in MySQL need to be broken into two pieces.

The first piece would pull the data required into a temporary table and the second piece would be the query that attempts to manipulate the dataset created. Doing this definitely results in a significant performance gain.

John M
Thanks. This one of the first ideas that came to my mind (but with an actual table). The other option I'm considering is having a flag to mark a player_team relation as current / active.
Juan Pablo Califano
+2  A: 

Its the subquery that is killing it - if you add a current field on the player_team table, where you give it value = 1 if it is current, and 0 if it is old you could simplify this alot by just doing:

SELECT 
    COUNT(*) AS total,
    pt.team_id,
    p.facebook_uid AS owner_uid, 
    t.color 
FROM 
    player_team pt 
JOIN player p ON (p.id = pt.player_id)  
JOIN team t ON (t.id = pt.team_id) 
WHERE 
    player_team.current = 1 
GROUP BY 
    pt.team_id 
ORDER BY 
    total DESC 
LIMIT 50  

Having multiple entries in the player_team table for the same relationship where the only way to distinguish which one is the 'current' record is by comparing two (or more) rows I think is bad practice. I have been in this situation before and the workarounds you have to do to make it work really kill performance. It is far better to be able to see which row is current by doing a simple lookup (in this case, where current=1) - or by moving historical data into a completely different table (depending on your situation this might be overkill).

Mailslut
Thanks. I'm considering adding that column. Just want to see if there are other options.
Juan Pablo Califano
Along with the current flag, you can add two more columns, activate_datetime and inactivate_datetime this way you will know, when the actual transition was happened.
Nitin Midha
@Nitin Midha. Thanks for the suggestion. I actually do have a "created" column to store the timestamp of the insertion of the row (which is the time the player joined the team). I just tried to leave less important stuff out of the post just not to add too much clutter.
Juan Pablo Califano
+2  A: 

Try this:

SELECT  t.*, cnt
FROM    (
        SELECT  team_id, COUNT(*) AS cnt
        FROM    (
                SELECT  player_id, MAX(id) AS mid
                FROM    player_team
                GROUP BY
                        player_id
                ) q
        JOIN    player_team pt
        ON      pt.id = q.mid
        GROUP BY
                team_id
        ) q2
JOIN    team t
ON      t.id = q2.team_id
ORDER BY
        cnt DESC
LIMIT 50

Create an index on player_team (player_id, id) (in this order) for this to work fast.

Quassnoi
Thanks Quassnoi. I think you meant pt.id = q.mid in the ON condition; change that and worked. I tried this and the results came up really fast. Have not checked if the results are correct yet, but will do as soon as possible. Thanks again!
Juan Pablo Califano
+1 for remembering indexes
Mailslut
Sorry, I meant this secondo ON condition, which should read "t.id= q2.team_id" instead of "t.team_id = q2.team_id"
Juan Pablo Califano
@Quassnoi. I was in a bit of a rush yesterday so I couldn't really try you solution. But today I did and it works great. Results are correct and the query runs fast with my test data (150000 records in player_team, which is frankly quite optimistic for this app). Furthermore, your query has helped me as a template for writting other related queries (listing the player details for a given team, getting player counts, etc). I'm glad I didn't have to add (and mantain) an extra "current" flag, so one thing less to worry about. So, thanks again, your answer has been really helpful.
Juan Pablo Califano
A: 

This will get the current teams with colours ordered by size:

  SELECT team_id, COUNT(player_id) c AS total, t.color 
    FROM player_team pt JOIN teams t ON t.team_id=pt.team_id  
    GROUP BY pt.team_id WHERE current=1
    ORDER BY pt.c DESC
    LIMIT 50;

But you've not given a condition for which player should be considered owner of the team. Your current query is arbitrarily showing one player as owner_id because of the grouping, not because that player is the actual owner. If your player_team table contained an 'owner' column, you could join the above query to a query of owners. Something like:

SELECT o.facebook_uid, a.team_id, a.color, a.c
FROM player_teams pt1 
  JOIN players o ON (pt1.player_id=o.player_id AND o.owner=1)
  JOIN (...above query...) a
    ON a.team_id=pt1.team_id;
dnagirl
A: 

You could add a column "last_playteam_id" to player table, and update it each time a player changes his team with the pk from player_team table.

Then you can do this:

SELECT 
    COUNT(*) AS total,
    pt.team_id,
    p.facebook_uid AS owner_uid, 
    t.color 
FROM 
    player_team pt 
JOIN player p ON (p.id = pt.player_id)  and p.last_playteam_id = pt.id
JOIN team t ON (t.id = pt.team_id) 
GROUP BY 
    pt.team_id 
ORDER BY 
    total DESC 
LIMIT 50   

This could be fastest because you don't have to update the old player_team rows to current=0.

You could also add instead a column "last_team_id" and keep it's current team there, you get the fastest result for the above query, but it could be less helpful with other queries.

ceteras