views:

254

answers:

3

This is a query that totals up every players game results from a game and displays the players who match the conditions.

select   *, 
         (kills / deaths)  as killdeathratio, 
         (totgames - wins) as losses 
from     (select   gp.name              as name, 
                   gp.gameid            as gameid, 
                   gp.colour            as colour, 
                   Avg(dp.courierkills) as courierkills, 
                   Avg(dp.raxkills)     as raxkills, 
                   Avg(dp.towerkills)   as towerkills, 
                   Avg(dp.assists)      as assists, 
                   Avg(dp.creepdenies)  as creepdenies, 
                   Avg(dp.creepkills)   as creepkills, 
                   Avg(dp.neutralkills) as neutralkills, 
                   Avg(dp.deaths)       as deaths, 
                   Avg(dp.kills)        as kills, 
                   sc.score             as totalscore, 
                   Count(* )            as totgames, 
                   Sum(case 
                         when ((dg.winner = 1 and dp.newcolour < 6) or 
                               (dg.winner = 2 and dp.newcolour > 6)) 
                         then 1 
                         else 0 
                       end) as wins 
          from     gameplayers as gp, 
                   dotagames   as dg, 
                   games       as ga, 
                   dotaplayers as dp, 
                   scores      as sc 
          where    dg.winner <> 0 
                   and dp.gameid = gp.gameid 
                   and dg.gameid = dp.gameid 
                   and dp.gameid = ga.id 
                   and gp.gameid = dg.gameid 
                   and gp.colour = dp.colour 
                   and sc.name   = gp.name 
          group by gp.name 
          having   totgames >= 30
          ) as h 
order by totalscore desc

Now I'm not too sure what's the best way to go but what would in your opinion be to optimize this query?

I run a Q6600 @ 2.4ghz, 4gb of ram, 64-bit Linux Ubuntu 9.04 system and this query can take up to 6.7 seconds to run (I do have a huge database).

Also I would like to paginate the results as well and executing extra conditions on top of this query is far too slow....

I use django as a frontend so any methods that include using python +/- django methods would be great. MySQL, Apache2 tweaks are also welcome. And of course, I'm open to changing the query to make it run faster.

Thanks for reading my question; look forward to reading your answers!

Edit: EXPLAIN QUERY RESULTS

id  select_type  table  type  possible_keys  key  key_len  ref  rows  Extra 
1   PRIMARY  <derived2>  ALL   NULL   NULL   NULL   NULL  783  Using filesort
2   DERIVED  sc  ALL  name,name_2  NULL  NULL  NULL  2099  Using temporary; Using filesort
2   DERIVED  gp  ref  gameid,colour,name  name  17  development.sc.name  2   
2   DERIVED  ga  eq_ref  PRIMARY,id,id_2  PRIMARY  4  development.gp.gameid  1  Using index
2   DERIVED  dg  ref  gameid,winner  gameid  4  development.ga.id  1  Using where
2   DERIVED  dp  ref  gameid_2,colour  gameid_2  4  development.ga.id  10  Using where
+2  A: 

First of all, the SQL is badly formatted. The most obvious error is the line splitting before each AS clause. Second obvious problem is using implicit joins instead of explicitly using INNER JOIN ... ON ....

Now to answer the actual question.

Without knowing the data or the environment, the first thing I'd look at would be some of the MySQL server settings, such as sort_buffer and key_buffer. If you haven't changed any of these, go read up on them. The defaults are extremely conservative and can often be raised more than ten times their default, particularly on the large iron like you have.

Having reviewed that, I'd be running pieces of the query to see speed and what EXPLAIN says. The effect of indexing can be profound, but MySQL has a "fingers-and-toes" problem where it just can't use more than one per table. And JOINs with filtering can need two. So it has to descend to a rowscan for the other check. But having said that, dicing up the query and trying different combinations will show you where it starts stumbling.

Now you will have an idea where a "tipping point" might be: this is where a small increase in some raw data size, like how much it needs to extract, will result in a big loss of performance as some internal structure gets too big. At this point, you will probably want to raise the temporary tables size. Beware that this kind of optimization is a bit of a black art. :-)

However, there is another approach: denormalization. In a simple implementation, regularly scheduled scripts will run this expensive query from time-to-time and poke the data into a separate table in a structure much closer to what you want to display. There are multiple variations of this approach. It can be possible to keep this up-to-date on-the-fly, either in the application, or using table triggers. At the other extreme, you could allow your application to run the expensive query occasionally, but cache the result for a little while. This is most effective if a lot of people will call it often: even 2 seconds cache on a request that is run 15 times a second will show a visible improvement.

You could find ways of producing the same data by running half-a-dozen queries that each return some of the data, and post-processing the data. You could also run version of your original query that returns more data (which is likely to be much faster because it does less filtering) and post-process that. I have found several times that five simpler, smaller queries can be much faster - an order of magnitude, sometimes two - than one big query that is trying to do it all.

staticsan
A: 

No index will help you since you are scanning entire tables. As your database grows the query will always get slower.

Consider accumulating the stats : after every game, insert the row for that game, and also increment counters in the player's row, Then you don't need to count() and sum() because the information is available.

peufeu
A: 
  • select * is bad most times - select only the columns you need
  • break the select into multiple simple selects, use temporary tables when needed
  • the sum(case part could be done with a subselect
  • mysql has a very bad performance with or-expressions. use two selects which you union together
codymanix