views:

171

answers:

7

I am using the following MySQL query in a PHP script on a database that contains over 300,000,000 (yes, three hundred million) rows. I know that it is extremely resource intensive and it takes ages to run this one query. Does anyone know how I can either optimise the query or get the information in another way that's quicker?

I need to be able to use any integer between 1 and 15 in place of the 14 in MID(). I also need to be able to match strings of lengths within the same range in the LIKE clause.

Table Info:

games | longint, unsigned, Primary Key
win   | bit(1)
loss  | bit(1)

Example Query:

SELECT MID(`game`,14,1) AS `move`,
       COUNT(*) AS `games`,
       SUM(`win`) AS `wins`,
       SUM(`loss`) AS `losses`
FROM `games`
WHERE `game` LIKE '1112223334%'
GROUP BY MID(`game`,1,14)

Thanks in advance for your help!

+4  A: 

First, have an index on the game field... :)

The query seems simple and straightforward, but it hides that fact that a datasbase design change is probably required.

In such cases I always prefer to maintain a field that holds aggregated data, either per day, per user, or per any other axis. This way you can have a daily task that aggregates the relevant data and saves it in the database.

If indeed you call this query often, you should use the principle of decreasing the efficiency of insertion for increasing the efficiency of retrieval.

Roee Adler
+1  A: 

The query is simple and, aside from making sure there are all the necessary indexes ("game" field obviously), there may be no obvious way to make it faster by rewriting the query only. Some modification of data structures will probably be necessary.

One way: precalculate the sums. Each of these records will most likely have a create_date or an autoincremented key field. Precalculate the sums for all records, where this field is ≤ some X, put results in a side table, and then you only need to calculate for all records > X, then summarize these partial results with your precalculated ones.

ttarchala
+1  A: 

It looks like the game column is storing two (or possibly more) different things that this query is using:

  1. Filtering by the start of game (first 10 characters)
  2. Grouping by and returning MID(game,1,14) (I'm assuming one of the MID expressions is a typo.

I'd split that up so that you don't have to use string operations on the game column, and also put indexes on the new columns so you can filter and group them properly.

This query is doing a lot of conversions (long to string) and string manipulations that wouldn't be necessary if the table were normalized (as in one piece of information per column instead of multiple like it is now).

Leave the game column the way it is, and create a game_filter string column based on it to use in your WHERE clause. Then set up a game_group column and populate it with the MID expression on insert. Set up these two columns as your clustered index, first game_filter, then game_group.

Welbog
A: 
SELECT  MID(`game`,14,1) AS `move`,
        COUNT(*) AS `games`,
        SUM(`win`) AS `wins`,
        SUM(`loss`) AS `losses`
FROM    `games`
WHERE   `game` LIKE '1112223334%'

Create an index on game:

CREATE INDEX ix_games_game ON games (game)

and rewrite your query as this:

SELECT  move,
        (
        SELECT  COUNT(*)
        FROM    games
        WHERE   game >= move
                AND game < CONCAT(SUBSTRING(move, 1, 13), CHR(ASCII(SUBSTRING(move, 14, 1)) + 1))
        ),
        (
        SELECT  SUM(win)
        FROM    games
        WHERE   game >= move
                AND game < CONCAT(SUBSTRING(move, 1, 13), CHR(ASCII(SUBSTRING(move, 14, 1)) + 1))
        ),
        (
        SELECT  SUM(lose)
        FROM    games
        WHERE   game >= move
                AND game < CONCAT(SUBSTRING(move, 1, 13), CHR(ASCII(SUBSTRING(move, 14, 1)) + 1))
        )
FROM    (
        SELECT  DISTINCT SUBSTRING(q.game, 1, 14) AS move
        FROM    games
        WHERE   game LIKE '1112223334%'
        ) q

This will help to use the index on game more efficiently.

Quassnoi
Why drop the GROUP BY clause? He wants the COUNT and SUM split out by the 14th digit of the game-column.
mlarsen
@mlarsen: I didn't get it first and deleted the answer. Now it's all rewritten.
Quassnoi
+1  A: 

You could precompute the MID(game,14,1) and MID(game,1,14) and store the first ten digits of the game in a separate gameid column which is indexed.

It might also be an idea to investigate if you could just store an aggregate table of the precomputed values so you increment the count and wins or losses column on insert instead.

mlarsen
A: 

Please note I have updated the info above...

A: 

Can you cache the result set with Memcache or something similar? That would help with repeated hits. Even if you only cache the result set for a few seconds, you might be able to avoid a lot of DB reads.

Ted Pennings