ansaurus

Question

Answer 1

+3 A:

Method 1 - improving the original query

I am pretty sure that in this case is an INNER JOIN will suffice, there is no reason to do a RIGHT JOIN (if an id exists in comm it will also exists in comments). INNER JOINs can result in better performance.

Moreover, you really want to push the LIMIT 10 inside comm (incidentally, keeping it together with the ORDER BY):

for one, not keeping LIMIT 10 and ORDER BY together will not get you the ten most recently posted-to topics (the ordering of the comm subquery will not necessarily be preserved into the final result which you are LIMITing.)
additionally, enforcing the LIMIT inside the innermost, aggregate subquery will encourage cost-based optimizers to favour nested loops (10, to be exact) over hash or merge joins (the 10 nested loops being by far the fastest for any respectably-sized comments table.)

So, your query should be rewritten as:

SELECT `comments`.* FROM `comments` 
INNER JOIN (
 SELECT MAX( id ) AS id, core_id, topic_id 
 FROM comments
 GROUP BY core_id, topic_id
 ORDER BY id DESC
 LIMIT 10
) comm 
ON comm.id = comments.id
ORDER BY comments.id

Finally, use EXPLAIN to see what the query is doing. Do not forget to check that you have created an index on comments.id to help with the JOIN nested loops.

Method 2 - a different approach

Note that while the above query could still be faster than your original query, the innermost comm subquery may still turn out to be a significant bottleneck if it results in a full table scan of comments. This really depends on how smart the database is when it sees GROUP BY, ORDER BY and LIMIT together.

If EXPLAIN indicates that the subquery is doing a table scan, then you can try a combination of SQL and application-level logic to get the best performance assuming that I have understood your requirement correctly and that you want to identify the ten most recent comments posted in ten different topics:

# pseudo-code
core_topics_map = { }
command = "SELECT * FROM comments ORDER BY id DESC;"
command.execute
# iterate over the result set, betting that we will be able to break
#  early, bringing only a handful of rows over from the database server
while command.fetch_next_row do
  # have we identified our 10 most recent topics?
  if core_topics_map.size >= 10 then
    command.close
    break
  else
    core_topic_key = pair(command.field('core_id'), command.field('topic_id'))
    if not defined?(core_topics_map[core_topic_key]) then
      core_topics_map[core_topic_key] = command.field('id')
    end
  end
done
# sort our 10 topics in reverse chronological order
sort_by_values core_topics_map

In most cases (that is, provided that your application's database driver does not attempt to buffer all rows into memory for you before giving you back control from execute), the above will only fetch a handful of rows, always using an index, with no table scans involved.

Method 3 - a hybrid approach

If ten seconds ago I knew what the ten most recent comments were, can I be smart about it when I ask the question again later on? Unless comments can be deleted from the database, then the answer is yes, because I know that, when I ask the question again, all comment IDs will be greatly than or equal to the oldest comment ID I got in my last query.

I can therefore rewrite the innermost query to be much, much more selective using an additional condition, WHERE id >= :last_lowest_id:

SELECT `comments`.* FROM `comments` 
INNER JOIN (
 SELECT MAX( id ) AS id, core_id, topic_id 
 FROM comments
 WHERE id >= :last_lowest_id
 GROUP BY core_id, topic_id
 ORDER BY id DESC
 LIMIT 10
) comm 
ON comm.id = comments.id
ORDER BY comments.id

When you run the query for the very first time, use 0 for :last_lowest_id. The query will return up to 10 rows, in descending order. Inside your application, put aside the id of the last row, and reuse its value as :last_lowest_id the next time you run the query, and repeat (again, put aside the id of the last row returned by the latest query etc.) This will essentially make the query incremental, and extremely fast.

Example:

running the query for the first time with :last_lowest_id set to 0
- returns 10 rows with IDs: 129, 100, 99, 88, 83, 79, 78, 75, 73, 70
- save 70
running the query for the second time with :last_lowest_id set to 70
- returns 10 rows with IDs: 130, 129, 100, 99, 88, 83, 79, 78, 75, 73
- save 73
etc.

Method 4 - yet another approach

If you expect to perform SELECT ... ORDER BY id DESC LIMIT 10 much more often than INSERTs into the comments table, consider putting a bit more work into the INSERT to make the SELECT faster. Thus, you can add an indexed updated_at column to your topics etc. table, and whenever you INSERT a comment into the comments table consider also updating the corresponding topic's updated_at value to NOW(). You can then easily select the 10 most recently updated topics (a simple and short index scan on updated_at returning 10 rows), inner joining with the comments table to get the MAX(id) for those 10 topics (infinitely more efficient than getting MAX(id) for all topics before picking the ten greatest, like in the original and Method 1), then inner joining again on comments to get the rest of thecolumn values for those 10.

I expect overall performance of Method 4 to be comparable with Methods 2 and 3. Method 4 will have to be used if you need to get arbitrary topics (e.g. by paginating them, LIMIT 10 OFFSET 50) or if topics or comments can be removed (no changes necessary to support topic removal; to support comment removal properly then the topic's updated_at should be updated on both comment INSERT and DELETE with the created_at value of the latest non-deleted comment for the topic.)

Cheers, V.

vladr 2010-03-26 23:58:42

Wonderful answer! +1

Nicolò Martini 2010-03-27 13:53:33

ansaurus

tags:

views:

answers:

Better way to do SELECT with GROUP BY

Method 1 - improving the original query

Method 2 - a different approach

Method 3 - a hybrid approach

Method 4 - yet another approach

related questions