views:

232

answers:

7

I'm still learning about MySQL. I may be making a very basic error, and I'm prepared to be chastened here...

What this query is trying to do is select the top members from our website based on a count of the number of book and recipe reviews they have made.

I'm making a calculation of the total in the SQL query itself. The query is slow (9 seconds) and will definitely not scale considering we only have 400 members and a few thousand reviews so far and it's growing quite quickly.

I presume it's doing a full table scan here, and that the calculation is slowing it down, but I don't know of an alternative way to do this and would love some wisdom.

Here's the SQL statement:

SELECT users.*, COUNT( DISTINCT bookshelf.ID ) AS titles, COUNT( DISTINCT book_reviews.ID ) as bookreviews, COUNT( DISTINCT recipe_reviews.ID ) AS numreviews, COUNT( DISTINCT book_reviews.ID ) + COUNT( DISTINCT recipe_reviews.ID ) as reviewtotal
FROM users
LEFT OUTER JOIN recipe_reviews ON recipe_reviews.user_id = users.ID
LEFT OUTER JOIN book_reviews ON book_reviews.user_id = users.ID
LEFT OUTER JOIN bookshelf ON users.ID = bookshelf.user_id
GROUP BY users.ID
ORDER BY reviewtotal DESC
LIMIT 8

Here is the EXPLANATION:

+----+-------------+----------------+-------+-------------------+-------------------+---------+---------------------+------+---------------------------------+
| id | select_type | table          | type  | possible_keys     | key               | key_len | ref                 | rows | Extra                           |
+----+-------------+----------------+-------+-------------------+-------------------+---------+---------------------+------+---------------------------------+
|  1 | SIMPLE      | users          | index | NULL              | PRIMARY           | 4       | NULL                |  414 | Using temporary; Using filesort | 
|  1 | SIMPLE      | recipe_reviews | ref   | recipe_reviews_fk | recipe_reviews_fk | 5       | users.ID            |   12 |                                 | 
|  1 | SIMPLE      | book_reviews   | ref   | user_id           | user_id           | 5       | users.ID            |    4 |                                 | 
|  1 | SIMPLE      | bookshelf      | ref   | recipe_reviews_fk | recipe_reviews_fk | 5       | users.ID            |   13 |                                 | 
+----+-------------+----------------+-------+-------------------+-------------------+---------+---------------------+------+---------------------------------+

UPDATE & SOLVED:

I realized, and @recursive confirmed, that the query is the root of the problem. I'm getting Cartesian products from this. I rewrote it as a series of subqueries and the final working code is here:

SELECT  *, bookreviews + recipereviews AS totalreviews
FROM (SELECT users.*,
            (SELECT count(*) FROM bookshelf WHERE bookshelf.user_id = users.ID) as titles,
            (SELECT count(*) FROM book_reviews WHERE book_reviews.user_id = users.ID) as bookreviews,
            (SELECT count(*) FROM recipe_reviews WHERE recipe_reviews.user_id = users.ID) as recipereviews
    FROM users) q

This gives me a result in milliseconds. There are also ways to do this with JOINs. See http://stackoverflow.com/questions/2042414/how-to-add-together-the-results-of-several-subqueries if you want to follow this up.

+3  A: 

for features like that, it is always helpful to work with some kind of caching...

It might already help to create sums for all users on a nightly basis and store those sums with the user. This will help a lot and speed up your search.

You should also cache this request somehow for at least a minute or five since you will execute the same request independently on whos logged in.

Ralf
I suggest you also pair an "as of" date for that batch-calculated sum in addition to the sum.
Mike Atlas
A: 

I often find that creating a smaller temporary table from a larger table will have noticable speed benefits.

So the basic process:

  1. store query (with joins) into temporary table
  2. run count/summary queries on temporary table
John M
+2  A: 

You might try seeing if there is an improvement from removing the DISTINCT modifiers. Assuming the DISTINCTed fields are primary keys anyway, this could be causing unnecessary work.

recursive
I tried this and ended up with a count of thousands for each of the count fields.
mandel
It sounds like you might have duplicate records your database. Have you checked your tables to see if they make sense?
recursive
I'll review the tables to make sure - perhaps I need to make primary keys on some of these as combinations of fields rather than a straight ID. Bookshelf, for example, has ID, user_id, cookbook_id. The combo of user_id and cookbook_id should be unique...
mandel
A single column "artificial" primary key is not a bad idea, but they should already be guaranteed to be unique, so the fact that you get different results after removing the DISTINCT suggests database problems to me.
recursive
There are no other database issues that I've found so far - every other query using these tables works as expected. I'm going to review the logic here and see if I'm missing something. I'm also going to try a caching/calculated field.
mandel
I figured it out. You're doing a cartesian join across the reviews and the bookshelf. If you used a separate query for each type of review, I suspect your query would speed up by orders of magnitude.
recursive
YES! This is the problem. The only question left is how to add up the results to get a field which aggregates book reviews and recipe reviews.
mandel
+2  A: 

Index all tables on user_id. This could easily speed this query up by orders of magnitude if it hasn't been done yet.

recursive
Alas, there are already indexes on each of the user_id fields.
mandel
A: 

Why not just store the number of reviews per user as a column in the users table? Every new review the user does should also require an increment of the value of the their user record review count by one.

For example:

user_id user_name number_of_reviews
1       bob       5
2       jane      10

Bob puts a new review in, and you up his number to 6:

review_id user_id review_text
16        1       "Great!"

user_id user_name number_of_reviews
1       bob       6
2       jane      10

Now you can simply get the top 5 reviewers like this:

SELECT * FROM users ORDER BY number_of_reviews DESC LIMIT 5
Mike Atlas
I'd considered something like this early in my design of the website and been told (on SO) that I shouldn't rely on incremented columns over queries. But that may have been a more general caution as I had started using incremented columns for a number of things.
mandel
I can't think of a problematic scenario in your design that this (number_of_reviews) would be risky. If it represented real physical inventory or money amounts, I would advise a bit more caution. But otherwise this should suffice. Don't make it hard on yourself!
Mike Atlas
Also, if you were ever in doubt that the count was off, you could recalculate the number of reviews per user on an "offline" database copy to see if there is any difference, by doing what you were doing above (count * with a join).
Mike Atlas
Incremented columns is generally a terrible idea that inevitably will lead to data inconsistency. It could be considered on very large databases with many reads, but your DB is small so you certainly should avoid it at all times.
eckesicle
Can you give some examples of how in this case it would "lead to" data inconsistency?
Mike Atlas
I will dismiss your comment as a drive-by job then...
Mike Atlas
+1  A: 

You are trying to accomplish too many things with this query. I see problems with your db / query design. Why do you have a user_id in book_shelf? How about the following table structure

CREATE TABLE users (
id INT NOT NULL AUTO_INCREMENT ,
name VARCHAR( 20 ) NOT NULL ,
PRIMARY KEY ( `id` )
)

CREATE TABLE recipe_reviews (
id INT NOT NULL AUTO_INCREMENT ,
review VARCHAR( 20 ),
user_id INT,
PRIMARY KEY (id),
FOREIGN KEY (user_id) references users(id)
)

CREATE TABLE bookshelf (
id INT NOT NULL AUTO_INCREMENT ,
name VARCHAR( 20 ) NOT NULL ,
PRIMARY KEY ( id )
)

CREATE TABLE book_reviews (
id INT NOT NULL AUTO_INCREMENT ,
review VARCHAR( 20 ),
user_id INT,
bookshelf_id INT,
PRIMARY KEY (id),
FOREIGN KEY (user_id) references users(id),
FOREIGN KEY (bookshelf_id) references bookshelf(id)
)

If you want to aggregate on users, here is your query :

SELECT users.*, COUNT(book_reviews.ID ) as bookreviews, COUNT( recipe_reviews.ID ) AS recipereviews, bookreviews + recipereviews as reviewtotal
    FROM users
    LEFT OUTER JOIN recipe_reviews ON recipe_reviews.user_id = users.ID
    LEFT OUTER JOIN book_reviews ON book_reviews.user_id = users.ID
    GROUP BY users.ID
    ORDER BY reviewtotal DESC

You can also aggregate on both users and books, then including the recipe_reviews doesn't make sense.

PS: you don't need the DISTINCT as you have the keys take care of that.

Dave.Sol
Thanks for your thoughts. However, there's a user_id in bookshelf because each user has their own bookshelf to which they can add any book on the site, so there has to be an association with the user_id to find out how many books each user has in their shelf.As for the foreign keys, I'm using MyISAM tables for these, so I can't use FKs. Would switching to InnoDB and FKs make a real difference to performance?
mandel
Foreign keys would be a hit on the performance in general because of the constraint checks that have to be performed during insert (and possibly update/delete). But specifically for data retrieval with this query, I don't see any difference as you have the indices. I would go for InnoDB though - at least for the purposes of data integrity.
Dave.Sol
+2  A: 

You need to create indexes on user_id (preferably clustered indexes if possible).

Are you sure you have done this? Remember that having a foreign key does not automatically generate an index on that key.

If you are joining 4 B-Trees of 1k rows each, this should surely not take 9s, but a few milliseconds.

The long execution time indicates that you are performing table scans for each user.

I'm pretty convinced this is the correct answer.

Your query is fine except that you are COUNTing your reviews twice, replace the second count with bookreviews and numreviews.

eckesicle