views:

357

answers:

5

I am using MySQL (MyISAM) 5.0.41 and I have this query:

SELECT `x`.`items`.id, `x`.`items`.name, COUNT(*) AS count
    FROM `x`.`items` INNER JOIN `x`.`user_items`
    ON `x`.`items`.id = `x`.`user_items`.item_id
    GROUP BY name HAVING count > 2 ORDER BY count DESC

I have about 36,000 users, 175,000 user_items and 60,000 items which are constantly added to. So this query is getting a bit slow...

Is it better to:

  • Have a count field in items and update that periodically (say each time a user adds an item)
  • or run the query like this (slowly)..

Or is there any SQL that will populate the count field for me?

Thanks

A: 

My impulse would be to leave the data in something like normal form (in other words, do not increment a "count" field), and then cache the result of the slow query at the application level.

If caching is ineffective, because many people are doing the query, and few of them do it twice, then, yes, you can set up a stored procedure that automatically updates some row in some table. The details vary depending on DB vendor. Here's how to do it in Postgresql. This is the only safe way to do it (i.e., within the DB, and not from the application layer) due to race conditions.

Jonathan Feinberg
I think caching seems like the best solution. I'm a little unclear on how/when to cache? As a cron job? Once an hour or something like that? If this helps; I'm using Django.
betamax
A: 

Are you really getting all 36,000 users every time that you run your query? If you're looking to find the source of a performance issue then that could be it right there.

Depending on your RDBMS you could look at things like indexed or materialized views. Including the count as part of the table and trying to maintain it will almost certainly be a mistake, especially with the small size of your database.

Tom H.
+2  A: 

You should start by indexing user_items.item_id and grouping on it instead of name. Strings are much slower to group by (try it out for yourself), and the index should speed things up a bit more. If that still is too slow, you could run the GROUP BY query first and then join on the items table if your DBMS execution plan isn't doing that by default.

NickLarsen
I grouped by items_id and it increased the speed by about 250ms. What exactly do you mean by index `user_items.item_id`?
betamax
Check out http://dev.mysql.com/doc/refman/5.0/en/create-index.html for the hard way to do it. If you can get on your MySQL server administrator application, you should be able to do it from there. Check out http://en.wikipedia.org/wiki/Index_(database) if you want some info on database indexes.
NickLarsen
Also, it increased the speed by 250ms relative to what? How long was it taking and how long does it take now?
NickLarsen
When I had `name` as the grouping, it took ~1.05s and then when I changed that grouping to `item_id` it took ~750ms (roughly). I have now added `item_id` as an index and its running at about 430ms now. I just don't know if thats quick enough or too slow. I will have to do some more testing..
betamax
If it is still too slow, run the group by as a sub query and then join on the items table afterwards. This wont do anything if MySQL builds its execution plan in that method to begin with, but it will help a small amount otherwise.
NickLarsen
+1  A: 

You can use an intermediate solution:

  • Add a ts DATETIME column to the user_items table which would describe the time the user added the item

  • Add a ts DATETIME column to the users table which would describe the point of actuality, as long as cnt, the cached count column

  • Periodically update the users table with the new count and timestamp:

    INSERT
    INTO    users (id, ts, cnt)
    SELECT  *
    FROM    (
            SELECT  user_id, NOW() AS nts, COUNT(*) AS ncnt
            FROM    user_items ui
            WHERE   ui.timestamp <= NOW()
            )
    ON DUPLICATE KEY
    UPDATE  ts = nnow,
            cnt = ncnt
    
  • Invalidate the user's timestamp when a user_items entry is deleted

  • Issue this query to count the items:

    SELECT  u.id, u.cnt +
            (
            SELECT  COUNT(*)
            FROM    user_items ui
            WHERE   ui.ts > u.ts
                    AND ui.user_id = u.id
            )
    FROM    users
    

This way, only the newly added items will be counted in the user_items table which is much faster, and you won't have concurrency issues with updating the records too often.

Quassnoi
The result set is looking for `items.id`, `items.name` and a count of how many uses exist for each item.
NickLarsen
`@NickLarsen`: then add the `ts` and `cnt` columns to the `items`, not to the `users`, and put it to the query. If fact, you can do both, just update and invalidate both tables.
Quassnoi
@NickLarsen Exactly. I'm not saying this method wont work or is wrong but I think my database schema is working at the moment and these changes might create unnecessary problems/work for me.
betamax
@betamax: in your question you considered the possibility to add a `count` field which would imply changing the schema anyway. The answer just describes a more efficient way of doing this.
Quassnoi
@Quassnoi thats why I didn't mark it as -1, the idea is still valid.
NickLarsen
@Quassnoi I know, thanks for the solution but I had decided I wasn't going to do that based on other answers; I should have edited the question to reflect that.
betamax
+1  A: 

That query is pretty much doing a full table scan every time. There is no way around that. Indexes will speed things up my speeding up the join, but the query will just get slower and slower as your data grows.

Storing summary data, like the "count" with the "items" would be the way to go. You can do this with stored procedures or through code. As a double check, you can periodically (i.e. once per day) update all counts so you know they are accurate.

Brent Baisley