views:

448

answers:

13

I'm going through an application and trying to optimize some queries and I'm really struggling with a few of them. Here's an example:

SELECT `Item` . * , `Source` . * , `Keyword` . * , `Author` . *
FROM `items` AS `Item`
JOIN `sources` AS `Source` ON ( `Item`.`source_id` = `Source`.`id` )
JOIN `authors` AS `Author` ON ( `Item`.`author_id` = `Author`.`id` )
JOIN `items_keywords` AS `ItemsKeyword` ON ( `Item`.`id` = `ItemsKeyword`.`item_id` )
JOIN `keywords` AS `Keyword` ON ( `Keyword`.`id` = `ItemsKeyword`.`keyword_id` )
JOIN `keywords_profiles` AS `KeywordsProfile` ON ( `Keyword`.`id` = `KeywordsProfile`.`keyword_id` )
JOIN `profiles` AS `Profile` ON ( `Profile`.`id` = `KeywordsProfile`.`profile_id` )
WHERE `KeywordsProfile`.`profile_id` IN ( 17 )
GROUP BY `Item`.`id`
ORDER BY `Item`.`timestamp` DESC , `Item`.`id` DESC
LIMIT 0 , 20;

This one is taking 10-30 seconds...in the tables referenced, there are about 500k author rows, and about 750k items and items_keywords rows. Everything else is less than 500 rows.

Here's the explain output: http://img.skitch.com/20090220-fb52wd7jf58x41ikfxaws96xjn.jpg

EXPLAIN is relatively new to me, but I went through this line by line and it all seems fine. Not sure what else I can do, as I've got indexes on everything...what am I missing?

The server this sits on is just a 256 slice over at slicehost, but there's nothing else running on it and the CPU is at 0% before its run. And yet still it cranks away on this query. Any ideas?

EDIT: Some further info; one of the things that makes this really frustrating is that if I repeatedly run this query, it takes less than .1 seconds. I'm assuming this is due to the query cache, but if I run RESET QUERY CACHE before it, it still runs extremely quickly. It's only after I wait a little while or run some other queries that the 10-30 second times return. All the tables are MyISAM...does this indicate that MySQL is loading stuff into memory and that's why it runs so much faster for awhile?

EDIT 2: Thanks so much to everyone for your help...an update...I cut everything down to this:

SELECT i.id
FROM items AS i
ORDER BY i.timestamp DESC, i.id DESC
LIMIT 0, 20;

Consistently took 5-6 seconds, despite there only being 750k records in the DB. Once I dropped the 2nd column on the ORDER BY clause, it was pretty much instant. There's obviously several things going on here, but when I cut the query down to this:

SELECT i.id
FROM items AS i
JOIN items_keywords AS ik ON ( i.id = ik.item_id )
JOIN keywords AS k ON ( k.id = ik.keyword_id )
JOIN keywords_profiles AS kp ON ( k.id = kp.keyword_id )
WHERE kp.profile_id IN (139)
ORDER BY i.timestamp DESC
LIMIT 20;

It's still taking 10+ seconds...what else can I do?

Minor curiosity: on the explain, the rows column for items_keywords is always 1544, regardless of what profile_id I'm using in the query. shouldn't it change depending on the number of items associated with that profile?

EDIT 3: Ok, this is getting ridiculous :). If I drop the ORDER BY clause entirely, things are very speedy and the temp table / filesort disappears from explain. There's currently an index on the item.timestamp column, but is it not being used for some reason? I thought I remembered something about mysql only using one index per table or something? should I create a multi-column index over all the columns on the items table that this query references (source_id, author_id, timestamp, etc)?

A: 

The problem appears that it has to full joins across every single table before it even tries to do a where clause. This can cause 500k rows per table across you're looking in the millions+ rows that it's populating in memory. I would try changing the JOINS to LEFT JOINS USING ().

Suroot
this is really not the problem
Sam Saffron
How is this not the problem, the problem is a query not returning quickly. Doing a LEFT JOIN should actually speed this query up.
Suroot
Tried it...no discernible difference
Ryan Waggoner
What are you key's setup as on each table? Seems like the tables don't get indexed often if at all.
Suroot
LEFT JOIN in MySQL is generally slower than the equivalent INNER JOIN, never mind the change in result set.
staticsan
+3  A: 

It looks okay, every row in the explain is using an index of some sort. One possible worry is the filesort bit. Try running the query without the order by clause and see if that improves it.

Then, what I would do is gradually take out each join until you (hopefully) get that massive speed increase, then concentrate on why that's happening.

The reason I mention the filesort is because I can't see a mention of timestamp anywhere in the explain output (even though it's your primary sort criteria) - it might be requiring a full non-indexed sort.

UPDATE#1:

Based on edit#2, the query:

SELECT i.id
    FROM items AS i
    ORDER BY i.timestamp DESC, i.id DESC
    LIMIT 0, 20;

takes 5-6 seconds. That's abhorrent. Try creating a composite index on both TIMESTAMP and ID and see if that improves it:

create index timestamp_id on items(timestamp,id);
select id from items order by timestamp desc,id desc limit 0,20;
select id from items order by timestamp,id limit 0,20;
select id from items order by timestamp desc,id desc;
select id from items order by timestamp,id;

On one of the tests, I've left off the descending bit (DB2 for one sometimes doesn't use indexes if they're in the opposite order). The other variation is to take off the limit in case that's affecting it.

paxdiablo
+1 a good approach to take
Sam Saffron
Removing the ORDER BY clause had no discernable effect, unless it perhaps made it worse. Hard to tell, as you can read from my edit above, but I ran on a different Profile and it took ~60 seconds without the ORDER BY. Am working on the JOIN thing...
Ryan Waggoner
as you'll see by the edits, there's numerous things slowing the query down, so it's hard to isolate. turns out that the ORDER BY was indeed a factor...my bad :) how can I make sure that my timestamp index is used?
Ryan Waggoner
test query results:1: ~.05s2: ~.05s3: ~.85s4: ~.85sSo the composite index clearly helped here, but the original query still is slow as hell :(
Ryan Waggoner
A: 

I would suggest you run a profiler on the query, then you can see how long each subquery took and where the time is being consumed. If you have phpmyadmin, it's a simple chekbox you need to check to get this functionality, but my guess is you can get it manually from the mysql terminal app as well. I haven't seen this explain thing before, if it is in fact the profiling i am used to in phpmyadmin i apologize for the nonsense.

Per Stilling
A: 

What is the GROUP BY clause achieving? There are no aggregate functions in the SELECT so the GROUP BY should be unnecessary

barrowc
I think it was a relic from a previous (probably misguided) attempt to avoid duplicate items from showing up. Is DISTINCT the correct way to achieve this?
Ryan Waggoner
distinct is the more correct way to avoid duplicates, but the even more correct way is to write the query so that you don't have duplicates. It's almost invariably a sign that either your query or your schema is making a mistake somewhere.
ʞɔıu
A: 
ʞɔıu
How'd you get that neat upside-down name? I've been meaning to ask for a while.
paxdiablo
http://www.fileformat.info/convert/text/upside-down.htm?text=nick
lubos hasko
I noticed after seeing a cyrillic username that they seem to be allowing usernames using any unicode characters
ʞɔıu
A: 

Is it possible you're having issues because of filesystem I/O ? The EXPLAIN shows that there have to be 1544 rows fetched from the ItemsKeyword table. If you have to go to disk for each of those you'll add about 10-15 second total to the run time (assuming a high-ish seek time because you're on a VM). Normally the tables are cached in RAM or the data is stored close enough on the disk that reads can be combined. However, you're running on a VM with 256MB of ram, so you may no memory spare it can cache into and if your table file is fragmented enough you might be able to get the query performance degraded this much.

You could probably get some idea of what's happening with I/O during the query by running something like pidstat -d 1 or iostat 1 in another shell on the server.

EDIT: From looking at the query adding an index on (ItemsKeyword.item_id, ItemsKeyword.keyword_id) should fix it if my theory is right about it being a problem with the seeks for the ItemsKeyword table.

paranoidgeek
Actually I think it means that it needs to examine 1544 rows at that stage of the loop, not necessarily that they will all be returned.
ʞɔıu
creating an index on (item_id,keyword_id) didn't help :(
Ryan Waggoner
+3  A: 

Try this and see how it does:

SELECT i.*, s.*, k.*, a.*
FROM items AS i
 JOIN sources AS s ON (i.source_id = s.id)
 JOIN authors AS a ON (i.author_id = a.id)
 JOIN items_keywords AS ik ON (i.id = ik.item_id)
 JOIN keywords AS k ON (k.id = ik.keyword_id)
WHERE k.id IN (SELECT kp.keyword_id
           FROM keywords_profiles AS kp
           WHERE kp.profile_id IN (17))
ORDER BY i.timestamp DESC, i.id DESC
LIMIT 0, 20;

I factored out a couple of the joins into a non-correlated subquery, so you wouldn't have to do a GROUP BY to map the result to distinct rows.

Actually, you may still get multiple rows per i.id in my example, depending on how many keywords map to a given item and also to profile_id 17.

The filesort reported in your EXPLAIN report is probably due to the combination of GROUP BY and ORDER BY using different fields.

I agree with @ʞɔıu's answer that the speedup is probably because of key caching.

Bill Karwin
I've tried everything and I can't get the query posted above to run in less than 10 seconds, even just selecting the i.id column, dropping the author / source joins, and only ordering by timestamp desc.
Ryan Waggoner
I think further diagnosis and optimization is going to be beyond what folks can do on StackOverflow without access to test on your database. You may need to hire a consultant who specializes in MySQL performance, e.g. http://www.percona.com/
Bill Karwin
A: 

MySQL loads a lot into different caches, including indexes and queries. In addition, your operating system will keep a file system cache that could speed up your query when executed repeatedly.

One thing to consider is how MySQL creates temporary tables during this type of query. As you can see in your explain, a temporary table is being created, probably for sorting of the results. Generally, MySQL will create these temporary tables in memory, except for 2 conditions. First, if they exceed the maximum size set in MySQL settings (max temp table size or heap size - check mysqlperformanceblogs.com for more info on these settings). The second and more important one is this:

  • Temporary tables will always be created on disk when text or blob tables are selected in the query.

This can create a major performance hit, and even lead to an i/o bottleneck if your server is getting any amount of action.

Check to see if any of your columns are of this data type. If they are, you can try to rewrite the query so that a temporary table is not created (group by always causes them, I think), or try not selecting these out. Another strategy would be to break this up into several smaller queries that might execute in a fraction of the time.

Good luck!

jonstjohn
yep, I have TEXT columns being returned...arg. how can I avoid having a temp table needing to be created?
Ryan Waggoner
well, I changed the query to just return the id columns on all the tables and still taking forever...I'll keep the above in mind, though
Ryan Waggoner
A: 

I may be completely wrong but what happens when you change

WHERE kp.profile_id IN (139)

to

WHERE kp.profile_id = 139
John Boker
Nope. This is so frustrating :-(
Ryan Waggoner
+1  A: 

For your query to run fast, you need:

  1. Create an index: CREATE INDEX ix_timestamp_id ON items (timestamp, id)
  2. Ensure that id's on sources, authors and keywords are primary keys.
  3. Force MySQL to use this index for items, and perform NESTED LOOP joins for other items:

    EXPLAIN EXTENDED
    SELECT `Item`.*, `Source` . * , `Keyword` . * , `Author` . *
    FROM `items` AS `Item` FORCE INDEX FOR ORDER BY (ix_timestamp_id)
    JOIN `items_keywords` AS `ItemsKeyword` FORCE INDEX (ix_item_keyword) ON ( `Item`.`id` = `ItemsKeyword`.`item_id` AND `ItemsKeyword`.`keyword_id` IN
      (
      SELECT `keyword_id`
      FROM `keywords_profiles` AS `KeywordsProfile` FORCE INDEX (ix_keyword_profile)
      WHERE `KeywordsProfile`.`profile_id` = 17
      )
    )
    JOIN `sources` AS `Source` FORCE INDEX (primary) ON ( `Item`.`source_id` = `Source`.`id` )
    JOIN `authors` AS `Author` FORCE INDEX (primary) ON ( `Item`.`author_id` = `Author`.`id` )
    JOIN `keywords` AS `Keyword` FORCE INDEX (primary)  ON ( `Keyword`.`id` = `ItemsKeyword`.`keyword_id` )
    ORDER BY `Item`.timestamp DESC, `Item`.id DESC
    LIMIT 0, 20
    

As you can see, we get rid of GROUP BY, push the subquery into the JOIN condition and force PRIMARY KEYs to be used for joins.

That's how we ensure that NESTED LOOPS with items as a leading tables will be used for all joins.

As a result:

1, 'PRIMARY', 'Item',         'index',  '', 'ix_timestamp_id', '12', '', 20, 2622845.00, ''
1, 'PRIMARY', 'Author',       'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'test.Item.author_id', 1, 100.00, ''
1, 'PRIMARY', 'Source',       'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'test.Item.source_id', 1, 100.00, ''
1, 'PRIMARY', 'ItemsKeyword', 'ref', 'PRIMARY', 'PRIMARY', '4', 'test.Item.id', 1, 100.00, 'Using where; Using index'
1, 'PRIMARY', 'Keyword',      'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'test.ItemsKeyword.keyword_id', 1, 100.00, ''
2, 'DEPENDENT SUBQUERY',      'KeywordsProfile', 'unique_subquery', 'PRIMARY', 'PRIMARY', '8', 'func,const', 1, 100.00, 'Using index; Using where'

, and when we run this, we get

20 rows fetched in 0,0038s (0,0019s)

There are 500k in items, 600k in items_keywords, 512 values in keywords and 512 values in keywords_profiles (all with profile 17).

Quassnoi
really really appreciate your help here, but this didn't work. CPU spiked above 100% and stayed there for about five minutes before I finally killed it. explain revealed that there was still a temp table / filesort issue. once I removed the FORCE INDEX on everything but items, it took ~60 secs :(
Ryan Waggoner
Did you create an index on (timestamp, id)? Did you put correct index name into FORCE INDEX?
Quassnoi
Ah, and make sure ItemsKeyword has a UNIQUE INDEX or PRIMARY KEY defined on (item_id, keyword_id) (in that order)
Quassnoi
And KeywordsProfile should have UNIQUE INDEX or PRIMARY KEY on (keyword_id, profile_id) too, order doesn't matter.
Quassnoi
I created the (timestamp, id) index on items. As far as the join tables go (ItemsKeyword and KeywordsProfile), they both have id columns defined as the PRIMARY column. Does it have to be a UNIQUE or PRIMARY key?
Ryan Waggoner
No matter UNIQUE or PRIMARY, just make sure BOTH columns are in the index: ALTER TABLE items_keywords ADD PRIMARY KEY (item_id, keyword_id)
Quassnoi
And they need to be in that order. I just checked with the order reversed, doesn't work.
Quassnoi
What I'm saying is that I already have a primary key on items_keywords.id. And creating a UNIQUE index fails because there are duplicates. It's a mess, I know :(
Ryan Waggoner
Create a plain index then: ALTER TABLE `test`.`items_keywords` ADD INDEX `ix_item_keyword` USING BTREE(`item_id`, `keyword_id`) and use its name in FORCE INDEX right after `items_keywords`. We're close :)
Quassnoi
Still no luck :( When I run explain, this is what I get...is the number of rows on that first line troubling? http://img.skitch.com/20090220-dyuaau9d9jp2ki9ah4n4rpb2jw.jpg
Ryan Waggoner
Right, it should be 20 there. Did you put the limit? Did you put the ORDER BY query? Could you please post whole EXPLAIN result?
Quassnoi
Here's the full explain...I just did the query you posted, but FORCE INDEX (ix_item_keyword) on that join. http://img.skitch.com/20090220-8cqejdhajbws7ne71i2bcd6eqa.jpg
Ryan Waggoner
Am I doing this backwards? I'm trying to get the most recent 20 items for a particular profile...
Ryan Waggoner
Plan is fine, except that it uses INDEX `keyword_id` instead of `keyword_profile`. See updated post on how to fix it. And sorry for a stupid question, but are you absolutely positively sure that you didn't miss the LIMIT 0, 20 clause?
Quassnoi
I checked and double-checked...I have no idea what's going on, but the edits don't seem to have made any difference. Something fundamental must be wrong here...
Ryan Waggoner
A: 

Try this:

SELECT i.id
FROM ((items AS i
   INNER JOIN items_keywords AS ik ON ( i.id = ik.item_id ))
   INNER JOIN keywords AS k ON ( k.id = ik.keyword_id ))
   INNER JOIN keywords_profiles AS kp ON ( k.id = kp.keyword_id AND kp.profile_id = 139)
ORDER BY i.timestamp DESC
LIMIT 20;
Jess
A: 

Looking at the pastie.org link in the comments to the question:

  • you're joining items.source_id int(4) to sources.id int(16)
  • also items.id int(16) to itemskeywords.item_id int(11)

I can't see any good reason for the two fields to have different widths in these cases

I realise that these are just display widths and that the actual range of numbers which the column can store is determined solely by the INT part but the MySQL 6.0 reference manual says:

Note that if you store larger values than the display width in an integer column, you may experience problems when MySQL generates temporary tables for some complicated joins, because in these cases MySQL assumes that the data fits into the original column width.

From the rough figures you quoted, it doesn't look as though you are exceeding the display width on any of the ID columns. You may as well tidy up these inconsistencies though just to eliminate another possible bug.

You might be as well to remove the display widths altogether if you don't have a need for them

edit:

I would hazard a guess that the original author of the database perhaps thought that int(4) meant "an integer with up to 4 digits" whereas it actually means "an integer between -2147483648 and 2147482647 displayed with at least 4 characters left-padded with spaces if need be"

Definitions like authors.refreshed int(20) or items.timestamp int(30) don't really make sense as there can only be 10 digits plus the sign in an int. Even a bigint can't exceed 20 characters. Perhaps the original author thought that int(4) was analogous to varchar(4)?

barrowc
A: 

Try a backup copy of your tables. After that rename the original tables to something else, rename the new tables to the original and try again with your new-but-old-named tables...

Or you can try to repair the tables, but this doesn't always help.

Edit: Man, this was an old question...

Vili