views:

463

answers:

5

I have a page that is taking 37 seconds to load. While it is loading it pegs MySQL's CPU usage through the roof. I did not write the code for this page and it is rather convoluted so the reason for the bottleneck is not readily apparent to me.

I profiled it (using kcachegrind) and find that the bulk of the time on the page is spent doing MySQL queries (90% of the time is spent in 25 different mysql_query calls).

The queries take the form of the following with the tag_id changing on each of the 25 different calls:

SELECT * FROM tbl_news WHERE news_id
 IN (select news_id from
 tbl_tag_relations WHERE tag_id = 20)

Each query is taking around 0.8 seconds to complete with a few longer delays thrown in for good measure... thus the 37 seconds to completely load the page.

My question is, is it the way the query is formatted with that nested select that is causing the problem? Or could it be any one of a million other things? Any advice on how to approach tackling this slowness is appreciated.

Running EXPLAIN on the query gives me this (but I'm not clear on the impact of these results... the NULL on primary key looks like it would be bad, yes? The number of results returned seems high to me as well as only a handful of results are returned in the end):

1    PRIMARY  tbl_news ALL NULL NULL NULL NULL 1318 Using where
2   DEPENDENT SUBQUERY tbl_tag_relations ref FK_tbl_tag_tags_1 FK_tbl_tag_tags_1 4 const 179 Using where
+2  A: 

If I understand correctly, this is just listing the news stories for a specific set of tags.

  1. First of all, you really shouldn't ever SELECT *

  2. Second, this can probably be
    accomplished within a single query, thus reducing the overhead cost of
    multiple queries. It seems like it is getting fairly trivial data so it could be retrieved within a single call instead of 20.

  3. A better approach to using IN might be to use a JOIN with a WHERE condition instead. When using an IN it will basically be a lot of OR statements.
  4. Your tbl_tag_relations should definitely have an index on tag_id
Joe Philllips
If you write a statement like that as an answer, you should explain why (at least in a few words) or provide better examples. Not everyone knows these things, and the OP clearly said that s/he doesn't really know much about the database.
Jason Coco
I'm down with giving examples, and frankly, I can do without the listing of the fields.
altCognito
Is there a specific reason why SELECT * is undesirable? How is SELECT * different from explicitly listing every single field in the table?
Calvin
It's not different. It makes it difficult to know what those fields are without looking at the table though. It's poor documentation. It may cause more data to be sent, hindering speed, etc.
Joe Philllips
+1  A: 
select * 
 from tbl_news, tbl_tag_relations 
 where 
      tbl_tag_relations.tag_id = 20 and 
      tbl_news.news_id = tbl_tag_relations.news_id 
 limit 20

I think this gives the same results, but I'm not 100% sure. Sometimes simply limiting the results helps.

altCognito
It will give all fields in tbl_tag_relations which technically wouldn't be the same results but there's probably only two fields in there anyway
Joe Philllips
And limiting the results would provide less results.
Joe Philllips
lolz. I imagine! ;)
altCognito
Oh wait, I forgot, there might not be 20 results to begin with, which is certainly possible, so it might be the same. :)
altCognito
That does give the same results and much faster to boot!
gaoshan88
+3  A: 

The SQL Query itself is definitely your bottleneck. The query has a sub-query in it, which is the IN(...) portion of the code. This is essentially running two queries at once. You can likely halve (or more!) your SQL times with a JOIN (similar to what d03boy mentions above) or a more targeted SQL query. An example might be:

SELECT * 
FROM tbl_news, tbl_tag_relations 
WHERE tbl_tag_relations.tag_id = 20 AND
tbl_news.news_id = tbl_tag_relations.news_id

To help SQL run faster you also want to try to avoid using SELECT *, and only select the information you need; also put a limiting statement at the end. eg:

SELECT news_title, news_body 
... 
LIMIT 5;

You also will want to look into the database schema itself. Make sure you are indexing all of the commonly referred to columns so that the queries will run faster. In this case, you probably want to check your news_id and tag_id fields.

Finally, you will want to take a look at the PHP code and see if you can make one single all-encompassing SQL query instead of iterating through several seperate queries. If you post more code we can help with that, and it will probably be the single greatest time savings for your posted problem. :)

Andy Moore
This isn't the problem. Your query will be optimised and executed the same as the original. You've just rewritten IN as a join, which is arguably wrong because the second table isn't selected at all.
cletus
The index Cletus suggested really did the trick. You are right about the SELECT * and I'm trying to find a simple way to take care of that but the code is pretty wild so changing it will probably affect queries I've not yet imagined... I'll have to be careful.
gaoshan88
+1  A: 

Unfortunately MySQL doesn't do very well with uncorrelated subqueries like your case shows. The plan is basically saying that for every row on the outer query, the inner query will be performed. This will get out of hand quickly. Rewriting as a plain old join as others have mentioned will work around the problem but may then cause the undesired affect of duplicate rows.

For instance the original query would return 1 row for each qualifying row in the tbl_news table but this query:

SELECT news_id, name, blah
FROM tbl_news n
JOIN tbl_tag_relations r ON r.news_id = n.news_id
WHERE r.tag_id IN (20,21,22)

would return 1 row for each matching tag. You could stick DISTINCT on there which should only have a minimal performance impact depending on the size of the dataset.

Not to troll too badly, but most other databases (PostgreSQL, Firebird, Microsoft, Oracle, DB2, etc) would handle the original query as an efficient semi-join. Personally I find the subquery syntax to be much more readable and easier to write, especially for larger queries.

EvilRyry
I have used PostgreSQL and it is powerful but changing databases simply isn't an option.
gaoshan88
+4  A: 

I'e addressed this point in Database Development Mistakes Made by AppDevelopers. Basically, favour joins to aggregation. IN isn't aggregation as such but the same principle applies. A good optimize will make these two queries equivalent in performance:

SELECT * FROM tbl_news WHERE news_id
 IN (select news_id from
 tbl_tag_relations WHERE tag_id = 20)

and

SELECT tn.*
FROM tbl_news tn
JOIN tbl_tag_relations ttr ON ttr.news_id = tn.news_id
WHERE ttr.tag_id = 20

as I believe Oracle and SQL Server both do but MySQL doesn't. The second version is basically instantaneous. With hundreds of thousands of rows I did a test on my machine and got the first version to sub-second performance by adding appropriate indexes. The join version with indexes is basically instantaneous but even without indexes performs OK.

By the way, the above syntax I use is the one you should prefer for doing joins. It's clearer than putting them in the WHERE clause (as others have suggested) and the above can do certain things in an ANSI SQL way with left outer joins that WHERE conditions can't.

So I would add indexes on the following:

  • tbl_news (news_id)
  • tbl_tag_relations (news_id)
  • tbl_tag_relations (tag_id)

and the query will execute almost instantaneously.

Lastly, don't use * to select all the columns you want. Name them explicitly. You'll get into less trouble as you add columns later.

cletus
+1 Your point is taken well, but I don't think cletus ever uses words like disabuse :)
altCognito
I would say that the above syntax is more explicit, not necessarily more clear.
altCognito
It's a good word. :) Alas I had to remove it since I went and checked and surprisingly MySQL does treat IN differently to joins (which I don't believe Oracle does). Edited accordingly.
cletus
Interesting. There were already keys set for all of those (by tbl_rows I assume you mean tbl_news) except for news_id on tbl_tag_relations. Adding only that index cut page load time in half... it's now at 16 seconds. I'm amazed that such a tiny change has such a massive impact on performance. Thank you very much... I'm off to read up on indexes a bit.
gaoshan88
Er yeah, getting my wires crossed with something else. Fixed. I'm surprised your query still takes time in the seconds range. How many rows in tbl_news and tbl_tag_relations? I tried with 100k and 600k records equivalent and get it in 0.7 seconds. Of course, the joins are instantaneous.
cletus
Well don't forget that I'm talking about total page load time taking so long, not just the sql queries. When I profiled the page I saw that those queries were what was taking up the bulk of the time. Currently the bulk is taken up by calls to our ad server (third party) which is something I can live with.Your index suggestion cut queries from taking 90% of the load time to taking only 9%... that’s pretty solid (and thanks very much for that).For the record tbl_news has about 1,400 records and tbl_tag_relations has about 9,000.
gaoshan88
Ah gotcha. That makes more sense (total load time vs query time). Glad it worked out for you.
cletus