views:

284

answers:

2

Hi,

I have two queries to get related tags from a mysql database, one works, one does not, my question is: "why?"

Problem: When executing the first query, the mysql server gets 100% cpu usage, and has to be restarted to function again.

Query 1 (does not work):

SELECT tags.*, COUNT(ct.company_id) AS count
FROM company2tag ct, tags
WHERE ct.company_id IN (
    SELECT ct.company_id FROM company2tag ct
    WHERE ct.tag_id = 18
    GROUP BY ct.company_id
    HAVING COUNT(ct.company_id) = 1
)
AND tags.id != 18
AND tags.id = ct.tag_id
GROUP BY ct.tag_id
ORDER BY count DESC
LIMIT 5;

Query 2 (works):

SELECT tags.*, COUNT(ct.company_id) AS count
FROM company2tag ct, tags
WHERE ct.company_id IN (5864, 5870, 6140, 6221, 6268)
    AND tags.id != 18
    AND tags.id = ct.tag_id
GROUP BY ct.tag_id
ORDER BY count DESC
LIMIT 5;

To my understanding the two queries above do completely the same, the only difference is that the first query retrieves its "company_id's" via a subselect.

How can this happen?

+2  A: 

MySQL is not very good in optimizing IN conditions.

The condition in your first query cannot be easily rewritten as EXISTS, that's why MySQL checks the results for each row.

If you want to select company_id's that are mentioned more than once in tag 18, it's better to rewrite this query as such:

SELECT  tags.*, COUNT(company_id) AS count
FROM    company2tag ct
JOIN    tags
ON      tags.id = ct.tag_id
WHERE   ct.tag_id <> 18
        AND NOT EXISTS
        (
        SELECT  NULL
        FROM    company2tag cti
        WHERE   cti.tag_id = 18
                AND cti.company_id = ct.company_id
        LIMIT 1, 1
        )
GROUP BY
        ct.tag_id
ORDER BY
        count DESC

The main idea here is that you don't need to COUNT(*): it's enough just to check that at least two values exist.

See this article in my blog for the similar problem:

Having the following index:

CREATE INDEX ix_company2tag_tag_company_id ON company2tag (tag_id, company_id)

will greatly improve this query.

Quassnoi
+3  A: 

First of all, you may be experiencing problems from the first query because you have two tables aliased to ct... one in the outer query, one in the sub-query.

Secondly, you can rewrite the IN as a JOIN:

SELECT tags.*, COUNT(ct.company_id) AS count
FROM company2tag ct
INNER JOIN tags ON tags.id = ct.tag_id
INNER JOIN (
    SELECT company_id FROM company2tag
    WHERE tag_id = 18
    GROUP BY company_id
    HAVING COUNT(company_id) = 1
) ctf ON ct.company_id = ctf.company_id
WHERE tags.id != 18
GROUP BY ct.tag_id
ORDER BY count DESC
LIMIT 5;

Note that I haven't actually tested this.

R. Bemrose
+1 for mentioning rewriting the IN as a JOIN. IN is supposed to be for matching against a short list. If you have a larger list based on other tables/queries it should be a JOIN.
Jason S
Thanks for this! Your solution is about 10 times faster and does not block the server.
smoove666