ansaurus

Question

MySQL tagging question: how to select an item that has been tagged as X, Y, and Z?

Answer 1

+3 A:

The method you have suggested is probably the most common way to perform the query but might not be the fastest. Using joins can be faster:

SELECT T1.item_id
FROM item2tag T1
JOIN item2tag T2 ON T1.item_id = T2.item_id
JOIN item2tag T3 ON T2.item_id = T3.item_id
WHERE T1.tag_id = 1 AND T2.tag_id = 2 AND T3.tag_id = 3

You should ensure that you have the following indexes:

Primary key on (item_id, tag_id)
Index on (tag_id).

I performance tested this query against the original in a few different scenarios.

For the case where nearly every item in the table is tagged with at least one of the tags being searched for, the original query takes about 5 seconds and the JOIN version takes about 10 seconds - slightly slower.
For the case where two of the tags occur very frequently and one of the tags occurs only very rarely the original query takes about 0.9 seconds, whereas the JOIN query takes just 0.003 seconds - a considerable performance improvement.

The SQL I used to make performance test is pasted below. You can run this test yourself or modify it slightly and test other queries, or different scenarios.

Warning: Don't run this script on your production database as it modifies the contents of the item2tag table. Running the script can take a few minutes as it creates a lot of data.

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$
CALL prc_filler(1000000);

CREATE TABLE item2tag (
    item_id INT NOT NULL,
    tag_id INT NOT NULL,
    count INT NOT NULL
);

INSERT INTO item2tag (item_id, tag_id, count)
SELECT  id % 150001, id % 10, 1
FROM    filler;
ALTER TABLE item2tag ADD PRIMARY KEY (item_id, tag_id);
ALTER TABLE item2tag ADD KEY (tag_id);

-- Make tag 3 occur rarely.    
UPDATE item2tag SET tag_id = 10 WHERE tag_id = 3 AND item_id > 0;

SELECT T1.item_id
FROM item2tag T1
JOIN item2tag T2 ON T1.item_id = T2.item_id
JOIN item2tag T3 ON T2.item_id = T3.item_id
WHERE T1.tag_id = 1 AND T2.tag_id = 2 AND T3.tag_id = 3;

SELECT item_id
FROM item2tag
WHERE tag_id=1 or tag_id=2 or tag_id=3
GROUP BY item_id
HAVING count(tag_id)=3;

Mark Byers 2010-07-15 22:23:54

+1 You're quicker than me. :) Yes, in MySQL particularly, group-by solutions are slow. I'd use the self-join solution. Be sure to create a compound index over (item_id, tag_id).

Bill Karwin 2010-07-15 22:25:55

Shouldn't the copies both be joining to T1, not chaining off each other?

OMG Ponies 2010-07-15 22:28:17

@OMG Ponies: In this case, if A=B and B=C then A=C. It's six of one and half-dozen of the other.

Bill Karwin 2010-07-15 22:38:05

Wow! Thank you so much for this. A few notes: The most common tags occur in less than 2% of the entries in item2tag. This signifies to me that your solution is a much better fit... thanks a ton for this.The second issue has to do with "3" tags. I used 3 as an example, but it's possible there may be up to (some arbitrary number, probably 10) tags included. If this occurs, I'd like to select the items *with the most matched tags* and get rid of the "must match all tags" requirement. How do I deal with that? Should I start a new question?

Sambo 2010-07-16 01:54:25

@Sambo: Starting a new question would be a very good idea, yes. Have you profiled the simple solution that you have suggested yourself? It might be good enough already.

Mark Byers 2010-07-16 08:11:52

Answer 2

A:

You'll be better placed having an index that has tag_id as the first column - otherwise finding all items with tag_id 1 will require a full table scan (same for any tag_id, of course).

Will A 2010-07-15 22:23:54

Answer 3

A:

Depending on how many items are tagged with individual tags, you might do it by getting list of items tagged with one tag, and then filtering it for occurences of other tags, like this:

select item_id from item2tag
where item_id in (
    select item_id from item2tag
    where item_id in (
        select item_id from item2tag where tag_id = TID1
    ) and tag_id = TID2
) and tag_id = TID3

che 2010-07-15 22:32:34

Answer 4

A:

Another way of doing it. Borrowing Mark's test this seems to be faster than the GROUP BY but slower than the JOINs. It is of course more extensible for dealing with an arbitrary number of tags.

Edit Now I've populated the item and tags tables with more records the above is no longer true as these answers don't currently use those tables. Probably worth trying all alternatives against your own data.

SELECT i.*
FROM item i
WHERE NOT EXISTS(
    SELECT * FROM tag t
    WHERE
    name IN ("X","Y","Z") 
    AND NOT EXISTS(
        SELECT * FROM item2tag it
        WHERE it.tag_id = t.id AND it.item_id = i.id
                  )
                )

Martin Smith 2010-07-15 23:21:03

ansaurus

tags:

views:

answers:

MySQL tagging question: how to select an item that has been tagged as X, Y, and Z?

related questions