views:

62

answers:

4

I'm dealing with a database where items are "tagged" a certain number of times.

item (100k rows)

  • id
  • name
  • other stuff

tag (10k rows)

  • id
  • name

item2tag (1,000,000 rows)

  • item_id
  • tag_id
  • count

I'm looking for the fastest solution to:

Select items that have been tagged as X, Y, and Z (where X, Y, and Z correspond to (possibly) tag names) ?

Here's what I have so far... I'd just like to make sure I'm doing it in the best way possible:

First get the tag_ids from the names:

SELECT tag.id WHERE name IN ("X","Y","Z");

Then I group by those tag_ids and use Having to filter the result:

SELECT item2tag.*, count(tag_id)
  FROM item2tag
  WHERE tag_id=1 or tag_id=2 or tag_id=3
GROUP BY item_id
HAVING count(tag_id)=3;

Then I can just select from item with those ids.

SELECT * FROM item WHERE id IN ([results from prior query])

I have millions of rows in item2tag, with an index on (item_id, tag_id). Is this going to be the fastest solution?

+3  A: 

The method you have suggested is probably the most common way to perform the query but might not be the fastest. Using joins can be faster:

SELECT T1.item_id
FROM item2tag T1
JOIN item2tag T2 ON T1.item_id = T2.item_id
JOIN item2tag T3 ON T2.item_id = T3.item_id
WHERE T1.tag_id = 1 AND T2.tag_id = 2 AND T3.tag_id = 3

You should ensure that you have the following indexes:

  • Primary key on (item_id, tag_id)
  • Index on (tag_id).

I performance tested this query against the original in a few different scenarios.

  • For the case where nearly every item in the table is tagged with at least one of the tags being searched for, the original query takes about 5 seconds and the JOIN version takes about 10 seconds - slightly slower.
  • For the case where two of the tags occur very frequently and one of the tags occurs only very rarely the original query takes about 0.9 seconds, whereas the JOIN query takes just 0.003 seconds - a considerable performance improvement.

The SQL I used to make performance test is pasted below. You can run this test yourself or modify it slightly and test other queries, or different scenarios.

Warning: Don't run this script on your production database as it modifies the contents of the item2tag table. Running the script can take a few minutes as it creates a lot of data.

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$
CALL prc_filler(1000000);

CREATE TABLE item2tag (
    item_id INT NOT NULL,
    tag_id INT NOT NULL,
    count INT NOT NULL
);

INSERT INTO item2tag (item_id, tag_id, count)
SELECT  id % 150001, id % 10, 1
FROM    filler;
ALTER TABLE item2tag ADD PRIMARY KEY (item_id, tag_id);
ALTER TABLE item2tag ADD KEY (tag_id);

-- Make tag 3 occur rarely.    
UPDATE item2tag SET tag_id = 10 WHERE tag_id = 3 AND item_id > 0;

SELECT T1.item_id
FROM item2tag T1
JOIN item2tag T2 ON T1.item_id = T2.item_id
JOIN item2tag T3 ON T2.item_id = T3.item_id
WHERE T1.tag_id = 1 AND T2.tag_id = 2 AND T3.tag_id = 3;

SELECT item_id
FROM item2tag
WHERE tag_id=1 or tag_id=2 or tag_id=3
GROUP BY item_id
HAVING count(tag_id)=3;
Mark Byers
+1 You're quicker than me. :) Yes, in MySQL particularly, group-by solutions are slow. I'd use the self-join solution. Be sure to create a compound index over (item_id, tag_id).
Bill Karwin
Shouldn't the copies both be joining to T1, not chaining off each other?
OMG Ponies
@OMG Ponies: In this case, if A=B and B=C then A=C. It's six of one and half-dozen of the other.
Bill Karwin
Wow! Thank you so much for this. A few notes: The most common tags occur in less than 2% of the entries in item2tag. This signifies to me that your solution is a much better fit... thanks a ton for this.The second issue has to do with "3" tags. I used 3 as an example, but it's possible there may be up to (some arbitrary number, probably 10) tags included. If this occurs, I'd like to select the items *with the most matched tags* and get rid of the "must match all tags" requirement. How do I deal with that? Should I start a new question?
Sambo
@Sambo: Starting a new question would be a very good idea, yes. Have you profiled the simple solution that you have suggested yourself? It might be good enough already.
Mark Byers
A: 

You'll be better placed having an index that has tag_id as the first column - otherwise finding all items with tag_id 1 will require a full table scan (same for any tag_id, of course).

Will A
A: 

Depending on how many items are tagged with individual tags, you might do it by getting list of items tagged with one tag, and then filtering it for occurences of other tags, like this:

select item_id from item2tag
where item_id in (
    select item_id from item2tag
    where item_id in (
        select item_id from item2tag where tag_id = TID1
    ) and tag_id = TID2
) and tag_id = TID3
che
A: 

Another way of doing it. Borrowing Mark's test this seems to be faster than the GROUP BY but slower than the JOINs. It is of course more extensible for dealing with an arbitrary number of tags.

Edit Now I've populated the item and tags tables with more records the above is no longer true as these answers don't currently use those tables. Probably worth trying all alternatives against your own data.

SELECT i.*
FROM item i
WHERE NOT EXISTS(
    SELECT * FROM tag t
    WHERE
    name IN ("X","Y","Z") 
    AND NOT EXISTS(
        SELECT * FROM item2tag it
        WHERE it.tag_id = t.id AND it.item_id = i.id
                  )
                )
Martin Smith