views:

104

answers:

6

Hello,

I am trying to set up a normalised MySQL database containing the three following tables. The first table contains a list of items which can be described by various tags. The third table contains the various tags used to describe the items in the first table. The middle table relates the other two tables to each other. In each table's case, the id is an auto-incrementing primary key (and each is used as the foreign key in the middle table)

+---------------+---------------------+---------------+
|   Table 1     |      Table 2        |   Table 3     |
+---------------+---------------------+---------------+
|id        item |id   item_id   tag_id|id          tag|
+---------------+---------------------+---------------+
| 1      spaniel| 1         1        4| 1         bird|
| 2        tabby| 2         1       23| 4          pet|
| 3      chicken| 3         1       41|23          dog|
| 4     goldfish| 4         2        4|24          cat|
|               | 5         2       24|25      reptile|
|               | 6         3        1|38         fish|
|               | 7         3       40|40    delicious|
|               | 8         4        4|41        cheap|
|               | 9         4       38|42    expensive|
|               |10         4       41|               |
|               |                     |               |
+---------------+---------------------+---------------+

I want to run a query of one ore more tags against the three tables to return the items that match ALL of the tags.

So for example, querying for "pet" would return the items (1)spaniel, (2)tabby and (4)goldfish, because all of them are tagged "pet". Querying for "cheap" and "pet" together would return (1)spaniel and (4)goldfish because they are both tagged "cheap" and "pet". Tabby would not be returned as it is only tagged "pet" but not "cheap" (in my world tabby cats are expensive :P)

Querying for "cheap", "pet" and "dog" would only return (1)Spaniel, since it is the only one matching all three tags.

Anyway, this is the desired behaviour. I have two questions.

  1. Is this the best way to set up my tables for my intended purposes? I am still new to ideas of normalising databases, and am picking this up as I go along - any input on efficiency or even if this is an appropriate layout for my database would be much appreciated.

  2. Provided the above setup is workable, how could I structure a single MySQL query to achieve my intended purpose?* (that being, for a series of tags, returning ONLY the item(s) that match ALL the specified tags). I have tried doing a variety of JOINs/UNIONs but none of them are giving me the desired effect(usually return ALL the items that match ANY of the tags). I've spent some time looking through the MySQL manual online but I feel like I'm missing something conceptually.

*I say single query since of course I could just run a series of simple WHERE/JOIN queries, one for each tag and then combine/sort the returned items in PHP or something after the fact but it seems a foolish and inefficient way of doing it. I feel like there is a way I should be able to do this with a single MySQL query, given the appropriate setup.

A: 
  1. This mapping-table concept is pretty standard and looks well-implemented here. The only thing I'd change is getting rid of the ID in Table 2; for what would you use it? Just make a joint key for Table 2 on both item ID and tag ID.

  2. Actually, selecting where an item matches ALL tags is hard. Try this:

    SELECT item_id,COUNT(tag_id) FROM Table2 WHERE tag_id IN (your set here) GROUP BY item_id

Where the count equals the number of tag IDs in your set, you have found a match.

Borealid
A: 

You could try something like this:

select item, count(*) 'NrMatches'
from #table1 i
inner join #table2 l ON i.id = l.item_id
inner join #table3 t on l.tag_id = t.id
where t.tag IN ('cheap', 'pet', 'dog')
group by item
having count(*) = (select count(*) from #table3 
                   where tag IN ('cheap', 'pet', 'dog'))

It means having your search terms twice, but it mostly does what you're after.

Jonathan
+6  A: 

Your schema is looking fairly good. There’s no need for the ID column in your join table—just create a primary key from the ID columns of the other tables (although see Marjan Venema's comment and Should I use composite primary keys or not? for alternative views on this). The following examples show how you can create the tables, add some data, and perform the queries that you requested.

Create tables, complete with foreign key constraints. In short, foreign key constraints help to ensure database integrity. In this example, they prevent items being inserted in the join table (item_tag), if there are no matching items in the item and tag tables:

CREATE  TABLE IF NOT EXISTS `item` (
  `id` INT UNSIGNED NOT NULL AUTO_INCREMENT ,
  `item` VARCHAR(255) NOT NULL ,
  PRIMARY KEY (`id`) )
ENGINE = InnoDB;

CREATE  TABLE IF NOT EXISTS `tag` (
  `id` INT UNSIGNED NOT NULL AUTO_INCREMENT ,
  `tag` VARCHAR(255) NOT NULL ,
  PRIMARY KEY (`id`) )
ENGINE = InnoDB;

CREATE  TABLE IF NOT EXISTS `item_tag` (
  `item_id` INT UNSIGNED NOT NULL ,
  `tag_id` INT UNSIGNED NOT NULL ,
  PRIMARY KEY (`item_id`, `tag_id`) ,
  INDEX `fk_item_tag_item` (`item_id` ASC) ,
  INDEX `fk_item_tag_tag` (`tag_id` ASC) ,
  CONSTRAINT `fk_item_tag_item`
    FOREIGN KEY (`item_id` )
    REFERENCES `item` (`id` )
    ON DELETE CASCADE
    ON UPDATE CASCADE,
  CONSTRAINT `fk_item_tag_tag`
    FOREIGN KEY (`tag_id` )
    REFERENCES `tag` (`id` )
    ON DELETE CASCADE
    ON UPDATE CASCADE)
ENGINE = InnoDB;

Insert some test data:

INSERT INTO item (item) VALUES
('spaniel'),
('tabby'),
('chicken'),
('goldfish');

INSERT INTO tag (tag) VALUES
('bird'),
('pet'),
('dog'),
('cat'),
('reptile'),
('fish'),
('delicious'),
('cheap'),
('expensive');

INSERT INTO item_tag (item_id, tag_id) VALUES
(1,2),
(1,3),
(1,8),
(2,2),
(2,4),
(3,1),
(3,7),
(4,2),
(4,6),
(4,8);

Select all items and all tags:

SELECT item.id, item.item, tag.tag
FROM item
JOIN item_tag ON item_tag.item_id = item.id
JOIN tag ON item_tag.tag_id = tag.id;

+----+----------+-----------+
| id | item     | tag       |
+----+----------+-----------+
|  1 | spaniel  | pet       |
|  1 | spaniel  | dog       |
|  1 | spaniel  | cheap     |
|  2 | tabby    | pet       |
|  2 | tabby    | cat       |
|  3 | chicken  | bird      |
|  3 | chicken  | delicious |
|  4 | goldfish | pet       |
|  4 | goldfish | fish      |
|  4 | goldfish | cheap     |
+----+----------+-----------+

Select items with a specific tag:

SELECT item.id, item.item, tag.tag
FROM item
JOIN item_tag ON item_tag.item_id = item.id
JOIN tag ON item_tag.tag_id = tag.id
WHERE tag = 'pet';

+----+----------+-----+
| id | item     | tag |
+----+----------+-----+
|  1 | spaniel  | pet |
|  2 | tabby    | pet |
|  4 | goldfish | pet |
+----+----------+-----+

Select items with one or more tags. Note that this will return items that have the tags cheap OR pet:

SELECT item.id, item.item, tag.tag
FROM item
JOIN item_tag ON item_tag.item_id = item.id
JOIN tag ON item_tag.tag_id = tag.id
WHERE tag IN ('cheap', 'pet');

+----+----------+-------+
| id | item     | tag   |
+----+----------+-------+
|  1 | spaniel  | pet   |
|  1 | spaniel  | cheap |
|  2 | tabby    | pet   |
|  4 | goldfish | pet   |
|  4 | goldfish | cheap |
+----+----------+-------+

The above query produces an answer that you might not want, as highlighted by the following query. In this case, there are no items with the house tag, but this query still returns some rows:

SELECT item.id, item.item, tag.tag
FROM item
JOIN item_tag ON item_tag.item_id = item.id
JOIN tag ON item_tag.tag_id = tag.id
WHERE tag IN ('cheap', 'house');

+----+----------+-------+
| id | item     | tag   |
+----+----------+-------+
|  1 | spaniel  | cheap |
|  4 | goldfish | cheap |
+----+----------+-------+

You can fix that by adding GROUP BY and HAVING:

SELECT item.id, item.item, tag.tag
FROM item
JOIN item_tag ON item_tag.item_id = item.id
JOIN tag ON item_tag.tag_id = tag.id
WHERE tag IN ('cheap', 'house')
GROUP BY item.id HAVING COUNT(*) = 2;

Empty set (0.00 sec)

GROUP BY causes all items with the same id (or whatever column you specify) to be grouped together into a single row, effectively removing duplicates. HAVING COUNT limits the results to those where the count of the matching grouped rows is equal to two. That ensures that only items with two tags are returned - note that this value must match the number of tags specified in the IN clause. Here’s an example that produces something:

SELECT item.id, item.item, tag.tag
FROM item
JOIN item_tag ON item_tag.item_id = item.id
JOIN tag ON item_tag.tag_id = tag.id
WHERE tag IN ('cheap', 'pet')
GROUP BY item.id HAVING COUNT(*) = 2;

+----+----------+-----+
| id | item     | tag |
+----+----------+-----+
|  1 | spaniel  | pet |
|  4 | goldfish | pet |
+----+----------+-----+

Note that in the previous example, the items have been grouped together so that you don’t get duplicates. In this case, there’s no need for the tag column, as that just confuses the results—you already know what tags there are, as you have requested items with those tags. You can therefore simplify things a little by removing the tag column from the query:

SELECT item.id, item.item
FROM item
JOIN item_tag ON item_tag.item_id = item.id
JOIN tag ON item_tag.tag_id = tag.id
WHERE tag IN ('cheap', 'pet')
GROUP BY item.id HAVING COUNT(*) = 2;

+----+----------+
| id | item     |
+----+----------+
|  1 | spaniel  |
|  4 | goldfish |
+----+----------+

You could go a step further, and use GROUP_CONCAT to provide a list of matching tags. This might be handy where you want a list of items that have one or more of the specified tags, but not necessarily all of them:

SELECT item.id, item.item, GROUP_CONCAT(tag.tag) AS tags
FROM item
JOIN item_tag ON item_tag.item_id = item.id
JOIN tag ON item_tag.tag_id = tag.id
WHERE tag IN ('cheap', 'pet', 'bird', 'cat')
GROUP BY id;

+----+----------+-----------+
| id | item     | tags      |
+----+----------+-----------+
|  1 | spaniel  | pet,cheap |
|  2 | tabby    | pet,cat   |
|  3 | chicken  | bird      |
|  4 | goldfish | pet,cheap |
+----+----------+-----------+

One problem with the above schema design is that it is possible to enter duplicate items and tags. That is, you could insert bird into the tag table as many times as you like, and this is not good. One way to fix that is to add a UNIQUE INDEX to the item and tag columns. This has the added benefit of helping to speed up queries which rely on these columns. The updated CREATE TABLE commands now look like this:

CREATE  TABLE IF NOT EXISTS `item` (
  `id` INT UNSIGNED NOT NULL AUTO_INCREMENT ,
  `item` VARCHAR(255) NOT NULL ,
  UNIQUE INDEX `item` (`item`) ,
  PRIMARY KEY (`id`) )
ENGINE = InnoDB;

CREATE  TABLE IF NOT EXISTS `tag` (
  `id` INT UNSIGNED NOT NULL AUTO_INCREMENT ,
  `tag` VARCHAR(255) NOT NULL ,
  UNIQUE INDEX `tag` (`tag`) ,
  PRIMARY KEY (`id`) )
ENGINE = InnoDB;

Now if you try to insert a duplicate value, MySQL will prevent you from doing so:

INSERT INTO tag (tag) VALUES ('bird');
ERROR 1062 (23000): Duplicate entry 'bird' for key 'tag'
Mike
+1 This is a much more detailed version of what I said; this answer is both correct and complete. Good job.
Borealid
The need for having a separate ID column in the link table may come from an ORM. The generic nature of ORM (aka OPF) may well dictate that every table row is identified by a single column and preclude tables with varying numbers of identifying columns.
Marjan Venema
@Marjan Venema: Good point. I've updated the first paragraph to include this.
Mike
+2  A: 

Yes. This is called relational division. A variety of techniques are discussed here http://www.simple-talk.com/sql/t-sql-programming/divided-we-stand-the-sql-of-relational-division/

One Approach would be to use a double negative. ie. to select all records from table 1 for which no tag in the list 'cheap','pet' does not have an associated record in table2

SELECT t1.id, t1.item
FROM Table1 t1
WHERE NOT EXISTS
(
    SELECT * FROM  
    table3 t3 WHERE tag IN ('cheap','pet')
    AND NOT EXISTS (
        SELECT * FROM table2 t2
        WHERE t2.tag_id = t3.id
        AND t1.id=t2.item_id
    )
)
Martin Smith
+1 for the relational division article link.
Mike
A: 

Not sure others might have already mentioned this but the id column in the second table is redundant. You can just create a join primary key:

PRIMARY KEY (item_id, tag_id)

Otherwise, it's a bretty standard m:n database scheme and it should work alright.

Richard Knop
A: 

Thank you everyone for your very detailed and helpful replies. The bit about using "WHERE tag IN ('tag_1'...'tag_x')" in conjunction with COUNT to select items that match all the tags was exactly what I was missing before.

The input on using composite primary keys was also really helpful - I felt like there was no point to using a unique ID key on the middle table, but never realised I could use composite keys.

Thank you once again! You guys are great!

Bamboo