views:

143

answers:

5

I have three tables, books, tags, and taggings (books-xref-tags):

books
id | title |      author     
 1 | Blink | Malcolm Gladwell
 2 |  1984 |    George Orwell

taggings
book_id | tag_id
      1 |      1
      1 |      2
      2 |      1
      2 |      3

tags
id | name
 1 | interesting
 2 |  nonfiction
 3 |     fiction

I'd like to search for all books tagged both "interesting" and "fiction." The best I've come up with is

select books.* from books, taggings, tags
 where taggings.book_id = books.id
   and taggings.tag_id  = tag.id
   and tag.name = "interesting"
intersect
select books.* from books, taggings, tags
 where taggings.book_id = books.id
   and taggings.tag_id  = tag.id
   and tag.name = "fiction"

That seems to work, but I'm not sure how it will scale, either in rows or number of tags. That is, what happens as I add hundreds of books, hundreds of tags, and thousands of taggings? What happens as the search becomes "'interesting' and 'fiction' and 'aquatic' and 'stonemasonry'"?

I have an alternative approach in mind if there's no better way of doing the query directly in SQL:

  1. select all books with the first tag, along with all of those books' tags
  2. remove any from the list that don't have all of the tags queried
+1  A: 

A little bit more 'old school' dialect of SQL here, but it's more compact syntax and still an inner join.

select * from books, taggings tg1, tags t1, taggings tg2, tags t2 
 where tg1.book_id = books.id
   and tg1.tag_id  = t1.id
   and t1.name = 'interesting'
   and tg2.book_id = books.id
   and tg2.tag_id  = t2.id
   and t2.name = 'fiction'

EDIT: Wow, that's a lot of hate from the stackers for joining too much in one query. More optimization can be had by using exists subqueries:

select * from books
 where exists (select * from taggings, tags
                where tags.name = 'fiction'
                  and taggings.tag_id = tags.id
                  and taggings.book_id = books.id)
   and exists (select * from taggings, tags
                where tags.name = 'interesting'
                  and taggings.tag_id = tags.id
                  and taggings.book_id = books.id)
Jeffrey Hantin
Not sure why you got a negative comment here. I guess it didn't really answer the question about scaling or efficiency but did it really deserve a negative vote?
Chuck Vose
How well does adding join dimensions scale?
James A. Rosen
-1 for outdated join syntax, incorrect string literal quotes, and returning each book twice.
Joel Coehoorn
edited query fixed all the problems in the original so removed the downvote, but it's probably less efficient than what the OP already has.
Joel Coehoorn
I noticed the quotes issue before you commented :-) but most SQL implementations I've seen don't have issues with ancient inner join syntax. Syntax aside, how does the original query return duplicate books, short of there being duplicated rows in a table?
Jeffrey Hantin
+3  A: 

If you want to keep the option of using more than two tags, this answer to a similar could be interesting for you.

It uses MySQL syntax (not sure what you use), but it is quite simple and you should be able to use it with other databases.

This would look like that for you (using MySQL syntax):

SELECT books.id, books.title, books.author
FROM books
INNER JOIN taggings ON ( taggings.book_id = books.book_id )
INNER JOIN tags ON ( tags.tag_id = taggings.tag_id )
WHERE tags.name IN ( @tag1, @tag2, @tag3 )
GROUP BY books.id, books.title, books.author
HAVING COUNT(*) = @number_of_tags

From my other post:

If you have 3 tags like in your example then number_of_tags would have to be 3, and the join would result in 3 rows per id that matches.

You can either create that query dynamically, or define it with, say, 10 tags and initialize them with a value that won't occur in tags.

Peter Lang
That's a genius way of doing it. **Plus** it lets me do "any 5 of 7," which is pretty awesome.
James A. Rosen
+1  A: 

I would recommend the ALL instead of intersect since mysql actually knows how to join this much better though I lack the proper benchmarks.

select books.* from books, taggings, tags
 where taggings.book_id = books.id
   and taggings.tag_id  = tag.id
   and tag.name ALL("interesting", "fiction");

As to it's scaling, with millions of books and a low cardinality on the tags table what you're going to end up doing is migrating the tag id into the code/memory so that you're using taggings.tag_id ALL(3, 7, 105) or something. That last join to get the tags table on isn't going to use an index unless you get over like 1k tags so you're going to do a table scan each time.

In my experience joins, intersections and unions are huge evils for performance. Mostly joins are the problem we commonly experience. The less joins you have, the faster you'll end up getting.

Chuck Vose
A: 

What database? That will change the answer a bit. For example, this works with sql server and should be faster because it eliminates the need to go to the tags table twice, but will fail on mysql because mysql doesn't do CTEs:

WITH taggingNames
AS
(
    SELECT tag.Name, tag.tag_id, tagging.book_id
    FROM tags
    INNER JOIN taggings ON tags.tag_id = taggings.tagid
) 
SELECT b.* 
FROM books b
INNER JOIN (
  SELECT t1.book_id
   FROM taggingNames 
   INNER JOIN taggingNames t2 ON t2.book_id = t1.book_id AND t2.Name='fiction'
   WHERE t1.Name='interesting' 
   GROUP BY t1.book_id
 ) ids ON b.book_id = ids.book_id

Thought now that I see it I like Peter Lang's answer as well.

Joel Coehoorn
+1  A: 
with
  tt as
  (
      select id
      from tags
      where name in ('interesting', 'fiction')
  ),
  mm as
  (
      select book_id
      from taggings join tt on taggings.tag_id = tt.id
      group by taggings.book_id having count(*) = 2
  )
select books.*
from books join mm on books.id = mm.book_id

This variation appears to produce a better execution plan (on Oracle, at least) than Peter Lang's solution for the following reasons (paraphrased from EXPLAIN PLAN):

  • The join between tags and taggings is performed table-to-index instead of table-to-table. I do not know if this would actually impact query performance for large datasets.

  • The plan groups and counts the dataset before performing the final join with books. This would most certainly impact performance for large datasets.

Vadim K.
`with` won't work on MySQL, but if it's a real win, I certainly wouldn't mind switching to PostgreSQL, which does support it. Another very fine answer!
James A. Rosen
Would changing the `mm` sub-query as follows be an effective way of getting books matching *any* of the tags, but ordered by number of matches? `mm as (select book_id, COUNT(*) as taggings_count from taggings join tt on taggings.tag_id = tt.id group by taggings.book_id) select books.*, mm.taggings_count ... order by mm.taggings_count`
James A. Rosen
James, your change to `mm` produces the result you describe. By the way, you can probably have MySQL produce a similar execution plan by using nested sub-queries in the `from` clause.
Vadim K.