views:

156

answers:

3

In addition to this question http://stackoverflow.com/questions/1202668/problem-with-sql-query which had very neat solution, I was wondering how the next step would look:

 DOCUMENT_ID |     TAG
----------------------------
   1        |   tag1
   1        |   tag2
   1        |   tag3
   2        |   tag2
   3        |   tag1
   3        |   tag2
   4        |   tag1
   5        |   tag3

So, to get all the document_ids that have tag 1 and 2 we would perform a query like this:

SELECT document_id
FROM table
WHERE tag = 'tag1' OR tag = 'tag2'
GROUP BY document_id
HAVING COUNT(DISTINCT tag) = 2

Now, what would be interesting to know is how we would get all the distinct document_ids that have tags 1 and 2, and in addition to that the ids that have tag 3. We could imagine making the same query and performing a union between them:

SELECT document_id
FROM table
WHERE tag = "tag1" OR tag = "tag2"
GROUP BY document_id
HAVING COUNT(DISTINCT tag) = 2
UNION
SELECT document_id
FROM table
WHERE tag = "tag3"
GROUP BY document_id

But I was wondering if with that condition added, we could think of another initial query. I am imagining having many "unions" like that with different tags and tag counts. Wouldn't it be very bad in terms of performance to create chains of unions like that?

+2  A: 

This still uses unions of sorts but may be easier to read and control. I am really interested on the speed of this query on a large data set, so please let me know how fast it is. When I put in your small data set it took 0.0001 secs.

SELECT DISTINCT (dt1.document_id)
FROM 
  document_tag dt1,
  (SELECT document_id
    FROM document_tag
    WHERE tag =  'tag1'
  ) AS t1s,
  (SELECT document_id
    FROM document_tag
    WHERE tag =  'tag2'
  ) AS t2s,
  (SELECT document_id
    FROM document_tag
    WHERE tag =  'tag3'
  ) AS t3s
WHERE
  (dt1.document_id = t1s.document_id
  AND dt1.document_id = t2s.document_id
  )
  OR dt1.document_id = t3s.document_id

This will make it easy to add new parameters because you have already specified the result set for each tag.

For example adding:

OR dt1.document_id = t2s.document_id

to the end will also pick up document_id 2

Justin Giboney
A: 

It's possible to do this within a single, however you'll need to promote your WHERE clause into the having clause in order to use a disjunctive.

Alex Gaynor
A: 

You're correct, that will get slower and slower as you add new tags you want to look for in additional UNION clauses. Each UNION clause is an additional query that needs to be planned and executed. Plus you won't be able to sort when you are done.

You're looking for a basic data warehousing technique. First, let me recreate your schema with one additional table.

create table a (document_id int, tag varchar(10));

insert into a values (1, 'tag1'), (1, 'tag2'), (1, 'tag3'), (2, 'tag2'), 
                     (3, 'tag1'), (3, 'tag2'), (4, 'tag1'), (5, 'tag3');

create table b (tag_group_id int, tag varchar(10));

insert into b values (1, 'tag1'), (1, 'tag2'), (2, 'tag3');

Table b contains "tag groups". Group 1 includes tag1 and tag2, while group 2 contains tag3.

Now you can modify table b to represent the query you are interested in. When you are ready to query, you create temp tables to store aggregate data:

create temporary table c 
(tag_group_id int, count_tags_in_group int, tags_in_group varchar(255));

insert into c
select 
    tag_group_id,
    count(tag),
    group_concat(tag)
from b
group by tag_group_id;

create temporary table d (document_id int, tag_group_id int, document_tag_count int);

insert into d
select
    a.document_id,
    b.tag_group_id,
    count(a.tag) as document_tag_count
from a
inner join b on a.tag = b.tag
group by a.document_id, b.tag_group_id;

Now c contains the number of tags for tag group, and d contains the number of tags each document has for each tag group. If a row in c matches a row in d, then that means that document has all of the tags in that tag group.

select 
    d.document_id as "Document ID",
    c.tags_in_group as "Matched Tag Group"
from d
inner join c on d.tag_group_id = c.tag_group_id
            and d.document_tag_count = c.count_tags_in_group

One cool thing about this approach is that you could run reports like 'How many documents have 50% or more of the tags in each of these tag groups?'

select 
    d.document_id as "Document ID",
    c.tags_in_group as "Matched Tag Group"
from d
inner join c on d.tag_group_id = c.tag_group_id
            and d.document_tag_count >= 0.5 * c.count_tags_in_group
mehaase