tags:

views:

303

answers:

4

Hello.

Sorry, I couldn't provide a better title for my problem as I am quite new to SQL. I am looking for a SQL query string that solves the below problem.

Let's assume the following table:

DOCUMENT_ID |     TAG
----------------------------
   1        |   tag1
   1        |   tag2
   1        |   tag3
   2        |   tag2
   3        |   tag1
   3        |   tag2
   4        |   tag1
   5        |   tag3

Now I want to select all distinct document id's that contain one or more tags (but those must provide all specified tags). For example: Select all document_id's with tag1 and tag2 would return 1 and 3 (but not 4 for example as it doesn't have tag2).

What would be the best way to do that?

Regards, Kai

+13  A: 
SELECT document_id
FROM table
WHERE tag = 'tag1' OR tag = 'tag2'
GROUP BY document_id
HAVING COUNT(DISTINCT tag) = 2

Edit:

Updated for lack of constraints...

John Rasch
Make sure to adjust that last line with the correct number of tags you are looking for. ie. if you are looking for those documents tagged with tag1, tag2 and tag3, you would need to use HAVING COUNT(document_id) >= 3
Peter Di Cecco
Aside from using or in favor of in, that's about what I came up with (you got there first). +1. Oh, and as long as you don't get dups, you don't need to worry about >= vs =
BCS
-1 If duplicates of Document_ID and Tag are allowed in the data, the above will not work. Two rows of (10, tag1) will return 10 even though no row has (10, tag2)
Shannon Severance
@Shannon - this was only recently added in the comments as an issue... most people put use constraints properly
John Rasch
Make the last line HAVING COUNT(DISTINCT tag) = 2 will work on SQL Server and Oracle.
Shannon Severance
@John. Without knowing the context, it is impossible to know if a unique constraint on (document_id, tag) is an appropriate constraint or not.
Shannon Severance
Thank you all. It was also nice to see how dynamic stackoverflow is. Thanks again.
Zardoz
Its pretty awesome to watch in action isn't it...??
Chris Porter
A: 
Select distinct document_id 
from {TABLE} 
where tag in ('tag1','tag2')
group by id 
having count(tag) >=2

How you generate the list of tags in the where clause depends on your application structure. If you are dynamically generating the query as part of your code then you might simply construct the query as a big dynamically generated string.

We always used stored procedures to query the data. In that case, we pass in the list of tags as an XML document. - a procedure like that might look something like one of these where the input argument would be

<tags>
   <tag>tag1</tag>
   <tag>tag2</tag>
</tags>


CREATE PROCEDURE [dbo].[GetDocumentIdsByTag]
@tagList xml
AS
BEGIN

declare @tagCount int
select @tagCount = count(distinct *) from @tagList.nodes('tags/tag') R(tags)


SELECT DISTINCT documentid
FROM {TABLE}
JOIN @tagList.nodes('tags/tag') R(tags) ON {TABLE}.tag = tags.value('.','varchar(20)')
group by id 
having count(distict tag) >= @tagCount 

END

OR

CREATE PROCEDURE [dbo].[GetDocumentIdsByTag]
@tagList xml
AS
BEGIN

declare @tagCount int
select @tagCount = count(*) from @tagList.nodes('tags/tag') R(tags)


SELECT DISTINCT documentid
FROM {TABLE}
WHERE tag in
(
SELECT tags.value('.','varchar(20)') 
FROM @tagList.nodes('tags/tag') R(tags)
}
group by id 
having count( distinct tag) >= @tagCount 
END

END

James Conigliaro
This is wrong. in ('tag1','tag2') would return 1, 2, 3, and 4. He stated that he only wanted IDs with BOTH tags returned.
Kevin Crowell
XML document, eh? Still in the stone-age myself with a split function and commas. Like the cut of your jib, sir.
Paul Alan Taylor
Missed the requirement of having both tags
James Conigliaro
Added clauses to account for requirement of having ALL tags
James Conigliaro
+1  A: 
select DOCUMENT_ID
      TAG in ("tag1", "tag2", ... "tagN")
   group by DOCUMENT_ID
   having count(*) > N and

Adjust N and the tag list as needed.

BCS
+7  A: 

This assumes DocumentID and Tag are the Primary Key.

Edit: Changed HAVING clause to count DISTINCT tags. That way it doesn't matter what the primary key is.

Test Data

-- Populate Test Data
CREATE TABLE #table (
  DocumentID varchar(8) NOT NULL, 
  Tag varchar(8) NOT NULL
)

INSERT INTO #table VALUES ('1','tag1')
INSERT INTO #table VALUES ('1','tag2')
INSERT INTO #table VALUES ('1','tag3')
INSERT INTO #table VALUES ('2','tag2')
INSERT INTO #table VALUES ('3','tag1')
INSERT INTO #table VALUES ('3','tag2')
INSERT INTO #table VALUES ('4','tag1')
INSERT INTO #table VALUES ('5','tag3')

INSERT INTO #table VALUES ('3','tag2')  -- Edit: test duplicate tags

Query

-- Return Results
SELECT DocumentID FROM #table
WHERE Tag IN ('tag1','tag2')
GROUP BY DocumentID
HAVING COUNT(DISTINCT Tag) = 2

Results

DocumentID
----------
1
3
beach
HAVING COUNT(*) = 2 instead of >= 2 would rule out documents with more than 1 instance of a given DocumentID and Tag, assuming the data rules allow that.
Chris Porter
Yep. That is why I was assuming the primary key (or unique key) was DocumentID and Tag. Otherwise, as you suggest, changing HAVING COUNT(*) >= 2 will account for that.
beach
If dup tags are leagal, the whole thing crashes down because you are only counting tags, not distinct tags.
BCS
Very good point BCS, dup tags would break both beach's and John's answers.
Chris Porter
Okay, changed implementation to account for duplicate tags (assuming the primary key is not DocumentID,Tag). *HAVING COUNT(Distint Tag) = 2* should now work in all cases.
beach
Wouldn't adding Tag to the GROUP BY statement correct for duplicate entries in #table?
Chris Porter
@Chris- nope. Adding the *Tag* to the GROUP BY and keeping the *HAVING COUNT (DISTINCT Tag) = 2* would return no rows as you will only ever get a single Tag in a group. Changing the GROUP BY and the HAVING to *COUNT (*) = 2* would only return groupings that contained duplicate tags (Document 3 in my updated example.) So in this case, adding Tag to the GROUP BY wouldn't work.
beach
+1 - more work than I wanted to do for this! :)
John Rasch
My question was posted before you did the HAVING COUNT(DISTINCT Tag) update (or before I noticed it). Either way, you are correct, it would not help with the duplication of a DocumentID + Tag instance. I've never used Distinct in a HAVING COUNT() statement but I definitely like that approach and am filing that away in my brain somewhere for later use. Hopefully I tag it properly so it shows up in future brain queries.
Chris Porter