views:

38

answers:

1

Hey all,

This is a question that's a bit different from the typical "how do I make a tagging system" question, which has been thoroughly discussed on SO and elsewhere.

I want to crowdsource tagging so you don't have to rely on the owner of each document to completely list out applicable tags. At the same time, I don't want a random schmoe to be able to mess up everyone's tags by intentionally mistagging lots of documents.

How does a system like this work in general? For example, Slashdot.org relies on something like this to come up with story tags. (never having edited tags, I would be interested in hearing more about how this works.)

Now to make this more concrete: Suppose my tagging DB schema looks like this:

doc:     id, name, ...
tag:     id, tag_name
doc_tag: doc_id, tag_id, user_id

Now, each user can assign his/her own tags to the documents. One way to determine a consensus is to look at the fraction of people who labeled a document with a particular tag. This results in the below monstrosity of an SQL statement.

SELECT
   doc_id, tag_id,
   num_times_tagged, taggers_count,
   num_times_tagged/taggers_count AS popularity

FROM doc_tag

LEFT JOIN (
   SELECT doc_id, tag_id, COUNT(*) AS num_times_tagged
   FROM doc_tag GROUP BY doc_id, tag_id
) num_times
ON doc_tag.doc_id = num_times.doc_id AND
   doc_tag.tag_id = num_times.tag_id

LEFT JOIN (
   SELECT doc_id, COUNT(DISTINCT user_id) AS taggers_count
   FROM doc_tag GROUP BY doc_id
) num_taggers
ON doc_tag.doc_id = num_taggers.doc_id

GROUP BY doc_tag.doc_id, doc_tag.tag_id

Am I going about this completely wrong? This seems like a really expensive query to make. Suppose I just wanted to get a list of documents and top tags for each one -- how would I even write a join for this? I don't want to run this query for every document that I get!

Thanks for any advice.

David

A: 

Here's a cleaner query:

SELECT
   doc_id, 
   tag_id,
   COUNT(*) AS num_times_tagged, 
   COUNT(DISTINCT user_id) AS taggers_count,
   COUNT(*)/COUNT(DISTINCT user_id) AS popularity

FROM doc_tag
GROUP BY doc_tag.doc_id, doc_tag.tag_id

Also, I'm not familiar with all RBDMS, but if you're using Sql Server, you can create a view, then a clustered index on top of the view. This will slow down your inserts to doctag, but make your reads from that view really fast.

Shlomo