views:

58

answers:

3

Hello,

me and my colleagues are developing a website with similar idea as Stackoverflow, but for submitting tasks (and for internal use). This morning we talked about tagging tasks and couldn't really figure which option would be the fastest one, or if we aren't missing something.

Let's imagine table with tags, which would dynamically update, depending on users. Users can create any tags and they'd be added to this table. Structure is following:

  • ID
  • name
  • count

I'll get to actual point now. If you click on, for example, tag "PHP", it'd show you another page with all tasks tagged with "PHP". Something similar to this page. Important thing is this list of related tags. How to represent it in database?

Two options came to our minds, but I don't think any of them is actually the most effective one.

  1. Make select of all tasks with "PHP" tag and check what other tags they contain. In few years we might get an answer from server.

  2. Make a table with cols tag, related tag, count where would be all possible tag relationships. Only problem we see is duplicity. We could have tag PHP and related tag DB2, but we could also have tag DB2 with related tag PHP, which is of course the very same relationship, with the very same count.

I actually quite like option #2, but without duplicity. Perhaps option where there wouldn't be so close relationship between tags (as if there weren't any "primary" and "secondary" tags) could work the best. I'm not really very sure at this point and I wouldn't like to model something that wouldn't work in the future or would be way too slow if there were, for example, one million tags.

We will use PHP and mySQL or DB2, but I guess that doesn't matter.

So, the actual questions is: Are there any other, possibly better options? In case of any questions, just ask me.

Thanks in advance.

+1  A: 

I suppose that if you have a table of "Tags assigned to Task X" with correct/clever indices, finding the tags as described in option 1) should not take that long using a join. That would be the most dynamic approach.

Option two would provide you with means to do a "Tag X is often used together with Tags Y and Z" query and could statically be filled when a new task is created, however, it would take more efford for example when a tag is added or removed from a task. That would be automatic for approach 1).

Approach 2) would (as you described it) not allow you to get exactly the related tags for the current task, as you're not storing the task ID. If you did that, however, you are about at the same point as approach 1).

Thorsten Dittmar
+1  A: 

I assume that you are doing this because want the "show the top N tags related to 'tag'" query to be really fast.

If you are doing this in a DB, then your second approach is best. You might even consider creating an index which is ascending on the tag field and descending on the related-tag-count field.

But if you really want speed, consider representing this as an in-memory data structure.

Stephen C
+1  A: 

I assume that you represent the task-tag relationship using a separate table (just task-id, tag-id), so the first option that you describe would be a "simple" join from your task table to the tags table using the task-tag relationship table. I'm afraid that my SQL-knowledge has dried up a bit, so I wouldn't trust myself to give you advice on exactly what type of INNER/OUTER/LEFT/RIGHT join it would call for, nor what type of performance you could expect from that with the proper indexes build etcetera. Try it out, that's probably the best thing to do... The sql statement can be built using Visual Studio/Access/probably something else.

I would assume that your second approach is faster if you expect there to be many items in your db. However, I would definitely recommend you to do proper performance testing to determine this rather than guessing. Either way, you can get rid of the duplicity by only storing for one of the tag-tag pairs (i.e. db2-php and not php-db2 for instance). Which one to store could be determined by ordering them by id for instance so that you always store them with the tag with the smallest id first.

I would also guess that your first option is faster to get started with, so that you could begin with using that and then go for the second option once you have time to do so or once it becomes a performance problem.

villintehaspam