views:

252

answers:

4

I'm in the early stages of my database design so nothing is final yet, and I'm using the "TOXI" 3-table design for my threads which have optional tags, but I can't help but feel that the joining is not really necessary and perhaps I need to just rely on a simple tags column in my posts table where I can just store a varchar of something like <tag>, <secondTag>.

So to recap:

  • is it worth the trouble of the extra left joins on the 2 tag tables instead of just having a tag column in my posts table.
  • is there a way I can optimize my query?

Schema:

posts
---------

post_id      post_name      
1            test


post_tags
---------
tag_id         tag_name
1              mma
2              ufc


posts_tags_map
--------
map_id         post_id         tags_id
1              1                 1
2              1                 2


Current query:

SELECT 
posts.*,
GROUP_CONCAT( post_tags.tag_name order by post_tags.tag_name ) AS tags

FROM posts
LEFT JOIN posts_tags_map
  ON posts_tags_map.post_id = posts.post_id
LEFT JOIN post_tags
  ON posts_tags_map.tags_id = posts_tags.tag_id

WHERE posts.post_id = 1
GROUP BY post_id

Result:

post_id     post_name        tags
1             test           mma, ufc

*IF* there are tags
+5  A: 

Having all tags in different records (normalized) means that you'll be able to rename the tags more easily should the need arise and track the tag name history.

SO, for instance, renamed SQL Server related tags at least thrice (mssql -> sqlserver -> sql-server).

Having all tags in one record (denormalized) means that you can index this column with a FULLTEXT index and search for posts having two or more tags at once:

SELECT  *
FROM    posts
WHERE   MATCH(tags) AGAINST('+mma +ufc')

which is possible too but less efficient with normalized design.

(Don't forget to adjust @ft_min_word_len to index tags of 3 characters or less for this to work)

You can combine both designs: store both the map table and the denormalized column. This will require more maintenance, though.

You can also store the normalized design in your database and use the query you provided to feed the tags to Sphinx or Lucene.

This way, you can do history digging with MySQL, fulltext tag searches using Sphinx, and no extra maintenance will be required.

Quassnoi
Note that, if using the default settings, a FULLTEXT index wouldn't index those two words: it wants at least 4 characters. You can change the settings, but it needs to be done.
Michael Madsen
`@Michael Madsen`: sure thing, thanks for pointing out.
Quassnoi
+3  A: 

Your query of a tag would be very slow if you had a varchar with a list of tags in. You would be doing something along the lines of where post.tag like '%mytag%' which would not perform anywhere near as well as searching on an indexed key.

[edit] This study shows performance of various ways of doing tagging systems (including FULLTEXT indexed) and suggests where and when you would like to use each one.

Jeremy French
`FULLTEXT` indexes are designed specially for the queries like that.
Quassnoi
Would it be as fast as querying an id on an indexed column?
Jeremy French
`@Jeremy French`: Which column? You cannot make a plain index to index one record with two keys. A query to retrieve all posts holding a series of id's is of course possible but less efficient.
Quassnoi
in the above schema querying post_tags for the name requires one text lookup where a fulltext index will help you. and then a join on post_tags_map which can use a numeric index. The alternative in the OP is to have a denormalized string with a comma separated list of tags in. This WILL be slower than the normalized version.
Jeremy French
`@Jeremy French`: `FULLTEXT` index is designed specially for fast querying of comma-separated (or any separated) strings. It indexes one record with as many keys are there are distinct words in the field. Say, you need to query all posts tagged `sql`, `mysql`, `optimization` and `normalization` (all at once). It's very easy and efficient with a `FULLTEXT` index: `MATCH (tags) AGAINST ('+sql +mysql +optimization +normalization')`. Now, how would you do the same with a normalized schema?
Quassnoi
"how would you do the same with a normalized schema?" With multiple joins. I'm not saying that you can't do it with a FULLTEXT index, only that it will not be as efficient. Indexes (especially text ones) add time to inserts and take up a lot memory.
Jeremy French
+3  A: 

If you use the VARCHAR hack, it will be nearly impossible for you to query the data. It will be hell to write a query which accurately and efficiently shows all posts with a given tag (and let's face it, that's a pretty big aspect of a tagging system): The accuracy part is hard because you need to consider all possibilities for the comma; the effeciency part is hard because searching in a string is much, much slower than looking at the full value of a field (moreso if you could use an integer).

So yes, it is most certainly worth it.

As far as making your query faster is concerned - make sure you have the relevant indexes on your tables. Run an EXPLAIN on the query to see where any bottleneck is placed. I don't think it would be better to fetch the tags for each post as you process it, but it might be - I'm not sure how efficient MySQL really is at string manipulation, which is what it's doing when you do the GROUP_CONCAT.

Michael Madsen
+1  A: 

Joining (when you have correct indexes) is generally much faster than trying to pull data out of the middle of a comma delimited string in a field even using full text search. Or you could go with a bunch of separate tag fields (Tag1, tag2, tag3) and querying will still be harder (let me search 5 fields to find if I have used that tag) and you would need to add a new column every time you need to add a new tag and you've used up the existing columns. The normalized database design is the best possible, most performant way to go. Databases are designed to use joins. Why you wouldn't want to use them is beyond me.

HLGEM