views:

149

answers:

4

I know this question might sound a little cheesy but this is the first time I am implementing a "tagging" feature to one of my project sites and I want to make sure I do everything right.

Right now, I am using the very same tagging system as in SO.. space seperated, dash(-) combined multiple words. so when I am validating a user-input tag field I am checking for

  1. Empty string (cannot be empty)
  2. Make sure the string doesnt contain particular letters (suggestions are welcommed here..)
  3. At least one word
  4. if there is a space (there are more than one words) split the string
  5. for each splitted, insert into db

I am missing something here? or is this roughly ok?

A: 

I hope you're doing the usual protection against injection attacks - maybe that's included under #2.

At the very least, you're going to want to escape quote characters and make embedded HTML harmless - in PHP, functions like addslashes and htmlentities can help you with that. Given that it's for a tagging system, my guess is you'll only want to allow alphanumeric characters. I'm not sure what the best way to accomplish that is, maybe using regular expressions.

IanGreenleaf
+2  A: 

Split the string at " ", iterate over the parts, make sure that they comply with your expectations. If they do, put them into the DB.

For example, you can use this regex to check the individual parts:

^[-\w]{2,25}$

This would limit allowed input to consecutive strings of alphanumerics (and "_", which is part of "\w" as well as "-" because you asked for it) 2..25 characters long. This essentially removes any code injection threat you might be facing.

EDIT: In place of the "\w", you are free to take any more closely defined range of characters, I chose it for simplicity only.

Tomalak
+1  A: 

Be sure your algorithm can handle leading/trailing/extra spaces with no trouble = )

Also worth thinking about might be a tag blacklist for inappropriate tags (profanity for example).

rlb.usa
+1  A: 

I've never implemented a tagging system, but am likely to do so soon for a project I'm working on. I'm primarily a database guy and it occurs to me that for performance reasons it may be best to relate your tagged entities with the tag keywords via a resolution table. So, for instance, with example tables such as:

TechQuestion
TechQuestionID (pk)
SubjectLine
QuestionBody

TechQuestionTag
TechQuestionID (pk)
TagID (pk)
Active (indexed)

Tag
TagID (pk)
TagText (indexed)

... you'd only add new Tag table entries when never-before-used tags were used. You'd re-associate previously provided tags via the TechQuestionTag table entry. And your query to pull TechQuestions related to a given tag would look like:

SELECT
 q.TechQuestionID,
 q.SubjectLine,
 q.QuestionBody
FROM
 Tag t INNER JOIN TechQuestionTag qt
  ON t.TagID = qt.TagID AND qt.Active = 1
 INNER JOIN TechQuestion q
  ON qt.TechQuestionID = q.TechQuestionID
WHERE
 t.TagText = @tagText

... or what have you. I don't know, perhaps this was obvious to everyone already, but I thought I'd put it out there... because I don't believe the alternative (redundant, indexed, text-tag entries) wouldn't query as efficiently.

codemonkey