views:

581

answers:

8

I have a StackOverflow-like tagging system for a database I'm working on. And I'm writing a stored procedure that looks for results based on an undetermined number of tags in a WHERE clause. There could be anywhere between 0 and 10 tags to filter results. So for example the user could be searching for items tagged with 'apple', 'orange', and 'banana' and each result must include all 3 tags. My query is made even more complicated because I'm also dealing with a cross reference table for the tagging, but for the purposes of this question I won't go into that.

I know I can do some string manipulation and feed the exec() function a query to take care of this but I'd rather not for performance problems associated with dynamic SQL. I figure it's best if SQL caches a query plan for the stored proc.

What are some techniques you've used to avoid dynamic SQL in this type of scenario?

By popular demand, here's the query I'm working with:

SELECT ft.[RANK], s.shader_id, s.page_name, s.name, s.description, s.download_count, s.rating, s.price FROM shader s 
INNER JOIN FREETEXTTABLE(shader, *, @search_term) AS ft ON s.shader_id = ft.[KEY]
WHERE EXISTS(SELECT tsx.shader_id FROM tag_shader_xref tsx INNER JOIN tag t ON tsx.tag_id = t.tag_id WHERE tsx.shader_id = s.shader_id AND t.tag_name = 'color')
AND EXISTS(SELECT tsx.shader_id FROM tag_shader_xref tsx INNER JOIN tag t ON tsx.tag_id = t.tag_id WHERE tsx.shader_id = s.shader_id AND t.tag_name = 'saturation')
ORDER BY ft.[RANK] DESC

This is functional but hard-coded. You'll see that I have it set to look for the 'color' and 'saturation' tags.

A: 

String the tags together with a comma separating them 'apple','orange' and then pass it in to one parameter that uses the IN clause on it in your stored procedure.

Of course if you have the values (key) from the lookup table for these tags I would use those.

EDIT:

Since you need all tags in the result....

Unfortunately, I think no matter what you do, the SP will be in jeopardy of the plan being regenerated.

You could use optional parameters and use CASE and ISNULL to build up the arguments.

I still think this means your SP has lost most of it's cached goodness but it's better than straight exec 'string' I believe.

klabranche
I thought about this. Unfortunately, this won't work in my case because IN acts as an OR. And I need the functionality of AND.
Steve Wortham
Ah, you need all the tags to be found not just any of them.... Sorry, you did mention that in your post.
klabranche
No problem. It's a good idea though. I reworded the question a bit to include that requirement.
Steve Wortham
You can't use a parameter in an IN clause that way anyway. You can only say "IN (@param1, @param2, @param3)", you can't put the parameters in a comma-separated list and do "IN @params".
Todd Owen
@Todd Owen - Good point. Although it didn't matter since the IN acts like an OR; Hence my edit.
klabranche
+9  A: 

For an extensive overview concerning this and similar problems see: http://www.sommarskog.se/dyn-search-2005.html

Specific to your question is the part here: http://www.sommarskog.se/dyn-search-2005.html#AND%5FISNOTNULL

Also take into account that a (straight) dynamic Solution is not necessarily slower than a (possibly convoluted) static one, as query plans can still get cached: see http://www.sommarskog.se/dyn-search-2005.html#dynsql

So you'll have to carefully test/measure your options against realistic amounts of data, taking into account realistic queries (e.g. searches with one or two parameters might be way more common than searches with ten, etc.)


EDIT: Questioner gave a good reason to optimize this in the comments, hence moving the 'premature' warning a bit out of the way:

The (standard ;) word of warning applies, though: This smells a lot like premature optimization! - Are you sure this sproc will get called that often that using dynamic SQL will be significantly slower (that is, compared to other stuff going on in your app)?

Henrik Opel
+1 Yep, this looks like it will work and also have good performance, although it means that the maximum number of tags (10 in this case) is hard-coded.
Todd Owen
Thanks for the suggestion. I'm going to review this in-depth tomorrow as well as the other answers here to be sure. As for the premature optimization thought -- this query will be absolutely critical to the operation of the site. I'm even initiating this query via AJAX on an as-you-type basis. So it's important that I extract the most performance possible out of it. I'm planning on testing several techniques to determine which is fastest.
Steve Wortham
I have more details about my final implementation in my own answer below. You can also see this search engine in action athttp://beta.silverlightshaders.net/pixel/
Steve Wortham
A: 

This may not be the fastest method but could you just generate a query string for each tag and then join them with " INTERSECT "?

Edit: Didn't see the sproc tag so I don't know if this would be possible.

llamaoo7
+1  A: 

I've seen two types of solutions to this problem:

The first is to join the shader table to tags (via the xref as needed) once for each tag that you're looking for. The result of the inner join includes only shaders that have a match for all tags.

SELECT s.*
FROM shader s
JOIN tag_shader_xref x1 ON (s.shader_id = x1.shader_id)
JOIN tag t1 ON (t1.tag_id = x1.tag_id AND t1.tag_name = 'color')
JOIN tag_shader_xref x2 ON (s.shader_id = x2.shader_id)
JOIN tag t2 ON (t2.tag_id = x2.tag_id AND t2.tag_name = 'saturation')
JOIN tag_shader_xref x3 ON (s.shader_id = x3.shader_id)
JOIN tag t3 ON (t3.tag_id = x3.tag_id AND t3.tag_name = 'transparency');

The second solution is to join to that tags once, restricting tags to the three you need, and then GROUP BY the shader_id so you can count the matches. The count will be three only if all tags were found (assuming uniqueness in the xref table).

SELECT s.shader_id
FROM shader s
JOIN tag_shader_xref x ON (s.shader_id = x.shader_id)
JOIN tag t ON (t.tag_id = x.tag_id 
  AND t.tag_name IN ('color', 'saturation', 'transparency'))
GROUP BY s.shader_id
HAVING COUNT(DISTINCT t.tag_name) = 3;

Which should you use? Depends on how well your brand of database optimizes one method or the other. I usually use MySQL, which doesn't do as well with GROUP BY, so it's better to use the former method. In Microsoft SQL Server, the latter solution might do better.

Bill Karwin
The second example could provide false positives - a SHADER record might have 2+ tag_name values of 'color'/etc.
OMG Ponies
How about now? cf. `COUNT(DISTINCT t.tag_name)`
Bill Karwin
The next problem is that you'd have to use dynamic SQL in order to specific the list of possibilities for your IN clause. A string/varchar variable containing a comma delimited list won't be accepted. Even if it, it won't see into the variable to grasp that it should be dealing with a comma separated list.
OMG Ponies
You can still use query parameters, but you need as many parameter placeholders as you have elements in your array: "`tag.name IN (?, ?, ?)`" See: http://stackoverflow.com/questions/337704/parameterizing-a-sql-in-clause/337792#337792
Bill Karwin
Yes, but you have to hardcode the parameters. Depending on the constraints, you might not be able to use NULL in these cases. The first example doesn't raise these concerns.
OMG Ponies
You would literally *never* have to use NULL for this query. "I want to make sure the shader has the following three properties: 'color', 'saturation', and, um..."
Bill Karwin
I don't know what you mean by hardcode the parameters.
Bill Karwin
tag_name IN (?, ?) == tag_name IN (@param1, @param2) etc. Hence, they are hard coded as opposed to dynamically constructing the IN clause in a single variable (which won't be accepted in anything but dynamic SQL). Per the OP, the parameters are optional - the param values would be defaulted to NULL until set.
OMG Ponies
You have to use dynamic SQL anyway. And if only two out of five possible tags are non-null, you'd only put two parameter placeholders in the expression. Don't put more placeholders in, if you already know the parameter values would be null.
Bill Karwin
+1  A: 

Your query is perfect for using a Common Table Expression (CTE) because of the duplicated correlated subquery in the EXISTS clauses:

WITH attribute AS(
  SELECT tsx.shader_id,
         t.tag_name
    FROM TAG_SHADER_XREF tsx ON tsx.shader_id = s.shader_id
    JOIN TAG t ON t.tad_id = tsx.tag_id)
SELECT ft.[RANK], 
       s.shader_id, 
       s.page_name, 
       s.name, 
       s.description, 
       s.download_count, 
       s.rating, 
       s.price 
  FROM SHADER s 
  JOIN FREETEXTTABLE(SHADER, *, @search_term) AS ft ON s.shader_id = ft.[KEY]
  JOIN attribute a1 ON a1.shader_id = s.shader_id AND a1.tag_name = 'color'
  JOIN attribute a2 ON a2.shader_id = s.shader_id AND a2.tag_name = 'saturation'
 ORDER BY ft.[RANK] DESC

By using the CTE, I also converted the EXISTS into JOINs.

Speaking to your original question regarding the use of dynamic SQL - the only alternative is to check the incoming parameter for an escape criteria before applying it. IE:

WHERE (@param1 IS NULL OR a1.tag_name = @param1)

If @param1 contains a NULL value, the later portion of the SQL in brackets is not executed. I prefer the dynamic SQL approach for sake that otherwise you're making JOINs/etc that might not be used - that's a waste of resources.

What performance problems do you believe to exist with dynamic SQL? Using sp_executesql does cache the query plan. Frankly I find it odd that a query plan wouldn't be cached if the query is validated for syntax/etc (using exec or sp_executesql) - the validation would happen prior to the query plan, why the step afterwards be skipped?

OMG Ponies
Wow... I didn't know about CTE's or that sp_executesql caches query plans. I learn something new every day. By the way, I've done dynamic sql in the past with the exec() function (which does not cache query plans) and that's why I try to stay away from it. But that's good to know about sp_executesql... good stuff.
Steve Wortham
A: 
Todd Owen
+1  A: 

How do I avoid dynamic SQL when using an undetermined number of parameters?

You may dynamically generate the appropriate parameterized (prepared) SQL templates instead.

Build and prepare the statement template when the parameters present themselves for the first time, caching the prepared statements for re-use when the same number of parameters appears again.

This could be done in the application or a sufficiently sophisticated stored procedure.

I much prefer this approach to, say, a procedure that takes at most 10 tags and has grody logic to deal with any of them being NULL.

Bill Karwin's GROUP BY answer in this question is probably the easiest template to construct -- you're simply concatenating placeholders for the IN predicate and updating the COUNT clause. Other solutions involving joins-per-tag would require incrementing table aliases (e.g., xref1, xref2, and so on) as you go.

pilcrow
+2  A: 

So this was easier than I expected. After implementing a rather simple query to take care of this, I instantly had far better performance than I thought I would. So I'm not sure it's necessary to implement and test the other solutions.

I currently have my database filled with around 200 shaders and 500 tags. I ran what I think is a somewhat realistic test where I performed 35 different search queries against my stored proc with a varying number of tags, with and without a search term. I put all of this in a single SQL statement and then I benchmarked the results in ASP.NET. It consistently ran these 35 searches in under 200 milliseconds. If I reduced it to just 5 searches then the time goes down to 10 ms. That kind of performance is awesome. It helps that my database size is small. But I think it also helps that the query utilizes indexes well.

One thing I changed in my query was the way I was looking up tags. I'm now looking up the tags by their id instead of the name. By doing this I can get away with doing 1 less join, and have the benefit of using an index for the search. And then I also added "dbo." to the front of the table names after learning that SQL caches queries on a per-user basis.

In case anyone is interested, here's my finished stored proc:

ALTER PROCEDURE [dbo].[search] 
    @search_term varchar(100) = NULL,
    @tag1   int = NULL,
    @tag2   int = NULL,
    @tag3   int = NULL,
    @tag4   int = NULL,
    @tag5   int = NULL,
    @tag6   int = NULL,
    @tag7   int = NULL,
    @tag8   int = NULL,
    @tag9   int = NULL,
    @tag10   int = NULL
AS
BEGIN
    SET NOCOUNT ON;

    IF LEN(@search_term) > 0
     BEGIN
      SELECT s.shader_id, s.page_name, s.name, s.description, s.download_count, s.rating, s.price FROM dbo.shader s 
      INNER JOIN FREETEXTTABLE(dbo.shader, *, @search_term) AS ft ON s.shader_id = ft.[KEY]
      WHERE (@tag1 IS NULL OR EXISTS(SELECT 1 AS num FROM dbo.tag_shader_xref tsx WHERE tsx.shader_id = s.shader_id AND tsx.tag_id = @tag1))
      AND   (@tag2 IS NULL OR EXISTS(SELECT 1 AS num FROM dbo.tag_shader_xref tsx WHERE tsx.shader_id = s.shader_id AND tsx.tag_id = @tag2))
      AND   (@tag3 IS NULL OR EXISTS(SELECT 1 AS num FROM dbo.tag_shader_xref tsx WHERE tsx.shader_id = s.shader_id AND tsx.tag_id = @tag3))
      AND   (@tag4 IS NULL OR EXISTS(SELECT 1 AS num FROM dbo.tag_shader_xref tsx WHERE tsx.shader_id = s.shader_id AND tsx.tag_id = @tag4))
      AND   (@tag5 IS NULL OR EXISTS(SELECT 1 AS num FROM dbo.tag_shader_xref tsx WHERE tsx.shader_id = s.shader_id AND tsx.tag_id = @tag5))
      AND   (@tag6 IS NULL OR EXISTS(SELECT 1 AS num FROM dbo.tag_shader_xref tsx WHERE tsx.shader_id = s.shader_id AND tsx.tag_id = @tag6))
      AND   (@tag7 IS NULL OR EXISTS(SELECT 1 AS num FROM dbo.tag_shader_xref tsx WHERE tsx.shader_id = s.shader_id AND tsx.tag_id = @tag7))
      AND   (@tag8 IS NULL OR EXISTS(SELECT 1 AS num FROM dbo.tag_shader_xref tsx WHERE tsx.shader_id = s.shader_id AND tsx.tag_id = @tag8))
      AND   (@tag9 IS NULL OR EXISTS(SELECT 1 AS num FROM dbo.tag_shader_xref tsx WHERE tsx.shader_id = s.shader_id AND tsx.tag_id = @tag9))
      AND   (@tag10 IS NULL OR EXISTS(SELECT 1 AS num FROM dbo.tag_shader_xref tsx WHERE tsx.shader_id = s.shader_id AND tsx.tag_id = @tag10))
      ORDER BY ft.[RANK] DESC
     END
    ELSE
     BEGIN
      SELECT s.shader_id, s.page_name, s.name, s.description, s.download_count, s.rating, s.price FROM dbo.shader s 
      WHERE (@tag1 IS NULL OR EXISTS(SELECT 1 AS num FROM dbo.tag_shader_xref tsx WHERE tsx.shader_id = s.shader_id AND tsx.tag_id = @tag1))
      AND   (@tag2 IS NULL OR EXISTS(SELECT 1 AS num FROM dbo.tag_shader_xref tsx WHERE tsx.shader_id = s.shader_id AND tsx.tag_id = @tag2))
      AND   (@tag3 IS NULL OR EXISTS(SELECT 1 AS num FROM dbo.tag_shader_xref tsx WHERE tsx.shader_id = s.shader_id AND tsx.tag_id = @tag3))
      AND   (@tag4 IS NULL OR EXISTS(SELECT 1 AS num FROM dbo.tag_shader_xref tsx WHERE tsx.shader_id = s.shader_id AND tsx.tag_id = @tag4))
      AND   (@tag5 IS NULL OR EXISTS(SELECT 1 AS num FROM dbo.tag_shader_xref tsx WHERE tsx.shader_id = s.shader_id AND tsx.tag_id = @tag5))
      AND   (@tag6 IS NULL OR EXISTS(SELECT 1 AS num FROM dbo.tag_shader_xref tsx WHERE tsx.shader_id = s.shader_id AND tsx.tag_id = @tag6))
      AND   (@tag7 IS NULL OR EXISTS(SELECT 1 AS num FROM dbo.tag_shader_xref tsx WHERE tsx.shader_id = s.shader_id AND tsx.tag_id = @tag7))
      AND   (@tag8 IS NULL OR EXISTS(SELECT 1 AS num FROM dbo.tag_shader_xref tsx WHERE tsx.shader_id = s.shader_id AND tsx.tag_id = @tag8))
      AND   (@tag9 IS NULL OR EXISTS(SELECT 1 AS num FROM dbo.tag_shader_xref tsx WHERE tsx.shader_id = s.shader_id AND tsx.tag_id = @tag9))
      AND   (@tag10 IS NULL OR EXISTS(SELECT 1 AS num FROM dbo.tag_shader_xref tsx WHERE tsx.shader_id = s.shader_id AND tsx.tag_id = @tag10))
     END
END

Even though I didn't exhaust every option, this was still a good exercise because I have proven to myself that my database design is working very well for this task. And I also learned a lot from posting this question. I knew exec() was bad because it doesn't cache the query plan. But I didn't know that sp_executesql caches query plans, and that's very cool. I also didn't know about Common Table Expressions. And the link Henrik Opel posted is packed full of good tips for this type of task.

Of course I still may revisit this a year from now if the database grows drastically. Until then, thanks everyone for the help.

UPDATE:

So I have a working beta of this search engine online at http://beta.silverlightshaders.net/pixel/ if anyone is interested in seeing this in action.

There are a couple bugs I need to work out. But for the most part I've been very happy with it. It is fast, and I think it's pretty user-friendly as well.

Steve Wortham
+1 for posting a follow up
Henrik Opel