views:

659

answers:

4

I have a case where i need to translate (lookup) several values from the same table. First way I wrote it, was using subqueries:

SELECT
   (SELECT id FROM user WHERE user_pk = created_by) AS creator,
   (SELECT id FROM user WHERE user_pk = updated_by) AS updater,
   (SELECT id FROM user WHERE user_pk = owned_by) AS owner,
   [name]
FROM asset

As I'm using this subquery a lot (ie I have about 50 tables with these fields), and I might need to add some more code to the subquery (f.eks "AND active = 1" ) I thougth id put these into a UDF and use that. But the performance using that UDF was abysmal.

CREATE FUNCTION dbo.get_user ( @user_pk INT )
RETURNS INT
AS BEGIN 
   RETURN ( SELECT id
            FROM   ice.dbo.[user]
            WHERE  user_pk = @user_pk )
END

SELECT dbo.get_user(created_by) as creator, [name]
FROM asset

Performance of #1 is less than 1 sec. Performance of #2 is about 30 seconds...

Could anyone explain to me why, or more importantly, is there anyway I can code in SQL server 2008 so that I don't have to use so many subqueries?

edit:

just a litte more explanation of when this is usefull. This simple querey (ie get userid) gets a lot more complex when I want to have a tekst for a user, since I have to join with profile to get language, with company to see if language should be fetch'ed from there instead, and with the translation table to get the translated text. And for most of these queries, performance is a secondary issue to read/maintainability.

+18  A: 

The udf is a black box to the query optimiser so it's executed every row. You are doing a row by row cursor. For each row in asset, look up an id 3 times in another table.

One of many articles on the problem

The subqueries can be optimised to correlate and avoid the row by row operations.

What you really want is this:

select 
   uc.id as creator,
   uu.id as updater,
   uo.id as owner,
   a.[name]
from
    asset a
    JOIN
    user uc ON uc.user_pk = a.created_by
    JOIN
    user uu ON uu.user_pk = a.updated_by
    JOIN
    user uo ON uo.user_pk = a.owned_by
gbn
To be equivalent to his original query he actually needs a left join, not an inner join, because the nested queries may return null
Scott Wisniewski
True, to be 100% like for like
gbn
Good explanation on the why the UDF kills performance, what throws me off is that the execution plan shows 3 steps: index seek 36%, Compute scalar 1% and hash match 63%. Anyway: a join is NOT an option, allready unreadable code will only be a LOOT worse. I'm looking for other solutions.
devzero
devzero, the join is the absolute fastest solution and is not unreadable to anyone familiar with t-sql.Joins are almost always far better than subqueries and tremendously better than UDFs which should be avoided at all costs.
HLGEM
HLGEM: the performance gain using joins is not so much compared to subselects, and have when one of the joins above is expanded into 4 join it's NOT readable. (ie the total number of joins above will be 13, and for my real query it would be about 25+)
devzero
The syntax for CTE and inline function is actually just about the same when used (declaration is very diffrent). So pro CTE: it does not clutter the DB, pro UDF: it can be reused in several queries, and thus better to maintain. I guess a view could work just as well as a UDF.
devzero
Reusability has a place: but udfs in this context are a performance killer. I'd use this udf to resolve a single name in places, not for a set based operation. I also wouldn't use a view, personally, because the temptation is to reuse it and kill performance again.
gbn
"The udf is a black box to the query optimiser so it's executed every row" sold me to upvote this question. I was never sure why execution plan did not calculate UDF well... but now i have a better idea why.
Sung Meister
+2  A: 

To get the same result (NULL if user is deleted or not active).

 select 
    u1.id as creator,
    u2.id as updater,
    u3.id as owner,
    [a.name]
 FROM asset a
        LEFT JOIN user u1 ON (u1.user_pk = a.created_by AND u1.active=1) 
        LEFT JOIN user u2 ON (u2.user_pk = a.created_by AND u2.active=1) 
        LEFT JOIN user u3 ON (u3.user_pk = a.created_by AND u3.active=1)
dmajkic
+6  A: 

As other posters have suggested, using joins will definitely give you the best overall performance.

However, since you've stated that that you don't want the headache of maintaining 50-ish similar joins or subqueries, try using an inline table-valued function as follows:

CREATE FUNCTION dbo.get_user_inline (@user_pk INT)
RETURNS TABLE AS
RETURN
(
    SELECT TOP 1 id
    FROM ice.dbo.[user]
    WHERE user_pk = @user_pk
        -- AND active = 1
)

Your original query would then become something like:

SELECT
    (SELECT TOP 1 id FROM dbo.get_user_inline(created_by)) AS creator,
    (SELECT TOP 1 id FROM dbo.get_user_inline(updated_by)) AS updater,
    (SELECT TOP 1 id FROM dbo.get_user_inline(owned_by)) AS owner,
    [name]
FROM asset

An inline table-valued function should have better performance than either a scalar function or a multistatement table-valued function.

The performance should be roughly equivalent to your original query, but any future changes can be made in the UDF, making it much more maintainable.

LukeH
It's overkill, really and makes it worse to read...
gbn
Just answering the question! I agree with you that JOINs are the best way to do this, but the OP indicated in their question that they wanted to do this with a UDF but were disappointed with the performance. They also state in their comment to your post that "join is NOT an option".
LukeH
I'm aware that join's will give the best performance, but it gives me 20+ joins all in all. This is not easy to maintain, and the performance gain (0.00x seconds to 0.0x seconds or something) isn't an issue.
devzero
UDF's used as above might seem worse to read when it's simplified like here, but when you start adding to it, it's much better. In my case I have another similar UDF that contains a tripple join, with 4 where clauses. Try repeating this 3 times inside a select and you get the point.
devzero
@devzero, Did you just accept this answer and then change your mind? Your comments seem to suggest that this is your preferred solution.
LukeH
yes, this is the preferred solution, however, this is solution is the same as the solution in the link from the answer above. And he was first. If I could I would accept both answers, as they are both correct.
devzero
A: 

am I missing something? Why can't this work? You are only selecting the id which you already have in the table.

select   created_by as creator, updated_by as updater, 
owned_by as owner, [name]
from asset

BTW in designing you really should avoid keywords like name as field names.

HLGEM
"id" is the required column and it's only in the "user" table. "user_pk" is the reference column that's present in both "asset" and "user".
LukeH
To clarify, "asset.created_by", "asset.updated_by" and "asset.owned_by" refer to "user.user_pk", and the column that's required is the corresponding "user.id".
LukeH
Also in our discussions we have come to the conclusion that "name" is not a bad field name. This is because most of the time the field name will be prefixed (ie user.name) and this is better than f.eks table_user.user_name or some such.
devzero
It is a bad field name because it is a reserved word. You really should try to avoid those as they can create unnecessary bugs when people forget, as they inevitably do, to qualify them properly.
HLGEM