views:

711

answers:

5

Hi All,

I'm selecting some rows from a table valued function but have found an inexplicable massive performance difference by putting SELECT TOP in the query.

SELECT  col1, col2, col3 etc
FROM    dbo.some_table_function
WHERE   col1 = @parameter

is taking upwards of 5 or 6 mins to complete.

However

SELECT  TOP 6000 col1, col2, col3 etc
FROM    dbo.some_table_function
WHERE   col1 = @parameter

completes in about 4 or 5 seconds.

This wouldn't surprise me if the returned set of data were huge, but the particular query involved returns ~5000 rows out of 200,000.

So in both cases, the whole of the table is processed, as SQL Server continues to the end in search of 6000 rows which it will never get to. Why the massive difference then? Is this something to do with the way SQL Server allocates space in anticipation of the result set size (the TOP 6000 thereby giving it a low requirement which is more easily allocated in memory)? Has anyone else witnessed something like this?

Thanks

+3  A: 

Table valued functions can have a non-linear execution time.

Let's consider function equivalent for this query:

SELECT  (
        SELECT  SUM(mi.value)
        FROM    mytable mi
        WHERE   mi.id <= mo.id
        )
FROM    mytable mo
ORDER BY
        mo.value

This query (that calculates the running SUM) is fast at the beginning and slow at the end, since on each row from mo it should sum all the preceding values which requires rewinding the rowsource.

Time taken to calculate SUM for each row increases as the row numbers increase.

If you make mytable large enough (say, 100,000 rows, as in your example) and run this query you will see that it takes considerable time.

However, if you apply TOP 5000 to this query you will see that it completes much faster than 1/20 of the time required for the full table.

Most probably, something similar happens in your case too.

To say something more definitely, I need to see the function definition.

Update:

SQL Server can push predicates into the function.

For instance, I just created this TVF:

CREATE FUNCTION fn_test()
RETURNS TABLE
AS
RETURN  (
        SELECT  *
        FROM    master
        );

These queries:

SELECT  *
FROM    fn_test()
WHERE   name = @name

SELECT  TOP 1000 *
FROM    fn_test()
WHERE   name = @name

yield different execution plans (the first one uses clustered scan, the second one uses an index seek with a TOP)

Quassnoi
'Fraid not in this case. The point of my query is that the _same_ rows are returned regardless of whether the TOP clause it used or not (TOP 6000 being bigger than the result set). It therefore can't be to do with the calculation of those rows themselves.
Arj
`@Arj`: could you please post your function definition?
Quassnoi
@Quassnoi: the inline TVF is simply a macro.
gbn
A: 

It's not necessarily true that the whole table is processed if col1 has an index.

The SQL optimization will choose whether or not to use an index. Perhaps your "TOP" is forcing it to use the index.

If you are using the MSSQL Query Analyzer (The name escapes me) hit Ctrl-K. This will show the execution plan for the query instead of executing it. Mousing over the icons will show the IO/CPU usage, I believe.

I bet one is using an index seek, while the other isn't.

If you have a generic client: SET SHOWPLAN_ALL ON; GO select ...; go

see http://msdn.microsoft.com/en-us/library/ms187735.aspx for details.

ericp
Yeah - I'm having a look at the plan right now. Though I've altered the query for posting. In reality it's doing SELECT *. I can't see how using TOP would prompt an index use?
Arj
SQL Optimizer will decide whether or not to use an index. I've done queries where the where clause causes a "tipping point" where the optimizer decides to do a full table scan instead of use an index.
ericp
+1  A: 

You may be running into something as simple as caching here - perhaps (for whatever reason) the "TOP" query is cached? Using an index that the other isn't?

In any case the best way to quench your curiosity is to examine the full execution plan for both queries. You can do this right in SQL Management Console and it'll tell you EXACTLY what operations are being completed and how long each is predicted to take.

All SQL implementations are quirky in their own way - SQL Server's no exception. These kind of "whaaaaaa?!" moments are pretty common. ;^)

Jim Davis
+1  A: 

Your TOP has no ORDER BY, so it's simply the same as SET ROWCOUNT 6000 first. An ORDER BY would require all rows to be evaluated first, and it's would take a lot longer.

If dbo.some_table_function is a inline table valued udf, then it's simply a macro that's expanded so it returns the first 6000 rows as mentioned in no particular order.

If the udf is multi valued, then it's a black box and will always pull in the full dataset before filtering. I don't think this is happening.

Not directly related, but another SO question on TVFs

gbn
+1  A: 

I think Quassnois' suggestion seems very plausible. By adding TOP 6000 you are implicitly giving the optimizer a hint that a fairly small subset of the 200,000 rows are going to be returned. The optimizer then uses an index seek instead of an clustered index scan or table scan.

Another possible explanation could caching, as Jim davis suggests. This is fairly easy to rule out by running the queries again. Try running the one with TOP 6000 first.

Sven Olausson