views:

362

answers:

2

My system does some pretty heavy processing, and I've been attacking the performance in order to give me the ability to run more test runs in shorter times.

I have quite a few cases where a UDF has to get called on say, 5 million rows (and I pretty much thought there was no way around it).

Well, it turns out, there is a way to work around it and it gives huge performance improvements when UDFs are called over a set of distinct parameters somewhat smaller than the total set of rows.

Consider a UDF that takes a set of inputs and returns a result based on complex logic, but for the set of inputs over 5m rows, there are only 100,000 distinct inputs, say, and so it will only produce 100,000 distinct result tuples (my particular cases vary from interest rates to complex code assignments, but they are all discrete - the fundamental point with this technique is that you can simply determine if the trick will work by running the SELECT DISTINCT).

I found that by doing something like this:

INSERT INTO PreCalcs
SELECT param1
       ,param2
       ,dbo.udf_result(param1, param2) AS result
FROM (
    SELECT DISTINCT param1, param2 FROM big_table
)

When PreCalcs is suitably indexed, the combination of that with:

SELECT big_table.param1
    ,big_table.param2
    ,PreCalcs.result
FROM big_table
INNER JOIN PreCalcs
    ON PreCalcs.param1 = big_table.param1
    AND PreCalcs.param2 = big_table.param2

You get a HUGE boost in performance. Apparently, just because something is deterministic, it doesn't mean SQL Server is caching the past calls and re-using them, as one might think.

The only thing you have to watch out for is where NULL are allowed, then you need to fix up your joins carefully:

SELECT big_table.param1
    ,big_table.param2
    ,PreCalcs.result
FROM big_table
INNER JOIN PreCalcs
    ON (
        PreCalcs.param1 = big_table.param1
        OR COALESCE(PreCalcs.param1, big_table.param1) IS NULL
    )
    AND (
        PreCalcs.param2 = big_table.param2
        OR COALESCE(PreCalcs.param2, big_table.param2) IS NULL
    )

Hope this helps and any similar tricks with UDFs, or refactoring queries for performance are welcome.

I guess the question is, why is manual caching like this necessary - isn't that the point of the server knowing that the function is deterministic? And if it makes such a big difference, and if UDFs are so expensive, why doesn't the optimizer just do it in the execution plan?

+1  A: 

How would SQL Server know that you have 100,000 discrete combinations within 5 million rows?

By using the PreCalcs table, you are simply running the udf over 100k rows rather that 5 million rows, before expanding back out again.

No optimiser in existence would be able to divine this useful information. The scalar udf is a black box.

For a more practical solution, I'd use a computed, persisted columns that does the udf call. So it's available in all queries can be indexed/included.

This suits OLTP more, maybe... I query a table to get trading cash and positions in real time in many different ways so this approach suits me to avoid the udf math overhead every time.

gbn
Well, it certainly does know, since it's called the UDF 5m times. What is the point of the database knowing that a UDF is deterministic if that is never used to avoid calling it over and over again with the same parameters?
Cade Roux
Now I can understand if it only caches a certain number of results and there are cache misses, but it appears to do, basically, nothing.
Cade Roux
It does not cache. Deterministic means same output for same inputs (close enough), but row to row (5 million) this is not persisted. The optimiser does not keep this information".
gbn
Also, you say it gives the same output for same inputs. Does SQL know that? SELECT OBJECTPROPERTYEX(OBJECT_ID('udf_result'), 'IsDeterministic')
gbn
Yes, all my UDFs are deterministic - but apparently that information is not used for anything.
Cade Roux
I have a System_Health SP which warns me about heap tables, non-deterministic UDFs, unused indexes, etc.
Cade Roux
It is: for indexed views and indexed computed columns... which is what I'd do here.
gbn
The problem is I can't. A lot of this calculation work should have been done in the ETL and stored in the DW, but all I have is views to a DW I'm not allowed to change. I'm stuck in another database, so no indexed views possible for me.
Cade Roux
We're mostly OLTP so it suits. But the nested/staging approach works so sometimes that's what you have to do...
gbn
+2  A: 

Yes, the optimizer will not manually memoize UDFs for you. Your trick is very nice in the cases where you can collapse the output set down in this way.

Another technique that can improve performance if your UDF's parameters are indices into other tables, and the UDF selects values from those tables to calculate the scalar result, is to rewrite your scalar UDF as a table-valued UDF that selects the result value over all your potential parameters.

I've used this approach when the tables we based the UDF query on were subject to a lot of inserts and updates, the involved query was relatively complex, and the number of rows the original UDF had to be applied to were large. You can achieve some great improvement in performance in this case, as the table-values UDF only needs to be run once and can run as an optimized set-oriented query.

mwigdahl
Yes, unfortunately, all the potential parameters is the problem. I'm trying to re-write some of this code to be natively table-driven. But in other cases, the original logic is being very resistant to refactoring. I'm finding the UDFs to be very poorly performing to the point of uselessness.
Cade Roux