views:

608

answers:

2

I'm planning a data warehouse migration to SQL Server 2008, and trying to think of ways to replicate the LAG, LEAD, FIRST_VALUE and LAST_VALUE analytic functions from Oracle in SQL Server 2008. They are not included in SQL Server 2008, although the basic machinery for windowed analytic functions is (e.g. ROW_NUMBER, RANK and DENSE_RANK are all present).

For these functions, it is possible to achieve the same function by creating a subquery that assigns each row a number using ROW_NUMBER and then doing self-joins of that query to find related rows with nearby row numbers (for LAG and LEAD), or a row number of 1 (for FIRST_VALUE).

I expect that doing the self-joins would detract from the efficiency of an operation: but I don't yet have a SQL Server to test this. So without having actually assessed the performance, I'm wondering whether there is a better workaround that avoids the self-joins.

Looking at the documentation for user-defined aggregate functions, it is conceivable that the same code structure could be used to provide user-defined analytic functions.

So my question is: can you add an OVER() clause after a user-defined aggregate function to have it called as an analytic function?

If so, is the Terminate() method called once per row? Is there anything special needed to ensure that rows are sent to your UDF in the ordering specified in the OVER() clause?

+1  A: 

I'd use self-joins not udfs.

You are looking at scalar UDFS that use table access which almost always gives bad performance (it's a cursor). Otherwise, you could probably use APPLY, but this also is row by row.

Also, the Oracle functions are not aggregate functions. A user defined aggregate would still have to do the same processing over the result set.

Remember, internally Oracle would still have do some row by row processing to work out the values anyway.

So, SQL Server 2005+ example for FIRST_VALUE (not tested) using self join.

Note the cross join to decouple the FIRST_VALUE and the rest 2 because the result sets have no relation. If you used a UDF or user defined agg, then most likely you'd have to calculate FIRST_VALUE over and over per row from the 1st result set.

;WITH CTE AS
(
    SELECT
        department_id, last_name, salary,
        ROW_NUMBER() OVER (ORDER BY salary) AS ranking
    FROM employees
    WHERE department_id = 90
)
SELECT
    c1.department_id, c1.last_name, c1.salary,
    c2.last_name as Poorest
FROM
    CTE c1
    CROSS JOIN
    (SELECT last_name FROM CTE WHERE Ranking = 1) c2
ORDER BY
    c1.employee_id
gbn
Fair enough: I was hoping from the design of the Aggregate UDF examples that there might be more options available to the optimiser that could result in better performance than a straight cursor. Oracle presumably does do row-by-row processing for the analytic functions, but it is able to do this in a single pass over the data (following a sort), which is often better than the equivalent self-join.Do you think the UDFs gain any advantage if there are multiple functions in each row, which would lead to many self-joins?
William Rose
It may be that you can service multiple outputs in one self join. In my example, you could ass an extra CTE column for salary DESC and do LAST_VALUE too. Multiple udfs would be worse that a single UDF.
gbn
+1  A: 

In SQL server, analytic is part of SSAS; you will find FirstNonEmpty, LastNonEmpty, FirstChild, LastChild there. It is included with standard and enterprise versions of SQL server; see here. That is, if you wish to build cubes.

Damir Sudarevic