views:

301

answers:

4

We generate a lot of SQL procedurally and SQL Server is killing us. Because of some issues documented elsewhere we basically do SELECT TOP 2 ** 32 instead of TOP 100 PERCENT.

Note: we must use the subqueries.

Here's our query:

SELECT * FROM ( 
    SELECT [me].*, ROW_NUMBER() OVER( ORDER BY (SELECT(1)) ) 
    AS rno__row__index FROM (
       SELECT [me].[id], [me].[status] FROM (
          SELECT TOP 4294967296 [me].[id], [me].[status] FROM 
          [PurchaseOrders] [me] 
          LEFT JOIN [POLineItems] [line_items] 
          ON [line_items].[id] = [me].[id] 
          WHERE ( [line_items].[part_id] = ? ) 
          ORDER BY [me].[id] ASC
       ) [me]
   ) [me] 
) rno_subq 
WHERE rno__row__index BETWEEN 1 AND 25

Are there better ways to do this that anyone can see?

UPDATE: here is some clarification on the whole subquery issue:

The key word of my question is "procedurally". I need the ability to reliably encapsulate resultsets so that they can be stacked together like building blocks. For example I want to get the first 10 cds ordered by the name of the artist who produced them and also get the related artist for each cd.. What I do is assemble a monolithic subselect representing the cds ordered by the joined artist names, then apply a limit to it, and then join the nested subselects to the artist table and only then execute the resulting query. The isolation is necessary because the code that requests the ordered cds is unrelated and oblivious to the code selecting the top 10 cds which in turn is unrelated and oblivious to the code that requests the related artists.

Now you may say that I could move the inner ORDER BY into the OVER() clause, but then I break the encapsulation, as I would have to SELECT the columns of the joined table, so I can order by them later. An additional problem would be the merging of two tables under one alias; if I have identically named columns in both tables, the select me.* would stop right there with an ambiguous column name error.

I am willing to sacrifice a bit of the optimizer performance, but the 2**32 seems like too much of a hack to me. So I am looking for middle ground.

+1  A: 

Use:

SELECT x.*
  FROM (SELECT po.id, 
               po.status,
               ROW_NUMBER() OVER( ORDER BY po.id) AS rno__row__index
          FROM [PurchaseOrders] po
          JOIN [POLineItems] li ON li.id = po.id
         WHERE li.pat_id = ?) x
 WHERE x.rno__row__index BETWEEN 1 AND 25
ORDER BY po.id ASC

Unless you've omitted details in order to simplify the example, there's no need for all your subqueries in what you provided.

OMG Ponies
The OP does say "we *must* use the subqueries" although it's not clear why.
LukeH
In order to reference ROW_NUMBER() output, that's correct - which is preserved in the answer.
OMG Ponies
There's also no reason I see to have the TOP clause, just move the ORDER BY to the outer query. I think you're actually messing up the query optimizer by adding it to the inner query - there's a reason they don't let you specify ORDER in subqueries by default.
Chris Haas
@OMG Ponies: If that's the case then your query does the job, although I'd drop the `TOP`/`ORDER BY` from the subquery and just add an `ORDER BY x.rno__row__index` to the outermost query. But as I said, the OP isn't clear on *why* the subqueries are required; maybe because they're autogenerated in some crazy way.
LukeH
@Luke: You *will* have a single subquery at minimum in order to use pagination with the row_number function call.
OMG Ponies
Re: TOP use - on reflection, I agree - it's unnecessary.
OMG Ponies
@OMG Ponies: I know that a subquery is required in order to use `ROW_NUMBER()` for pagination, but maybe there are also *other* reasons why the OP requires them (difficult to see what those reasons might be though).
LukeH
See clarification in post.
Frew
+3  A: 
  • If you want top rows by me.id, just ask for that in the ROW_NUMBER's ORDER BY. Don't chase your tail around subqueries and TOP.
  • If you have a WHERE clause on a joined table field, you can have an outer JOIN. All the outer fields will be NULL and filtered out by the WHERE, so is effectively an inner join.

.

WITH cteRowNumbered AS (
  SELECT [me].id, [me].status 
  ROW_NUMBER() OVER (ORDER BY me.id ASC) AS rno__row__index 
  FROM [PurchaseOrders] [me]           
  JOIN [POLineItems] [line_items] ON [line_items].[id] = [me].[id]          
  WHERE [line_items].[part_id] = ?)
SELECT me.id, me.status 
FROM cteRowNumbered
WHERE rno__row__index BETWEEN 1 and 25

I use CTEs instead of subqueries just because I find them more readable.

Remus Rusanu
LEFT JOIN vs. inner join
Remus Rusanu
@Remus: Thanks - in my haste, I didn't correct the join. Still, given the where clause null line_item rows would be filtered out.
OMG Ponies
@Ponies: yes, that was exactly my point. I wanted to call out that a WHERE clause and a LEFT JOIN just don't mix, they cancel each other out to make an ordinary inner JOIN.
Remus Rusanu
As for why I posted a query almost identical to yours, w/o reading yours first and perhaps just adding a comment... well step 1 is admition: "Hello, my name is Remus and I'm a SO rep addict"
Remus Rusanu
Note that although this returns the first 25 rows (according to `PurchaseOrders.id`), it doesn't guarantee the order in which those rows will appear in the resultset. This is "stronger" than the code given in the question, which doesn't even guarantee which 25 rows will be returned. (The only guarantee in the OP's code is that the 25 rows will be selected from somewhere in the top 4294967296.)
LukeH
A: 

Your update makes things much clearer. I think that the approach which you're using is seriously flawed. While it's nice to be able to have encapsulated, reusable code in your applications, front-end applications are a much different animal than a database. They typically deal with small structures and small, discrete process that run against those structures. Databases on the other hand often deal with tables that are measured in the millions of rows and sometimes more than that. Using the same methodologies will often result in code that simply performs so badly as to be unusable. Even if it works now, it's very likely that it won't scale and will cause major problems down the road.

Best of luck to you, but I don't think that this approach will end well in all but the smallest of databases.

Tom H.
I'm gonna vote this down because it doesn't really help our question, but you can feel free to vote my flawed question down too :-)
Frew
I'm not too worried about helping you. I'm hoping it helps the people who see this question so that they don't fall into the same trap as you have. That's worth the -2 rep to me, so guess I'll just have to live with it."A fool despises good counsel, but a wise man takes it to heart." - Confucius
Tom H.
+1  A: 

Kudos to the only person who saw through naysaying and actually tried the query on a large table we do not have access to. To all the rest saying this simply will not work (will return random rows) - we know what the manual says, and we know it is a hack - this is why we ask the question in the first place. However outright dismissing a query without even trying it is rather shallow. Can someone provide us with a real example (with preceeding CREATE/INSERT statements) demonstrating the above query malfunctioning?

Peter Rabbitson