views:

225

answers:

3

I have a SQL Server 2008 database with approximately 14 millions rows. In it there are two tables

Table1
    rowId int, primary key
    someData1 int
    someData2 int...

Table2
    id int, primary key
    rowId ==> int, refers to the rowId from Table1
    someCalculatedData int...

    Table2.rowId is not a foreign key, but I did make a Non-clustered, Non-Unique index on it

When I import the data I insert all the data into Table1 and then find the rows that don't have an entry in Table2 and insert data into this table.

I was doing this in batches by selecting 250,000 rows at a time, processing the data and inserting them into Table2, and then finding the next 250,000 rows and so on until they were no rows in Table1 that don't have any entry in Table2.

select TOP 250000 rowId from Table1 
       where NOT EXISTS (select rowId from Table2 where Table1.rowId=Table2.rowId)

This query has really slowed down and now it takes over two minutes to get the next batch. If I run the query without the TOP or SET ROWCOUNT keywords and get all of the rows, then the query returns the results in ~15 seconds.

Does anyone know why TOP and SET ROWCOUNT cause the query to take so much longer then getting all of the data?

Can I improve the performance of the query and still only get a subset of the data each time?

+1  A: 

This occurs because all of the query needs to run on the server before the server can decide which the "top" 250,000 rows are. Only then will the data start coming back to your client over the network.

Without the "top" statement the server will start sending data immediately, though I suspect the total time for all the data to be sent from the server and received by your machine is likely to be similar to the statement including the top.

Sonny Boy
interesting....so, is there a clever workaround?
andy
Unfortunately not. Whether it's the server working throught the 14 million rows or your client machine doing it it still has to be done.
Sonny Boy
The query seems acceptably fast both locally and remotely if I select all the rowIDs in Table1 but not in Table2. If I try to get any subset of them then it takes a lot longer.
nick
+3  A: 

see if this helps

select top 250000 t1.rowid from Table1 t1 left outer join table2 t2 on t1.rowid=t2.rowid where t2.rowid is null

ps
This a good suggestion. "Not exists" especially in a subquery can be slow. A "left join where null" will give you the same thing.
DanO
Exactly what I'd recommend!!! Depending on what indexes you have set this switch can make quite a bit of a difference.
KSimons
+1  A: 

Apparently my index or the index statistics on Table2.rowId weren't fresh and the query plan for the TOP/SET ROWCOUNT queries to perform poorly.

I reorganized/rebuilt the index and the query performance for the TOP/SET ROWCOUNT queries was greatly improved.

nick
If anyone can expound on the nuances of what was going on and/or how to diagnose/more quickly identify these issues please share :)
nick