views:

71

answers:

6

I am getting a large text file of updated information from a customer that contains updates for 500,000 users. However, as I am processing this file, I often am running into SQL Server timeout errors.

Here's the process I follow in my VB application that processes the data (in general):

  1. Delete all records from temporary table (to remove last month's data) (eg. DELETE * FROM tempTable)
  2. Rip text file into the temp table
  3. Fill in extra information into the temp table, such as their organization_id, their user_id, group_code, etc.
  4. Update the data in the real tables based on the data computed in the temp table

The problem is that I often run commands like UPDATE tempTable SET user_id = (SELECT user_id FROM myUsers WHERE external_id = tempTable.external_id) and these commands frequently time out. I have tried bumping the timeouts up to as far as 10 minutes, but they still fail. Now, I realize that 500k rows is no small number of rows to manipulate, but I would think that a database purported to be able to handle millions and millions of rows should be able to cope with 500k pretty easily. Am I doing something wrong with how I am going about processing this data?

Please help. Any and all suggestions welcome.

A: 

There are more efficient ways of importing large blocks of data. Look in SQL Books Online under BCP (Bulk Copy Protocol.)

Jekke
In this case, the import is fine. What we are having trouble with is manipulating it and adding to it once we get it into SQL Server
cdeszaq
As I read it, it's not the copying of the data but enriching the data which is giving timeout problems.
extraneon
+1  A: 

Needs more information. I am manipulating 3-4 million rows in a 150 million row table regularly and I am NOT thinking this is a lot of data. I have a "products" table that contains about 8 million entries - includign full text search. No problems either.

Can you just elaborte on your hardware? I assume "normal desktop PC" or "low end server", both with absolutely non-optimal disc layout, and thus tons of IO problems - on updates.

TomTom
+1  A: 

Are you indexing your temp table after importing the data?

temp_table.external_id should definitely have an index since it is in the where clause.

dan
There are a couple of indexes on the temp table, but they are on rather un-touched fields.
cdeszaq
+5  A: 

subqueries like the one you give us in the question:

UPDATE tempTable SET user_id = (SELECT user_id FROM myUsers WHERE external_id = tempTable.external_id) 

are only good on one row at a time, so you must be looping. Think set based:

UPDATE t
    SET user_id = u.user_id
    FROM tempTable          t
        inner join myUsers  u ON t.external_id=u.external_id

and remove your loops, this will update all rows in one statement and be significantly faster!

KM
The update statement is not being run in a loop...to update a field in all rows of the table, we are only firing off 1 command to SQL Server, as I indicated above. We did it in a loop at first, but switched because it was about an order of magnitude faster to use 1 command.
cdeszaq
+1: i was just gonna write the same query. you beat me to it.
Numenor
@cdeszaq: your query it self is like a forloop since it has a sub query that ran for each row in tempTable.
Numenor
+1: I was about to add the same example. This row by row sub-query is much slower than a join.
Hogan
So is it the case then that the SET clause works much like a SELECT clause but instead of returning those fields, it updates them instead? And I assume the where clause works the same as well in terms of being able to limit things off of anywhere that has been joined in?
cdeszaq
+1  A: 

Make sure you have indexes on your tables that you are doing the selects from. In your example UPDATE command, you select the user_id from the myUsers table. Do you have an index with the user_id column on the myUsers table? The downside of indexes is that they increase time for inserts/updates. Make sure you don't have indexes on the tables you are trying to update. If the tables you are trying to update do have indexes, consider dropping them and then rebuilding them after your import.

Finally, run your queries in SQL Server Management Studio and have a look at the execution plan to see how the query is being executed. Look for things like table scans to see where you might be able to optimize.

TLiebe
+1  A: 

Look at the KM's answer and don't forget about indexes and primary keys.