views:

67

answers:

5

I am reading and validating large fixed-width text files (range from 10-50K lines) that are submitted via our ASP.net website (coded in VB.Net). I do an initial scan of the file to check for basic issues (line length, etc). Then I import each row into a MS SQL table. Each DB rows basically consists of a record_ID (Primary, auto-incrementing) and about 50 varchar fields.

After the insert is done, I run a validation function on the file that checks each field in each row based on a bunch of criteria (trimmed length, isnumeric, range checks, etc). If it finds an error in any field, it inserts a record into the Errors table, which has an error_ID, the record_ID and an error message. In addition, if the field fails in a particular way, I have to do a "reset" on that field. A reset might consist of blanking the entire field, or simply replacing the value with another value (e.g. replacing the string with a new one that has all illegals chars taken out).

I have a 5,000 line test file. The upload, initial check, and import takes about 5-6 seconds. The detailed error check and insert into the Errors table takes about 5-8 seconds (this file has about 1200 errors in it). However, the "resets" part takes about 40-45 seconds for 750 fields that need to be reset. When I comment out the resets function (returning immediately without actually calling the UPDATE stored proc), the process is very fast. With the resets turned on, the pages take 50 seconds to return.

My UPDATE stored proc is using some recommended code from http://sommarskog.se/dynamic_sql.html, whereby it uses CASE instead of dynamic SQL:

UPDATE dbo.Records
SET    dbo.Records.file_ID = CASE @field_name WHEN 'file_ID' THEN @field_value ELSE file_ID END,
.
. (all 50 varchar field CASE statements here)
.
WHERE dbo.Records.record_ID = @record_ID

Is there any way I can help my performance here. Can I somehow group all of these UPDATE calls into a single transaction? Should I be reworking the UPDATE query somehow? Or is it just sheer quantity of 750+ UPDATEs and things are just slow (it's a quad proc server with 8GB ram).

Any suggestions appreciated.

+2  A: 

Don't do this in sql; fix the data up in code, then do you updates.

If you have sql 2008, then look into table-value parameters. It enables you to pass an entire table as a parameter to a s'proc. From their you just have the one insert/update or merge statement

Chris Lively
I am doing all of the checking and "fixing" in code. Then I have a resetField(record_ID, field_name, field_value) function which makes the call to the stored proc to send the fixed data. But it's called once per field that needs fixing...I think that is causing the slow down. I will look into the Table-Valued Parameters option...it looks like a possibility (we have 2008).
Bryan Lewis
I ended up using SqlBulkCopy for the inserts and then using TVP for the bulk update. The speed increase was massive. That 40-50 seconds of processing time (for the updates) is now down to about 5-7 seconds. Thanks all.
Bryan Lewis
A: 

I would try changing the recovery model to simple and look at my indexes. Kimberly Tripp did a session showing a scenario with improved performance using a heap.

Jason Love
A: 

I believe you are doing 50 case statements on every update. Sounds like that would be slow.

It is possible to solve this problem with inject proof code via parameterized querys and a string constant table.

Quick and dirty example code.

string [] queryList = { "UPDATE records SET col1 = {val} WHERE ID={key}",
                        "UPDATE records SET col2 = {val} WHERE ID={key}",
                        "UPDATE records SET col3 = {val} WHERE ID={key}",
                         ...
                        "UPDATE records SET col50 = {val} WHERE ID={key}"} 

Then in your call to SQL you just pick the item in the array corresponding to the col you want to update and set the value and key for the parameterized items.

I'm guessing you will see a significant improvement... let me know how it goes.

Hogan
NB: The reason I'm not building the queries dynamically is because doing it this way guarantees you are injection proof.
Hogan
I don't think the CASE statement is really the issue. Before that, I was using dynamic SQL and it was equally slow:SET @sql = 'UPDATE dbo.Records ' + 'SET ' + @field_name + ' = ''' + @field_value + '''' + ' WHERE dbo.Records.record_ID = ' + @record_IDEXEC (@sql)
Bryan Lewis
A: 

Um. Why are you inserting numeric data into VARCHAR fields then trying to run numeric checks on it? This is yucky.

Apply correct data typing and constraints to your table, do the INSERT, and see if it failed. SQL Server will happily report errors back to you.

Matt Rogish
I actually have no control over the datatable layout. I have to use a pre-determined fixed width file layout and DB table design. The INSERT is NOT the problem. The system can loop through the data, do the validation and inserts very quickly. Only when I do my second pass of validation and have to do some UPDATEs do things get slow.
Bryan Lewis
+1  A: 

If your looping through the lines and doing individual updates/inserts this can be really expensive... Consider using SqlBulkCopy which can speed up all your inserts. Similarly, you can create a DataSet, make your updates on the dataset and then submit them all in one shot through a SqlDataAdapter.

JoeGeeky
I have changed my insert code over to use SqlBulkCopy and it's working great. Now I just need to figure out the updates. Trying to compare the SqlDataAdapter option versus the Table-Valued Parameters option that Chris mentioned above.
Bryan Lewis