views:

420

answers:

5

Hi,

I'm building a system for updating large amounts of data through various CSV feeds. Normally I would just loop though each row in the feed, do a select query to check if the item already exists and insert/update an item depending if it exists or not.

I feel this method isn't very scalable and could hammer the server on larger feeds. My solution is to loop through the items as normal but store them in memory. Then for every 100 or so items do a select on those 100 items and get a list of existing items in the database that match. Then concatenate the insert/update statements together and run them into the database. This would essentially cut down on the trips to the database.

Is this a scalable enough solution and are there any example tutorials on importing large feeds into a productive environment?

Thanks

+2  A: 

One way is load your CSV into a DataTable (or more likely a DataReader) and then batch slam in the results using SqlBulkCopy -

http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.aspx

Its pretty efficient and you can do some column mapping. Tip - when you map columns using SqlBulkCopy they are case sensitive.

Kris Krause
A: 

Another approach would be to write a .Net stored procedure on server on the server to operate on the entire file...

Only if you need more control than Kris Krause's solution though - I'm a big fan of keeping it simple (and reusable) where we can...

Martin Milan
A: 

Do you need to be rolling your own here at all? Would it be possible to provide the data in such a way that the SQL Server can use Bulk Import to load it in and then deal with duplicates in the database once the import is complete?

When it comes to heavy lifting with a lot of data my experience tends to be that working in the database as much as possible is much quicker and less resource intensive.

glenatron
+2  A: 

Your way is the worst possible solution. In general, you should not think in terms of looping through records individually. We used to havea a company built import tool that loops through reciords, it would take 18-20 hours to load a file with over a million records (something that wasn't a frequent occurance when it was built but which is a many times a day occurance now).

I see two options: First use bulk insert to load to a staging table and do whatever clean up you need to do on that table. How are you determining if the record alrady exists? You should be able to build a set-based update by joining to the staging table on those fields which determine the update. Often I have a added a column to my staging table for the id of the record it matches to and populated that through a query then done the update. Then you do an insert of the records which don't havea corresponding id. If you have too many records to do all at once, you may want to run in batches (which yes is a loop), but make the batches considerably larger than 1 record at a time (I usually start with 2000 and then based on the time it takes for that determine if I can do more or less in the batch).

I think 2008 also has a merge statment but I have not yet had a chance to use it. Look it up in books online.

The alternative is to use SSIS which is optimized for speed. SSIS is a complex thing though and the learning curve is steep.

HLGEM
Nick Kavadias
Thanks for your suggestion. The reason I loop through each item is because I need to perform some validation and formatting logic before adding it to the database. This then relays back to the user if there are any problems with the feed itself. I like the idea of merging the data though, I'll look into that.
CL4NCY
You can easily do validation and formatting in a set-based fashion as well. Looping through individual records is almost always a poor choice and you should not consider doing it until all other options have been elimiated.
HLGEM
+3  A: 

Seeing that you're using SQL Server 2008, I would recommend this approach:

  • first bulkcopy your CSV files into a staging table
  • update your target table from that staging table using the MERGE command

Check out the MSDN docs and a great blog post on how to use the MERGE command.

Basically, you create a link between your actual data table and the staging table on a common criteria (e.g. a common primary key), and then you can define what to do when

  • the rows match, e.g. the row exists in both the source and the target table --> typically you'd either update some fields, or just ignore it all together
  • the row from the source doesn't exist in the target --> typically a case for an INSERT

You would have a MERGE statement something like this:

MERGE TargetTable AS t
USING SourceTable AS src
ON t.PrimaryKey = src.PrimaryKey

WHEN NOT MATCHED THEN
  INSERT (list OF fields)
  VALUES (list OF values)

WHEN MATCHED THEN
  UPDATE
    SET (list OF SET statements)
;

Of course, the ON clause can be much more involved if needed. And of course, your WHEN statements can also be more complex, e.g.

WHEN MATCHED AND (some other condition) THEN ......

and so forth.

MERGE is a very powerful and very useful new command in SQL Server 2008 - use it, if you can!

marc_s