views:

1666

answers:

10

Hi,

I have a CSV file and I have to insert it into a SQL Server database. Is there a way to speed up the LINQ inserts?

I've created a simple Repository method to save a record:

    public void SaveOffer(Offer offer)
    {
        Offer dbOffer = this.db.Offers.SingleOrDefault (
             o => o.offer_id == offer.offer_id);

        // add new offer
        if (dbOffer == null)
        {
            this.db.Offers.InsertOnSubmit(offer);
        }
        //update existing offer
        else
        {
            dbOffer = offer;
        }

        this.db.SubmitChanges();
    }

But using this method, the program is way much slower then inserting the data using ADO.net SQL inserts (new SqlConnection, new SqlCommand for select if exists, new SqlCommand for update/insert).

On 100k csv rows it takes about an hour vs 1 minute or so for the ADO.net way. For 2M csv rows it took ADO.net about 20 minutes. LINQ added about 30k of those 2M rows in 25 minutes. My database has 3 tables, linked in the dbml, but the other two tables are empty. The tests were made with all the tables empty.

P.S. I've tried to use SqlBulkCopy, but I need to do some transformations on Offer before inserting it into the db, and I think that defeats the purpose of SqlBulkCopy.

Updates/Edits: After 18hours, the LINQ version added just ~200K rows.

I've tested the import just with LINQ inserts too, and also is really slow compared with ADO.net. I haven't seen a big difference between just inserts/submitchanges and selects/updates/inserts/submitchanges.

I still have to try batch commit, manually connecting to the db and compiled queries.

+6  A: 

Have you tried wrapping the inserts within a transaction and/or delaying db.SubmitChanges so that you can batch several inserts?

Transactions help throughput by reducing the needs for fsync()'s, and delaying db.SubmitChanges will reduce the number of .NET<->db roundtrips.

Edit: see http://www.sidarok.com/web/blog/content/2008/05/02/10-tips-to-improve-your-linq-to-sql-application-performance.html for some more optimization principles.

Eamon Nerbonne
I haven't tried anything yet. I'm trying to find out what are the options on speeding this up...
Marius Ursache
I'm about to try something similar for a project I'm doing - what worked wonders for me (though this was sqlite) was using triggers that rewrote inserts into updates when the inserted row already existed - that way you can avoid checking if the to-be-inserted row already exists in the dabase.
Eamon Nerbonne
Good suggestion. Maybe have your method accept an IEnumerable<Offer> to further enforce batch processing?
ph0enix
+3  A: 

Why not pass an offer[] into that method, and doing all the changes in cache before submitting them to the database. Or you could use groups for submission, so you don't run out of cache. The main thing would be how long till you send over the data, the biggest time wasting is in the closing and opening of the connection.

Yuriy Faktorovich
+3  A: 

I wonder if you're suffering from an overly large set of data accumulating in the data-context, making it slow to resolve rows against the internal identity cache (which is checked once during the SingleOrDefault, and for "misses" I would expect to see a second hit when the entity is materialized).

I can't recall 100% whether the short-circuit works for SingleOrDefault (although it will in .NET 4.0).

I would try ditching the data-context (submit-changes and replace with an empty one) every n operations for some n - maybe 250 or something.


Given that you're calling SubmitChanges per isntance at the moment, you may also be wasting a lot of time checking the delta - pointless if you've only changed one row. Only call SubmitChanges in batches; not per record.

Marc Gravell
This is the most likely problem. Look for high cpu-usage.
usr
Yeap- batching is a good idea.
RichardOD
A: 

I suspect it isn't the inserting or updating operations that are taking a long time, rather the code that determines if your offer already exists:

Offer dbOffer = this.db.Offers.SingleOrDefault (
         o => o.offer_id == offer.offer_id);

If you look to optimise this, I think you'll be on the right track. Perhaps use the Stopwatch class to do some timing that will help to prove me right or wrong.

Usually, when not using Linq-to-Sql, you would have an insert/update procedure or sql script that would determine whether the record you pass already exists. You're doing this expensive operation in Linq, which certainly will never hope to match the speed of native sql (which is what's happening when you use a SqlCommand and select if the record exists) looking-up on a primary key.

Paul Suart
ADO probably has the same problem. Although your comment is still helpful.
usr
Ahh, but the ADO method is *massively* quicker as your asking the database to do what it's best at. Iterating through an IEnumerable to see if an item with the same Id exists will be so much slower.
Paul Suart
A: 

Well you must understand linq creates code dynamically for all ADO operations that you do instead handwritten, so it will always take up more time then your manual code. Its simply an easy way to write code but if you want to talk about performance, ADO.NET code will always be faster depending upon how you write it.

I dont know if linq will try to reuse its last statement or not, if it does then seperating insert batch with update batch may improve performance little bit.

Akash Kava
A: 

Converting this to a compiled query is the easiest way I can think of to boost your performance here:

Change the following:

    Offer dbOffer = this.db.Offers.SingleOrDefault (
         o => o.offer_id == offer.offer_id);

to:

Offer dbOffer = RetrieveOffer(offer.offer_id);

private static readonly Func<DataContext, int> RetrieveOffer
{
   CompiledQuery.Compile((DataContext context, int offerId) => context.Offers.SingleOrDefault(o => o.offer_id == offerid))
}

This change alone will not make it as fast as your ado.net version, but it will be a significant improvement because without the compiled query you are dynamically building the expression tree every time you run this method.

As one poster already mentioned, you must refactor your code so that submit changes is called only once if you want optimal performance.

Otter
+8  A: 

SubmitChanges does not batch changes, it does a single insert statement per object. If you want to do fast inserts, I think you need to stop using LINQ.

While SubmitChanges is executing, fire up SQL Profiler and watch the SQL being executed.

See question "Can LINQ to SQL perform batch updates and deletes? Or does it always do one row update at a time?" here: http://www.hookedonlinq.com/LINQToSQLFAQ.ashx

It links to this article: http://www.aneyfamily.com/terryandann/post/2008/04/Batch-Updates-and-Deletes-with-LINQ-to-SQL.aspx that uses extension methods to fix linq's inability to batch inserts and updates etc.

Alex Black
Yes, bulk insert is really the best way to go. Either that or let the LINQ run overnight.
Kirk Broadhurst
@Alex, @Kirk and @Marius, I cant find any implementation that fixes batch inserts in the second articles (batch updates and deletes...). Can you please shed some lights on that? It does batch update and delete well, but I am desperately in need of batch insert. Using SqlBulkCopy is great, but the manual sync/tracking of my automatic Identity field (PK) after insert for use in my linked table is quite a pain for a large number of tables. Any advice will be greatly appreciated
Fadrian Sudaman
+2  A: 

Do you really need to check if the record exist before inserting it into the DB. I thought it looked strange as the data comes from a csv file.

P.S. I've tried to use SqlBulkCopy, but I need to do some transformations on Offer before inserting it into the db, and I think that defeats the purpose of SqlBulkCopy.

I don't think it defeat the purpose at all, why would it? Just fill a simple dataset with all the data from the csv and do a SqlBulkCopy. I did a similar thing with a collection of 30000+ rows and the import time went from minutes to seconds

TT
I need to add/update the date from CSV. My original file is way bigger then 30k rows, and I need to do this daily/weekly. Loading it all into db with SqlBulkCopy just to select each row again would be even slower.
Marius Ursache
+1  A: 

Alex gave the best answer, but I think a few things are being over looked.

One of the major bottlenecks you have here is calling SubmitChanges for each item individually. A problem I don't think most people know about is that if you haven't manually opened your DataContext's connection yourself, then the DataContext will repeatedly open and close it itself. However, if you open it yourself, and then close it yourself when you're absolutely finished, things will run a lot faster since it won't have to reconnect to the database every time. I found this out when trying to find out why DataContext.ExecuteCommand() was so unbelievably slow when executing multiple commands at once.

A few other areas where you could speed things up:

While Linq To SQL doesn't support your straight up batch processing, you should wait to call SubmitChanges() until you've analyzed everything first. You don't need to call SubmitChanges() after each InsertOnSubmit call.

If live data integrity isn't super crucial, you could retrieve a list of offer_id back from the server before you start checking to see if an offer already exists. This could significantly reduce the amount of times you're calling the server to get an existing item when it's not even there.

rossisdead
LINQ is slow only on inserts only too. The first test I've did was just with InsertOnSubmit/SubmitChanges.
Marius Ursache
Yeah, that part's still going to be slow, because Linq is also returning back any data that's db generated.
rossisdead
I realize LINQ-to-SQL is not the best tool for bulk inserts. However, your suggestion of manually opening/closing the connection resulted in less CPU utilization and better performance.
Mayo
A: 

Hello

this code runs ok, and prevents on large ammounts of data : if (repository2.GeoItems.GetChangeSet().Inserts.Count > 1000) { repository2.GeoItems.SubmitChanges(); }

at the end of the bulk insertion :

repository2.GeoItems.SubmitChanges();

mariano