tags:

views:

765

answers:

12

Hi, all.

I wrote a VB.net console program to process CSV record that come in a text file. I'm using FileHelpers library

along with MSFT Enterprise library 4. To read the record one at the time and insert into the database.

It took about 3 - 4 hours to process 5+ million records on the text file.

Is there anyway to speed up the process? Has anyone deal with such large amount of records before and how would you update such records if there is new data to be update?

edit: Can someone recommend a profiler? prefer Open source or free.

A: 

I'd speed it up the same way I'd speed anything up: by running it through a profiler and figuring out what's taking the longest.

There is absolutely no way to guess what the bottleneck here is -- maybe there is a bug in the code which parses the CSV file, resulting in polynomial runtimes? Maybe there is some very complex logic used to process each row? Who knows!

Also, for the "record", 5 million rows isn't all THAT heavy -- an off-the-top-of-my-head guess says that a reasonable program should be able to churn through that in half an hour, an a good program in much less.

Finally, if you find that the database is your bottleneck, check to see if a transaction is being committed after each insert. That can lead to some nontrivial slowdown...

David Wolever
@David don't underestimate 5 million round trips, can be a lot depending on the environment.
eglasius
+2  A: 

Use a profiler - find out where the time is going.

Short of a real profiler, try the following:

  • Time how long it takes to just read the files line by line, without doing anything with them
  • Take a sample line, and time how long it takes just to parse it and do whatever processing you need, 5+ million times
  • Generate random data and insert it into the database, and time that

My guess is that the database will be the bottleneck. You should look into doing a batch insert - if you're inserting just a single record at a time, that's likely to be a lot slower than batch inserting.

Jon Skeet
+2  A: 

I have done many applications like this in the past and there are a number of ways that you can look at optimizing.

  1. Ensure that the code you are writing is properly managing memory, with something like this one little mistake here can slow the process to a crawl.

  2. Think about writing the database calls to be Async as it may be the bottleneck so a bit a queuing could be ok

  3. Consider dropping indexes, doing the import then re-doing the import.

  4. Consider using SSIS to do the import, it is already optimized and does this kind of thing out fo the box.

Mitchel Sellers
Just don't spend too long doing this -- unless it's something really obvious (like reading the file one character at a time, an exponential-time algorithm, or something that obvious), you're wasting your time when you try to speculate where the slowdown is.
David Wolever
what's wrong with reading the file one character at a time?
John
@david good point, the other thing is that for my # 1 item, something like Ants Profiler is going to be a potential good troubleshooting tool.
Mitchel Sellers
+1  A: 

Best bet would to try using a profiler with a relatively small sample -- this could identify where the actual hold-ups are.

Rowland Shaw
+3  A: 

read the record one at the time and insert into the database

Read them in batches and insert them in batches.

Justice
Do you have a link on how to do batch insert/update?
Jack
Bulk Insert is your friend. http://msdn.microsoft.com/en-us/library/ms188365.aspx
Joel Coehoorn
+2  A: 

Why not just insert that data directly to SQL Server Database using Microsoft SQL Server Management Studio or command line - SQLCMD? It does know how to process CVC files.

BulkInsert property should be set to True on your database.

If it has to be modified, you can insert it into Temprorary table and then apply your modifications with T-SQL.

Koistya Navin
A: 

Load it into memory and then insert into the DB. 5 million rows shouldn't tax your memory. The problem is you are essentially thrashing your disk--both reading the CSV and writing to the DB.

Jeff
A: 

Not sure what you're doing with them, but have you considered perl? I recently re-wrote a vb script that was doing something similar - processing thousands of records - and the time went from about an hour for the vb script to about 15 seconds for perl.

chris
A: 

After reading all records from file (I would read entire file in one pass, or in blocks), then use the SqlBulkCopy class to import your records into the DB. SqlBulkCopy is, as far as I know, the fasted approach to importing a block of records. There are a number of tutorials online.

Kyle B.
A: 

As others has suggested, profile the app first.

That said, you will probably gain from doing batch inserts. This was the case for one app I worked with, and it was a high impact.

Consider 5 million round trips are a lot, specially if each of them is for a simple insert.

eglasius
A: 

In a similar situation we saw considerable performance improvement by switching from one-row-at-time inserts to using the SqlBulkCopy API.

There is a good article here.

Jason DeFontes
A: 

You need to bulk load the data into your database, assuming it has that facility. In Sql Server you'd be looking at BCP, DTS or SSIS - BCP is the oldest but maybe the fastest. OTOH if that's not possible in your DB turn off all indexes before doing the run, I'm guessing it's the DB that's causing problems, not the .Net code.

MrTelly