views:

427

answers:

7

What's the most efficient method to load large volumes of data from CSV (3 million + rows) to a database.

  • The data needs to be formatted(e.g. name column needs to be split into first name and last name, etc.)
  • I need to do this in a efficiently as possible i.e. time constraints

I am siding with the option of reading, transforming and loading the data using a C# application row-by-row? Is this ideal, if not, what are my options? Should I use multithreading?

+3  A: 

You will be I/O bound, so multithreading will not necessarily make it run any faster.

Last time I did this, it was about a dozen lines of C#. In one thread it ran the hard disk as fast as it could read data from the platters. I read one line at a time from the source file.

If you're not keen on writing it yourself, you could try the FileHelpers libraries. You might also want to have a look at Sébastien Lorion's work. His CSV reader is written specifically to deal with performance issues.

Robert Harvey
yes, the c# io library is well made with buffer. Recently I had to transform from one csv to another (1.5 m lines) in something like a minute.
call me Steve
I recommend FileHelpers too. It saved me from having to write a parser to deal with values that have commas in them. If the CSV has any such nasty details, consider FileHelpers.
Charles
i know that in the past, the seek time on drives was an issue. in the case of large image files, we would read from one drive and write to another to cut down on the number of times to reposition the drive heads.
yamspog
+1  A: 

I would agree with your solution. Reading the file one line at a time should avoid the overhead of reading the whole file into memory at once, which should make the application run quickly and efficiently, primarily taking time to read from the file (which is relatively quick) and parse the lines. The one note of caution I have for you is to watch out if you have embedded newlines in your CSV. I don't know if the specific CSV format you're using might actually output newlines between quotes in the data, but that could confuse this algorithm, of course.

Also, I would suggest batching the insert statements (include many insert statements in one string) before sending them to the database if this doesn't present problems in retrieving generated key values that you need to use for subsequent foreign keys (hopefully you don't need to retrieve any generated key values). Keep in mind that SQL Server (if that's what you're using) can only handle 2200 parameters per batch, so limit your batch size to account for that. And I would recommend using parameterized TSQL statements to perform the inserts. I suspect more time will be spent inserting records than reading them from the file.

BlueMonkMN
+1  A: 

You don't state which database you're using, but given the language you mention is C# I'm going to assume SQL Server.

If the data can't be imported using BCP (which it sounds like it can't if it needs significant processing) then SSIS is likely to be the next fastest option. It's not the nicest development platform in the world, but it is extremely fast. Certainly faster than any application you could write yourself in any reasonable timeframe.

Greg Beech
I'm with Greg and JayRiggs on this one. Skip the C# (unless you're writing a CLR module for SQL Server). Let SQL do the work. It's kinda good at working with mass volumes of data from files, in case you hadn't heard. ;) That'll save you all kinds of headaches on opening conxns etc.
drachenstern
This makes it very difficult for unit testing?
guazz
This isn't really the kind of problem where unit testing is much use. People focus too much on unit testing and ignore the bigger picture. What you should be looking to test is that the data that gets into the database is correct, given a known set of data in a CSV, and that known-bad cases are handled (either fixed, discarded or failed) as expected. If you do it that way then it doesn't really matter how it gets into the database. So from any practical perspective I'd say SSIS is just as testable as anything else.
Greg Beech
+2  A: 

You could use the csvreader to quickly read the CSV.

Assuming you're using SQL Server, you use csvreader's CachedCsvReader to read the data into a DataTable which you can use with SqlBulkCopy to load into SQL Server.

Jay Riggs
This is what I use. I like csvreader, it is a very convenient way to access a delimited file.
galford13x
+1 for the SqlBulkCopy
Lirik
A: 

BCP is pretty quick so I'd use that for loading the data. For string manipulation I'd go with a CLR function on SQL once the data is there. Multi-threading won't help in this scenario except to add complexity and hurt performance.

Kofi Sarfo
A: 

Hi, read the contents of the CSV file line by line into a in memory DataTable. You can manipulate the data (ie: split the first name and last name) etc as the DataTable is being populated.

Once the CSV data has been loaded in memory then use SqlBulkCopy to send the data to the database.

See http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.writetoserver.aspx for the documentation.

Hidden_au
A: 

If you really want to do it in C#, create & populate a DataTable, truncate the target db table, then use System.Data.SqlClient.SqlBulkCopy.WriteToServer(DataTable dt).

mhanney
Unfortunately, I need to update existing records and the data will be loaded daily.
guazz