ansaurus

Question

Answer 1

+1 A:

Are you doing a bulk insert? I'd use it if you arnt already.

INSERT INTO dbo.NewTable(fields) 
SELECT fields 
FROM dbo.oldTable 
WHERE ...

In the above example you would want to ensure the tables used in the select statement have the appropriate indexes... correctly assigning the clustered index to the most relevant field.

If the select statement is slow, check the execution plan to possibly find the bottleneck.

Chris Klepeis 2009-06-27 00:23:39

Unfortunately source data is not coming from a query on an old database

pablo 2009-06-27 00:35:16

DaveE 2009-06-27 00:52:24

If this approach were taken you could also look at the SQLBulkLoad object

Simon Wilson 2009-06-27 01:46:42

Answer 2

+6 A:

You will see greater speed writing to a flat file for a few reasons:

ExecuteNonQuery does not group multiple insert statements into batches, so you are incurring a full inter-process communication turnaround per record. Send your insert statements in groups.
The data you have are already in the shape of a flat file, so you can fire it all off in one write, or a few writes with buffering.
Database operations tend to use trees which take n log n time, while a simple array-shaped construct will take linear time. On the other hand, if you're merging into a sorted flat file, that will take a while.

Jeffrey Hantin 2009-06-27 00:24:25

Answer 3

+2 A:

So that's about 8 milliseconds for a single row versus about that for the entire file. Fair?

A database certainly has a lot more potentially going on:

Parsing, validating, executing SQL
Calculating the values of any indexes
Managing rollback logs if this is a single transaction
Writing to its own file

I'll assume that you're running locally, so there's no network latency to include.

So I would guess that a database is slower. I wouldn't have thought 600K times slower, though.

duffymo 2009-06-27 00:25:13

Add constraints and triggers to the list, please :)

Eugene 2009-06-27 00:30:59

Fortunately I'm not using triggers :-P.

pablo 2009-06-27 00:36:12

Answer 4

A:

you are probably running the command over and over against the database server, what if you construct a command text that includes multiple inserts and then run this ? ie

string commandText = "insert into x ( y, z) values ( 1, 2 );\r\n"
commandText += "insert into x ( y, z) values ( 2, 3 );"

command.Text = commandText;
command.ExecuteNonQuery();

Konstantinos 2009-06-27 00:25:51

It dramatically improves things up in MySql, but is still not the fastest thing around: and old test I did with 400K records needed 77sec to execute inserting 1 by 1, and 14sec when done in "batch mode", but then you also have to check you don't pass the mysql package limit (you can tune it on my.cnf). But you can't do that with SqlServer 2005, can you?

pablo 2009-06-27 00:44:31

i don't know the sqlserver limit, but yeah, i have to tell you that inserting those lines to a flat file is by definition faster than contacting a database server

Konstantinos 2009-06-27 00:55:23

Answer 5

+1 A:

I can't help you much with MySQL. However, SQL Server 2005 and greater have some pretty intriguing XML support that might help you out. I recommend looking into Updategrams, a feature that allows you to submit a batch of data to be inserted, updated, or deleted. This might help you improve the performance with SQL Server, as you only need to issue a single statement rather than 600,000 statements. I am not sure it would be quite as fast as writing to a raw file, but it should be significantly faster than issuing individual statements.

You can start learning about updategrams here: http://msdn.microsoft.com/en-us/library/aa258671(SQL.80).aspx

jrista 2009-06-27 00:37:20

never though I would see XML being linked with faster performance. The verbosity of XML usually is a bottleneck.

Joshua 2009-06-27 16:38:31

@Joshua: Given the sheer volume of individual sql statements currently being sent, I think in this case XML could be a true saving grace. ;)

jrista 2009-06-28 01:38:45

Answer 6

A:

If you do not require many concurrent users try using MS-Jet, i.e. "Microsoft Access" as your DBMS. The MSJet performance can be about 10x faster than SqlServer. BTW, inserting 600k records in just 50 seconds (12k/sec) is very fast for SqlServer.

2009-06-27 00:37:58

I've seen it on a manual test (actually copying pasting records on an Access table), how can it be so dammed fast?

pablo 2009-06-27 00:46:08

Lots of reasons - first, there's no logging. The plus side of that is that it's fast, but the downside is that if anything goes wrong, you're plain out of luck. You can't recover to a point in time with Access.

Brent Ozar 2009-06-27 14:49:31

Or you could just turn off logging on SQL Server. I'd see little reason to step back from SQL Server to Access.

duffymo 2009-06-27 15:32:58

Answer 7

+3 A:

Use SqlBulkCopy:

http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.aspx

http://www.sqlteam.com/article/use-sqlbulkcopy-to-quickly-load-data-from-your-client-to-sql-server

AlexKuznetsov 2009-06-27 01:32:59

+1 the fastest way to move into sql server.

Sam Saffron 2009-06-27 01:37:48

Answer 8

+1 A:

As Alex said: use SqlBulkCopy, nothing beats it when it comes to performance.

It is a bit tricky to use, for sample code have a look here:

http://github.com/sambo99/So-Slow/blob/1552b1293525bfe36f6c9b522e370de626ac6f05/Importer.cs

Sam Saffron 2009-06-27 01:37:06

Answer 9

+3 A:

If all you need is to insert the data and never read it back then you can write a noop function and pretend you inserted them in /dev/nul. The real question is how do you plan to consume the said data? Do you need to interrogate, filter, sort, reference the individual records? Ie. why did you even consider a database to start with, if a flat file appears to be just as good?

With SQL Server you can certainly achieve better performance with a database and insert at a rate of about 50-100k per second at least. Your current chocking point is probably the lgo flush on each insert. You must batch commits and make sure your log is on a fast array of spindles. Start a transaction, insert roughly enough records to fill a log page (64kb) then commit. Also is worth using a battery of 5-10 SqlCommands and connections and use async commands (BeginExecuteNonReader with callback) to launch multiple inserts in parallel, this way you can leverage all dead times you now loose in network round-trip and execution context preparation.

Remus Rusanu 2009-06-27 10:59:49

Hi, what do you mean by "You must batch commits and make sure your log is on a fast array of spindles"? I'm inserting them all inside a transaction, and yes, it would be great to reach 50k records per second but, do I need to use a Batch Mode for that?

pablo 2009-06-27 12:39:20

Answer 10

A:

My guess is that you're doing transactional inserts: inserts that look like this:

INSERT INTO dbo.MyTable (Field1, Field2, Field3)
VALUES (50, 100, 150)

That'll work, but like you've found, that doesn't scale. In order to push a lot of data into SQL Server very quickly, there are tools and techniques to pull it off.

Probably the simplest way to do it is with BCP. Here's a couple of links about it:

Next, you'll want to set up SQL Server in order to insert as many records as possible. Is your database in full recovery mode or simple recovery mode? To find out, go into SQL Server Management Studio, right-click on the database name, and click Properties. Full recovery mode will log every transaction, but simple recovery mode will run somewhat faster. Are the data files and log files located on separate arrays? How many drives are in each array, and what RAID type is it (1, 5, 10)? If both the data and log files are on the C drive, for example, you'll have poor performance.

Next, you'll want to set up your table, too. Do you have constraints and indexes on the table? Do you have other records in it already, and do you have other people querying it at the same time? If so, consider building an empty table for data loads with no indexes or constraints. Dump all the data in there as fast as possible, and then apply the constraints or indexes, or move the data into its final destination.

Brent Ozar 2009-06-27 15:07:37

Answer 11

A:

my SQL Server 2005 solution

StringBuilder sb = new StringBuilder();
bool bFirst = true;

foreach(Record r in myData)
{
    if (bFirst)
        sb.AppendLine("INSERT INTO tbl (f1, f2, f3)");
    else
        sb.AppendLine("UNION ALL");
    bFirst = false;

    sb.AppendLine("SELECT " + r.data1.ToString() + "," + 
        r.data2.ToString() + "," + r.data3.ToString());
}

SqlCommand cmd = new SqlCommand(sb.ToString(), conn);
cmd.ExecuteNonQuery();

wonder how that would perform ;)

devio 2009-06-27 15:27:55

Answer 12

+1 A:

Ayende has some interesting code to batch up exactly these ExecuteNonQuery situations. Opening Up Query Batching was the intro post where he talks about SqlCommandSet, then releases the code in There Be Dragons: Rhino.Commons.SqlCommandSet.

If you can optimise for SQL2008, you could also try the shiny new table value parameters. This sqlteam article is a good intro to them.

Dan F 2009-06-27 16:17:45

ansaurus

tags:

views:

answers:

Database or flat file for 600K records?

related questions