views:

242

answers:

3

I am facing an application designed to import huge amounts of data into a Microsoft SQL Server 2000 database. The application seems to take an awful long time to complete and I suspect the application design is flawed. Someone asked me to dig into the application to find and fix serious bottlenecks, if any. I would like a structured approach to this job and have decided to prepare a checklist of potential problems to look for. I have some experience with SQL databases and have so far written down some things to look for.

But it would be very helpful with some outside inspiration as well. Can any of you point me to some good resources on checklists for good database schema design and good database application design?

I plan on developing checklists for the following main topics:

  1. Database hardware - First thing is to establish proof that the server hardware is appropriate?
  2. Database configuration - Next step is to ensure the database is configured for optimal performance?
  3. Database schema - Does the database schema have a sound design?
  4. Database application - does the application incorporate sound algorithms?
+1  A: 

Three items I would add to the list:

  1. Bulk insert - Are you importing the data using a bulk provider (i.e. BCP or SqlBulkCopy) or via individual INSERT/UPDATE statements?
  2. Indexes - Do you have indexes on your target tables? Can they be dropped prior to import and then rebuilt after?
  3. Locking - Is there any contention occurring while you are trying to import data?
Ben Hoffstein
+1  A: 

Good start. Here are the recommended priorities.

First Principle. Import should do little or no processing other than source file reads and SQL Inserts. Other processing must be done prior to the load.

  1. Application Design is #1. Do the apps do as much as possible on the flat files before attempting to load? This is the secret sauce in large data warehouse loads: prepare offline and then bulk load the rows.

  2. Database Schema is #2. Do you have the right tables and the right indexes? A load doesn't require any indexes. Mostly you want to drop and rebuild the indexes.

    A load had best not require any triggers. All that triggered processing can be done off-line to prepare the file for a load.

    A load had best not be done as a stored procedure. You want to be using a simple utility program from Microsoft to bulk load rows.

  3. Configuration. Matters, but much, much less than schema design and application design.

  4. Hardware. Unless you have money to burn, you're not going far here. If -- after everything else -- you can prove that hardware is the bottleneck, then spend money.

S.Lott
+1  A: 

You have left out the first place I would start looking: the technique used to import the data.

If the application is inserting single rows, is there a reason for that? Is it using DTS or BULK INSERT or BCP? Is it loading to a staging table or a table with triggers? Is it loading in batches or attempting to load and commit the entire batch? Is there extensive transformation or data type conversion of rows on their way in? Is there extensive transformation of data into a different schema or model?

I wouldn't worry about 1 and 2 until I saw if the ETL techniques used were sound, and if they are importing data into an existing schema, then you aren't going to have much room to change anything to do with 3. With regard to import and 4, I prefer not to do much algorithms on the data during the load portion.

For the best performance in the most general cases, load to a flat staging table with good reliable basic type conversion and exceptions done at that time (using SSIS or DTS). For very large data (multi-million row daily loads, say), I load in 100,000 or 1,000,000 record batches (this is easily settable in BCP or SSIS). Any derived columns are either created at the time of the load (SSIS or DTS) or right after with an UPDATE. Run exceptions, validate the data and create contraints. Then manipulate the data into the final schema as part of one or more transactions - UPDATEs, INSERTs, DELETEs, GROUP BYs for dimensions or entities or whatever.

Obviously, there are exceptions to this and it depends a lot on the input data and the model. For instance, with EBCDIC packed data in inputs, there's no such thing as good reliable basic type conversion in the load stage, so that causes your load to be more complex and slower as the data has to be processed more significantly.

Using this overall approach, we let SQL do what it is good for, and let client applications (or SSIS) do what they are good for.

Cade Roux