views:

972

answers:

3

We have a system that is concurrently inserted a large amount of data from multiple stations while also exposing a data querying interface. The schema looks something like this (sorry about the poor formatting):

[SyncTable]
  SyncID
  StationID
  MeasuringTime


[DataTypeTable]
  TypeID
  TypeName


[DataTable]
  SyncID
  TypeID
  DataColumns...

Data insertion is done in a "Synchronization" and goes like this (we only insert data into the system, we never update)

INSERT INTO SyncTable(StationID, MeasuringTime) VALUES (X,Y); SELECT @@IDENTITY

INSERT INTO DataTable(SyncID, TypeID, DataColumns) VALUES 
  (SyncIDJustInserted, InMemoryCachedTypeID, Data)
  ... lots (500) similar inserts into DataTable ...

And queries goes like this ( for a given station, measuringtime and datatype)

SELECT SyncID FROM SyncTable WHERE StationID = @StationID 
                               AND MeasuringTime = @MeasuringTime 
SELECT DataColumns FROM DataTable WHERE SyncID = @SyncIDJustSelected
                                  AND DataTypeID = @TypeID

My question is how can we combine the transaction level on the inserts and NOLOCK/READPAST hints on the queries so that:

  1. We maximize the concurrency in our system while favoring the inserts (we need to store a lot of data, something as high as 2000+ records a second)
  2. Queries only return data from "commited" synchronization (we don't want a result set with a half inserted synchronization or a synchronization with some skipped entries due to lock-skipping)
  3. We don't care if the "newest" data is included in the query, we care more for consistency and responsiveness then for "live" and up-to-date data

This may be very conflicting goals and may require a high transaction isolation level but I am interested in all tricks and optimizations to achieve high responsiveness on both inserts and selects. I'll be happy to elaborate if more details are needed to flush out more tweaks and tricks.

UPDATE: Just adding a bit more information for future replies. We are running SQL Server 2005 (2008 within six months probably) on a SAN network with 5+ TB of storage initially. I'm not sure what kind of RAID the SAn is set up to and precisely how many disks we have available.

A: 
  1. What type of disk system will you be using? If you have a large striped RAID array, writes should perform well. If you can estimate your required reads and writes per second, you can plug those numbers into a formula and see if your disk subsystem will keep up. Maybe you have no control over hardware...

  2. Wouldn't you wrap the inserts in a transaction, which would make them unavailable to the reads until the insert is finished?

  3. This should follow if your hardware is configured correctly and you're paying attention to your SQL coding - which it seems you are.

Look into SQLIO.exe and SQL Stress tools:

SQLIOStress.exe SQLIOStress.exe simulates various patterns of SQL Server 2000 I/O behavior to ensure rudimentary I/O safety.

The SQLIOStress utility can be downloaded from the Microsoft Web site. See the following article.

• How to Use the SQLIOStress Utility to Stress a Disk Subsystem such as SQL Server http://support.microsoft.com/default.aspx?scid=kb;en-us;231619

Important The download contains a complete white paper with extended details about the utility.

SQLIO.exe SQLIO.exe is a SQL Server 2000 I/O utility used to establish basic benchmark testing results.

The SQLIO utility can be downloaded from the Microsoft Web site. See the following: • SQLIO Performance Testing Tool (SQL Development) – Customer Available http://download.microsoft.com/download/f/3/f/f3f92f8b-b24e-4c2e-9e86-d66df1f6f83b/SQLIO.msi

Sam
Marking this as the answer since part of the "solution" was tied to setting up a particular disk system correctly which greatly improved the throughput
soren.enemaerke
+1  A: 

If you are running SQL 2005 and above look into implementing snapshot isolation. You will not be able to get consistent results with nolock.

Solving this on SQL 2000 is much harder.

Sam Saffron
+1  A: 

This is a great scenario for SQL Server 2005/2008 Enterprise's Partitioning feature. You can create a partition for each StationID, and each StationID's data can go into its own filegroup (if you want, may not be necessary depending on your load.)

This buys you some advantages with concurrency:

  • If you partition by stationid, then users can run select queries for stationid's that aren't currently loading, and they won't run into any concurrency issues at all
  • If you partition by stationid, then multiple stations can insert data simultaneously without concurrency issues (as long as they're on different filegroups)
  • If you partition by syncid range, then you can put the older data on slower storage.
  • If you partition by syncid range, AND if your ranges are small enough (meaning not a range with thousands of syncids) then you can do loads at the same time your users are querying without running into concurrency issues

The scenario you're describing has a lot in common with data warehouse nightly loads. Microsoft did a technical reference project called Project Real that you might find interesting. They published it as a standard, and you can read through the design docs and the implementation code in order to see how they pulled off really fast loads:

http://www.microsoft.com/technet/prodtechnol/sql/2005/projreal.mspx

Partitioning is even better in SQL Server 2008, especially around concurrency. It's still not a silver bullet - it requires manual design and maintenance by a skilled DBA. It's not a set-it-and-forget-it feature, and it does require Enterprise Edition, which costs more than Standard Edition. I love it, though - I've used it several times and it's solved specific problems for me.

Brent Ozar