tags:

views:

73

answers:

3

Hi,

we have a 500gb database that performs about 10,000 writes per minute.

This database has a requirements for real time reporting. To service this need we have 10 reporting databases hanging off the main server.

The 10 reporting databases are all fed from the 1 master database using transactional replication.

The issue is that the server and replication is starting to fail with PAGEIOLATCH_SH errors - these seem to be caused by the master database being overworked. We are upgrading the server to a quad proc / quad core machine.

As this database and the need for reporting is only going to grow (20% growth per month) I wanted to know if we should start looking at hardware (or other 3rd party application) to manage the replication (what should we use) OR should we change the replication from the master database replicating to each of the reporting databases to the Master replicating to reporting server 1, reporting server 1 replicating to reporting server 2

Ideally the solution will cover us to a 1.5tb database, with 100,000 writes per minute

Any help greatly appreciated

+1  A: 

One common model is to have your main database replicate to 1 other node, then have that other node deal with replicating the data out from there. It takes the load off your main server and also has the benefit that if, heaven forbid, your reporting system's replication does max out it won't affect your live database at all.

I haven't gone much further than a handful of replicated hosts, but if you add enough nodes that your distribution node can't replicate it all it's probably sensible to expand the hierarchy so that your distributor is actually replicated to other distributors which then replicate to the nodes you report from.

How many databases you can have replicated off a single node will depend on how up-to-date your reporting data needs to be (EG: Whether it's fine to have it only replicate once a day or whether you need to the second) and how much data you're replicating at a time. Might be worth some experimentation to find out exactly how many nodes 1 distributor could power if it didn't have the overhead of actually running your main services.

Tim Schneider
+1  A: 

Depending on what you're inserting, a load of 100,000 writes/min is pretty light for SQL Server. In my book, I show an example that generates 40,000 writes/sec (2.4M/min) on a machine with simple hardware. So one approach might be to see what you can do to improve the write performance of your primary DB, using techniques such as batch updates, multiple writes per transaction, table valued parameters, optimized disk configuration for your log drive, etc.

If you've already done as much as you can on that front, the next question I have is what kind of queries are you doing that require 10 reporting servers? Seems unusual, even for pretty large sites. There may be a bunch you can do to optimize on that front, too, such as offloading aggregation queries to Analysis Services, or improving disk throughput. While you can, scaling-up is usually a better way to go than scaling-out.

I tend to view replication as a "solution of last resort." Once you've done as much optimization as you can, I would look into horizontal or vertical partitioning for your reporting requirements. One reason is that partitioning tends to result in better cache utilization, and therefore higher total throughput.

If you finally get to the point where you can't escape replication, then the hierarchical approach suggested by fyjham is definitely a reasonable one.

In case it helps, I cover most of these issues in depth in my book: Ultra-Fast ASP.NET.

RickNZ
Do you have 40K writes/sec on storage with or without hardware "Sync" flags ?
RocketSurgeon
I'm not sure what you mean by a hardware sync flag. My test (the final step in a performance optimization process) used a single vanilla 7200 rpm disk for the DB log, without enabling the disk write cache. The client was on a separate machine, connected by a 1 Gbps network. The test involves writing 20,000 rows. Each row has a datetime, a uniqueidentifier, a 33-byte varchar and an integer identity.
RickNZ
Thanks for the comments Rick. There are optimisations that could be carried out but it is a bit chicken and egg at the moment as the server is crashing today and the underlying changes will take a month or two to resolve. The need for so many reporting servers is due to the lack of Analysis server being implemented properly etc but again, resolving these issues will take time. Really aprreciated the feedback
David E
To RickNZ: Thank you for details about "writes" and "write cache disabled" flag of your storage. I was asking to make sure that its not the thousands of write transactions we talking about, but just single or several small transactions per second with the 40K rows total.
RocketSurgeon
@Rocket: I walk readers through a number of steps and techniques to demonstrate the perf effect of batch size and transaction size, starting with one row per transaction per round-trip. Best perf for this example ended up with a batch size of 500 rows and 40 round-trips running in 490 ms, using table valued parameters (if you have the book, see page 278).
RickNZ
A: 

Check that your publisher and distributor's transaction log files don't have too many VLFs (Virtual Log Files) as detailed here (step 8):

http://www.sqlskills.com/BLOGS/KIMBERLY/post/8-Steps-to-better-Transaction-Log-throughput.aspx

If your distribution database is co-located with you publisher database, consider moving it to its own dedicated server.

SuperCoolMoss