views:

89

answers:

3

I'm building a system which has the potential to require support for 500+ concurrent users, each making dozens of queries (selects, inserts AND updates) each minute. Based on these requirements and tables with many millions of rows I suspect that there will be the need to use database replication in the future to reduce some of the query load.

Having not used replication in the past, I am wondering if there is anything I need to consider in the schema design?

For instance, I was once told that it is necessary to use GUIDs for primary keys to enable replication. Is this true?
What special considerations or best practices for database design are there for a database that will be replicated?

Due to time constraints on the project I don't want to waste any time by implementing replication when it may not be needed. (I have enough definite problems to overcome at the moment without worrying about having to solve possible ones.) However, I don't want to have to make potentially avoidable schema changes when/if replication is required in the future.

Any other advice on this subject, including good places to learn about implementing replication, would also be appreciated.

+1  A: 

You may want to use GUIDs for primary keys - in a replicated system rows must be unique throughout your entire topology, and GUID PKs is one way of achieving this.

Here's a short article about use of GUIDs in SQL Server

flesh
+3  A: 

While every row must have a rowguid column, you are not required to use a Guid for your primary key. In reality, you aren't even required to have a primary key (though you will be stoned to death for failing to create one). Even if you define your primary key as a guid, not making it the rowguid column will result in Replication Services creating an additional column for you. You definitely can do this, and it's not a bad idea, but it is by no means necessary nor particularly advantageous.

Here are some tips:

  1. Keep table (or, rather, row) sizes small; unless you use column-level replication, you'll be downloading/uploading the entire contents of a row, even if only one column changes. Additionally, smaller tables make conflict resolution both easier and less frequent.
  2. Don't use sequential or deterministic algorithm-driven primary keys. This includes identity columns. Yes, Replication Services will handle identity columns and allocating key allotments by itself, but it's a headache that you don't want to deal with. This alone is a great argument for using a Guid for your primary key.
  3. Don't let your applications perform needless updates. This is obviously a bad idea to begin with, but this issue is made exponentially worse in replication scenarios, both from a bandwidth usage and a conflict resolution perspective.
Adam Robinson
+1  A: 

I'd say your real question is not how to handle replication, but how to handle scale out, or at least scale out for queryability. And while there are various answers to this conundrum, one answer will stand out: not using replication.

The problem with replication, specially with merge replication, is that writes gets multiplied in replication. Say you have a system which handles a load of 100 queries (90 reads and 10 writes) per second. You want to scale out and you choose replication. Now you have 2 systems, each handling 50 queries, 45 reads and 5 writes each. Now those writes have to be replicated so the actual number of writes is not 5+5, but 5+5 (original writes ) and then another 5+5 (the replica writes), so you have 90 reads and 20 writes. So while the load on each system was reduced, the ratio of writes and reads has increased. This not only changes the IO patterns, but most importantly it changes the concurency pattern of the load. Add a third system and you'll have 90 reads and 30 writes and so on and so forth. Soon you'll have more writes than reads and the replication update latency combined with the concurency issues and merge conflicts will derail your project. The gist of it is that the 'soon' is much sooner than you expect. Is soon enough to justify looking into scale up instead, since you're talking a scale out of 6-8 peers at best anyway, and 6-8 times capacity increase using scale up will be faster, much more simpler and possible even cheaper to start with.

And keep in mind that all these are just purely theorethical numbers. In practice what happens is that the replication infrastructure is not free, it adds its own load on the system. Writes needs to be tracked, changes have to be read, a distributor has to exists to store changes until distributed to subscribers, then changes have to be writes and mediated for possible conflicts. That's why I've seen very few deployments that could claim success with a replication based scale out strategy.

One alternative is to scale out only reads and here replication does work, usualy using transactional replication, but so does log-shipping or mirroring with a database snapshot.

The real alternative is partitioning (ie. sharding). Requests are routed in the application to the proper partition and land on the server containig the appropiate data. Changes on one partiton that need to be reflected on another partition are shipped via asynchronous (usually messaging based) means. Data can only be joined within a partition. For a more detailed discussion of what I'm talking about, read how MySpace does it. Needless to say, such a strategy has a major impact on the application design and cannot be simply glued in after v1.

Remus Rusanu
Thanks for the points on scaling. This is going to ba a major consideration. Unfortunately there is a requirement that we implement replication as part of the solution though.
Matt Lacey