views:

122

answers:

3

I'm looking to build a scalable database solution for the back end of my website. I've been reading about database design lately, and I seem to have developed an idea on my own that might work. I think this is a novel way of maintaining n databases with synchronized data, but I could be wrong. So I'm asking SO to evaluate the idea and tell me if it's crazy or not. (or if it already exists and is implemented)

In this scheme there are a group of server nodes. One node runs a query load balancer (Let's call it A) and the rest are running a typical dbms, let's call those nodes N collectively.

Each N is disconnected from the others. ie) a node in N doesn't need to communicate with any of the others. Each N has a connection to A only.

The process works like this

  • All database queries are passed through A. (Let's assume for now A has infinite throughput and processing ability)
  • A inspects each query (Q) and determines if it is an operation that will read from a database or a query that will write to a database. (in sql, read would be select and write would be update)
  • If Q is a read operation, forward it to one of the nodes in N
  • if Q is a write operation, forward it to all of the nodes in N

Assuming it's implemented properly, this results in all of the nodes in N having synchronized database content. Queries that are only reading data need to be sent to one node.

This idea seems to work especially well for me because in my system there are very few write operations, less than 1%.

So a few questions on this idea

  • Does a scheme like this make sense from a theoretical point of view?
  • If this does make sense, is there an already implemented solution either commercial or free?
+1  A: 

Your scheme only works with infinitely available nodes. How are you going to deal with node downtime? If a node is down for any reason and missed an update, it will serve dirty data next time is asked.

Remus Rusanu
That's not difficult to mitigate. When a node comes back online, it can either read all records with a timestamp greater than or equal to the latest write it knows of, or read all records. The load balancer makes this particularly easy for large N.
JoshJordan
How does a node decide which other node to read the missed records from? How can you guarantee that that node isn't missing data? Or any node for that matter?
Asaph
@Josh: that is actually *very* hard to mitigate. How would a node even *know* it missed an update? You'll be soon facing that *any* write is a 2 phase commit between N nodes in this scheme. Not exactly scalable.
Remus Rusanu
@JoshJordan: The question states: "each N is disconnected from the others. ie) a node in N doesn't need to communicate with any of the others. Each N has a connection to A only." - how is a restarting node going to "know the latest write" or "read all records"?
Mitch Wheat
@Asaph: It doesn't ask a node, it asks the load balancer. The load balancer handles the query. Each node, save for those that have been offline, should be all up to date. For a node that just came back online, it should not be marked available in the load balancer until it has received its updates.
JoshJordan
@Remus: It would not *know* it missed an update, it would *always assume* it missed updates if it was offline, and update itself accordingly.
JoshJordan
@Mitch: It never communicates with another node, it communicates with the load balancer, which we know it has a connection to. For your latter-most question: you program it that way! Maybe I should just turn this into an answer.
JoshJordan
+6  A: 

The typical setup for many reads few writes is to have a read/write master db and n replicated slave dbs which are read only. Replication is handled by the RBDMS. Read only queries can be load balanced across all your n read only nodes and if your read/write master goes down temporarily, at least your app will be able to service read operations. You don't need a central "A" proxy to decide if a query is a read or a write. The client issuing the query should be smart enough to know if it's reading or writing. That way you won't be bottlenecked on your "A" server.

Your proposed setup has the distinct flaw in that if you are simultaneously writing to n nodes, what if one or more of those writes fails?

Asaph
I think I'm going to end up using a master/slave system. Thanks to everyone who answered.
Mike
+1  A: 

Not a direct answer to your question, but SQL Server 2008 already supports something equivalent to what you are describing. Its called Peer-to-Peer Transactional Replication. I'm sure the other RDBMS do as well. I think MySQL calls it master-master replication.

Mitch Wheat