Practical Queuing Theory

I want to learn enough simple/practical queuing theory to model the behavior of a standard web application stack: Load balancer with multiple application server backends.

Given a simple traffic pattern extracted from a tool like NewRelic showing percentage of traffic to a given part of an application and average response time for that part of the application, I think I should be able to model different queueing behaviors with loadbalancer configuration, number of app servers, and queuing models.

Can anyone help point me to queuing theory introductory/fundamentals I would need to represent this system mathematically? I'm embarrassed to say I knew how to do this as an undergrad but have since forgotten all of the fundamentals.

My goal is to model different load-balancer and app-server queuing models and measure the results.

For example, it seems clear an N-mongrel Ruby on Rails application stack will have worse latency/wait time with a queue on each Mongrel than a Unicorn/Passenger system with a single queue for each group of app workers.

I can't point you at theory, but there are a few basic methods in popular usage:

Blind (linear or weighted) round-robining - requests are cycled through n servers, maybe according to some weighting. Each backend maintains a request queue. A slow-running request backs up that worker's request queue. A worker that stops returning results is eventually dropped out of the balancer pool, with all requests currently queued on it getting dropped. This is common for haproxy/nginx balancing setups.
Global pooling - a master queue maintains a list of requests, and workers report when they are free to accept a new request. The master hands off the front of the queue to the available worker. If a worker goes down, only the currently-being-handled request is lost. Results in slightly diminished performance under ideal circumstances (all workers up and returning requests quickly), since communication between queue master and backends is prerequisite to a job actually being handed off, but with the benefit of naturally avoiding slow, dead, or stalled workers. Passenger uses this balancing algorithm by default, and haproxy uses uses a variant on it with its "leastconn" balancing algorithm.
Hashed balancing - some component of the request is hashed, and the resulting hash determines which backend to use. memcached uses this sort of strategy for sharded setups. The downside is that if your cluster configuration changes, all the previous hashes become invalid, and may map to different backends than before. In the case of memcached specifically, this results in a likely invalidation of most or all of your cached data (reddit suffered some massive performance problems recently due to this sort of problem).

Generally speaking, for web apps, I tend to prefer the global pooling method, since it maintains the smoothest user experience when you have slow or dead workers.

ansaurus

tags:

views:

answers:

Practical Queuing Theory

related questions