views:

235

answers:

6

We need to build a system capable of processing 40,000 messages per second. No messages can be lost in case of any software or hardware failures.

Each message size is about 2-4Kb.

Processing of a message consists of validating the message, doing some simple arithmetical calculations, saving result to database and (sometimes) sending notifications to other systems.

Preferable software technology is .Net.

What software and hardware patterns are the most suitable for such task?

How much hardware will it require?

+2  A: 

The first thing I'd do is try to find out exactly what your requirements mean. "No messages can be lost in case of any software or hardware failures" is impossible. Suppose you write the message to 5000 different disks in 5000 different locations. If all of those disks fail simultaneously, you'll lose data, unavoidably.

Likewise if you do have a bug somewhere, that could lose data. The idea of being able to design a solution which will always work in the face of a bug anywhere in the system is impossible.

Once you've decided the level of redundancy and reliability you really need, it'll be more feasible to help you. It'll also be easier for you to have confidence that you've hit that level of reliability.

Jon Skeet
+8  A: 
  1. Message queuing. Your process flow sounds like a prime target for it.
  2. Clustering / load balancing.
  3. Streamline your code

First thing I'd do is queue the notifications. Then I'd queue all database writes that don't need to return a value. Then I'd look at scaling out.

Other considerations: * Avoid a big clunky framework that does way more work behind than scenes than you likely need. * Make use of cache and static variables wherever possible.

40,000 messages per second is doable, but when you add IO to the mix, it can be unpredictable even on super fast hardware with a ton of memory. Try to do as much out of band processing as you can. Where that fails, see if you can run multiple threads (on a multi-core or multi-proc machine) and look into multiple servers in a cluster if need be.

Edit:

I can't stress enough the benefits of load testing in a scenario like this. Make a simple prototype and load test. Refine the prototype until you get desired results. Then architect a final solution based on the prototype. Until you test for the desired performance level, you're guessing at the solution.

Chris
+2  A: 

4k * 40.000/s = 160MB/s is quite some bandwidth.

You probably need to have that bandwidth in both directions, since the no-message-lost requirement means that all communicating parties send and receive both directions.

Divide that number by the average throughput of your network card, or the write speed of your harddisk, to find that this is going to be a highly parallel and redundant system.

You also need to benchmark your db operations and the calculations of each message, multiply by 40.000 (or, 3.5 billion for a single day), to get an estimate of the required hardware.

I guess the .Net requirement will be the least of your problems.

devio
+1  A: 

If you're on a Microsoft stack, you will almost certainly need to use MSMQ (Microsoft Message Queueing). It has a lot of options you can configure for reliability or performance. Have a look at the MSMQ FAQ.

The bottle neck is not processing but disk I/O. Have a lot of RAM and do as much as you can in memory.

MSMQ manages its queue in memory but if hardware fails you, everything in memory is lost. If you mark your messages as recoverable they get written to disk but you can easily run into bottlenecks.

aleemb
+1  A: 

My advice is to hire someone who has already built a similar system. Let them choose the architecture and the development tools. Dealing with such high transaction rates will require specialist hardware and software knowledge, and the cheapest way to aquire such knowledge is to pay money for it.

anon
+1  A: 

If you use MSMQ and mark the messages as recoverable, be very careful about reliably taking the messages off the queue. Make that process as failsafe as you can, because if something goes wrong, messages can pile up so fast that the drive will fill up in a fraction of a second and crash the system. Then all incoming messages will be lost. Ask me how I know. (I didn't create it, I just had to support it. Not fun.)

I never did figure out how to tell MSMQ to persist messages to a drive other than C:, but that would be a necessity. At least that way the system will be able to tell you there is a problem.

As was mentioned above, disk and the database will be the bottleneck. I think that MSMQ can handle that volume, especially if you avoid triggers and such.

IBM's MQ is probably better suited to the task.

R Ubben