How to scale a TCP listener on modern multicore/multisocket machines....

views:

643

answers:

+5 Q:

How to scale a TCP listener on modern multicore/multisocket machines....

I have a daemon to write in C, that will need to handle 20-150K TCP connections simultaneously. They are long running connections, and rarely ever tear down. They have a very small amount of data (rarely exceeding MTU even.. it's a stimulus/response protocol) in transmit at any given time, but response times to them are critical. I'm wondering what the current UNIX community is using to get large amounts of sockets, and minimizing the latency on response of them. I've seen designs revolving around multiplexing connects to fork worker pools, threads (per connection), static sized thread pools. Any suggestions?

Several systems have been developed to improve on select(2) performance: kqueue, epoll, and /dev/poll. In all these systems, you can have a pool of worker threads waiting for tasks; you will not be forced to setup all file handles over and over again when done with one of them.

Martin v. Löwis 2009-10-22 18:07:09

do you have to start from scratch? You could use something like gearman.

sean riley 2009-10-22 18:14:30

To me it looks like gearman would introduce a lot of latency (because of the separate job server).

cmeerw 2009-10-22 20:26:11

I was thinking the same thing... it's likely fine for non-critical response times and job batching.

Obi 2009-10-23 05:47:56

+7 A:

the easiest suggestion is to use libevent, it makes it easy to write a simple non-blocking single-threaded server that would comply with your requirements.

if the processing for each response takes some time, or if it uses some blocking API (like almost anything from a DB), then you'll need some threading.

One answer is the worker threads, where you spawn a set of threads, each listening on some queue to work. it can be separate processes, instead of threads, if you like. The main difference would be the communications mechanism to tell the workers what to do.
A different way to do is to use several threads, and give to each of them a portion of those 150K connections. each will have it's own process loop and work mostly like the single-threaded server, except for the listening port, which will be handled by a single thread. This helps spreading the load between cores, but if you use a blocking resource, it would block all the connections handled by this specific thread.

libevent lets you use the second way if you're careful; but there's also an alternative: libev. it's not as well known as libevent, but it specifically supports the multi-loop scheme.

Javier 2009-10-22 18:38:07

You should consider both a worker thread model and a worker process model. In fact, if you want this to scale, you should really write code for both threaded and process and then time it. On a modern Linux kernel process switching overhead is similar to thread switching, but the app does not have to do any locking semaphores.

Michael Dillon 2009-10-22 20:25:45

If you have system configuration access don't over-do it and set up some iptables/pf/etc to load-balance connections across n daemon instances (processes) as this will work out of the box. Depending on how blocking the nature of the daemon n should be from the number of cores on the system or several times higher. This approach looks crude but it can handle broken daemons and even restart them if necessary. Also migration would be smooth as you could start diverting new connections to another set of processes (for example, a new release or migrating to a new box) instead of service interruptions. On top of that you get several features like source affinity wich can help significantly caching and contention of problematic sessions.

If you don't have system access (or ops can't be bothered), you can use load balancer daemon (there are plenty of open source ones) instead of iptables/pf/etc and use also n service daemons, like above.

Also this approach helps with separating privileges of ports. If the external service needs to service on a low port (<1024) you only need the load balancer running privileged/or admin/root, or kernel.)

I've written several IP load balancers in the past and it can be very error-prone in production. You don't want to support and debug that. Also operations and management will tend second-guess your code more than external code.

alecco 2009-10-22 19:23:36

Contrary to how it sounds.. it's not a web server. :) It's server for a lot of clients.. the clients keep a TCP connect up, and have a set response they expect for their input.

Obi 2009-10-22 19:45:08

You **read** a lot of Mad Magazine as a **child**, didn't you?

caf 2009-10-22 20:35:00

@Obi iptables/pf are IP/TCP/UDP and not HTTP. You can use a library (libevent?), threading, or straight OpenMP. The last can do quite a lot of inter-core balancing at specific critical points of the source. Also Intel's recent nehalem/i7 thingies have again hyper-threading but don't know if ICC or GCC support that already with OpenMP (they did a few years ago for the previous hyper-threading era in early 2000s.)IMHE threading gives O(n^2) performance degradation due to synchronization and caché mixups. Also it usually comes with a lot of branch mispredictions. YMMV.@caf **What**?

alecco 2009-10-22 21:25:40

I see what you're getting at as far as process VS threads, and yea, using the packet level firewall could be useful for migrations but I think it would add more complexity then needed. thanks for the suggestions thou, I'll keep it in mind.

Obi 2009-10-23 06:19:26

+2 A:

If performance is critical then you'll really want to go for a multithreaded event loop solution - i.e. a pool of worker threads to handle your connections. Unfortunately, there is no abstraction library to do this that works on most Unix platforms (note that libevent is only single-threaded as are most of these event-loop libraries), so you'll have to do the dirty work yourself.

On Linux that means using edge-triggered epoll with a pool of worker threads (Windows would have I/O completion ports which also works fine in a multithreaded environment - I am not sure about other Unixes).

BTW, I have done some work trying to abstract edge-triggered epoll on Linux and Windows I/O completion ports on http://nginetd.cmeerw.org (it is work in progress, but might provide some ideas).

cmeerw 2009-10-22 20:15:57

+1 A:

i think javier's answer makes the most sense. if you want to test the theory out, then check out the node javascript project.

Node is based on Google's v8 engine which compiles javascript to machine code and is as fast as c for certain tasks. It is also based on libev and is designed to be completely non-blocking, meaning you don't have to worry about context switching between threads (everything runs on a single event loop). It is very similar to erlang in that respect.

Writing high performance servers in javascript is now really, really easy with node. You could also, with a little bit of effort, write your custom code in c and create bindings for node to call into it to do your actual processing (look at the node source to see how to do this - documentation is a little sketchy at the moment). as an uglier alternative, you could build your custom c code as an application and use stdin/stdout to communicate with it.

I've tested node myself with upwards of 150k connections with absolutely no issues (of course you will need some serious hardware if all these connections are going to be communicating at once). A TCP connection in node.js on average uses only 2-3k of memory so you could theoretically handle 350-500k connections per 1GB of RAM.

Note - Node.js is not currently supported on windows, but it is only at an early stage of development and i'd imagine it will be ported at some stage.

Note 2 - you will have to ensure the code you are calling into from Node does not block

billywhizz 2009-12-23 01:52:16

ansaurus

tags:

views:

answers:

How to scale a TCP listener on modern multicore/multisocket machines....

related questions