views:

120

answers:

4

I'm developing an application under linux that will need to support around 250 connections and be transmitting large files over TCP sockets in the 100MB+ size range. The aim is to tune for throughput rather than latency. I want to keep saturated 2x1Gbit ethernet connectons at all times. These will be channel bonded.

It's expected that the application will be busy continuously and will just be throwing out data as quick as possible. The connections will remain up most of the time so unlike HTTP they won't be torn down so often.

I've been looking at the various options such as epoll, sendfile api etc for high performance and aio (which looks too immature and risky IMHO).

I've also been looking at the boost asio api which uses epoll underneath. I've used it before but not for a high performance application like this.

I have more than 4 processor cores available so I can make use of that.

However, I read that boost asio is not very good with multiple threads because of some locking in the reactor design. Is this likely to be an issue for me?

If I have lots of CPU cores available should I just create as many threads or forked processes and set them to run on each processor core?

What about locking etc. I'd like some design suggestions. I suspect my main bottleneck is going to be Disk I/O but nonetheless... I want a good design up front withough much rework later.

Any suggestions?

+1  A: 

IMHO your main problem is going to be disk I/O - file serving usually isn't CPU bound, so many core aren't necessarily going to help much. Things get somewhat worse if you're serving lots of different files as you seem to imply; at this point, the simultaneous reads from disk are going to cause you major pain.

I would try to cache as much data as possible in memory and try to serve that in order to speed things up.

Timo Geusch
@Timo: there is no need to cache something in memory on Linux. `sendfile()` + Linux own I/O cache do good job alredy.
Dummy00001
+1  A: 

sendfile() is definitely the way to go if you're sending large amounts of sequential data from disk files. epoll() is unlikely to be particularly helpful - it primarily helps when you're dealing with large numbers of connections. 250 isn't very large at all, so plain old select() or poll() will likely be just as good.

caf
+4  A: 

I'm developing an application under linux that will need to support around 250 connections and be transmitting large files over TCP sockets in the 100MB+ size range. The aim is to tune for throughput rather than latency. I want to keep saturated 2x1Gbit ethernet connectons at all times. These will be channel bonded.

Disk IO is generally slower than network. 250 clients is nothing for the modern CPUs.

And how large files are isn't important. Real question is whether the total amount of data fits the RAM or not - and can the RAM be extended so that the data would fit it. If the data fit into RAM, then do not bother overoptimizing: dumb single threaded server with sendfile() would do fine.

SSD should be considered for storage, especially if reading data is the priority.

It's expected that the application will be busy continuously and will just be throwing out data as quick as possible. The connections will remain up most of the time so unlike HTTP they won't be torn down so often.

"As quick as possible" is a recipe for a disaster. I'm responsible for at least one such multi-threaded disaster which simply can't scale because of the amount of disk seeks it causes.

Generally you might want to have few (e.g. 4) disk reading thread per storage which would call read() or sendfile() for very large blocks so that OS has chance to optimize the IO. Few threads are needed since one wants to be optimistic that some data can be served from the OS' IO cache in parallel.

Do not forget to also set large socket send buffer. In your case it also makes sense to poll for write-ability of the socket: if client can't received as fast as you can read/send, it makes no sense to read. Net channel on your server might be fat, but clients NICs/disks are not so.

I've been looking at the various options such as epoll, sendfile api etc for high performance and aio (which looks too immature and risky IMHO).

Virtually all FTP servers now use sendfile(). Oracle uses AIO and Linux is their primary platform.

I've also been looking at the boost asio api which uses epoll underneath. I've used it before but not for a high performance application like this.

IIRC that is only for sockets. IMO any utility which facilitates handling of sockets is fine.

I have more than 4 processor cores available so I can make use of that.

TCP is accelerated by the NICs and disk IO is largely done by the controllers themselves. Ideally your application would be idle, waiting for the disk IO.

However, I read that boost asio is not very good with multiple threads because of some locking in the reactor design. Is this likely to be an issue for me?

Check the libevent as an alternative. Limited number of threads you would likely need only for sendfile(). And the number should be limited, since otherwise you would kill the throughput with the disk seeks.

If I have lots of CPU cores available should I just create as many threads or forked processes and set them to run on each processor core?

No. Disks are most affected by the seeks. (Have I repeated that sufficient number of time?) And if you would have many autonomous reading threads, you would lose the possibility to control IO which is sent to the disk.

Consider the worst case. All read()s has to go to the disk == more threads, more disk seeks.

Consider the best case. All read()s are served from the cache == no IO at all. Then you are working at speed of the RAM and probably do not need threads at all (RAM is faster than network).

What about locking etc. I'd like some design suggestions. I suspect my main bottleneck is going to be Disk I/O but nonetheless...

That is a question with very very long answer which isn't going to fit here (nor I have the time to write). And it also depends largely how much data you are going to serve, what kind of storage you are using and how you access the storage.

If we take SSD as a storage, then any dumb design (like starting a thread for every client) would work fine. If you have real spinning media in the back-end, then you have to slice and queue the IO requests from clients, trying to avoid starving clients on one side and on another side schedule IO in a way to cause least possible amount of seeks.

I personally would have started with simple single-threaded design having poll() (or boost.asio or libevent) in the main loop. If data are cached then there is no point to start a new thread. If data has to be fetched from the disk, single-thread-ness would ensure that I avoid seeks. Fill socket buffer with read data and change waiting to POLLOUT mode to know when the client has consumed the data and ready to receive the next chunk. That means I would have at least three types of sockets in the main loop: listening socket, client socket I'm waiting request from, client socket I'm waiting for become write-able again.

I want a good design up front withough much rework later.

Ah... sweet dreams......

Dummy00001
Thanks for the pointers. I guess the answer is never that simple, so I'll start with a simple solution and see how it performs.
Matt H
A: 

Do the simplest thing that could possibly work, as it sounds like even that is likely to have no performance problems that you can fix in your code.

One thread per client sounds good, it makes programming as simple as possible. Ideally don't write a custom file server at all, but use an already-existing one - http, rsync etc. Rsync protocol is good for many small files as it supports pipelining.

Having 250 threads really is no problem - indeed, 1000 would be fine too.

Depending on whether the files fit in ram, and how fast your IO is, you may bottleneck on the network. If your network is just 1-2Gbit/sec, it seems likely that your storage can beat it on sequential IO, so the network will be a bottleneck.

MarkR
Thanks, I think the network will definitely be the bottleneck. I can't use existing protocols. This is all custom for good reasons.
Matt H