views:

1810

answers:

4

I've developed a mini HTTP server in C++, using boost::asio, and now I'm load testing it with multiple clients and I've been unable to get close to saturating the CPU. I'm running on a 4-cpu box, and getting about 50% usage of one cpu, 20% of another, and the remaining two are idle (according to htop).

Details:

  • The server fires up one thread per core
  • Requests are received, parsed, processed, and responses are written out
  • The requests are for data, which is read out of memory (read-only for this test)
  • I'm 'loading' the server using two machines, each running a java application, running 25 threads, sending requests
  • I'm seeing about 230 requests/sec throughput (this is application requests, which are composed of many HTTP requests)

So, what should I look at to improve this result? Given the CPU is mostly idle, I'd like to leverage that additional capacity to get a higher throughput, say 800 requests/sec or whatever.

Ideas I've had:

  • The requests are very small, and often fulfilled in a few ms, I could modify the client to send/compose bigger requests (perhaps using batching)
  • I could modify the HTTP server to use the Select design pattern, is this appropriate here?
  • I could do some profiling to try to understand what the bottleneck's are/is
+15  A: 

boost::asio is not as thread-friendly as you would hope - there is a big lock around the epoll code in boost/asio/detail/epoll_reactor.hpp which means that only one thread can call into the kernel's epoll syscall at a time. And for very small requests this makes all the difference (meaning you will only see roughly single-threaded performance).

Note that this is a limitation of how boost::asio uses the Linux kernel facilities, not necessarily the Linux kernel itself. The epoll syscall does support multiple threads when using edge-triggered events, but getting it right (without excessive locking) can be quite tricky.

BTW, I have been doing some work in this area (combining a fully-multithreaded edge-triggered epoll event loop with user-scheduled threads/fibers) and made some code available under the nginetd project.

cmeerw
Thanks for the info cmeerw, thats interesting stuff.
Alex Black
(+1) cmeer I have an unanswered post relating performance of boost::asio in general on windows and linux. If you have read large sections of asio please come and answer my post :P
Hassan Syed
I was really worried about this global lock. It is not as big an issue as it would seem. The bottle neck can only occur in high through put scenarios. However, when asio is running in epoll mode (linux) it preemptively tries to write or read when the `async_*` call is issued. In a high input scenario the socket will usually be ready for reading, letting `async_read` skip epoll entirely. You can't ask for better network performance than that.
caspin
I don't think it's the case. Yes, it looks like epoll reactor has a scoped lock for the entire duration of the run() function, but it's temporarily released ("lock.unlock();") before calling into epoll_wait and locked again after epoll_wait returns("lock.lock();"). Not sure why it's done this way instead of two scoped locks, though.
Alex B
@Alex Black bump, so that the previous comment reaches the OP. What were your results with this question? Did replacing boost::asio help?
Alex B
@Checkers: Sorry, I didn't go far enough with this to come to any conclusion.
Alex Black
A: 

From your comments on network utilization,
You do not seem to have much network movement.

3 + 2.5 MiB/sec is around the 50Mbps ball-park (compared to your 1Gbps port).

I'd say you are having one of the following two problems,

  1. Insufficient work-load (low request-rate from your clients)
  2. Blocking in the server (interfered response generation)

Looking at cmeerw's notes and your CPU utilization figures
(idling at 50% + 20% + 0% + 0%)
it seems most likely a limitation in your server implementation.
I second cmeerw's answer (+1).

nik
A: 

230 requests/sec seems very low for such simple async requests. As such, using multiple threads is probably premature optimisation - get it working properly and tuned in a single thread, and see if you still need them. Just getting rid of un-needed locking may get things up to speed.

This article has some detail and discussion on I/O strategies for web server-style performance circa 2003. Anyone got anything more recent?

soru
Keep in mind the 230 requests/sec are 'application requests' which are composed of many actually HTTP requests.
Alex Black
There isn't much locking to get rid of, none in my code, but as cmeerw points out boost::asio does some internal locking. The HTTP server does purely CPU-bounded work, so not using the additional cores would be an expensive waste
Alex Black
If the goal is just to saturate the CPU, do the work in one thread and have the other three calculate PI or something.Having multiple user-level threads won't make it easier or faster for the OS and IO hardware to read and write network packets. Threads and cores are for computational work, if you aren't doing any, they can't possibly gain you anything, and risk contention with whatever else the system is doing.
soru
As I said: "the HTTP server does purely CPU-bounded work".
Alex Black
Except, demonstrably, it's not.Optimal solution is probably one thread doing I/O and 2 or 3 the parsing and so on. But that's very likely premature optimisation until you can get your IO properly asynchronously scheduled so you either saturate one CPU core or your network.
soru
I see what you're saying. Well, I'll fire up the server with 1 thread as a quick test and see what comes of that.
Alex Black
+1  A: 

As you are using EC2, all bets are off.

Try it using real hardware, and then you might be able to see what's happening. Trying to do performance testing in VMs is basically impossible.

I have not yet worked out what EC2 is useful for, if someone find out, please let me know.

MarkR
This system is going to be deployed in EC2, so the testing the performance of the system on real hardware wouldn't be helpful I don't think.
Alex Black
Mark's point is valid: For profiling use a real machine, or at least a more controlled environment. Deploy to EC2 all you like, but understand that you are running in a VM image and that means that your "idle" CPU might just be because some other tenant on the box got all the CPU for a while. And that makes profiling difficult.
janm
(+1) hate ill-informed down votes
Hassan Syed