Minimizing Java Thread Context Switching Overhead

views:

291

answers:

+3 Q:

Minimizing Java Thread Context Switching Overhead

I have a Java application running on Sun 1.6 32-bit VM/Solaris 10 (x86)/Nahelem 8-core(2 threads per core).

A specific usecase in the application is to respond to some external message. In my performance test environment, when I prepare and send the response in the same thread that receives the external input, I get about 50 us advantage than when I hand off the message to a separate thread to send the response. I use a ThreadPoolExecutor with a SynchronousQueue to do the handoff.

In your experience what is the ~~acceptable~~ expected delay between scheduling a task to a thread pool and it getting picked up for execution? What ideas had worked for you in the past to try improve this?

Not the same task, but "yes" - queue is to general to be used in time critical tasks. We have concentrated to avoid synchronization to handle events at all. Review following hints

Don't use synchronized containers (arrays, lists, maps...). Think about container-per-thread.
We have used round-robin pool of threads. This pool consist of pre-allocated threads and(!) exactly one listen for event appear without any queue. When event raised, thread is deleted from round-robin and another one became listener. When handling is accomplished thread returned to pool.

Dewfy 2010-05-28 06:07:11

+2 A:

The "acceptable delay" entirely depends on your application. Dealing with everything on the same thread can indeed help if you've got very strict latency requirements. Fortunately most applications don't have requirements quite that strict.

Of course, if only one thread is able to receive requests, then tying up that thread for computing the response will mean you can't accept any other requests. Depending on what you're doing you can use asynchronous IO (etc) to avoid the "thread per request" model, but it's significantly harder IMO, and still ends up with thread context switching.

Sometimes it's appropriate to queue requests to avoid having too many threads processing them: if your handling is CPU-bound, it doesn't make much sense to have hundreds of threads - better to have a producer/consumer queue of tasks and distribute them at roughly one thread per core. That's basically what ThreadPoolExecutor will do if you set it up properly of course. That doesn't work as well if your requests spend a lot of their time waiting for external services (including disks, but primarily other network services)... at that point you either need to use asynchronous execution models whenever you would potentially make a core idle with a blocking call, or you take the thread context switching hit and have lots of threads, relying on the thread scheduler to make it work well enough.

The bottom line is that latency requirements can be tough - in my experience they're significantly tougher than throughput requirements, as they're much harder to scale out. It really does depend on the context though.

Jon Skeet 2010-05-28 06:08:27

@Jon Thanks! I meant, "expected" instead of "acceptable". I wanted to get an idea if 50 us is too large or if it can be brought down. Yes, I would like to free up the receiving thread ASAP to be able to pick up the next message. Also, my request processing is CPU bound.

binil 2010-05-28 07:24:49

Is there a reason why you don't use a LinkedBlockingQueue so your producer can queue up a couple of items instead of a SynchronousQueue? At the very least have a queue with 1 item in it so you can get better parallelism.

What is the speed of the "prepare" process versus the "response"? Can you use a thread pool to have multiple threads handling the responses if they are too expensive?

Gray 2010-06-02 22:12:02

+1 A:

50us sounds somewhat high for a handoff, IME (Solaris 10/Opteron) LBQ is typically in the 30-35us range while LTQ (LinkedTransferQueue) is about 5us faster than that. As stated in the other replies SynchronousQueue may tend to be slightly slower because the offer doesn't return until the other thread has taken.

According to my results Solaris 10 is markedly slower than Linux at this which sees times <10us.

It really depends on a few things, under peak load

how many requests per second are you servicing?
how long does it typically take to process a request?

If you know the answer to those Qs then it should be fairly clear, on performance grounds, whether you should handle in the receiving thread or handoff to a processing thread.

Matt 2010-06-10 07:45:41

ansaurus

tags:

views:

answers:

Minimizing Java Thread Context Switching Overhead

related questions