views:

72

answers:

3

I have a low latency system that receives UDP messages. Depending on the message, the system responds by sending out 0 to 5 messages. Figuring out each possible response takes 50 us (microseconds), so if we have to send 5 responses, it takes 250 us.

I'm considering splitting the system up so that each possible response is calculated by a different thread, but I'm curious about the minimum "work time" needed to make that better. While I know I need to benchmark this to be sure, I'm interested in opinions about the minimum piece of work that should be done on a separate thread.

If I have 5 threads waiting on a signal to do 50 us of work, and they don't contend much, will the total time before all 5 are done be more or less than 250 us?

+1  A: 

Is that 50us compute-bound, or IO-bound ? If compute-bound, do you have multiple cores available to run these in parallel ?

Sorry - lots of questions, but your particular environment will affect the answer to this. You need to profile and determine what makes a difference in your particular scenario (perhaps run tests with differently size Threadpools ?).

Don't forget (also) that threads take up a significant amount of memory by default for their stack (by default, 512k, IIRC), and that could affect performance too (through paging requests etc.)

Brian Agnew
I believe the 50 us is 40 us computation and 10 us IO. We have 8 cores available and would not have more worker threads than cores.
Ted Graham
Are you sure you mean microseconds and not milliseconds? Even at 250 microseconds you still haven't a full millisecond! You also say there will never be more than 8 threads. In that case, I would think that the overhead of creating and using (not to mention the resulting code complexity) a thread would greatly outweigh the benefits of parallel processing. Now if you mean milliseconds, then 250 is close to a quarter of a second and it may be more feasible to parallel process with threads.
BigMac66
Yes, we are talking about microseconds, not milliseconds. A modern machine can do a fair amount of work in 20 us.
Ted Graham
Don't forget that a thread pool will create and reuse threads, so you won't pay for thread creation repeatedly.
Brian Agnew
A: 

If you have more cores than threads, and if the threads are truly independent, then I would not be surprised if the multi-threaded approach took less than 250 us. Whether it does or not will depend on the overhead of creating and destroying threads. Your situation seems ideal, however.

Steve Emmerson
I won't be creating and destroying threads on each packet. I'll start the threads once and they will block until a packet arrives.
Ted Graham
Ted, sounds perfect.
Steve Emmerson
+1  A: 

Passing data from one thread to another is very fast 1-4 us provided the thread is already running on the core. (and not sleep/wait/yielding) If your thread has to wake it can take 15 us but the task will also take longer as the cache is likely to have loads of misses. This means the task can take 2-3x longer.

Peter Lawrey
I don't understand. If only one thread is on a given core at a time, then how can a thread pass data to a "thread already running on the core"
Ted Graham
I think he means "already running on a core", but if it's already running on a core, perhaps it's in the middle of some other work and can't process the message immediately, unless it's busy-waiting
MarkR
I meant a second thread running on a second core, busy waiting. Even if the core isn't doing anything, it takes time to wake up and load caches etc. Note: your first core has the same issue. If you are not busy waiting you may find it take alot longer to process requests than it does in tests which run continously.
Peter Lawrey