views:

278

answers:

6

I have a C#.NET application that needs to inform anywhere from 4000 to 40,000 connected devices to perform a task all at once (or as close to simultaneous as possible).

The application works well; however, I am not satisfied with the performance. In a perfect world, as soon as I send the command I would like to see all of the devices respond simultaneously. Yet, there seems to be a delay as all the threads I have created spin up and perform the task.

I have used the .NET 4.0 ThreadPool, created my own solution using custom threads and I have even tweaked the existing ThreadPool to allow for more threads to be executed at once.

I still want better performance and that is why I am here. Any ideas? Comments? Suggestion? Thank you.

-Shaun

Let me add that the application notifies these 'connected devices' that they need to go listen for audio on a multicast address.

+9  A: 

You cannot execute 4000 threads simultaneously, let alone 40k. At best on a desktop box with hyperthreading, you might get up to 8 simultaneous processes going (this assumes quad core). Threads are pseudo-parallel, and that's not even digging into the issues of bus contention.

If you absolutely need simultaneity for 40k devices, you want some form of hardware synchronization.

Randolpho
And I'd be willing to bet any hardware synchronization system that can execute 40k nodes simultaneously is going to be uber-expensive.
Randolpho
Appreciate your response. I would like to think that this is possible; however, only because I believe I have seen some applications do it. That said, maybe it was hardware-based like you said. Thanks.
SCMcDonnell
+10  A: 

A dual-core hyperthreaded processor MAY be able to execute 4 threads simultaneously - depending on what the thread is doing (no contention on IO or memory access, etc). A quad-core hyperthread perhaps 8. But 40K just can't physically happen.

If you want near simultaneous, you're better off spinning up just as many threads as the computer has free cores and having each thread fire off notifications then end. You'll get rid of a bunch of context switching this way.

Or, look elsewhere. As SB recommended in the comments, use a UDP multicast to notify listening machines that they should do something.

Philip Rieck
I'd give you another +1 for suggesting a thread limit, if I could.
Randolpho
+3  A: 

The overhead of creating thousands of threads is (very) significant; I would seek an alternative solution. This sounds like a job for asynchronous IO: your computer presumably only has one network connection, so no more than one message can be sent at a time - threads cannot improve on this!

Rafe
+2  A: 

Am I correct in guessing that you're using a synchronous API call on your device, which is why it must be executed in a thread? Does the API have an asynchronous version of the call? If the device API can really support 40k+ devices, then it should. It should also have internal handling of whatever wait handles (or equivalent) are required to synchronize the return data for callback. This isn't something you can handle at the client application side; you don't have enough visibility of the underlying implementation of the device API to know how to parallelize the tasks. As you've discovered, creating 40k threads with blocking calls doesn't cut it.

Dan Bryant
+1  A: 

You should do async IO to the devices. This is very efficient and uses a different (larger ) set of threads to handle some of the work. Certainly the devices will receive the commands much faster. The IO thread pool will handle the replies (if any)

pm100
+3  A: 

It sounds like you have some control over what software runs on each device. In which case, you could look to HPC usage and architect your devices (nodes) hierarchically and/or use MPI to execute your remote processes.

For the hierarchy example: Designate say, 8 nodes as primary masters, again with 8 slave nodes, each slave can act as a master too with 8 slaves (you might need to look at an automated subscription algorithm to do this). You will have a hierarchy 6 deep to cover 40,000 nodes. Each master has a small portion of code running continually waiting for instructions to pass to slaves.

All you then do is pass the instruction to the 8 primary masters and your instruction will be propagated to the ‘cluster’ on the wire asynchronously by the masters. The instruction only has to be passed on a maximum of 5 times, and thus will be propagated v-quickly.

Alternatively (or in conjunction) you could look at MPI, which is an out-of-the-can solution. There are some established C# implementations.

jnielsen
I will look into this. Thanks.
SCMcDonnell