views:

43

answers:

2

hi,

we have to make our system highly scalable and it has been developed for windows platform using VC++. Say initially, we would like to process 100 requests(from msmq) simultaneously. what would be the best approach. single process with 100 threads or 2 processes with 50-50 threads. What is the gain apart from process memory in case of second approach. does in windows first cpu time is allocated to process and then split between threads for that process. or OS counts the number of threads for each process and allocate cpu on the basis of threads rather than process. we notice that in first case, cpu utilization is 15-25% and we want to consume more cpu. remember that we would like to get optimal performance thus 100 requests are just for example. we have also noticed that if we increase number of threads of the process above 120, performance degrades due to context switches.

one more point, our product already supports clustering, but we want to utilize more cpu on the single node.

any suggestions will be highly appreciated.

+2  A: 

The standard approach on windows is multiple threads. Not saying that is always your best solution but there is a price to be paid for each thread or process and on windows a process is more expensive. As for scheduler i'm not sure but you can set the priory of the process and threads. The real benefit to threads is their shared address space and the ability to communicate without IPC, however synchronization must be careful maintained.

If you system is already developed, which it appears to be, it is likely to be easier to implement a multiple process solution especially if there is a chance that latter more then one machine may be utilized. As your IPC from 2 process on one machine can scale to multiple machines in the general case. Most attempts at massive parallelization fail because the entire system is not evaluated for bottle necks. for example if you implement a 100 threads that all write to the same database you may gain little in actual performance and just wait on your database.

just my .02

rerun
yes, we have faced it. database is the major bottleneck. not to offend anyone, but we have had far better performance with oracle as compare to sql server. but here in our region most of our clients are sticking with sql server (being less expensive), thus a real bottle neck for us.
Mubashir Khan
+1  A: 

You cant process more requests than you have CPU cores. "fast" scalable solutions involve setting up thread pools, where the number of active (not blocked on IO) threads == the number of CPU cores. So creating 100 threads because you want to service 100 msmq requests is not good design.

Windows has a thread pooling mechanism called IO Completion Ports.

Using IO Completion ports does push the design to a single process as, in a multi process design, each process would have its own IO Completion Port thread pool that it would manage independently and hence you could get a lot more threads contending for CPU cores.

The "core" idea of an IO Completion Port is that its a kernel mode queue - you can manually post events to the queue, or get asynchronous IO completions posted to it automatically by associating file (file, socket, pipe) handles with the port.

On the other side, the IO Completion Port mechanism automatically dequeues events onto waiting worker threads - but it does NOT dequeue jobs if it detects that the current "active" threads in the thread pool >= the number of CPU cores.

Using IO Completion Ports can potentially increase the scalability of a service a lot, usually however the gain is a lot smaller than expected as other factors quickly come into play when all the CPU cores are contending for the services other resource.

If your services are developed in c++, you might find that serialized access to the heap is a big performance minus - although Windows version 6.1 seems to have implemented a low contention heap so this might be less of an issue.

To summarize - theoretically your biggest performance gains would be from a design using thread pools managed in a single process. But you are heavily dependent on the libraries you are using to not serialize access to critical resources which can quickly loose you all the theoretical performance gains. If you do have library code serializing your nicely threadpooled service (as in the case of c++ object creation&destruction being serialized because of heap contention) then you need to change your use of the library / switch to a low contention version of the library or just scale out to multiple processes.

The only way to know is to write test cases that stress the server in various ways and measure the results.

Chris Becke
i will definitely write the test case. but i would like to know how is the cpu scheduling done in windows theoritically. this could help me in analyzing the tests
Mubashir Khan
just to add one more point, we are using VS2008, i guess it uses 6.1 sdk. how could i confirm this??
Mubashir Khan
It doesn't matter which version of the SDK you develop against - the performance gains of "Windows 7" heap (being windows version 6.1) are automatic as soon as you upgrade the server.
Chris Becke
This article by Mark Russinovich might give some insight into how NT implements IO Completion Ports http://doc.sch130.nsc.ru/www.sysinternals.com/ntw2k/info/comport.shtml
Chris Becke