I have a multithreaded server C++ program that uses MSXML6 and continuously parses XML messages, then applies a prepared XSLT transform to produce text. I am running this on a server with 4 CPUs. Each thread is completely independent and uses its own transform object. There is no sharing of any COM objects among the threads.
This works well, but the problem is scalability. When running:
- with one thread, I get about 26 parse+transformations per second per thread.
- with 2 threads, I get about 20/s/thread,
- with 3 threads, 18/s/thread.
- with 4 threads, 15/s/thread.
With nothing shared between threads I expected near-linear scalability so it should be 4 times faster with 4 threads than with 1. Instead, it is only 2.3 times faster.
It looks like a classic contention problem. I've written test programs to eliminate the possibility of the contention being in my code. I am using the DOMDocument60 class instead of the FreeThreadedDOMDocument one in order to avoid unnecessary locking since the documents are never shared between threads. I looked hard for any evidence of cache-line false sharing and there isn't any, at least in my code.
Another clue, the context switch rate is > 15k/s for each thread. I am guessing the culprit is the COM memory manager or the memory manager within MSXML. Maybe it has a global lock that has to be acquired and released for every memory alloc/deallocation. I just can't believe that in this day and age, the memory manager is not written in a way that scales nicely in multithreaded multi-cpu scenarios.
Does anyone have any idea what is causing this contention or how to eliminate it?