views:

441

answers:

3

I have a multithreaded server C++ program that uses MSXML6 and continuously parses XML messages, then applies a prepared XSLT transform to produce text. I am running this on a server with 4 CPUs. Each thread is completely independent and uses its own transform object. There is no sharing of any COM objects among the threads.

This works well, but the problem is scalability. When running:

  1. with one thread, I get about 26 parse+transformations per second per thread.
  2. with 2 threads, I get about 20/s/thread,
  3. with 3 threads, 18/s/thread.
  4. with 4 threads, 15/s/thread.

With nothing shared between threads I expected near-linear scalability so it should be 4 times faster with 4 threads than with 1. Instead, it is only 2.3 times faster.

It looks like a classic contention problem. I've written test programs to eliminate the possibility of the contention being in my code. I am using the DOMDocument60 class instead of the FreeThreadedDOMDocument one in order to avoid unnecessary locking since the documents are never shared between threads. I looked hard for any evidence of cache-line false sharing and there isn't any, at least in my code.

Another clue, the context switch rate is > 15k/s for each thread. I am guessing the culprit is the COM memory manager or the memory manager within MSXML. Maybe it has a global lock that has to be acquired and released for every memory alloc/deallocation. I just can't believe that in this day and age, the memory manager is not written in a way that scales nicely in multithreaded multi-cpu scenarios.

Does anyone have any idea what is causing this contention or how to eliminate it?

+2  A: 

It is fairly common for heap-based memory managers (your basic malloc/free) to use a single mutex, there are fairly good reasons for it: a heap memory area is a single coherent data structure.

There are alternate memory management strategies (e.g. hierachical allocators) that do not have this limitation. You should investigate customizing the allocator used by MSXML.

Alternatively, you should investigate moving away from a multi-threaded architecture to a multi-process architecture, with separate processes for each MSXML worker. Since your MSXML worker take string data as input and output, you do not have a serialization problem.

In summary: use a multiprocess architecture, it's a better fit to your problem, and it will scale better.

ddaa
+1 for the multi-process instead of multi-thread.
call me Steve
+1  A: 

MSXML uses BSTRs, which use a global lock in its heap management. It caused us a ton of trouble for a massively multiuser app a few years ago.

We removed our use of XML in our app, you may not be able to do this, so you might be better off using an alternative XML parser.

gbjbaanb
A: 

Thanks for the answers. I ended up implementing a mix of the two suggestions.

I made a COM+ ServicedComponent in C#, hosted it as a separate server process under COM+, and used the XSLCompiledTransform to run the transformation. The C++ server connects to this external process using COM and sends it the XML and gets back the transformed string. This doubled the performance.

Carlos A. Ibarra