views:

268

answers:

7

Hi!

I'm facing a really strange problem with a .Net service.

I developed a multithreaded x64 windows service.

I tested this service in a x64 server with 8 cores. The performance was great!

Now I moved the service to a production server (x64 - 32 cores). During the tests I found out the performance is, at least, 10 times worst than in the test server.

I've checked loads of performance counters trying to find some reason for this poor performance, but I couldn't find a point.

Could be a GC problem? Have you ever faced a problem like this?

Thank you in advance! Alexandre

A: 

Could it be down to differences in memory or the disk? If there were the bottleneck, you'd not get the value for the additional processing power. Can't really tell without more details of your application/configuration.

dommer
+2  A: 

There are way too many variables to know why one machine is slower than the other. 32 core machines are usually more specialized where an eight core could just be a dual proc quad core machine. Are there vm's or other things running at the same time? Usually with that many cores, IO bandwidth becomes the limiting factor (even if the cpu's still have plenty of bandwidth).

To start off, you should probably add lots of timers in your code (or profiling or whatever) to figure out what part of your code is taking up the most time.

Performance troublshooting 101: what is the bottleneck ( where in the code and what subsystem (memory, disk, cpu) )

Alan Jackson
+1  A: 

There are so many factors here:

  • are you actually using the cores?
  • are your extra threads causing locking issues to be more obvious?
  • do you not have enough memory to support all the extra stacks / data you can process?
  • can your IO (disk/network/database) stack keep up with the throughput?

etc

Marc Gravell
+8  A: 

This is a common problem which people are generally unaware of, because very few people have experience on many-CPU machines.

The basic problem is contention.

As the CPU count increases, contention increases in all shared data structures. For low CPU counts, contention is low and the fact you have multiple CPUs improves performance. As the CPU count becomes significantly larger, contention begins to drown out your performance improvements; as the CPU count becomes large, contention actually starts reducing performance below that of a lower number of CPUs.

You are basically facing one of the aspects of the scalability problem.

I'm not sure however where this problem lies; in your data structures, or in the operating systems data structures. The former you can address - lock-free data structures are an excellent, highly scalable approach. The latter is difficult, since it essentially requires avoiding certain OS functionality.

Blank Xavier
Not arguing with any of this, but for a 10x slow-down (rather than a disapointing lack of improvement) I'd start by looking for something more fundamental than lock weirdness. (Network or disk problems, for example)
Will Dean
I understand your reluctance, but when you run inappropriate data structures on many-CPU platforms, you end up spending almost all your time in destructive contention. It's a killer - and imagine what it's going to be like when Intel bring out that 80 core CPU in two years. Software isn't ready.
Blank Xavier
So I'm happy to ascribe his problems to this. However, of course, given that I know nothing about his software, he may *also* have other types of problems.
Blank Xavier
Not just software locking either. The smaller system might have had a different memory architecture or caching system. Data that might have sat in the same cache among cores on a socket might now be spread out much more.
MichaelGG
A: 

With that many threads running concurrently, you're going to have to be really careful to get around issues of threads fighting with each other to access your data. Read up on Non-blocking synchronization.

Adam Jaskiewicz
A: 

How many threads are you using? Using to many thread pool threads could cause thread starvation which would make your program slower.

Some articles: http://www2.sys-con.com/ITSG/virtualcd/Dotnet/archives/0112/gomez/index.html http://codesith.blogspot.com/2007/03/thread-starvation-in-shared-thread-pool.html

(search for thread starvation in them)

You could use a .net profiler to find your bottle necks, here are a good free one: http://www.eqatec.com/tools/profiler

jgauffin
A: 

I agree with Blank, it's likely to be some form of contention. It's likely to be very hard to track down, unfortunately. It could be in your application code, the framework, the OS, or some combination thereof. Your application code is the most likely culprit, since Microsoft has expended significant effort on making the CLR and the OS scale on 32P boxes.

The contention could be in some hot locks, but it could be that some processor cache lines are sloshing back and forth between CPUs.

What's your metric for 10x worse? Throughput?

Have you tried booting the 32-proc box with fewer CPUs? Use the /NUMPROC option in boot.ini or BCDedit.

Do you achieve 100% CPU utilization? What's your context switch rate like? And how does this compare to the 8P box?

George V. Reilly