views:

364

answers:

7

I recently learned that sometimes people will lock specific processes or threads to specific processors or cores, and it's thought that this manual tuning will best distribute the load. This is a bit counter-intuitive to me -- I would think the OS scheduler would be able to make a better decision than a human about how to spread the load. I could see it being true for older operating systems that perhaps weren't aware of issues like their being more latency between specific pairs of cores, or shared cache between one pair of cores but not another pair. But I assume 'modern' OSs like Linux, Solaris 10, OS X, and Vista should have schedulers that know this information. Am I mistaken about their capabilities? Am I mistaken that it's a problem the OS can actually solve? I'm particularly interested in the answer for Solaris and Linux.

The consequence is whether or not I need to inform users of my (multithreaded) software of how they might consider balancing on their box.

A: 

I am not even sure you can ping processes to a specific CPU on linux. So, my answer is "NO" - let the OS handle it, it's smarter then you most of the time.

Edit: It seems that on win32 you have some control over which CPU family are you going to run this process. Now I only wait for someone to prove me wrong also on linux/posix ...

elcuco
On windows you can do so from the task manager I'm pretty sure. I can't remember with absolute confidence since it's been awhile, but pretty sure.
Joseph Garvin
Yes. You can set the "CPU affinity" of a process to lock it to a specific processor.
Michael Aaron Safyan
You can pin on Windows, SetThreadAffinityMask/SetProcessAffinityMask are the API's to do it.
Michael
@joseph, you can. Right click a task, pick "Set affinity...". You get a list of processors and you tick the checkbox next to those that are allowed to run the task in question.
Andrew
-1 Revise to remove the conjecture about not being able to set affinity, which clearly can be done.
Will Bickford
A: 

Most modern operating systems will do an effective job of allocating work between cores. They also attempt to keep threads running on the same core, to get the cache benefits you mentioned.

In general, you should never be setting your thread affinity unless you have a very good reason to. You don't have as good an insight as the OS into the other work that threads on the system are doing. Kernels are constantly being updated based on new processor technology (single CPU per socket to hyper threading to multiple cores per sockets). Any attempt by you to set hard affinity may backfire on future platforms.

Michael
+1  A: 

For general-purpose applications, there is no reason to set the CPU affinity; you should just allow the OS scheduler to choose which CPU should run the process or thread. However, there are instances where it is necessary to set the CPU affinity. For example, in real-time systems where the cost of migrating a thread from one core to another (which can happen at any time if the CPU affinity has not been set) can introduce unpredictable delays that can cause tasks to miss their deadlines and which preclude real-time guarantees.

You can take a look at this article about a multi-core aware implementation of real-time CORBA that, among other things, had to set the CPU affinity so that CPU migration could not result in missed deadlines.

The paper is: Real-Time Performance and Middleware for Multiprocessor and Multicore Linux Platforms

Michael Aaron Safyan
Does it ever make sense outside of realtime scheduling?
Joseph Garvin
No. At least not to my knowledge.
Michael Aaron Safyan
on a two cpus system (on win 7) setting the affinity to only one core, increase the speed of the application by 30% (I can't refrain to mention that it's huge).
call me Steve
A: 

This article from MSDN Magazine, Using concurrency for scalability, gives a good overview of multithreading on Win32. Regarding CPU affinity,

Windows automatically employs so-called ideal processor affinity in an attempt to maximize cache efficiency. For example, a thread running on CPU 1 that gets context switched out will prefer to run again on CPU 1 in the hope that some of its data will still reside in cache. But if CPU 1 is busy and CPU 2 is not, the thread could be scheduled on CPU 2 instead, with all the negative cache effects that implies.

The article also warns that CPU affinity shouldn't be manipulated without a deep understanding of the problem. Based on this information, my answer to your question would be No, except for very specific, well-understood scenarios.

gWiz
If CPU1 is busy, couldn't the cost of the negative cache effect of running the thread on CPU2 be less than the cost of waiting for CPU1 to be free? It's not clear to me that what Windows is doing is the wrong answer. It actually sounds like if you used CPU affinity in this scenario you might slow things down.I thought I read at some point that Linux at least takes this cost into account when determining whether to wait or move the thread to another CPU.
Joseph Garvin
Yes, at least according to the article, you are right.
gWiz
+1  A: 

For applications designed with parallelism and multiple cores in mind, OS-default thread affinity is sometimes not enough. There are many approaches to parallelism, but so far all require involvement of the programmer and knowledge - at some level at least - of the architecture on which the solution will be mapped. This includes the machines, CPU's and threads that are involved.

This is an actively researched subject, and there is an excellent course on MIT's OpenCourseWare that delves into these issues: http://ocw.mit.edu/OcwWeb/Electrical-Engineering-and-Computer-Science/6-189January--IAP--2007/CourseHome/

bright
But *why* is OS default thread affinity not enough? What is it that an application or user can know do that it's not possible for the scheduler to know?
Joseph Garvin
The application and user can know the specific work load that they want the computer to do, and the particular performance characteristics they want the computer to exhibit. A general purpose OS will by necessity try to work well in all common cases, and so it can't take advantage of the particulars of an unusual individual installation.
Jeremy Friesner
A: 

Well something many people haven't thought here is the idea of forbidding two processes to run on the same processor (socket). It might be worth to help the system to bound different heavily used processes to different processors. This can avoid contention if the scheduler is not clever enough to figure it out itself.

But this is more a system admin task then one for the programmers. I have seen optimizations like this for a few high performance database servers.

Lothar
+1  A: 

First of all, 'Lock' is not a correct term to describe it. 'Affinity' is more suitable term.

In most case, you don't need to care about it. However, in some cases, manually setting CPU/Process/Thread affinity could be beneficial.

Operating systems are usually oblivious to the details of modern multicore architecture. For example, say we have 2-socket quadcore processors, and the processor supports SMT(=HyperThreading). In this case, we have 2 processors, 8 cores, and 16 hardware threads. So, OS will see 16 logical processors. If an OS does not recognize such hierarchy, it is highly likely to loose some performance gains. The reasons are:

  1. Caches: in our example, two different processors (installed on two different sockets) are not sharing any on-chip caches. Say an application has 4 busy-running threads and a lot of data are shared by threads. If an OS schedules the threads across the processors, then we may loose some cache locality, resulting in performance lose. However, the threads are not sharing much data (having distinct working set), then separating to different physical processors would be better by increasing effective cache capacity. Also, more tricky scenario could be happen, which is very hard for OS to aware of.

  2. Resource conflict: let's consider SMT(=HyperThreading) case. SMT shares a lot of important resources of CPU such as caches, TLB, and execution units. Say there are only two busy threads. However, an OS may stupidly schedule these two threads on two logical processors from the same physical core. In such case, a significant resources are contended by two logical threads.

One good example is Windows 7. Windows 7 now supports a smart scheduling policy that consider SMT (related article). Windows 7 actually prevents the above 2. case. Here is a snapshot of task manger in Windows 7 with 20% load on Core i7 (quadcore with HyperThreading = 8 logical processors):

alt text

The CPU usage history is very interesting, isn't? :) You may see that only a single CPU in pairs is utilized, meaning Windows 7 avoids scheduling two threads on a same core simultaneously as possible. This policy will definitely decrease the negative effects of SMT such as resource conflict.

I'd like to say OS are not very smart to understand modern multicore architecture where a lot of caches, shared last-level cache, SMT, and even NUMA. So, there could be good reasons you may need to manually set CPU/process/thread affinity.

However, I won't say this is really needed. Only when you fully understand your workload patterns and your system architecture, then try it on. And, see the results whether your try is effective.

minjang
"I'd like to say OS are not very smart to understand modern multicore architecture where a lot of caches, shared last-level cache, SMT, and even NUMA." <--- "A lot of caches", "shared last-level cache" and "NUMA" all really sound like, "NUMA." And Linux and other OSes have been NUMA aware for awhile now. Even SMT is just a CPU having another set of registers, so I'd be surprised if a NUMA aware scheduler didn't already take it into account. Are you sure schedulers aren't already handling this?
Joseph Garvin
SMT is not just have a set of registers. Yes, registers (but only for architectural registers, not physical registers) are replicated. Some resources such as some TLBs are duplicated. However, many important resources (such as ROB, scheduler, caches, instruction pipeline) are either statically partitioned or shared.
minjang
NUMA is supported to some extend, but not fully. For example, Windows 7/Windows 2008 R2 started to support 64+ NUMA logical processors. Before that, <64 processors were handled. It definitely affects scheduling logic. AFAIK, schedulers are not well considering the detailed cache hierarchies, such as how caches are shared by cores. "A lot of caches" -> private L1/L1 caches, "Shared LLC" -> L3 cache: they are not same. Very different.
minjang
As I pointed out in the figure of TaskManager, Windows Vista/XP can't recognize SMT. By default, each thread can be scheduled on any logical processor. So, in legacy OS, setting affinity to minimize conflicts b/w SMT-ed logical processors is necessary. I am able to get the performance differences by setting affinities.
minjang
My comment about hyper-threading being just another register set was inspired by Ulrich Drepper's paper, "What Every Programmer Should Know About Memory," located here, http://people.redhat.com/drepper/cpumemory.pdf Specifically I was given the impression by, "They all share almost all the processing resources except for the register set," on pg. 29. I didn't know about the TLB duplication, but the quote's wording certainly leaves room for that. Nonetheless, my point was that SMT can be thought of as a specific case of cache sharing. I think that's still true.
Joseph Garvin
Your 64 CPU limit would mean then that you should only be able to do a better job than the scheduler when using more than 64 CPUs though, correct?
Joseph Garvin
Also, you'd need to demonstrate more than a performance difference by setting affinities. I don't doubt you can *affect* the performance. But can you get better performance by setting affinity than if you hadn't at all? Since Vista/XP can't recognize SMT, I assume you're correct that it can make a difference there. But on modern Windows7/Linux/Solaris?
Joseph Garvin
(1) No, SMT is much more than cache sharing. *Execution units* are shared. SMT is not that simple. When fetching instructions in SMT, we also need to consider some scheduling: Intel's Nehalem has 4-wide pipeline (you can issue at most 4 instruction at a cycle). If SMT, then how would you divide the pipeline? Just 2 and 2? What about 1 and 3? Or, 0 and 4? SMT imposes a lot of problem. It is not just sharing caches.
minjang
(2) Yes, however, nowadays it's easy to find 100s logical processors (address space is shared by all processors). It's just an example that shows CPU scheduler can't always and perfectly handle all the details of architectures.
minjang
(3) Yes, but it depends on programs and workloads. I did several experiments. Let me show one result: I ran two SPEC 2006 benchmark on Core i7 w/ SMT and Vista. Note that SPEC 2006 is single threaded. Without setting affinities (i.e., a thread can be scheduled on any cores), it took 708 sec while setting affinities (i.e., a thread can run on a single core) gave 662 sec. **6.50% improvement in execution time.** It's not about SMT, just for affinity. Pinning a very busy thread on a specific core will increase cache affinity (less L1/L2/iTLB/dTLB misses, which are not trivial at all).
minjang
My point is very simple. Current even modern schedulers are hard to follow all the details of multicore architecture. And, the scheduler is generalized, can't be specialized for specific workloads. So, there should be some cases you *could* get more performance if you manually tweak affinities. All the examples that I've shown is for this claim.
minjang
But if a single busy single threaded process like your SPEC example runs faster with affinity set, either Vista's scheduler is awful, or there is some confounding factor, like maybe it automatically raises the priority of processes that have affinity, or your sample size of only 2 consecutive runs was too small (assuming you only did 2). If there's only 1 busy process, why would it ever migrate it? That only makes sense if it doesn't think changing CPUs doesn't have *any* cost, which is only true if all CPUs share all cache (rarely true, hard to believe they'd assume it).
Joseph Garvin
I double negativified myself into a corner there, I meant, "That only makes sense if it thinks changing CPUs has no cost..."
Joseph Garvin