We got a 12-core MacPro to do some Monte Carlo calculations. Its Intel Xeon processors have Hyper-Threading (HT) enabled, so in fact there should be 24 processes running in parallel to make them fully utilized. However, our calcs are more efficient to run on 12x100% than 24x50%, so we tried to turn Hyper-Threading off via Processor
pane in system preferences in order to get higher performance. One can also turn HT off by
hwprefs -v cpu_ht=false
Then we ran some tests and here is what we got:
- 12 parallel tasks run the same time w/ or w/o HT to our disappointment.
- 24 parallel tasks loose 20% if HT is off (not -50% as we thought)
- When HT is on, switching from 24 to 12 tasks decreases efficiency by 20% (also surprising)
- When HT is off, switching from 24 to 12 doesn't change anything.
It seems that Hyper-Threading just decreases performance for our calculations and there is no way to avoid it. The program we use for the calcs is written in Fortran and compiled with gfortran
. Is there a way to make it more efficient with this piece of hardware?
Update: Our Monte Carlo calculations (MCC) are typically done in steps to avoid data loss and due to other reasons (it's not always possible to avoid such steps). In our case each step consists of many simulations with variable duration. Since each step is splited between a number of parallel tasks, they also have variable duration. Essentially, all faster tasks have to wait until the slowest is done. This fact forces us to make bigger steps, which finish with less deviation in time due to averaging, so processors do not waste their time on waiting. This is our motivation for having 12*2.66 GHz instead of 24*1.33 GHz. If it would be possible to turn HT off, then we would get about +10% performance by switching from 24 tasks w/ HT to 12 tasks w/o HT. However, the tests show that we loose 20%. So my conclusion is that the calculation is 30% as inefficient.
For the tests I used quite large steps, however usually steps are shorter, so efficiency becomes even further.
There is one more reason - some of our calculations require 3-5 GB of memory, so you probably see how economical it would be for us to have 12 fast tasks. We are working on implementing shared memory, but it's going to be a looong term project. Therefore we need to find out how to make the existing hardware/software as fast as possible.