views:

246

answers:

2

Are SSE registers shared or duplicated between logical processors (hyper threading) ? Can I expect the same kind of speedup from parallelization for a SSE heavy program as for a normal program (Intel claims 30% for processors with hyper threading)?

+1  A: 

They are logically duplicated - each thread gets its own state. Physically, they may be shared - depends on your hyperthreading implementation.

Keith Randall
+1  A: 

It's unclear to me from Intel's documentation if Hyperthreading processors share the register file between threads or have two different ones (I would guess that they are in fact distinct, since otherwise the context switch time between HT threads would be quite high, but this is purely a guess).

As to the speedup - it is going to depend on your instruction mix and scheduling. Remember that a HT CPU doesn't have any extra execution resources (ALUs, load/store units, etc), the performance improvement is due to better utilization of those resources, since typical code, especially on a modern processor, spends a reasonable amount of time blocked waiting for memory loads and stores to complete before execution can continue. HT allows these loads and stores to be interleaved so that one one thread stalls on a read, the other can be switched in and start using the execution resources which previously had been sitting idle.

I would guess what kind of performance increase you would see with multithreading a SSE program will depend on the ratio of memory ops to arithmetic ops. If, for instance, your SSE program loads 4 SSE registers from memory, does 10,000 SSE operations on them, and then writes the 4 registers back, you're not likely to see much of an advantage from HT being able to overlap memory accesses because 99% of your programs runtime is going to be spent in the SIMD ALUs and not on memory access.

On the other hand, if your program is very compute-heavy, then multithreading your program could improve performance greatly on multicore processors, and might give you much better than a 30% improvement since in that case your code could access the full execution resources of multiple cores at once.

Jack Lloyd