views:

1559

answers:

4

I'm working on a custom mark-release style memory allocator for the D programming language that works by allocating from thread-local regions. It seems that the thread local storage bottleneck is causing a huge (~50%) slowdown in allocating memory from these regions compared to an otherwise identical single threaded version of the code, even after designing my code to have only one TLS lookup per allocation/deallocation. This is based on allocating/freeing memory a large number of times in a loop, and I'm trying to figure out if it's an artifact of my benchmarking method. My understanding is that thread local storage should basically just involve accessing something through an extra layer of indirection, similar to accessing a variable via a pointer. Is this incorrect? How much overhead does thread-local storage typically have?

Note: Although I mention D, I'm also interested in general answers that aren't specific to D, since D's implementation of thread-local storage will likely improve if it is slower than the best implementations.

+8  A: 

The speed depends on the TLS implementation.

Yes, you are correct that TLS can be as fast as a pointer lookup. It can even be faster on systems with a memory management unit.

For the pointer lookup you need help from the scheduler though. The scheduler must - on a task switch - update the pointer to the TLS data.

Another fast way to implement TLS is via the Memory Management Unit. Here the TLS is treated like any other data with the exception that TLS variables are allocated in a special segment. The scheduler will - on task switch - map the correct chunk of memory into the address space of the task.

If the scheduler does not support any of these methods, the compiler/library has to do the following:

  • get current ThreadId
  • Take a semaphore
  • Lookup the pointer to the TLS block by the ThreadId (may use a map or so)
  • Release the semaphore
  • Return that pointer.

Obviously doing all this for each TLS data access takes a while and may need up to three OS calls: Getting the ThreadId, Take and Release the semaphore.

The semaphore is btw required to make sure no thread reads from the TLS pointer list while another thread is in the middle of spawning a new thread. (and as such allocate a new TLS block and modify the datastructure).

Unfortunately it's not uncommon to see the slow TLS implementation in practice.

Nils Pipenbrinck
The slow version might only be needed one per stack frame as it could be cached in reused (the stack is TLS after a fashion)
BCS
Thats the sick idea of LinuxThreads. It terrible failed because you have to assign the stack address space in a computable way. On linux thread it was 2MB fixed per thread and a maximum of a few hundert threads. I'm so happy that somebody kicked Linus ass and convinced him that not everything can be done with a "clone" system call.
Lothar
+4  A: 

One needs to be very careful in interpreting benchmark results. For example, a recent thread in the D newsgroups concluded from a benchmark that dmd's code generation was causing a major slowdown in a loop that did arithmetic, but in actuality the time spent was dominated by the runtime helper function that did long division. The compiler's code generation had nothing to do with the slowdown.

To see what kind of code is generated for tls, compile and obj2asm this code:

__thread int x; int foo() { return x; }

TLS is implemented very differently on Windows than on Linux, and will be very different again on OSX. But, in all cases, it will be many more instructions than a simple load of a static memory location. TLS is always going to be slow relative to simple access. Accessing TLS globals in a tight loop is going to be slow, too. Try caching the TLS value in a temporary instead.

I wrote some thread pool allocation code years ago, and cached the TLS handle to the pool, which worked well.

Walter Bright
+2  A: 

We have seen similar performance issues from TLS (on Windows). We rely on it for certain critical operations inside our product's "kernel'. After some effort I decided to try and improve on this.

I'm pleased to say that we now have a small API that offers > 50% reduction in CPU time for an equivalent operation when the callin thread doesn't "know" its thread-id and > 65% reduction if calling thread has already obtained its thread-id (perhaps for some other earlier processing step).

The new function ( get_thread_private_ptr() ) always returns a pointer to a struct we use internally to hold all sorts, so we only need one per thread.

All in all I think the Win32 TLS support is poorly crafted really.

Hugh

Hugh
A: 

If you can't use compiler TLS support, you can manage TLS yourself. I built a wrapper template for C++, so it is easy to replace an underlying implementation. In this example, i've implemented it for Win32. Note: Since you cannot obtain an unlimited number of TLS indices per process (at least under Win32), you should point to heap blocks large enough to hold all thread specific data. This way you have a minimum number of TLS indices and related queries. In the "best case", you'd have just 1 TLS pointer pointing to one private heap block per thread.

In a nutshell: Don't point to single objects, instead point to thread specific, heap memory/containers holding object pointers to achieve better performance.

Don't forget to free memory if it isn't used again. I do this by wrapping a thread into a class (like Java does) and handle TLS by constructor and destructor. Furthermore, i store frequently used data like thread handles and ID's as class members.

usage:

for type*: tl_ptr<type>

for const type*: tl_ptr<const type>

for type* const: const tl_ptr<type>

const type* const: const tl_ptr<const type>

template<typename T>
class tl_ptr {
protected:
    DWORD index;
public:
    tl_ptr(void) : index(TlsAlloc()){
        assert(index != TLS_OUT_OF_INDEXES);
        set(NULL);
    }
    void set(T* ptr){
        TlsSetValue(index,(LPVOID) ptr);
    }
    T* get(void)const {
        return (T*) TlsGetValue(index);
    }
    tl_ptr& operator=(T* ptr){
        set(ptr);
        return *this;
    }
    tl_ptr& operator=(const tl_ptr& other){
        set(other.get());
        return *this;
    }
    T& operator*(void)const{
        return *get();
    }
    T* operator->(void)const{
        return get();
    }
    ~tl_ptr(){
        TlsFree(index);
    }
};
sam
This question is about D, not C++, and you don't address any of the questions asked by the OP.
Roger Pate
True, I added some more about the concept i've been following so far (imho, the language doesn't really matter).
sam