views:

549

answers:

7

I've been pondering this question for a while:

Can you build a faster fundamental datastructure (i.e. linked list, hashtable, set, skiplist, bloom filter, red-black tree, etc.) on a multicore machine, by taking advantage of the fact that you've got more than one CPU?

I did some preliminary experimenting with pthreads, and found that pthread_create() takes on the order of 30us, but a simple hash_map insert takes far less time than that on a single core. And thus, it's become hard for me to imagine creating a faster hash_map<>, since synchronization primitives and thread creation are so slow. I can also imagine tree traversal and balancing in parallel, but again, synchronization primitives would seem to make runtimes longer, not shorter.

It still feels intuitive to me that "I've got more CPU, and thus, I should be able to do it faster", but I can't quite wrap my head around a proof or counter-proof for that statement. I've experimented quite a bit in C++, but am now suspecting that other languages may offer better solutions (erlang?) for this task. Thoughts?

EDIT details: I think there are several programming / datastructure paradigms that are frequently used that could possibly be sped up. For example, I find myself frequently writing code that basically looks like this (where real data has been replaced with "rand()")

static const int N = 1000000; 
static const int M = 10000000; // 10x more lookups 
hash_map<int, int> m; 
// batch insert a bunch of interesting data 
for (int i = 0; i < N; i++) m[rand()] = rand(); 

// Do some random access lookups. 
for (int i = 0; i < M; i++) m[rand()]++;

This kind of paradigm is frequently used for things like name-value settings & configuration data, batch processing, etc. The 10x (or more) lookup/insert ratio is what makes a traditional hash_map<> ideal for this sort of operation.

This could easily be split in half, with an insert phase and a lookup phase, and in a parallel world, there may be some "flush queue" operation between the two halves. More difficult is the interleaved insert + lookup version:

hash_map<int, int> m; 

for (int i = 0; i < N; i++) { 
   if (rand() % LOOKUP_RATIO == 0) 
     hash_map[rand()]++;  // "lookup" 
   else 
     hash_map[rand()] = rand();  // "insert" 
}

In that scenario, insert could be asynchronous as long as the insert queue were flushed before each lookup, and if LOOKUP_RATIO is large enough (say, >1000) then it becomes quite similar to the batch example above, but with some queueing. Although, the queueing implies synchronization primitives.

Imagine for a second, the following snippet:

hash_map<int,int> a;
hash_map<int,int> b; 
for (int i = 0; i < N; i++) { 
  // the following 2 lines could be executed in parallel 
  a[rand()] = rand(); 
  b[rand()] = rand(); 
}

And thus, a lookup might be done in "parallel" by:

int lookup(int value) { 
  // The following 2 lines could be executed in parallel: 
  v1 = a[value]; 
  v2 = b[value]; 
  if (v1)  // pseudo code for "value existed in a" 
    return v1; 
  else 
    return v2; 
}
+1  A: 

I would think you'd need to look at the data structures and ask "What in this can be done Asynchronously?"

And for a lot of data structures, there isn't much if anything that I see.

But for some more esoteric or less-used structures, I bet there is. I bet rebalancing some sorts of trees could be parallelized. I bet traversing graphs could be (although that may be more algorithm than data structure). I bet traversing a doubly-linked list (from each end) could be.

Tom Ritter
Well, not so much "asynchronously" but "in batch" or "queued" is the right mindset, I think. Take a 2nd look at my hash_map<> example -- any set of accesses on a hashtable can be thought of as a sequence of inserts followed by a sequence of lookups. Queueing could give you transparent parallelism.
slacy
+6  A: 

The problem is that shared data is itself the bane of parallel computing. Ideally you want each core to be working on separate data, otherwise there will be overhead associated with synchronization. (How to communicate without shared state? By message passing.)

Also, it's a bit strange to talk about data structures being sped up. I find it more natural to talk about operations on the data structure being sped up, since different operations on different data structures have different characteristics. Is there a particular type of access that you want to speed up?

EDIT, in response to the extra details: I assume that the goal is to have a single hash map which can be accessed in parallel, and its underpinnings could be multiple hash tables, but which would be transparently presented to the user of this data structure as a single hash table. Naturally, we would be concerned about spending too much time spinning on locks. Also at this level, we have to be aware of cache consistency problems. That is, if cores or processors have separate caches pointing to the same data, and one modifies the data, then the cached data on the other is invalidated. If this happens repeatedly, it can impose huge costs, and parallelism could be worse than having things on a single core. So I'm very wary of shared data.

My instinct would be to have a pool of threads, each of which owns a different section of the hash table. The Hash would first map from key to hash table section, and then to an offset within that section. The update would be passed as a message to that thread which owns that section of the hash table. And that way, no one is trying to modify the same thing at once. Naturally, this is easier in languages (Erlang) which have features for asynchronous message passing concurrency than in others.

Rob Lachlan
There are a couple of standard use cases that would be interesting to optimize. One would look like this: parallel_hash_map<int,int> m; for (i = 0; i < N; i++) m[rand()] = rand(); for (i = 0; i < N; i++) m[rand()]++; Which would simulate a "batch insert + batch lookup" system.
slacy
Ugh, botched formatting -- see my edits to the original question.
slacy
I don't think that we're really going to come up with an optimal solution-- this is an ongoing area of research. But nice question, +1.
Rob Lachlan
See my other comments on the other answers -- anything involving a shared queue or pushing function calls to other threads means locking primitives, which are slower than just doing the hash_map<> operation directly. I like Zan's queueing answer the best...
slacy
It's true that each thread's input queue will need locking, but properly done the overhead needn't be prohibitive. Zan's scheme is similar in that he divides to table into per thread areas, and sends updates to the right thread. This is the way I would implement that scheme.
Rob Lachlan
+2  A: 

first, i don't think it's apropriate to compare pthread_create() time with a hashmap operation. better compare with (un)lock times, both in the contended and noncontended cases.

still, you're right, synchronisation times are the bottleneck and getting worse, since they have to go to the inter-CPU bus/bridge/channel, whatever, while most other datastructs try to stay in cache (or even in shadow registers).

there are two main directions to attack this problem:

  1. better shared structures: check lock-free structures and/or transactional-memory. both try to maximize accessibility by replacing the 'lock-modify-release' cycle with 'try-check-commit/rollback'. in most cases, the check should succeed, so the rollback shouldn't affect average performance. usually the check/commit is done atomically, so it's expensive in terms of CPU bandwidth, but it's a lot less than traditional locks.

  2. less sharing: that's what erlang/haskell languages emphasize. making it easy and inexpensive to transfer small messages, inter-thread communication looks more like function calls with parameters, and less than shared memory. this is far more scalable, since only two processes have to synchronize, and could (in theory) use non-RAM channels with lower latencies.

edit: i'm surprised nobody has any opinion about lock-free structures. check this (pdf) and this (video) about a lock-free hashtable implementation in Java that scales (almost) linearly up to 300 CPUS

Javier
If erlang, haskell make inter-thread communication "look like" function calls, does that mean that they're comparable in performance to a C++ function call? I can imagine that a parallel-focused language might make IPC fast, but doesn't that mean that regular function calls become slow?
slacy
no, regular function calls are still function calls. only inter-process 'function calls' are message passing underneath. also, they 'look like' but are not indistinguishable from regular calls
Javier
+3  A: 

I deal with this question every day. I have found that things like linked lists are really useful, as you can have each thread of your parallel algorithm construct its own linked list, then just sew them together on the master when you're done. Almost no overhead, as long as your threads are truly independent

If you every have arrays of data to use, I find it's almost always better to allocate a smaller array to work on for each thread, then merge the small arrays back into the master array upon completion - in fact, if you are in a clustered environment, using the "same" array isn't even a possibility!

If you're implementing an algorithm that uses associative arrays (think .NET Dictionary), you're almost always going to duplicate some work somewhere between threads. Try to avoid these when possible.

If you're coding for a CUDA (GPU) environment, you'll learn very quickly that the whole world can (nay, should!) be recast as an array before operating :)

Mike
Yeah, what I'm hoping for is something along these lines, but I'm having a hard time wrapping my head around the details. I'm imagining something like memcached that runs on localhost but offers better performance than a "native" hash_map<>.
slacy
Massively parallel stuff like CUDA is weird. It can be faster to forget everything you knew about sorted structures and binary search and just compare every item in A to every item in B!
Zan Lynx
@Zan Lynx: haha :) So true. I'm vectorizing some scientific programs against CUDA now. To quote Yoda: "you must unlearn what you have learned"
Mike
+1  A: 

I do not believe there is much parallelism to be had in a single lookup. But if you have an entire list of items to look up, it is a different case.

Take a hash table and take a large list of keys to look up in the hash table or tree. It would double performance to split the list of keys among 2 CPUs.

Or take a large list of items to insert. Divide the hash table into per-CPU areas and divide the key list. Then each CPU can stuff items into its own hash table.

This also applies to vectors, B+trees and binary trees although I believe hash tables can be constructed to need slightly less locking for updates.

Zan Lynx
slacy
A: 

Javier has a good point: if you're performing operations in parallel, you've already got the threads, you just need to give them something to do.

I think a lot of what this comes down to is the standard readers and writers problem. You should be able to have a virtually unlimited number of threads using a hash table if all they're doing is reads or other non-destructive operations. However, once one of them needs to do a write, then they have to grab an exclusive lock on the whole hash table (unless you hash your key externally first, then they could in theory just get a lock on the bucket they hash to, depending on your collision resolution mechanism).

One thing to consider is having one (or a small pool) of threads per data structure, and treat access as a "service". That is, instead of a thread looking something up in the hash map, it issues a synchronous request to the thread servicing that data structure. That localizes lock operations (only the threads servicing the requests have to know about the locking technique), but may make the request queue a bottleneck.

I think, as someone else has said, that the best way to exploit parallelism is through your algorithms, not the data structures.

TMN
This gets to the heart of the problem: You suggest using a threadpool and synchronous calls to those threads. I've found that the time to make a synchronous function call in another thread (i.e. pthread_create) is just as long as actually doing the operation on the hashtable directly.
slacy
@slacy: That isn't how you do it. You write a "message", which can just be a variable somewhere or better yet a message queue, and unlock the waiting thread. Much faster than thread creation.
Zan Lynx
@slacy: But then you've got your hashtable access control (locking and waiting) spread throughout your code, and all it takes is one place in the code to get that wrong and you've got a subtle data-dependent bug that might take days to find.
TMN
A: 

Put everything into work queues. That is the key -- and gets you closer to scaling across multiple machines. Synchronization is expensive and will only get more expensive later on (imagine having a memory barrier with 128 CPUs).

twk
slacy
hey, you have to hand off a batch of work each time you lock, or it is n't worth it
twk