ansaurus

Question

Faster fundamental datastructures on multicore machines?

Answer 1

+1 A:

I would think you'd need to look at the data structures and ask "What in this can be done Asynchronously?"

And for a lot of data structures, there isn't much if anything that I see.

But for some more esoteric or less-used structures, I bet there is. I bet rebalancing some sorts of trees could be parallelized. I bet traversing graphs could be (although that may be more algorithm than data structure). I bet traversing a doubly-linked list (from each end) could be.

Tom Ritter 2009-02-24 22:31:17

Well, not so much "asynchronously" but "in batch" or "queued" is the right mindset, I think. Take a 2nd look at my hash_map<> example -- any set of accesses on a hashtable can be thought of as a sequence of inserts followed by a sequence of lookups. Queueing could give you transparent parallelism.

slacy 2009-02-25 18:47:49

Answer 2

+6 A:

The problem is that shared data is itself the bane of parallel computing. Ideally you want each core to be working on separate data, otherwise there will be overhead associated with synchronization. (How to communicate without shared state? By message passing.)

Also, it's a bit strange to talk about data structures being sped up. I find it more natural to talk about operations on the data structure being sped up, since different operations on different data structures have different characteristics. Is there a particular type of access that you want to speed up?

EDIT, in response to the extra details: I assume that the goal is to have a single hash map which can be accessed in parallel, and its underpinnings could be multiple hash tables, but which would be transparently presented to the user of this data structure as a single hash table. Naturally, we would be concerned about spending too much time spinning on locks. Also at this level, we have to be aware of cache consistency problems. That is, if cores or processors have separate caches pointing to the same data, and one modifies the data, then the cached data on the other is invalidated. If this happens repeatedly, it can impose huge costs, and parallelism could be worse than having things on a single core. So I'm very wary of shared data.

My instinct would be to have a pool of threads, each of which owns a different section of the hash table. The Hash would first map from key to hash table section, and then to an offset within that section. The update would be passed as a message to that thread which owns that section of the hash table. And that way, no one is trying to modify the same thing at once. Naturally, this is easier in languages (Erlang) which have features for asynchronous message passing concurrency than in others.

Rob Lachlan 2009-02-24 22:31:28

There are a couple of standard use cases that would be interesting to optimize. One would look like this: parallel_hash_map<int,int> m; for (i = 0; i < N; i++) m[rand()] = rand(); for (i = 0; i < N; i++) m[rand()]++; Which would simulate a "batch insert + batch lookup" system.

slacy 2009-02-24 23:02:21

Ugh, botched formatting -- see my edits to the original question.

slacy 2009-02-24 23:17:42

I don't think that we're really going to come up with an optimal solution-- this is an ongoing area of research. But nice question, +1.

Rob Lachlan 2009-02-25 01:37:53

See my other comments on the other answers -- anything involving a shared queue or pushing function calls to other threads means locking primitives, which are slower than just doing the hash_map<> operation directly. I like Zan's queueing answer the best...

slacy 2009-02-25 18:54:45

It's true that each thread's input queue will need locking, but properly done the overhead needn't be prohibitive. Zan's scheme is similar in that he divides to table into per thread areas, and sends updates to the right thread. This is the way I would implement that scheme.

Rob Lachlan 2009-02-25 19:46:45

Answer 3

+2 A:

first, i don't think it's apropriate to compare pthread_create() time with a hashmap operation. better compare with (un)lock times, both in the contended and noncontended cases.

still, you're right, synchronisation times are the bottleneck and getting worse, since they have to go to the inter-CPU bus/bridge/channel, whatever, while most other datastructs try to stay in cache (or even in shadow registers).

there are two main directions to attack this problem:

better shared structures: check lock-free structures and/or transactional-memory. both try to maximize accessibility by replacing the 'lock-modify-release' cycle with 'try-check-commit/rollback'. in most cases, the check should succeed, so the rollback shouldn't affect average performance. usually the check/commit is done atomically, so it's expensive in terms of CPU bandwidth, but it's a lot less than traditional locks.
less sharing: that's what erlang/haskell languages emphasize. making it easy and inexpensive to transfer small messages, inter-thread communication looks more like function calls with parameters, and less than shared memory. this is far more scalable, since only two processes have to synchronize, and could (in theory) use non-RAM channels with lower latencies.

edit: i'm surprised nobody has any opinion about lock-free structures. check this (pdf) and this (video) about a lock-free hashtable implementation in Java that scales (almost) linearly up to 300 CPUS

Javier 2009-02-24 23:00:18

If erlang, haskell make inter-thread communication "look like" function calls, does that mean that they're comparable in performance to a C++ function call? I can imagine that a parallel-focused language might make IPC fast, but doesn't that mean that regular function calls become slow?

slacy 2009-02-24 23:17:10

no, regular function calls are still function calls. only inter-process 'function calls' are message passing underneath. also, they 'look like' but are not indistinguishable from regular calls

Javier 2009-02-25 20:25:21

Answer 4

+3 A:

I deal with this question every day. I have found that things like linked lists are really useful, as you can have each thread of your parallel algorithm construct its own linked list, then just sew them together on the master when you're done. Almost no overhead, as long as your threads are truly independent

If you every have arrays of data to use, I find it's almost always better to allocate a smaller array to work on for each thread, then merge the small arrays back into the master array upon completion - in fact, if you are in a clustered environment, using the "same" array isn't even a possibility!

If you're implementing an algorithm that uses associative arrays (think .NET Dictionary), you're almost always going to duplicate some work somewhere between threads. Try to avoid these when possible.

If you're coding for a CUDA (GPU) environment, you'll learn very quickly that the whole world can (nay, should!) be recast as an array before operating :)

Mike 2009-02-24 23:14:15

Yeah, what I'm hoping for is something along these lines, but I'm having a hard time wrapping my head around the details. I'm imagining something like memcached that runs on localhost but offers better performance than a "native" hash_map<>.

slacy 2009-02-24 23:20:42

Massively parallel stuff like CUDA is weird. It can be faster to forget everything you knew about sorted structures and binary search and just compare every item in A to every item in B!

Zan Lynx 2009-02-25 02:05:58

@Zan Lynx: haha :) So true. I'm vectorizing some scientific programs against CUDA now. To quote Yoda: "you must unlearn what you have learned"

Mike 2009-02-25 15:17:35

Answer 5

+1 A:

I do not believe there is much parallelism to be had in a single lookup. But if you have an entire list of items to look up, it is a different case.

Take a hash table and take a large list of keys to look up in the hash table or tree. It would double performance to split the list of keys among 2 CPUs.

Or take a large list of items to insert. Divide the hash table into per-CPU areas and divide the key list. Then each CPU can stuff items into its own hash table.

This also applies to vectors, B+trees and binary trees although I believe hash tables can be constructed to need slightly less locking for updates.

Zan Lynx 2009-02-25 02:02:52

slacy 2009-02-25 18:53:09

Answer 6

A:

Javier has a good point: if you're performing operations in parallel, you've already got the threads, you just need to give them something to do.

I think a lot of what this comes down to is the standard readers and writers problem. You should be able to have a virtually unlimited number of threads using a hash table if all they're doing is reads or other non-destructive operations. However, once one of them needs to do a write, then they have to grab an exclusive lock on the whole hash table (unless you hash your key externally first, then they could in theory just get a lock on the bucket they hash to, depending on your collision resolution mechanism).

One thing to consider is having one (or a small pool) of threads per data structure, and treat access as a "service". That is, instead of a thread looking something up in the hash map, it issues a synchronous request to the thread servicing that data structure. That localizes lock operations (only the threads servicing the requests have to know about the locking technique), but may make the request queue a bottleneck.

I think, as someone else has said, that the best way to exploit parallelism is through your algorithms, not the data structures.

TMN 2009-02-25 02:04:41

This gets to the heart of the problem: You suggest using a threadpool and synchronous calls to those threads. I've found that the time to make a synchronous function call in another thread (i.e. pthread_create) is just as long as actually doing the operation on the hashtable directly.

slacy 2009-02-25 18:51:21

@slacy: That isn't how you do it. You write a "message", which can just be a variable somewhere or better yet a message queue, and unlock the waiting thread. Much faster than thread creation.

Zan Lynx 2009-02-26 15:34:36

@slacy: But then you've got your hashtable access control (locking and waiting) spread throughout your code, and all it takes is one place in the code to get that wrong and you've got a subtle data-dependent bug that might take days to find.

TMN 2009-02-27 12:50:03

Answer 7

A:

Put everything into work queues. That is the key -- and gets you closer to scaling across multiple machines. Synchronization is expensive and will only get more expensive later on (imagine having a memory barrier with 128 CPUs).

twk 2009-02-25 02:08:13

slacy 2009-02-25 18:49:26

hey, you have to hand off a batch of work each time you lock, or it is n't worth it

twk 2009-02-25 18:58:27

ansaurus

tags:

views:

answers:

Faster fundamental datastructures on multicore machines?

related questions