I've been pondering this question for a while:
Can you build a faster fundamental datastructure (i.e. linked list, hashtable, set, skiplist, bloom filter, red-black tree, etc.) on a multicore machine, by taking advantage of the fact that you've got more than one CPU?
I did some preliminary experimenting with pthreads, and found that pthread_create() takes on the order of 30us, but a simple hash_map insert takes far less time than that on a single core. And thus, it's become hard for me to imagine creating a faster hash_map<>, since synchronization primitives and thread creation are so slow. I can also imagine tree traversal and balancing in parallel, but again, synchronization primitives would seem to make runtimes longer, not shorter.
It still feels intuitive to me that "I've got more CPU, and thus, I should be able to do it faster", but I can't quite wrap my head around a proof or counter-proof for that statement. I've experimented quite a bit in C++, but am now suspecting that other languages may offer better solutions (erlang?) for this task. Thoughts?
EDIT details: I think there are several programming / datastructure paradigms that are frequently used that could possibly be sped up. For example, I find myself frequently writing code that basically looks like this (where real data has been replaced with "rand()")
static const int N = 1000000;
static const int M = 10000000; // 10x more lookups
hash_map<int, int> m;
// batch insert a bunch of interesting data
for (int i = 0; i < N; i++) m[rand()] = rand();
// Do some random access lookups.
for (int i = 0; i < M; i++) m[rand()]++;
This kind of paradigm is frequently used for things like name-value settings & configuration data, batch processing, etc. The 10x (or more) lookup/insert ratio is what makes a traditional hash_map<> ideal for this sort of operation.
This could easily be split in half, with an insert phase and a lookup phase, and in a parallel world, there may be some "flush queue" operation between the two halves. More difficult is the interleaved insert + lookup version:
hash_map<int, int> m;
for (int i = 0; i < N; i++) {
if (rand() % LOOKUP_RATIO == 0)
hash_map[rand()]++; // "lookup"
else
hash_map[rand()] = rand(); // "insert"
}
In that scenario, insert could be asynchronous as long as the insert queue were flushed before each lookup, and if LOOKUP_RATIO is large enough (say, >1000) then it becomes quite similar to the batch example above, but with some queueing. Although, the queueing implies synchronization primitives.
Imagine for a second, the following snippet:
hash_map<int,int> a;
hash_map<int,int> b;
for (int i = 0; i < N; i++) {
// the following 2 lines could be executed in parallel
a[rand()] = rand();
b[rand()] = rand();
}
And thus, a lookup might be done in "parallel" by:
int lookup(int value) {
// The following 2 lines could be executed in parallel:
v1 = a[value];
v2 = b[value];
if (v1) // pseudo code for "value existed in a"
return v1;
else
return v2;
}