views:

107

answers:

2

I am looking for an approach that will let me assign ordinal numbers 0..(N-1) to N O/S threads, such that the threads are in numeric order. That is, the thread that gets will have a lower O/S thread ID than the thread with ordinal 1.

In order to carry this out, the threads communicate via a shared memory space.

The memory ordering model is such that writes will be atomic (if two concurrent threads write a memory location at the same time, the result will be one or the other). The platform will not support atomic compare-and-set operations.

I am looking for an algorithm that is efficient in the number of writes to shared memory, and will complete rapidly with up to tens of thousands of threads, and with no nasty worse-case thread arrival conditions.

The O/S will assign thread numbers in arbitrary order throughout a 32-bit space. There may be arbitrary thread creation delays - the algorithm can be considered complete when all N threads are present.

I am unable to use the obvious solution of collecting all the threads, and then having one thread sort them - without an atomic operation, I have no way of safely collecting all the individual threads (another thread could rewrite the slot).

+1  A: 

If I have this right you want to map Integer->Integer where the input in an arbitrary 32 bit number, and the output is a number from 0-N where N is the number of threads?

In that case, every time you create a new thread call this method, the returned value is the ID:

integer nextId = 0;
integer GetInteger()
{
    return AtomicIncrement(nextId);
}

This algorithm is obviously O(N)

Assuming several things:

  1. threads never die
  2. You have some kind of atomic increment which increments a number and returns the old value or the new value, the difference is between 0 based and 1 based IDs
  3. Threads call a method asking what their ID is just once when they're created
Martin
Where can I find an implementation of AtomicIncrement that doesn't rely on Atomic primitives? Atomic Compare-and-Set seems necessary to implement AtomicIncrement.
caffiend
some platforms have an intrinsic AtomicIncrement operator, if not, then you need to build it with CAS and you're stuffed.
Martin
Because this only runs once for each thread, you could just use a spinlock. In most cases it won't spin at all, in some cases it will take a while longer but even then not very long unless you're constantly creating hundreds of threads a second.
Martin
Wouldn't a spinlock also require an atomic operation?
caffiend
Normally they're implemented with a CAS, I suspect you can do it without one though. See my question here http://stackoverflow.com/questions/2754968/how-to-write-a-spinlock-without-using-cas
Martin
Of course, a spinlock isn't necessarily required here. You could just use a normal lock. It's a bit slower but that's ok, the runtime of this method is spread across all the time you're starting new threads, and is a tiny overhead for each thread.
Martin
i'm not sure what you mean by "normal lock"?
caffiend
just a normal thread lock which puts the thread to sleep and is usually provided by the operating system. in C# it's the "lock" keyword, for example
Martin
+3  A: 

With no claim to being optimal in any sense (there are clearly faster ways to do this with atomic compare-and-set operations, or as Martin indicated, atomic increment)...

Assuming N is known to all the threads, and each thread has a unique non-zero ID value, such as its stack address in 32-bit space...

Use an array of size N in shared space; ensure that this array is initialized to zero.

Each thread owns the first slot in the array that holds an ID lower than or equal to the thread's ID; the thread writes its ID there. This continues until the array is full of non-zero values, and all the values are in decreasing order.

At the completion of the algorithm, the index of the thread's slot in the array is its ordinal number.

Doug Currie
I'm not sure I completely understand this - making assumptions, it seems like you could have a scenario where all thread but the lowest one have slots. The low one comes in, overwrites the current lowest one. The 2nd-to-lowest now overwrites the 3rd, and this chain continues until all threads have moved by one?
caffiend
@caffiend, yes, you could certainly get the behavior you describe; that's fine, though worst case kind of behavior. It's fine because the slots aren't final until the algorithm terminates; it's worst case since every thread has to "start over" when a new low ID thread enters the party. Thinking about the worst case, each thread would move its ID in the array at most N times, so it is a O(N-squared) algorithm [Note that the threads only ever need to look at slots at or beyond their present slot in the array, so they only traverse the array once].
Doug Currie
O(N**2) is very scary with the potential number of threads - worse case unresponsiveness is important, and I think this algorithm needs to run until all threads have numbers.. Thinking about the array, it seems this ought to be solveable in O(log N) using sort techniques?
caffiend
The O(log N) sort assumes that all IDs are known to one thread, but you said you couldn't "collect" all the threads. I wouldn't be too scared by O(N**2) since the constant factor is very small (a memory access).
Doug Currie
The memory accesses seem to incur a large penalty when these threads are split across physical devices because only one device can 'own' a memory line. Would it make sense to use a hash-with-secondary-probing method instead of the linear probing?
caffiend
A randomized/hashing version can work: a thread may (a) have at most one slot with its ID, and (b) write a slot if it contains zero. Each thread chooses a random slot in the array; if it is zero, it writes its ID there; if it is not zero it tries another slot. If the contents of that remains its ID, the thread idles. Otherwise it chooses another random slot and tries again. -- The problem with this algorithm is determining when it is safe to stop waiting; the threads must all wait for the array to fill, and since the slots are written randomly, that means reading all N slots, so O(N*N) again.
Doug Currie
This seems to be the only viable approach - but the O(N*N) is bothersome and seems noticeable in my experiment. Is there a reason that this MUST be O(N*N), or is there a chance that something clever could make this better?
caffiend
Lots of people smarter than me have worked on process synchronization without atomic increment or CAS. Their solutions take O(N) per sync even *after* each thread has an ordinal (see Dekker's, Peterson's, and Lamport's solutions on http://en.wikipedia.org/wiki/Mutex). Now with N threads each doing an O(N) sync, guess what, O(N*N).
Doug Currie
I can see the reason in that, but is it required that the assignment of an ordinal require N x O(N)?
caffiend
That's seems to be the best I can do without an atomic read-modify-write primitive.
Doug Currie