You can allocate state on thread local memory. Thread local memory is thread safe (as long as you don't pass pointers arround).
You can use two integers to generate the unique number and only one is a synchronized number.
Integer 1: a incrementing integer which represents the thread, a new number is generated each time you initialize a thread (which should be a rare event).
Integer2: on thread initialization you start this integer on 0.
You will use both integers -- which are stored in the thread local memory -- as the unique integer, and integer 2 will be incremented normally (unlocked).
This way the generation of unique integers is absolutely thread safe -- i.e., you don't have to use a atomic CPU instruction -- Interlocked.increment (which does cause hardware level performance penalties).
-- edit : cache coherence --
from :
Cache Coherence
To decrease time required for memory access different caches are
used: recently accessed memory
duplicated in CPU cache which is
significantly faster than common
memory. Future access to the same
address will use data saved in cache,
decreasing fetch time. The problems
appear in SMP (symmetric
multiprocessing) systems, where there
are several processors have own cache
memory: when one processor changes
variable in memory region, used by
several processors simultaneously, it
actually changes own copy of a
variable, located in cache, while
shared variable still has original
value. This problem could not be
solved by using volatile keyword on a
shared variable, since this will only
guarantee that write-to-memory
instruction will present in resulting
program, but operations related to
cache are still not specified. Of
course, it is possible to disable CPU
cache, mapping memory as no-cache
(PAGE_NOCACHE protection flag in
VirtualAlloc() Win32 API function),
but along with significant slowdown
this imposes some limitations: for
example, interlocked instructions may
raise hardware exception on no-cache
memory.
For correct work of SMP systems data which is stored in cache of more
than one processor should be the same
in all caches. This means that CPU
caches must be synchronized (kept
coherent) at a hardware level**. But it
is important to note that cache
synchronization (flow of cache
coherency traffic) is made
asynchronously with program execution:**
when one CPU changes value of a shared
variable another CPU temporarily
observes old value. This means CPU
continues execution without waiting
for cache coherency operation to be
completed. Furthermore, if two
variables (a then b) were changed by
first CPU, another CPU could observe
that b has changed earlier than a.
Interlocked instructions have considerable differences on this
point. Exactly, interlocked
instruction is a command to make
something directly on a physical
memory under a locked bus. This means
that caches inconsistency does not
affects programs where shared
variables are accessed with only
interlocked instructions (note that
both processes, that who reads a
variable and that writes to it, should
use interlocked instructions).
-- edit : Further clarrification: --
Under your current design Interlocked increment is indeed your best bet, but it is far from ideal. Your CPU has a VERY FAST cache onchip (often clocked at the same speed as the CPU). If you have memory local to your thread it will be pulled into the CPU your thread is on which means your CPU won't have to go to main memory and it can fly at full-speed.
If you use interlocked increment your CPU will have to
- lock the bus.
- Increase the 32-bit word.
- release the bus.
You can do without this. I might appear to be acting pedantic, as the overhead might only be a 100% decrease in relative performance. However in a industrial server app with a 4 physical CPU's and 16 cores with this UID generator firing on every request... trust you me, your bus will be screwed. Micro-optimization is an important area in programming, especially as we are scaling horizontally now.