Registers are the "working storage" in a CPU. They are very fast, but a very limited resource. Typically, a CPU has a small fixed set of named registers, the names being part of the assembler language convention for that CPUs machine code. For example, 32-bit Intel x86 CPUs have four main data registers named eax, ebx, ecx and edx, along with a number of indexing and other more specialised registers.
Strictly speaking, this isn't quite true these days - register renaming, for example, is common. Some processors have enough registers that they number them rather than naming them etc. It remains, however, a good basic model to work from. For example, register renaming is used to preserve the illusion of this basic model despite out-of-order execution.
Use of registers in manually written assembler tends to have a simple pattern of register use. A few variables will be kept purely in registers for the duration of a subroutine, or some substantial part of it. Other registers are used in a read-modify-write pattern. For example...
mov eax, [var1]
add eax, [var2]
mov [var1], eax
IIRC, that is valid (though probably inefficient) x86 assembler code. On a Motorola 68000, I might write...
move.l [var1], d0
add.l [var2], d0
move.l d0, [var1]
This time, the source is usually the left parameter, with the destination on the right. The 68000 had 8 data registers (d0..d7) and 8 address registers (a0..a7), with a7 IIRC also serving as the stack pointer.
On a 6510 (back on the good old Commodore 64) I might write...
lda var1
adc var2
sta var1
The registers here are mostly implicit in the instructions - the above all use the A (accumulator) register.
Please forgive any silly errors in these examples - I haven't written any significant amount of "real" (rather than virtual) assembler for at least 15 years. The principle is the point, though.
Usage of registers is specific to a particular code fragment. What a register holds is basically whatever the last instruction left in it. It's the responsibility of the programmer to keep track of what is in each register at each point in the code.
When calling a subroutine, either the caller or callee must take responsibility for ensuring there's no conflict, which generally means that the registers are saved out to the stack at the start of the call and then read back in at the end. Similar issues occur with interrupts. Things like who is responsible for saving the registers (caller or callee) are typically a part of the documentation of each subroutine.
A compiler will typically decide how to use registers in a much more sophisticated way than a human programmer, but it operates on the same principles. The mapping from registers to particular variables is dynamic, and varies dramatically according to which fragment of code you are looking at. Saving and restoring registers is mostly handled according to standard conventions, though the compiler may improvise "custom calling conventions" in some circumstances.
Typically, local variables in a function are imagined to live on the stack. This is the general rule with "auto" variables in C. Since "auto" is the default, these are normal local variables. For example...
void myfunc ()
{
int i; // normal (auto) local variable
//...
nested_call ();
//...
}
In the above code, "i" may well be held primarily in a register. It may even be moved from one register to another and back as the function progresses. However, when "nested_call" is called, the value from that register will almost certainly be on the stack - either because the variable is a stack variable (not a register), or because the register contents are saved to allow nested_call its own working storage.
In a multithreading app, normal local variables are local to a particular thread. Each thread gets its own stack, and while it is running, exclusive use of the CPU registers. In a context switch, those registers are saved. Whether in registers or on the stack, local variables are not shared between threads.
This basic situation is preserved in a multicore application, even though two or more threads may be active at the same time. Each core has its own stack and its own registers.
Data stored in shared memory requires more care. This includes global variables, static variables within both classes and functions, and heap-allocated objects. For example...
void myfunc ()
{
static int i; // static variable
//...
nested_call ();
//...
}
In this case, the value of "i" is preserved between function calls. A static region of main memory is reserved to store this value (hence the name "static"). In principle, there is no need for any special action to preserve "i" during the call to "nested_call", and at first sight, the variable can be accessed from any thread running on any core (or even on a separate CPU).
However, the compiler is still working hard to optimise the speed and size of your code. Repeated reads and writes to main memory are much slower than register accesses. The compiler will almost certainly choose not to follow the simple read-modify-write pattern described above, but will instead keep the value in the register for a relatively extended period, avoiding repeated reads and writes to the same memory.
This means that modifications made in one thread may not be seen by another thread for some time. Two threads could end up having very different ideas about the value of "i" above.
There is no magic hardware solution for this. For example, there is no mechanism for synchronising the register between threads. To the CPU, the variable and the register are completely separate entities - it doesn't know that they need to be synchronised. There's certainly no synchronisation between registers in different threads or running on different cores - there's no reason to believe that another thread is using the same register for the same purpose at any particular time.
A partial solution is to flag a variable as "volatile"...
void myfunc ()
{
volatile static int i;
//...
nested_call ();
//...
}
This tells the compiler not to optimise reads and writes to the variable. The processor doesn't have a concept of volatility. This keyword tells the compiler to generate different code, doing immediate reads and writes to memory as specified by assignments, instead of avoiding those accesses by using a register.
This is not a multithreading synchronisation solution, however - at least not in itself. One appropriate multithreading solution is to use some kind of lock to manage access to this "shared resource". For example...
void myfunc ()
{
static int i;
//...
acquire_lock_on_i ();
// do stuff with i
release_lock_on_i ();
//...
}
There is more going on here than is immediately obvious. In principle, rather than write the value of "i" back to its variable ready for the "release_lock_on_i" call, it could be saved on the stack. As far as the compiler is concerned, this isn't unreasonable. It's doing stack access anyway (e.g. saving the return address), so saving the register on the stack may be more efficient than writing it back to "i" - more cache friendly than accessing a completely separate block of memory.
Unfortunately, though, the release lock function doesn't know that the variable hasn't been written back to memory yet, so can do nothing to fix it. After all, that function is just a library call (the real lock-release may be hidden in a more deeply nested call) and that library may have been compiled years before your application - it doesn't know how its callers use registers or the stack. That's a big part of why we use a stack, and why calling conventions have to be standardised (e.g. who saves the registers). The release lock function cannot force callers to "synchronise" registers.
Equally, you might relink an old app with a new library - the caller doesn't know what "release_lock_on_i" does or how, it's just a function call. It doesn't know that it needs to save registers back out to memory first.
To resolve this, we can bring back the "volatile".
void myfunc ()
{
volatile static int i;
//...
acquire_lock_on_i ();
// do stuff with i
release_lock_on_i ();
//...
}
We may use a normal local variable temporarily while the lock is active, to give the compiler the chance to use a register for that brief period. In principle, though, a lock should be released as soon as possible, so there shouldn't be that much code in there. If we do, though, we write our temporary variable back to "i" before releasing the lock, and the volatility of "i" ensures that it's written back to main memory.
In principle, this isn't enough. Writing to main memory doesn't mean you have written to main memory - there are layers of cache to traverse in between, and your data could sit in any one of those layers for a while. There is a "memory barrier" issue here, and I don't know a great deal about this - but fortunately this issue is the responsibility of thread synchronisation calls such as the lock acquire and release calls above.
This memory barrier issue doesn't remove the need for the "volatile" keyword, however.