Unfortunately volatile semantics are kinda wishy-washy. The concept of volatile wasn't really meant to be used for threading.
Potatoswatter is correct that calling the OS synchronization primitives will normally prevent the optimizing compiler from hoisting the read of num from the loop. But it works for sorta the same reason that using an accessor method works... by accident.
The compiler sees you calling a function that isn't immediately available for inlining or analysis, so it has to assume that any variable that could be used by some other function might be read or altered in that opaque function. So before doing the call, the compiler needs to write all those "global" variables back to memory.
At corensic, we added an inline function to jinx.h that does this in a more direct way. Something like the following:
inline void memory_barrier() { asm volatile("nop" ::: "memory"); }
This is rather subtle, but it effectively tells the compiler (gcc) that it can't get rid of this chunk of opaque asm and that the opaque asm can read or write any globally visible piece of memory. This effectively stops the compiler from reordering loads/stores across this boundary.
For your example:
memory_barrier();
while (num == 0) {
memory_barrier();
...
}
Now the read of num is stuck in place. And potentially more importantly, it's stuck in place with relation to other code. So you could have:
while (flag == 0) { memory_barrier(); } // spin
process data[0..N]
And another thread does:
populate data[0..N]
memory_barrier();
flag = 1;
PS. If you do this type of thing (essentially creating your own synchronization primitives) the perf wins can be big but the quality risk is high. Jinx is particularly good at finding bugs in these lock-free structures. So you might want to use it or some other tool to help test this stuff.
PPS. The linux community has a nice post about this called "volatile considered harmful", check it out.