In short, the compiler is allowed to reorder or transform the program as it likes, as long as the observable behavior on a C++ virtual machine does not change. The C++ standard has no concept of threads, and so this fictive VM only runs a single thread. And on such an imaginary machine, we don't have to worry about what other threads see. As long as the changes don't alter the outcome of the current thread, all code transformations are valid, including reordering memory accesses across sequence points.
understanding why volatile does not appear to be a requirement for the mutex protected Data is really part of this question
Volatile ensures one thing, and one thing only: reads from a volatile variable will be read from memory every time -- the compiler won't assume that the value can be cached in a register. And likewise, writes will be written through to memory. The compiler won't keep it around in a register "for a while, before writing it out to memory".
But that's all. When the write occurs, a write will be performed, and when the read occurs, a read will be performed. But it doesn't guarantee anything about when this read/write will take place. The compiler may, as it usually does, reorder operations as it sees fit (as long as it doesn't change the observable behavior in the current thread, the one that the imaginary C++ CPU knows about). So volatile doesn't really solve the problem. On the other hand, it offers a guarantee that we don't really need. We don't need every write to the variable to be written out immediately, we just want to ensure that they get written out before crossing this boundary. It's fine if they're cached until then - and likewise, once we've crossed the critical section boundary, subsequent writes can be cached again for all we care -- until we cross the boundary the next time. So volatile offers a too strong guarantee which we don't need, but doesn't offer the one we do need (that reads/writes won't get reordered)
So to implement critical sections, we need to rely on compiler magic. We have to tell it that "ok, forget about the C++ standard for a moment, I don't care what optimizations it would have allowed if you'd followed that strictly. You must NOT reorder any memory accesses across this boundary".
Critical sections are typically implemented via special compiler intrinsics (essentially special functions that are understood by the compiler), which 1) force the compiler to avoid reordering across that intrinsic, and 2) makes it emit the necessary instructions to get the CPU to respect the same boundary (because the CPU reorders instructions too, and without issuing a memory barrier instruction, we'd risk the CPU doing the same reordering that we just prevented the compiler from doing)