I'm trying to understand how the Memory barrier works, why it is used and in what cases it should be used. However I'm not entirely sure in what cases it would be more efficient to arrange the order of instructions, can anyone give me an example of that?
One of the cases that it's useful is floating point calculations - these generally take a lot longer than 'normal' instructions, so it's useful for the CPU to run these off to one side over several instructions, while other normal program instructions happen in the main ALU.
It can also help to keep all the pipelines active - some CPUs have multiple instruction pipelines (say, one specialised for branches, a couple specialised for arithmetic operators, and a couple for floating point & SIMD instructions). Reordering the instructions allows the CPU to keep all the pipelines full, rather than having one empty for a few instructions, and so speeds up the program execution.
Even for a single pipeline, reordering instructions can help keep the pipeline full by removing consecutive dependent instructions - see http://en.wikipedia.org/wiki/Instruction_pipeline