Let A represent the original array a1,a2,... ,an, b1,b2,..., bn. We swap the second quarter of the array with the third quarter of the array. We then group the first and the second quarter and call this array A1 and the third and the fourth quarter and call the array A2. We apply this technique recursively for A1, and then for A2.
For example:
Assume A={a1,a2,b1,a2}. The four quarters are {a1},{a2},{b1},{b2}. We swap the second quarter with the third quarter and we get {a1},{b1},{a1},{b2}. We are done.
Assume A={a1,a2,a3,a4,b1,a2,b3,b4}. We swap the second quarter with the third quarter. We get
{a1,a2,b1,b2,a3,a4,b3,b4}. Then A1={a1,a2,b1,b2} and A2={a3,a4,b3,b4}. We apply the algorithm recursively for A1 and A2.
The algorithm takes O(1) memory, but how much time does it take? Let's assume that swapping two individual elements of the array takes O(1) time (actually it takes 3 operations with the in place swapping using xor and no extra memory). Then, assume n=|A| the length of A, we need n/4 swaps and need to solve two other problems half as long: we have T(n)=n/4+2T(n/2). By the master theorem, we get T(n)=O(n*log(n)). This is not O(n).
Can we do better using the exact same technique? Maybe we can improve the time it takes to swap things around. Clearly, if we want to swap two elements, it takes O(1), nothing we can do about that. But we can take advantage of the fact that an array is represented in memory in continuous locations. Thus swapping an entire second quarter with an entire third quarter can be done in O(1) if we have some operator like XOR(memory_start_address1, memory_start_address2, length_in_bytes), where the result is put always at memory_start_address1. Then the equation above becomes T(n)=1+2T(n/2). By the master theorem T(n)=O(n).
The question is: do we have such an operator XOR(memory_start_address1, memory_start_address2, length_in_bytes) ?