There are libraries for this, in some cases. And, notably, there are tricks you can play with vectorized data (e.g., four 32-bit elements in a 128-bit vector, but this also applies to four 8-bit bytes in a 32-bit register) to go faster than individual-element accesses.
For a transpose, the standard idea is that you use "shuffle" instructions, which allow you to create a new data vector out of two existing vectors, in any order. You work with 4x4 blocks of the input array. So, starting out, you have:
v0 = 1 2 3 4
v1 = 5 6 7 8
v2 = 9 A B C
v3 = D E F 0
Then, you apply shuffle instructions to the first two vectors (interleaving their odd elements, A0B0 C0D0 -> ABCD, and interleaving their even elements, 0A0B 0C0D -> ABCD), and to the last two, to create a new set of vectors with each 2x2 block transposed:
1 5 3 7
2 6 4 8
9 D B F
A E C 0
Finally, you apply shuffle instructions to the odd pair and the even pair (combining their first pairs of elements, AB00 CD00 -> ABCD, and their last pairs, 00AB 00CD -> ABCD), to get:
1 5 9 D
2 6 A E
3 7 B F
4 8 C 0
And there, 16 elements transposed in eight instructions!
Now, for 8-bit bytes in 32-bit registers, ARM doesn't have exactly shuffle instructions, but you can synthesize what you need with shifts and a SEL (select) instruction, and the second set of shuffles you can do in one instruction with the PKHBT (pack halfword bottom top) and PKHTB (pack halfword top bottom) instructions.
Finally, if you're using a large ARM processor with NEON vectorizations, you can do something like this with 16-element vectors on 16x16 blocks.