It sounds like you are wanting to perform a matrix transpose which is a little different than rotation. In rotation, the rows may become columns, but either the rows or the columns will be in reverse order depending on the rotation direction. Transposition maintains the original ordering of the rows and columns.
I think using the right algorithm is much more important than whether you use assembly or just C. Rotation by 90 degrees or transposition really boils down to just moving memory. The biggest thing to consider is the effect of cache misses if you use a naive algorithm like this:
for(int x=0; x<width; x++)
{
for(y=0; y<height; y++)
out[x][y] = in[y][x];
}
This will cause a lot of cache misses because you are jumping around in the memory a lot. It is more efficient to use a block based approach. Google for "cache efficient matrix transpose".
One place you may be able to make some gains is using SSE instructions to move more than one piece of data at a time. These are available in assembly and in C. Also check out this link. About half way down they have a section on computing a fast matrix transpose.
edit:
I just saw your comment that you are doing this for a class in assembly so you can probably disregard most of what I said. I assumed you were looking to squeeze out the best performance since you were using assembly.