At first I did not want to write this answer after the testing on x86, but the testing on sparc Solaris showed it had a performance gain compared with "obvious solution", so maybe it would be useful to someone. I've taken it from a PDF that accompanies the book Hacker's Delight. Here it goes:
unsigned msec2sec(unsigned n) {
unsigned q, r, t;
n = n + 500;
t = (n >> 7) + (n >> 8) + (n >> 12);
q = (n >> 1) + t + (n >> 15) + (t >> 11) + (t >> 14);
q = q >> 9;
r = n - q*1000;
return q + ((r + 24) >> 10);
}
as opposed to:
unsigned msec2sec_obvious(unsigned n) {
return (n + 500)/1000;
}
On x86 the "obvious algorithm" translates into adding 500 and then a long multiply by 274877907, followed by grabbing the most significant 32 bits from edx and shifting them 6 bit right - so it beats this code above hands down (~5 times times performance difference).
However, on Solaris/sparc, the "obvious" is transformed into a call to .udiv - which all in all turns out to give a performance difference of ~2.5 times in another direction.