Use the best counter available on your platform, fall back to time() for portability.
I am using QueryPerformanceCounter, but see the comments in the other reply.
General advise:
The inner loop should run at least about 20 times the resolution of your clock, to make the resolution error < 5%. (so, when using time() your inner loop should run at least 20 seconds)
Repeat these measurements, to see if they are consistent.
I use an additional outer loop, running ten times, and ignoring the fastest and the slowest measurement for calculating average and deviation. Deviation comes handy when comparing two implementations: if you have one algorithm taking 2.0ms +/-.5, and the other 2.2 +/- .5, the difference is not significant to call one of them "faster".
(max and min should still be displayed). So IMHO a valid performance measurement should look something like this:
10000 x 2.0 +/- 0.2 ms (min = 1.2, , max=12.6), 10 repetitions
If you know what you are doing, purging the cache and setting thread affinity can make your measurements much more robust.
However, this is not without pifalls. The more "stable" the measurement is, the less realistic it is as well. Any implementation will vary strongly with time, depending on the state of data and instruction cache. I'm lazy here, useing the max= value to judge first run penalty, this might not be sufficient for some scenarios.