views:

153

answers:

4

How credible are the benchmarks carried out within a virtual machine, as opposed to real hardware?

Let's dissect a specific situation. Assume we want to benchmark the performance impact of a recent code change. Assume for simplicity that the workload is fully CPU bound (though IO bound and mixed workloads are also of interest). Assume that the machine is running under VirtualBox because it's the best one ;)

Assume that we measured the original code and the new code, and the new code was 5% faster (when benchmarked in virtual machine). Can we safely claim that it will be at least 5% faster on real hardware too?

And even more important part, assume that the new code is 3% slower. Can we be completely sure that on real hardware it will be 3% or less slower, but definitely not worse than 3%?

UPDATE: what I'm most interested in is your battlefield results. Ie. can you witness a case when code that was 10% slower in VM performed 5% faster on real iron, or vice versa? Or was it always consistent (ie. if it's faster/slower in VM, it's always proportionally faster/slower on real machine)? Mine are more or less consistent so far; at the very least, always going in the same direction.

+1  A: 

If you are comparing results on a VM to results not run on a VM, then no, the results are not credible.

On the other hand, if both tests were run in the same environment, they yes, the results are credible. Both tests will be slower inside a VM, but the difference should still be credible.

Jim Anderson
Please elaborate why so?Results obtained on a VM that fully emulates the CPU (including rdtsc) would be perfectly credible.So this is not entirely theoretically impossible.
Emulating instruction "x" might take 5x longer than "y" in a VM, when in hardware "x" is actually faster than "y"...
Brian Knoblauch
In the theoretical case that would not matter because the whole CPU state would be emulated - and tick count as measured within the *virtual* CPU would be correct.In practice modern VMs do not emulate most instructions; they run them.
You have the overheard of the VM itself to account for. For the tests to be valid, you must eliminate performance environmental differences
Jim Anderson
@shodan. I guess we have to decide if we're benchmarking based on ticks or time elapsed. :-)
Brian Knoblauch
@Brian, I'm talking about time measured within (!) VM itself too.Ie. this program took 100 sec as measured within VM, it now takes 97 sec of virtual time (even though maybe it takes 200 sec of real time). Can we consider this (minor) change "good" for perf w/o testing on real iron? ;)
My VMs always measure time the same, they just don't necessarily get the same amount of work done in that time.
Brian Knoblauch
+1  A: 

All things considered, using Fair Witness principals, all you can assert is how well the application performs in a VM, because that is what you are actually measuring.

Now, if you wish to try and extrapolate what you observe based on the environment, then, assuming you're running a native VM (vs an emulated one, PPC on x86 for example), a CPU bound task is a CPU bound task even in a VM because the CPU is doing most of the heavy lifting.

Arguably there may be some memory management issues involved that can distinguish between a VM and a native application, but once the memory is properly mapped, I can't think there would be dramatic differences in CPU bound run times between a VM and a native machine.

So, I think it is fair to intuit that performance change from one instance of the application to another when run on a VM would have a similar performance change, particularly with a CPU heavy application, when run on a native machine.

However, I don't think you can fairly say that "you know" unless you actually test it your self on the correct environment.

Will Hartung
It's understood there can be no guarantees.However if in 95% of the cases performance difference within VM is propotional to real hardware plus or minus 10%, and never ever going in different direction, then it's good enough for me.That's what I'm trying to find out.
@shodan: I think the point is that it's not necessarily proportional. There are so many variables to try that it's like comparing apples to volkswagons. They may both be red but that's about it and even then not all apples and not all volkswagons are red.
Chris Lively
@Chris, I do know the theory ;) but the whole question is about practice - ie. can you witness an actual situation when it was a) *severely* disproportional, or maybe even b) when the performance changes were going in different directions? I tried to reflect that in the Update.
A: 

The ONLY way to get credible performance results between a testing and production environment is to run IDENTICAL hardware and software. Right down to hardware version and software patch levels.

Otherwise you are pretty much wasting your time.

As an example, some memory sticks perform better than others which could easily account for a 5% throughput difference on otherwise identical boxes.

With regards to software the VM software will ALWAYS have an impact; and certain operations may be impacted more than others depending on so many different factors that there is no possible way to compare them.

Chris Lively
> Otherwise you are pretty much wasting your time.Not really. 2x speedup achieved within VM *does* confirm there will be speedup on real iron. Of course we would not know exactly how much will it be (2.0x too or 2.1x or 1.8x) but that's out of the scope of my question.
@shodan: you said, "Can we safely claim that it will be at least 5% faster on real hardware too?" My answer was: No you can't because you don't know for sure that 1) it really will be faster or 2) by how much until you test on the real hardware.
Chris Lively
+1  A: 

I don't think there is anything that special about a VM for this. Even on a 'real' machine, you are still running with virtual memory and sharing the CPU(s) with other processes, so similar considerations apply.

frankodwyer