views:

84

answers:

4

I want to perform a comparison of multiple implementations of basically the same algorithm, written in Java, C++ and Python, the latter executed using Pypy, Jython and CPython on a Mac OS X 10.6.4 Macbook Pro with normal (non-SSD) HDD.

It's a "decode a stream of data from a file" type of algorithm, where the relevant measurement is total execution time, and I want to prevent bias through e.g. OS an HDD caches, other programs running simultaneously, too large/small sample file etc. What do I need to pay attention to to create a fair comparison?

A: 

To prevent bias I would recommend first stopping all unnecessary processes from running in the background.

I'm not sure about windows, but under linux you can clear the HDD cache via drop_caches Information on how to use it here

Additionally you may want to take an average for several runs of the application, that way any HDD or OS interference won't skew the results.

BOMEz
I'm on Mac OS X 10.6.4 (and I do realize that's probably not the best testing platform, but it's all I have at the moment)
Daniel Beck
+1  A: 

These are difficult to do well.

In many cases the operating system will cache files so the second time they are executed they suddenly perform much better.

The other problem is you're comparing interpreted languages against compiled. The interpreted languages require an interpreter loaded into memory somewhere or they can't run. To be scrupulously fair you really should consider if memory usage and load time for the interpreter should be part of the test. If you're looking for performance in an environment where you can assume the interpreter is always preloaded then you can ignore that. Many setups for web servers will be able to keep an interpreter preloaded. If you're doing ad hoc client applications on a desktop then the start up can be very slow while the interpreter is loaded.

Jay
The programs are fairly short (<50 lines), and I've seen Python return very short scripts in less than 0.01 seconds. Since the files are rather large (400+ MB) and decoded bytewise, this probably won't be that much of an issue.
Daniel Beck
+1  A: 

I would recommend that you simply run each program many times (like 20 or so) and take the lowest measurement of each set. This will make it so it is highly likely that the program will use the HD cache and other things like that. If they all do that, then it isn't biased.

Evan Teran
Wouldn't the *lowest* measurement possibly be an outlier? I'm currently running the fastest program until performance bottoms out, to be sure everything's as much cached as possible, then run each implementation 5 times and take the median.
Daniel Beck
Sure it could be an outlier, but I don't think it is likely to be. If you feel that's the case, then you could use a clustering algorithm to determine which time things tend to "cluster towards".
Evan Teran
A: 

To get totally unbiased is impossible, you can do various stuff like running minimum processes etc but IMO best way is to run scripts in random order over a long period of time over different days and get average which will be as near to unbias as possible.

Because ultimately code will run in such environment in random order and you are interested in average behavior not some numbers.

Anurag Uniyal
In case my numbers as determined by (see my command to Evan's answer) are very close together (<1% difference between fastest and slowest execution) -- do you think this still applies? I don't want to test HDD seek times while the machine's busy with something else, I want to determine the impact of the technology on the algorithm.
Daniel Beck
< 1 %, i think there is no point in testing unless you have very special and minimalistic hardware, modern operating system and computers i think are complex enough to defeat your purpose
Anurag Uniyal
@Anurag Sorry I meant, multiple executions of the same implementation are very close together. C++ is multiple times faster than Python, for example. That's what I meant regarding randomness, etc.
Daniel Beck