views:

250

answers:

7

Along with producing incorrect results, one of the worst fears in scientific programming is not being able to reproduce the results you've generated. What best practices help ensure your analysis is reproducible?

+1  A: 

Post code, data, and results on the Internet. Write the URL in the paper.

Also, submit your code to "contests". For example, in music information retrieval, there is MIREX.

Steve
+3  A: 

publish the program code, make it available for review.

This is not directed at you by any means, but here is my rant:

If you do work sponsored by taxpayer money, if you publish the results in peer-reviewed journal, provide the source code, under open source license or in public domain. I am tired of reading about this great algorithm somebody came up with, which they claim does x, but provide no way to verify/check source code. if I cannot see the code, I cannot verify you results, for algorithm implementations can be very drastic differences.

It is not moral in my opinion to keep work paid by taxpayers out of reach of fellow researchers. it's against science to push papers yet provide no tangible benefit to public in terms of usable work.

aaa
+9  A: 
Michael Aaron Safyan
I really wish all researchers shared your philosophy
aaa
Randomization - you should set a flag for the seed, so you can choose whether or not you want to replicate the exact results.
wisty
@wisty: and should the flag used be stored as well?
Andrew Grimm
Haha, of course. Or you can have a default flag in the code, and only use other flags for exploration / testing. It depends how rigorous you want to be. Note, in python, both numpy.random and random need separate flags.
wisty
+2  A: 

I think a lot of the previous answers missed the "scientific computing" part of your question, and answered with very general stuff that applies to any science (make the data and method public, specialized to CS).

What they're missing is that you have to be even more specialized: you have to specific which version of the compiler you used, which switches were used when compiling, which version of the operating system you used, which versions of all the libraries you linked against, what hardware you are using, what else was going being run on your machine at the same time, and so forth. There are published papers out there where every one of these factors influenced the results in a non-trivial way.

For example (on Intel hardware) you could be using a library which uses the FPU's 80-bit floats, do an O/S upgrade, and now that library might now only use 64-bit doubles, and your results can drastically change if your problem was the least bit ill-conditioned.

A compiler upgrade might change the default rounding behaviour, or a single optimization might flip in which order 2 instructions get done, and again for ill-conditioned systems, boom, different results.

Heck, there are some funky stories of sub-optimal algorithms showing 'best' in practical tests because they were tested on a laptop which automatically slowed down the CPU when it overheated (which is what the optimal algorithm did).

None of these things are visible from the source code or the data.

Jacques Carette
however such things can be verified/checked iff source code/data is available
aaa
Good points, but +1 especially for how "optimal" algorithms can overheat the CPU and run slower.
DarenW
A: 

Record configuration parameters somehow (eg if you can set a certain variable to a certain value). This may be in the data output, or in version control.

If you're changing your program all the time (I am!), make sure you record what version of your program you're using.

Andrew Grimm
+4  A: 
dmckee
Can you expand a little on "all the surprises"? Are you talking about the program needing to do things differently than you initially anticipated, and why it has to do those things? And yes, a README for yourself can be useful!
Andrew Grimm
Surprises means 1) anything that is contrary to the usual practice in your discipline; 2) anything that you had to re-implement because the "obvious" way didn't work for some fundamental (as opposed to language realted) reason; 3) any "gotcha"s in setting up and running the code; and 4) anything else about the analysis that will have to be explained in the eventual paper.
dmckee
+3  A: 

I'm a software engineer embedded in a team of research geophysicists and we're currently (as always) working to improve our ability to reproduce results upon demand. Here are a few pointers gleaned from our experience:

  1. Put everything under version control: source code, input data sets, makefiles, etc
  2. When building executables: we embed compiler directives in the executables themselves, we tag a build log with a UUID and tag the executable with the same UUID, compute checksums for executables, autobuild everything and auto-update a database (OK, it's just a flat file really) with build details, etc
  3. We use Subversion's keywords to include revision numbers (etc) in every piece of source, and these are written into any output files generated.
  4. We do lots of (semi-)automated regression testing to ensure that new versions of code, or new build variants, produce the same (or similar enough) results, and I'm working on a bunch of programs to quantify the changes which do occur.
  5. My geophysicist colleagues do analyse the programs sensitivities to changes in inputs. I analyse their (the codes, not the geos) sensitivity to compiler settings, to platform and such like.

We're currently working on a workflow system which will record details of every job run: input datasets (including versions), output datasets, program (incl version and variant) used, parameters, etc -- what is commonly called provenance. Once this is up and running the only way to publish results will be by use of the workflow system. Any output datasets will contain details of their own provenance, though we haven't done the detailed design of this yet.

We're quite (perhaps too) relaxed about reproducing numerical results to the least-significant digit. The science underlying our work, and the errors inherent in the measurements of our fundamental datasets, do not support the validity of any of our numerical results beyond 2 or 3 s.f.

We certainly won't be publishing either code or data for peer-review, we're in the oil business.

High Performance Mark