Reproducibility in scientific programming

+1 A:

Post code, data, and results on the Internet. Write the URL in the paper.

Also, submit your code to "contests". For example, in music information retrieval, there is MIREX.

Steve 2010-04-29 01:09:57

+3 A:

publish the program code, make it available for review.

This is not directed at you by any means, but here is my rant:

If you do work sponsored by taxpayer money, if you publish the results in peer-reviewed journal, provide the source code, under open source license or in public domain. I am tired of reading about this great algorithm somebody came up with, which they claim does x, but provide no way to verify/check source code. if I cannot see the code, I cannot verify you results, for algorithm implementations can be very drastic differences.

It is not moral in my opinion to keep work paid by taxpayers out of reach of fellow researchers. it's against science to push papers yet provide no tangible benefit to public in terms of usable work.

aaa 2010-04-29 01:15:42

+9 A:

Michael Aaron Safyan 2010-04-29 01:18:02

I really wish all researchers shared your philosophy

aaa 2010-04-29 01:29:16

Randomization - you should set a flag for the seed, so you can choose whether or not you want to replicate the exact results.

wisty 2010-04-29 02:00:40

@wisty: and should the flag used be stored as well?

Andrew Grimm 2010-04-29 02:11:21

Haha, of course. Or you can have a default flag in the code, and only use other flags for exploration / testing. It depends how rigorous you want to be. Note, in python, both numpy.random and random need separate flags.

wisty 2010-04-29 02:31:43

+2 A:

I think a lot of the previous answers missed the "scientific computing" part of your question, and answered with very general stuff that applies to any science (make the data and method public, specialized to CS).

What they're missing is that you have to be even more specialized: you have to specific which version of the compiler you used, which switches were used when compiling, which version of the operating system you used, which versions of all the libraries you linked against, what hardware you are using, what else was going being run on your machine at the same time, and so forth. There are published papers out there where every one of these factors influenced the results in a non-trivial way.

For example (on Intel hardware) you could be using a library which uses the FPU's 80-bit floats, do an O/S upgrade, and now that library might now only use 64-bit doubles, and your results can drastically change if your problem was the least bit ill-conditioned.

A compiler upgrade might change the default rounding behaviour, or a single optimization might flip in which order 2 instructions get done, and again for ill-conditioned systems, boom, different results.

Heck, there are some funky stories of sub-optimal algorithms showing 'best' in practical tests because they were tested on a laptop which automatically slowed down the CPU when it overheated (which is what the optimal algorithm did).

None of these things are visible from the source code or the data.

Jacques Carette 2010-04-29 01:39:09

however such things can be verified/checked iff source code/data is available

aaa 2010-04-29 01:49:12

Good points, but +1 especially for how "optimal" algorithms can overheat the CPU and run slower.

DarenW 2010-09-23 19:18:37

A:

Record configuration parameters somehow (eg if you can set a certain variable to a certain value). This may be in the data output, or in version control.

If you're changing your program all the time (I am!), make sure you record what version of your program you're using.

Andrew Grimm 2010-04-29 02:37:07

+4 A:

dmckee 2010-04-29 03:41:26

Can you expand a little on "all the surprises"? Are you talking about the program needing to do things differently than you initially anticipated, and why it has to do those things? And yes, a README for yourself can be useful!

Andrew Grimm 2010-04-29 04:06:01

Surprises means 1) anything that is contrary to the usual practice in your discipline; 2) anything that you had to re-implement because the "obvious" way didn't work for some fundamental (as opposed to language realted) reason; 3) any "gotcha"s in setting up and running the code; and 4) anything else about the analysis that will have to be explained in the eventual paper.

dmckee 2010-04-29 19:43:06

+3 A:

I'm a software engineer embedded in a team of research geophysicists and we're currently (as always) working to improve our ability to reproduce results upon demand. Here are a few pointers gleaned from our experience:

Put everything under version control: source code, input data sets, makefiles, etc
When building executables: we embed compiler directives in the executables themselves, we tag a build log with a UUID and tag the executable with the same UUID, compute checksums for executables, autobuild everything and auto-update a database (OK, it's just a flat file really) with build details, etc
We use Subversion's keywords to include revision numbers (etc) in every piece of source, and these are written into any output files generated.
We do lots of (semi-)automated regression testing to ensure that new versions of code, or new build variants, produce the same (or similar enough) results, and I'm working on a bunch of programs to quantify the changes which do occur.
My geophysicist colleagues do analyse the programs sensitivities to changes in inputs. I analyse their (the codes, not the geos) sensitivity to compiler settings, to platform and such like.

We're currently working on a workflow system which will record details of every job run: input datasets (including versions), output datasets, program (incl version and variant) used, parameters, etc -- what is commonly called provenance. Once this is up and running the only way to publish results will be by use of the workflow system. Any output datasets will contain details of their own provenance, though we haven't done the detailed design of this yet.

We're quite (perhaps too) relaxed about reproducing numerical results to the least-significant digit. The science underlying our work, and the errors inherent in the measurements of our fundamental datasets, do not support the validity of any of our numerical results beyond 2 or 3 s.f.

We certainly won't be publishing either code or data for peer-review, we're in the oil business.

High Performance Mark 2010-04-29 14:20:00

ansaurus

tags:

views:

answers:

Reproducibility in scientific programming

related questions