Hello all,
I remember coming across R users writing that they use "Revision control" (e.g: "Source control"), and I am curious to know: How do you combine "Revision control" with your statistical analysis WorkFlow?
Two (very) interesting discussions talk about how to deal with the WorkFlow. But neither of them refer to the revision control element:
- http://stackoverflow.com/questions/1266279/how-to-organize-large-r-programs
- http://stackoverflow.com/questions/1429907/workflow-for-statistical-analysis-and-report-writing
A Long Update To The Question: Following some of the people's answers, and Dirk's question in the comment, I would like to direct my question a bit more.
After reading the Wiki article about "revision control" (which I was previously not familiar with), it was clear to me that when using revision control, what one does is to build a development structure of his code. This structure either leads to a "final product" or to several branches.
When building something like, let's say, a website. There is usually one end product you work towards (the website), with some prototypes along the way.
But when doing a statistical analysis, the work (to my view) is different. Sometimes you know where you want to get to. But more often, you explore. Explore cleaning the dataset. Explore different methods for statistical analysis, and ask various questions of your data (and I am writing this, knowing how Frank Harrell, and other experience statisticians feels about Data dredging).
That is way the WorkFlow question with statistical programming is (in my view) a serious and deep question, raising many issues, The simpler ones are technical:
- Which revision control software do you use (and why) ?
- Which IDE do you use(and why) ? The more interesting question are about work process:
- How do you structure your files?
- What do you keep as a separate file and what as a revision? or asking in a different way - What should be a "branch" and what should be a "sub project" in your code? For example: When starting to explore your data, should a plot be creating and then erased because it didn't lead any where (but kept as a revision) or should there be a backup file of that path?
How you solve this tension was my initial curiosity. The second question is "what might I be missing?". What rules (of thumb) should one follow so to avoid common pitfalls doing statistical programming with version control?
In my intuition, I feel that statistical programming is inherently different then software development (I am writing this without being a real expert in statistical programming, and even less so in software development). That's way I am unsure which of the lessons I have read here about version control would be applicable.
Thanks a lot, Tal