views:

797

answers:

10

Many data analysts that I respect use version control. For example:

However, I'm evaluating whether adopting a version control system such as git would be worthwhile.

A brief overview: I'm a social scientist who uses R to analyse data for research publications. I don't currently produce R packages. My R code for a project typically includes a few thousand lines of code for data input, cleaning, manipulation, analyses, and output generation. Publications are typically written using LaTeX.

With regards to version control there are many benefits which I have read about, yet they seem to be less relevant to the solo data analyst.

  • Backup: I have a backup system already in place.
  • Forking and rewinding: I've never felt the need to do this, but I can see how it could be useful (e.g., you are preparing multiple journal articles based on the same dataset; you are preparing a report that is updated monthly, etc)
  • Collaboration: Most of the time I am analysing data myself, thus, I wouldn't get the collaboration benefits of version control.

There are also several potential costs involved with adopting version control:

  • Time to evaluate and learn a version control system
  • A possible increase in complexity over my current file management system

However, I still have the feeling that I'm missing something. General guides on version control seem to be addressed more towards computer scientists than data analysts.

Thus, specifically in relation to data analysts in circumstances similar to those listed above:

  1. Is version control worth the effort?
  2. What are the main pros and cons of adopting version control?
  3. What is a good strategy for getting started with version control for data analysis with R (e.g., examples, workflow ideas, software, links to guides)?
+5  A: 

I would still recommend version control for a solo act like you because having a safety net to catch mistakes can be a great thing to have.

I've worked as a solo Java developer, and I still use source control. If I'm checking things in continuously I can't lose more than an hour's work if something goes wrong. I can experiment and refactor without worrying, because if it goes awry I can always roll back to my last working version.

If that's the case for you, I'd recommend using source control. It's not hard to learn.

duffymo
+9  A: 

I do economics research using R and LaTeX, and I always put my work under version control. It's like having unlimited undo. Try Bazaar, it's one of the simplest to learn and use, and if you're on Windows it has a graphical user interface (TortoiseBZR).

Yes, there are additional benefits to version control when working with others, but even on solo projects it makes a lot of sense.

Ana Nelson
Thanks for sharing your first hand experience.
Jeromy Anglim
Thanks for sharing about bazaar, very helpful!
Stedy
+3  A: 

A version Control for solo development (of any kind) is really interesting for:

  • exploring the history and compare the current work with past commits
  • branching and trying different versions for a same set of files

If you do not see yourself doing one of those two basic version control features, a simple backup tool might be all you need.
If you do have the need for those features, then you will get backup as well (with git bundle for instance)

VonC
+3  A: 

I also do solo scripting work, and I find that it keeps things simpler, rather than makes them more complex. Backup is integrated into the coding workflow and doesn't require a separate set of file system procedures. The time it takes to learn the basics of any version control system would definitely be time well spent.

MW Frost
+3  A: 

You have to use a version control software, otherwise your analysis won't be perfectly reproducible.

If you want to publish your results somewhere, you should always be able to reconstruct the status of your scripts at the moment you have produced them. Let's say that one of the reviewer discovers an error in one of your scripts: how would you know which results are effected and which are not?

In this sense, a backup system is not sufficient because it is probably done only once per day, and it doesn't apply labels to the different backups, so you don't know which versions correspond to which results. And learning a vcs is simpler than what you think, if learn how to add a file and how to commit changes it is already enough.

dalloliogm
You make a strong argument. However, I think reproducible research is possible without a formal version control system. It's just less elegant and less flexible. I try to write R code using principles of literate programming so that R output is automatically integrated into the final document. The files associated with this final product can then be saved.
Jeromy Anglim
that helps you on re-applying the whole analysis on your data, but it doesn't tell you which of your former results were affected by the error.
dalloliogm
+2  A: 

Right now, you probably think of your work as developing code that will do what you want it to do. After you adopt using a revision control system, you'll think of your work as writing down your legacy in the repository, and making brilliant incremental changes to it. It feels way better.

Ken Williams
+4  A: 

Is version control worth the effort?

a big YES.

What are the main pros and cons of adopting version control?

pros: you can track what you have done before. Especially useful for latex, as you may need an old paragraph that was deleted by you! When you computer crashes or you work on a new one, you have your data back on the fly.

cons: you need to do some settings.

What is a good strategy for getting started with version control for data analysis with R (e.g., examples, workflow ideas, software, links to guides)?

Just start to use it. I use tortoise SVN on windows as a client tool and my department has an svn server, I put all my code and data (yes, you also put your data there!) there.

Yin Zhu
+10  A: 

I feel the answer to your question is a resounding yes- the benefits of managing your files with a version control system far outweigh the costs of implementing such a system.

I will try to respond in detail to some of the points you raised:

  • Backup: I have a backup system already in place.

Yes, and so do I. However, there are some questions to consider regarding the appropriateness of relying on a general purpose backup system to adequately track important and active files relating to your work. On the performance side:

  • At what interval does your backup system take snapshots?
  • How long does it take to build a snapshot?
  • Does it have to image your entire hard drive when taking a snapshot, or could it be easily told to just back up two files that just received critical updates?
  • Can your backup system show you, with pinpoint accuracy, what changed in your text files from one backup to the next?

And most importantly:

  • How many locations are the backups saved in? Are they in the same physical location as your computer?
  • How easy is it to restore a given version of a single file from your backup system?

For example, have a Mac and use Time Machine to backup to another hard drive in my computer. Time Machine is great for recovering the odd file or restoring my system if things get messed up. However it simply doesn't have what it takes to be trusted with my important work:

  • When backing up, Time Machine has to image the whole hard drive which takes a considerable amount of time. If I continue working, there is no guarantee that my file will be captured in the state that it was when I initiated the backup. I also may reach another point I would like to save before the first backup finishes.

  • The hard drive to which my Time Machine backups are saved is located in my machine- this makes my data vulnerable to theft, fire and other disasters.

With a version control system like git, I can initiate a backup of specific files with no more effort that requesting a save in a text editor- and the file is imaged and stored instantaneously. Furthermore, git is distributed so each computer that I work at has a full copy of the repository.

This amounts to having my work mirrored across four different computers- nothing short of an act of god could destroy my files and data, at which point I probably would't care too much anyway.

  • Forking and rewinding: I've never felt the need to do this, but I can see how it could be useful (e.g., you are preparing multiple journal articles based on the same dataset; you are preparing a report that is updated monthly, etc)

As a soloist, I don't fork that much either. However, the time I have saved by having the option to rewind has single-handedly paid back my investment in learning a version control system many, many times. You say you have never felt the need to do this- but has rewinding any file under your current backup system really been a painless, feasible option?

Sometimes the report just looked better 45 minutes, an hour or two days ago.

  • Collaboration: Most of the time I am analysing data myself, thus, I wouldn't get the collaboration benefits of version control.

Yes, but you would learn a tool that may prove to be indispensable if you do end up collaborating with others on a project.

  • Time to evaluate and learn a version control system

Don't worry too much about this. Version control systems are like programming languages- they have a few key concepts that need to be learned and the rest is just syntactic sugar. Basically, the first version control system you learn will require investing the most time- switching to another one just requires learning how the new system expresses key concepts.

Pick a popular system and go for it!

  • A possible increase in complexity over my current file management system

Do you have one folder, say Projects that contains all the folders and files related to your data analysis activities? If so then slapping version control on it is going to increase the complexity of your file system by exactly 0. If your projects are strewn about your computer- then you should centralize them before applying version control and this will end up decreasing the complexity of managing your files- that's why we have a Documents folder after all.

  1. Is version control worth the effort?

Yes! It gives you a huge undo button and allows you to easily transfer work from machine to machine without worrying about things like loosing your USB drive.

2 What are the main pros and cons of adopting version control?

The only con I can think of is a slight increase in file size- but modern version control systems can do absolutely amazing things with compression and selective saving so this is pretty much a moot point.

3 What is a good strategy for getting started with version control for data analysis with R (e.g., examples, workflow ideas, software, links to guides)?

Keep files that generate data or reports under version control, be selective. If you are using something like Sweave, store your .Rnw files and not the .tex files that get produced from them. Store raw data if it would be a pain to re-acquire. If possible, write and store a script that acquires your data and another that cleans or modifies it rather than storing changes to raw data.

As for learning a version control system, I highly recommend git and this guide to it:

http://www-cs-students.stanford.edu/~blynn/gitmagic/

These websites also have some nice tips and tricks related to performing specific actions with git:

http://www.gitready.com/

http://progit.org/blog.html

Sharpie
+2  A: 

I'd agree with the sentiments above and say that, Yes, version control is usefull.

Advantages;

  • keep your research recorded as well as backed up, (tagging)
  • it lets you try different ideas out and go back if they don't work (branching)
  • You can share your work with other people, and they can share their changes to it with you (I know you didn't specify this, but it's great)
  • Most version control systems make it easy to create a compressed bundle fo all the files under control at a certain point, for instance at the point you submit an article for publication, this can help when others review your articles. (you can do this manually, but why make up these processes when version control just does it)

In terms of toolsets, I use Git, along with StatEt and Eclipse which works well, although you certainly don't have to use Eclipse. There are a few Git plugins for Eclipse, but I generally use the command line options.

PaulHurleyuk
I do use StatET and Eclipse for R; so perhaps I'll try git first.
Jeromy Anglim
+2  A: 

I worked for nine years in an analytics shop, and introduced the idea of version control for our analysis projects to that shop. I'm a big believer in version control, obviously. I would make the following points, however.

  1. Version control may not be appropriate if you are doing analysis for possible use in court. It doesn't sound like this applies to you, but it would have made our clients very nervous to know that every version of every script that we had ever produced was potentially discoverable. We used version control for code modules that were reused in multiple engagements, but did not use version control for engagement-specific code, for that reason.
  2. We found the biggest benefit to version control came from storing canned modules of code that were re-used across multiple projects. For example, you might have a particular favorite way of processing certain Census PUMS extracts. Organize this code into a directory and put it into your VCS. You can then check it out into each new project every time you need it. It may even be useful to create specific branches of certain code for certain project, if you are doing special processing of a particular common dataset for that project. Then, when you are done with that project, decide how much of your special code to merge back to the main branch.
  3. Don't put processed data into version control. Only code. Our goal was always to have a complete set of scripts so that we could delete all of our internally processed data, push a button, and have every number for the report regenerated from scratch. That's the only way to be sure that you don't have old bugs living on mysteriously in your data.
  4. To make sure that your results are really completely reproducible, it isn't sufficient just to keep your code in a VCS. It is critical to keep careful track of which version of which modules were used to create any particular deliverable.
  5. As for software, I had good luck with Subversion. It is easy to set up and administer. I recognize the appeal of the new-fangled distributed VCSs, like git and mercurial, but I'm not sure there are any strong advantages if you are working by yourself. On the other hand, I don't know of any negatives to using them, either--I just haven't worked with them in an analysis environment.
Dan Menes
thanks for the great first hand advice.
Jeromy Anglim