views:

38

answers:

1

I do a lot of solo data analysis, using a combination of tools such as R, Python, PostgreSQL, and whatever I need to get the job done. I use version control software (currently Subversion, though I'm playing around with git on the side) to manage all of my scripts, but the data is perpetually a challenge. My scripts tend to run for a long period of time (hours, or occasionally days) to generate small or large datasets, which I in turn use as input for more scripts.

The challenge I face is in how to "rollback" what I do if I want to check out my scripts from an earlier point in time. Getting the old scripts is easy. Getting the old data would be easy if I put my data into version control, but conventional wisdom seems to be to keep data out of version control because it's so darned big and cumbersome.

My question: how do you combine and/or manage your processed data with a version control system on your code?

A: 

Subversion, maybe other [d]vcs as well, supports symbolic links. The idea is to store raw data 'well organized' on a filesystem, while tracking the relation between 'script' and 'generated date' with symbolic links under version control.

data -> data-1.2.3

All your scripts will call load data to retrieve a given dataset, being linked through versioned symbolic link to a given dataset.

Using this approach, code and calculated datasets are tracked within one tool, without bloating your repository with binary data.

zellus