views:

140

answers:

3

After almost two years of using DVCS, it seems that one inherent "flaw" is accidental data loss: I have lost code which isn't pushed, and I know other people who have as well.

I can see a few reasons for this: off-site data duplication (ie, "commits have to go to a remote host") is not built in, the repository lives in the same directory as the code and the notion of "hack 'till you've got something to release" is prevalent... But that's beside the point.

I'm curious to know: have you experienced DVCS-related data loss? Or have you been using DVCS without trouble? And, related, apart from "remember to push often", is there anything which can be done to minimize the risk?

+3  A: 

I have lost data from a DVCS, both because of removing the tree along with the repository (not remembering it had important information), and because of mistakes in using the DVCS command line (git, in the specific case): some operation that was meant to revert a change that I made actually deleted a number of already-committed revisions from the repository.

Martin v. Löwis
+2  A: 

I've lost more data from clobbering uncommitted changes in a centralized VCS, and then deciding that I actually wanted them, than from anything I've done with a DVCS. Part of that is that I've been using CVS for almost a decade and git for under a year, so I've had a lot more opportunities to get into trouble with the centralized model, but differences in the properties of the workflow between the two models are also major contributing factors.

Interestingly, most of the reasons for this boil down to "BECAUSE it's easier to discard data, I'm more likely to keep it until I'm sure I don't want it". (The only difference between discarding data and losing it is that you meant to discard it.) The biggest contributing factor is probably a quirk of my workflow habits - my "working copy" when I'm using a DVCS is often several different copies spread out over multiple computers, so corruption or loss in a single repo or even catastrophic data loss on the computer I've been working on is less likely to destroy the only copy of the data. (Being able to do this is a big win of the distributed model over centralized ones - when every commit becomes a permanent part of the repository, the psychological barrier to copying tentative changes around is a lot higher.)

As far as minimizing the risks, it's possible to develop habits that minimize them, but you have to develop those habits. Two general principles there:

  • Data doesn't exist until there are multiple copies of it in different places. There are workflow habits that will give you multiple copies for free - f'rexample, if you work in two different places, you'll have a reason to push to a common location at the end of every work session, even if it's not ready to publish.
  • Don't try to do anything clever, stupid, or beyond your comfort zone with the only reference to a commit you might want to keep. Create a temporary tag that you can revert to, or create a temporary branch to do the operations on. (git's reflog lets you recover old references after the fact; I'd be unsurprised if other DVCSs have similar functionality. So manual tagging may not be necessary, but it's often more convenient anyways.)
A: 

There is an inherent tension between being distributed and making sure everything is "saved" (with the underlying assumption that saved means being backed up somewhere else).

IMO, this is only a real problem if you work on several computers at the same time on the same line of work (or more exactly several repositories: I often need to share changes between several VM on the same computer for example). In this case, a "centralized" workflow would be ideal: you would set up a temporary server, and on some given branches, use a centralized workflow. None of the current DVCS I know of (git/bzr/hg) support this well. That would be a good feature to have, though.

David Cournapeau
Bazaar does have the distinction between "branch" and "checkout" where the latter is a working copy bound to a repository living in another directory. On such trees every commit is implicitly a push (just like centralized VCS). How much this gets you in avoiding the poster's problem is another story, but you can get the centralized workflow you are talking about.
quark
Actually Mercurial, as of 1.3 has a similar ability with the share extension: http://mercurial.selenic.com/wiki/ShareExtension.
quark
Actually with git you can use either `git-new-workdir` from contrib.
Jakub Narębski
I am not sure I understand why ShareExtension or git-new-workdir would help: to be able to share between computers, the sharing needs to be networked (network is also more practical even on the same machine through vmware IMHO)
David Cournapeau