views:

287

answers:

5

As far as I know all distributed revision control systems require you to clone the whole repository. For this reason is it not wise to put huge amounts of content into one single repository (thanks for this answer). I know that this a not a bug but a feature, but I wonder whether this is a requirement for all distributed revision control systems.

In distributed rcs the history of a file (or a chunk of content) is a directed acyclic graph, so why can't you just clone this single DAG instead of the set of all graphs in the a repository? Maybe I miss something but the following use-cases are hard to do:

  • clone only a part of a repository
  • merge two repositories (preserving their histories!)
  • copy some files with their history from one repository to another

If I reuse parts of other people's code from multiple projects I cannot preserve their full history. At least in git I can think of a (rather complex) workaround:

  1. clone a full repository
  2. delete all content that I am not interested in
  3. rewrite the history to delete everything that is not in the master
  4. merge the remaining repository into an existing repository

I don't know if this is also possible with Mercurial or Bazaar but at least it is not easy at all. So is there any distributed rcs that supports partial checkout/clone by design? It should support one simple command to get a single file with its history from one repository and merge it into another. This way you would not need to think about how to structure your content into repositories and submodules but you could happily split and merge repositories as needed (the extreme would be one repository for each single file).

+1  A: 

There's a subtree module for git, allowing you to split off a portion of a repository into a new repo and then merge changes to/from the original and the subtree. Here's its readme on github: http://github.com/apenwarr/git-subtree/blob/master/git-subtree.txt

kwatford
+3  A: 

In distributed rcs the history of a file (or a chunk of content) is a directed acyclic graph, so why can't you just clone this single DAG instead of the set of all graphs in the a repository?

At least in Git, the DAG representing the repository history applies to the whole repository, not just a single file. Each commit object points to a "tree" object which represents the entire state of the repository at that time.

Git 1.7 supports "sparse checkouts", which allow you to restrict the size of your working copy. The entire repository data is still cloned, however.

Greg Hewgill
Ok this answers my question at least for git. I wonder whether every distributed rcs is designed this way or if you can have a design that allows splitting and joining of repositories.
Jakob
+2  A: 

In bazaar you can split and join parts of a repository.

The split-command allows you to split a repository into multiple repositories. The join-command allows you to merge repositories. Both keep the history.

However this isn't as handy a the SVN-model, where you can checkout/commit for a sub-tree.

There's a planned feature called Nested-Trees for bazaar, which maybe would allow partial checkouts.

Gamlor
Hm, I tried split and join but they keep the whole history instead of only the history of a subset of the repository. The fast-import plugin (https://launchpad.net/bzr-fastimport) seems to do the job but afterwards I cannot merge updates from the source repository which I split from. I hope that nested trees is not vaporware.
Jakob
I'm not 100%, but bazaar has only global versions and history. Each change-set applies to the whole repository. So when you split, the whole history applies also for the sub-directory. Thats why the whole history is still there after the split. Except that some entries don't have any effect. Nested-Trees: I don't know. Let's hope it's not vaporware.
Gamlor
+4  A: 

As for version 1.6, it is not possible to make a so-called "narrow clone" with Mercurial, that is, a clone where you only retrieve data for a specific sub-directory. We call it a "shallow clone" when you only retrieve part of the history, say, the last 100 revisions.

As you say, there is nothing in the common DAG-based history model that excludes this feature and we have been working on it. Peter Arrenbrecht, a Mercurial contributor, as implemented two different approaches for narrow clones, but neither approach has been merged yet. There is a Google Summer of Code student working on shallow clones this year, so there is hope that feature will appear soonish.

Btw, you can of course split an existing Mercurial repository into pieces where each smaller repository only has the history for a specific sub-directory of the original repository. The convert extension is the tool for this. Each of the smaller repositories will be unrelated to the bigger repository, though -- the tricky part is to make the splitting seamless so that the changesets keep their identities.

Martin Geisler
+1  A: 

I hope one of these RCS's will add narrow clone capability. My understanding is that the architecture of GIT (changes/moves tracked across the whole repo) makes this very difficult.

Bazaar prides itself on supporting many different types of workflows. Lack of narrow clone capability prohibits an SVN/CVS like workflow in bzr/hg/git, so I'm hoping they'll be motivated to find some way to do this.

New features shouldn't come at the expense of basic functionality, like the ability to fetch a single file/directory from the repo. The "distributed" feature of modern rcs's is "cool," but in my opinion discourages good development practices (frequent merges / continuous integration). These new RCS's all seem to lack very basic functionality. Even SVN without real branching/tagging support seemed like a step backwards from CVS imo.

nairbv