tags:

views:

969

answers:

3

I've been using Mercurial for my own personal projects for a while, and I love it. My employer is considering a switch from CVS to SVN, but I'm wondering whether I should push for Mercurial (or some other DVCS) instead.

One wrinkle with Mercurial is that it seems to be designed around the idea of having a single repository per "project". In this organization, there are dozens of different executables, DLLs, and other components in the current CVS repository, hierarchically organized. There are a lot of generic reusable components, but also some customer-specific components, and customer-specific configurations. The current build procedures generally get some set of subtrees out of the CVS repository.

If we move from CVS to Mercurial, what is the best way to organize the repository/repositories? Should we have one huge Mercurial repository containing everything? If not, how fine-grained should the smaller repositories be? I think people will find it very annoying if they have to pull and push updates from a lot of different places, but they will also find it annoying if they have to pull/push the entire company codebase.

Anybody have experience with this, or advice?


Related questions:

+24  A: 

Disclosure: This is a cross post from another thread that was focused around git, but I ended up recommending mercurial anyway. It deals with DVCS in an enterprise context in general, so I hope cross posting it is fine. I have modified it a little to better fit this question:


Against the common opinion, I think that using a DVCS is an ideal choice in an enterprise setting because it enables very flexible workflows. I will talk about using a DVCS vs. CVCS first, best-practices and then about git in particular.

DVCS vs. CVCS in an enterprise context:

I wont talk about the general pros/cons here, but rather focus on your context. It is the common conception, that using a DVCS requires a more disciplined team than using a centralized system. This is because a centralized system provides you with an easy way to enforce your workflow, using a decentralized system requires more communication and discipline to stick to the established of conventions. While this may seem like it induces overhead, I see benefit in the increased communication necessary to make it a good process. Your team will need to communicate about code, about changes and about project status in general.

Another dimension in the context of discipline is encouraging branching and experiments. Here's a quote from Martin Fowlers recent bliki entry on Version Control Tools, he has found a very concise description for this phenomenon.

DVCS encourages quick branching for experimentation. You can do branches in Subversion, but the fact that they are visible to all discourages people from opening up a branch for experimental work. Similarly a DVCS encourages check-pointing of work: committing incomplete changes, that may not even compile or pass tests, to your local repository. Again you could do this on a developer branch in Subversion, but the fact that such branches are in the shared space makes people less likely to do so.

DVCS enable flexible workflows because they provide changeset tracking via globally unique identifiers in a directed acyclic graph (DAG) instead of simple textual diffs. This allows them to transparently track the origin and history of a changeset, which can be quite important.

Workflows:

Larry Osterman (a Microsoft dev working on the Windows team) has a great blog post about the workflow they employ at the Windows team. Most notably they have:

  • A clean, high quality code only trunk (master repo)
  • All development happens on feature branches
  • Feature teams have team repos
  • They do regularily merge the latest trunk changes into their feature branch (Forward Integrate)
  • Complete features must pass several quality gates e.g. review, test coverage, Q&A (repos on their own)
  • If a feature is completed and has acceptable quality it is merged into the trunk (Reverse Integrate)

As you can see, having each of these repositories live on their own you can decouple different teams advancing at different paces. Also the possibility to implement a flexible quality gate system distinguishes DVCS from a CVCS. You can solve your permission issues at this level too. Only a handful of people should be allowed access to the master repo. For each level of the hierachy, have a seperate repo with the corresponding access policies. Indeed, this approach can be very flexible on the team level. You should leave it up to each team to decide wether they want to share their team repo among themselves or if they want a more hierachical approach where only the team lead may commit to the team repo.

Hierachical Repositories

(The picture is stolen from and served by Joel Spolsky's hginit.com.)

One thing remains to be said at this point, even though DVCS provides great merging capabilities, this is never a replacement for using Continous Integration. Even at that point you have a great deal of flexibility: CI for the trunk repo, CI for team repos, Q&A repos etc.

Mercurial in an enterprise context:

I don't want to start a git vs. hg flamewar here, you are already on the right track by considering switching to DVCS. Here are a couple of reasons to use Mercurial instead of git:

  • All plattforms that run python are supported
  • Great GUI tools on all major plattforms (win/linux/OS X), first class merge/vdiff tool integration
  • Very consistent interface, easy transition for svn users
  • Can do most of the things git can do too, but provides a cleaner abstraction. Dangerous operations are are always explicit. Advanced features are provided via extensions that must explicitly be enabled.
  • Commercial support is available from selenic.

In short, when using DVCS in an enterprise I think it's important to choose a tool that introduces the least friction. For the transition to be successful it's especially important to consider the varying skill between developers (in regards to VCS).

There are a couple of resources I'd like to point you to in the end. Joel Spolsky has recently written an article defeating a lot of arguments brought up against DVCS. It must be mentioned others have discovered these contra-arguments long before. Another good resource is Eric Sinks blog, where he wrote an article about Obstacles to an enterprise DVCS.

Johannes Rudolph
+1: very nice post
RedFilter
Very interesting. +1
VonC
@VonC: I'm proud to hear that from one of the SCM gurus here on SO :-)
Johannes Rudolph
@Johannes: "guru" with big double-quotes here ;) I am just learning the stuff through the SO questions (hence my many answers). A SO user like Jakub Narębski (http://stackoverflow.com/users/46058/jakub-narebski) is an actual Guru (less answers, because he is busy actually working on a DVCS instead of talking about it here)
VonC
+1 great post. very informative.
Brandon Montgomery
+5  A: 

First of all, some recent discussion on using a DVCS in huge projects is relevant:

http://stackoverflow.com/questions/2476356/distributed-version-control-for-huge-projects-is-it-feasible

One wrinkle with Mercurial is that it seems to be designed around the idea of having a single repository per "project".

Yes, while the norm with Subversion is to have one monolithic repository containing multiple projects, with a DVCS it is preferable to have more granular repositories, with one per component. Subversion has the svn:externals feature to aggregate multiple source trees at checkout time (which has its own logistical and technical issues). Both Mercurial and Git have a similar feature, called subrepos in hg.

The idea with subrepos is you have one repo per component, and a releasable product (comprising multiple reusable components) will simply refer to its dependent repos. When you clone the product repo, it brings along the components it needs.

Should we have one huge Mercurial repository containing everything? If not, how fine-grained should the smaller repositories be? I think people will find it very annoying if they have to pull and push updates from a lot of different places, but they will also find it annoying if they have to pull/push the entire company codebase.

It is certainly possible to have one monolithic repository (and you can even split it up down the track if you need to). The issues with this approach are more likely to come down to release schedles, and how you manage different versions of different components. If you have multiple products with their own release schedules sharing common components, you would probably be better off with a more granular approach, to facilitate configuration management.

One caveat is that the subrepo support is a relatively recent feature, and is not as fully fledged as other features. Specifically, not all hg commands know about subrepos, although the most important ones do.

I suggest you perform a test conversion, and experiment with the subrepo support, organising products and dependent components, etc. I am in the process of doing the same thing, and this seems to be the way to go.

gavinb
I strongly endorse test conversions. As ever, it depends on how well your management will view work that isn't, on the face of it, "productive" ...
crazyscot
+13  A: 

AFAICS most of the resistance to any of the DVCSes comes from people not understanding how to use them. The oft-repeated statement that "there is no central repository" is very scary to people who have been locked into the CVS/SVN model since time immemorial and can't imagine anything else, especially so for management and senior (experienced and/or cynical) developers who want strong source code tracking and reproducibility (and perhaps also if you have to satisfy certain standards regarding your development processes, like we did at a place I once worked). Well, you can have a central "blessed" repo; you just aren't shackled to it. It's easy for a subteam to set up an internal playground repo on one of their workstations for a while, for example.

There are so many ways to skin the proverbial cat that it will pay you to sit down and think carefully about your workflow. Think about your current practices and the power that nearly-free cloning and branching gives you. It's likely that some of what you currently do will have evolved to work around the limitations of the CVS-type model; be prepared to break the mould. You will probably need to appoint a champion or two to ease everybody through the transition; with a big team you probably want to think about restricting commit access to blessed.

At my work (small software house) we moved from CVS to hg and wouldn't go back. We're using it in a mostly-centralised way. Converting our main (ancient and very large) repo was painful, but it will be whatever way you go, and when it's done it's done - it'll be a lot easier to change VCS later. (We found a number of situations where the CVS conversion tools just can't figure out what happened; where somebody's commit only partially succeeded and they didn't notice for days; resolving vendor branches; general madness and insanity caused by time appearing to go backwards, not helped by commit timestamps in local time from different timezones...)

The great benefit I've found of a DVCS is the ability to commit early and commit often and only push when it's ready. As I reach various work-in-progress milestones I like to lay down a line in the sand so that I have somewhere I can rewind to if need be - but these are not commits which should be exposed to the team as they are manifestly incomplete in myriad ways. (I do this mostly with mercurial queues.) It's all about the workflow; I could never have done this with CVS.

I guess you already know this, but if you're contemplating moving away from CVS, you can do so much better than SVN...


To monolith, or to module? Any paradigm shift is going to be tricky whatever VCS you work with, distributed or not; the CVS model is quite special in how it allows you to commit on a file by file basis without checking whether the rest of the repo is up to date (and let's not mention the headache that module aliases have been known to cause).

  • Dealing with monolithic repositories can be pretty slow. Your vcs client has to scan your copy of the entire universe for changes, as opposed to just a single module. (If you're working on Linux, look into the hg inotify extension if you haven't already done so.)
  • A monolithic repo also causes unnecessary race conditions when committing(pushing). It's like the CVS up-to-date check, but applied across the entire repo: if you have many active developers, committing frequently, this one will bite you.

I'd suggest that it's worth the effort to stay away from monolithic, but beware that it will impose its own overhead in terms of added complexity in your build system. (Side note: If you find something a tiresome chore, automate it! We programmers are lazy creatures, after all.) Splitting your repo out into all its component modules may be too extreme; there may be a halfway house to be found with related components grouped together among a small number of repositories. You may also find it useful to look into mercurial's submodule support - Nested Repositories and the Forest Extension (both of which I ought to try and get my head around).

At a former workplace we had several dozen components which were kept as independent CVS modules with a fairly regimented metastructure. Components declared what they depended on and which built parts were to be exported where; the build system automatically wrote make fragments so that what you were working on would pick up what it needed. It generally worked very well and it was quite rare to fail the CVS up-to-date check. (There was also a fiendishly complicated but extremely powerful build bot with a least-effort attitude to dependency resolution: it wouldn't rebuild a component if there was already one which met your requirements. Add to that meta-components which assembled installers and whole ISO images, and you have a good recipe for easy start-to-finish builds and for things going Sorcerers Apprentice. Somebody ought to write a book about it...)

crazyscot