views:

866

answers:

5

I've heard a few places that one of the main ways distributed version control systems shine, is much better merging than traditional tools like SVN. Is this actually due to inherent differences in how the two systems work, or do specific DVCS implementations like Git/Mercurial just have cleverer merging algorithms than SVN?

+5  A: 

SVN tracks files while GIT tracks content. it is clever enough to track a block of code that was refactored from one class/file to another. They use two complete different approaches to tracking your source.

I still use SVN heavily but am very pleased with the few times i've used GIT.

A nice read if you have the time:
http://plasmasturm.org/log/487/

used2could
+1  A: 

it is a difference caused by the way revisions are stored. svn logically stores the file state at different points in time (though using deltas). but git and most other dvcs stores changesets.

alvin
Actually, Git stores file contents too.
Andrew Aylett
So conceptually, the way SVN saves files could be plugged into a GIT-style merge?
John
didnt understand. can you clarify a bit more.
alvin
Technically git stores file content (explained here: http://book.git-scm.com/1_the_git_object_model.html). Changesets are inferred by calculating the difference between revisions. Also mercurial (hg) stores changesets and snapshots of the content every now and then (explained here: http://hgbook.red-bean.com/read/behind-the-scenes.html).
Spoike
+3  A: 

Historically, Subversion has only been able to perform a straight two-way merge because it's didn't store any merge information. This involves taking a set of changes and applying them to a tree. Even with merge information, this is still the most commonly-used merge strategy.

Git uses a 3-way merge algorithm by default, which involves finding a common ancestor to the heads being merged and making use of the knowledge that exists on both sides of the merge. This allows Git to be more intelligent in avoiding conflicts.

Git also has some sophisticated rename finding code, which also helps. It doesn't store changesets or store any tracking information -- it just stores the state of the files at each commit and uses heuristics to locate renames and code movements as required (the on-disk storage is more complicated than this, but the interface it presents to the logic layer exposes no tracking).

Andrew Aylett
+3  A: 

Just read an article on Joel's blog(sadly his last one). This one is about Mercurial, but it actually talks about advantages of Distributed VC systems such as Git.

With distributed version control, the distributed part is actually not the most interesting part. The interesting part is that these systems think in terms of changes, not in terms of versions.

Read the article here.

rubayeet
That was one of the articles i was thinking about before posting here. But "thinks in terms of changes" is a very vague marketing-sounding term (remember Joel's company sells DVCS now)
John
I thought that was vague as well... I always thought changesets was an integral part to versions (or revisions rather), which surprises me that some programmers don't think in terms of changes.
Spoike
+45  A: 

The claim of why merging is better in a DVCS than in Subversion was largely based on how branching and merge worked in Subversion a while ago. Subversion prior to 1.5.0 didn't store any information about when branches were merged, thus when you wanted to merge you had to specify which range of revisions that had to be merged.

So why did Subversion merges suck?

Ponder this example:

      1   2   4     6     8
trunk o-->o-->o---->o---->o
       \
        \   3     5     7
b1       +->o---->o---->o

When we want to merge b1's changes into the trunk we'd issue the following command, while standing on a folder that has trunk checked out:

svn merge -r 3:7 {link to branch b1}

… which will attempt to merge the changes from b1 into your local working directory. And then you commit the changes after you resolve any conflicts and tested the result. When you commit the revision tree would look like this:

      1   2   4     6     8   9
trunk o-->o-->o---->o---->o-->o      "the merge commit is at r9"
       \
        \   3     5     7
b1       +->o---->o---->o

However this way of specifying ranges of revisions gets quickly out of hand when the version tree grows as subversion didn't have any meta data on when and what revisions got merged together. Ponder on what happens later:

           12        14
trunk  …-->o-------->o
                                     "Okay, so when did we merge last time?"
              13        15
b1     …----->o-------->o

This is largely an issue by the repository design that Subversion has, in order to create a branch you need to create a new virtual directory in the repository which will house a copy of the trunk but it doesn't store any information regarding when and what things got merged back in. That will lead to nasty merge conflicts at times. What was even worse is that Subversion used two-way merging by default, which has some crippling limitations in automatic merging when two branch heads are not compared with their common ancestor.

To mitigate this Subversion now stores meta data for branch and merge. That would solve all problems right?

And oh, by the way, Subversion still sucks…

On a centralized system, like subversion, virtual directories suck. Why? Because everyone has access to view them… even the garbage experimental ones. Branching is good if you want to experiment but you don't want to see everyones' and their aunts experimentation. This is serious cognitive noise. The more branches you add, the more crap you'll get to see.

The more public branches you have in a repository the harder it will be to keep track of all the different branches. So the question you'll have is if the branch is still in development or if it is really dead which is hard to tell in any centralized version control system.

Most of the time, from what I've seen, an organization will default to use one big branch anyway. Which is a shame because that in turn will be difficult to keep track of testing and release versions, and whatever else good comes from branching.

So why is DVCS, such as Git and Mercurial, better than Subversion at branching and merging?

There is a very simple reason why: branching is a first-class concept. There are no virtual directories by design and branches are hard objects in DVCS which it needs to be such in order to work simply with synchronization of repositories (i.e. push and pull).

The first thing you do when you work with a DVCS is to clone repositories (git's clone and hg's clone). Cloning is conceptually the same thing as creating a branch in version control. Some call this forking, but that's just the same thing. In fact every user is running their own repository which means you have a per-user branching going on.

The version structure is not a tree, but rather a graph instead. More specifically a directed acyclic graph (DAG, meaning a graph that doesn't have any cycles). You really don't need to dwell into the specifics of a DAG other than each commit has one or more parent references (which what the commit was based on). So the following graphs will show the arrows between revisions in reverse because of this.

A very simple example of merging would be this; imagine a central repository called origin and a user, Alice, cloning the repository to her machine.

         a…   b…   c…
origin   o<---o<---o
                   ^master
         |
         | clone
         v

         a…   b…   c…
alice    o<---o<---o
                   ^master
                   ^origin/master

What happens during a clone is that every revision is copied to Alice exactly as they were (which is validated by the uniquely identifiable hash-id's), and marks where the origin's branches are at.

Alice then works on her repo, committing in her own repository and decides to push her changes:

         a…   b…   c…
origin   o<---o<---o
                   ^ master

              "what'll happen after a push?"


         a…   b…   c…   d…   e…
alice    o<---o<---o<---o<---o
                             ^master
                   ^origin/master

The solution is rather simple, the only thing that the origin repository needs to do is to take in all the new revisions and move it's branch to the newest revision (which git calls "fast-forward"):

         a…   b…   c…   d…   e…
origin   o<---o<---o<---o<---o
                             ^ master

         a…   b…   c…   d…   e…
alice    o<---o<---o<---o<---o
                             ^master
                             ^origin/master

The use case, which I illustrated above, doesn't even need to merge anything. So the issue really isn't with merging algorithms since three-way merge algorithm is pretty much the same between all version control systems. The issue is more about structure than anything.

So how about you show me an example that has a real merge?

Admittedly the above example is a very simple use case, so lets do a much more twisted one albeit a more common one. Remember that origin started out with three revisions? Well, the guy who did them, lets call him Bob, has been working on his own and made a commit on his own repository:

         a…   b…   c…   f…
bob      o<---o<---o<---o
                        ^ master
                   ^ origin/master

                   "can Bob push his changes?" 

         a…   b…   c…   d…   e…
origin   o<---o<---o<---o<---o
                             ^ master

Now Bob can't push his changes directly to the origin repository. How the system detects this is by checking if Bob's revisions directly descents from origin's, which in this case doesn't. Any attempt to push will result into the system saying something akin to "Uh... I'm afraid can't let you do that Bob."

So Bob has to pull in the changes first and then merge. This is an automated two-step process both in git and hg. First Bob has to fetch the new revisions, which will copy them as they are from the origin repository. We can now see that the graph diverges:

                        v master
         a…   b…   c…   f…
bob      o<---o<---o<---o
                   ^
                   |    d…   e…
                   +----o<---o
                             ^ origin/master

         a…   b…   c…   d…   e…
origin   o<---o<---o<---o<---o
                             ^ master

The second step of the pull process is to merge the diverging tips and make a commit of the result:

                                 v master
         a…   b…   c…   f…       1…
bob      o<---o<---o<---o<-------o
                   ^             |
                   |    d…   e…  |
                   +----o<---o<--+
                             ^ origin/master

Hopefully the merge won't run into conflicts, but if you anticipate them it's good to atleast do this pull process manually (with git's fetch and merge; or hg's pull and merge). What later needs to be done is to push in those changes again to origin, which will result into a fast-forward merge since the merge commit is a direct descendant of the latest in the origin repository:

                                 v origin/master
                                 v master
         a…   b…   c…   f…       1…
bob      o<---o<---o<---o<-------o
                   ^             |
                   |    d…   e…  |
                   +----o<---o<--+

                                 v master
         a…   b…   c…   f…       1…
origin   o<---o<---o<---o<-------o
                   ^             |
                   |    d…   e…  |
                   +----o<---o<--+

There is another option to merge in git and hg, called rebase, which'll move Bob's changes to after the newest changes. Since I don't want this answer to be any more verbose I'll let you read the git or mercurial docs about that instead.

As an exercise for the reader, try drawing out how it'll work out with another user involved. It is similarly done as the example above with Bob. Merging between repositories is easier than what you'd think because all the revisions/commits are uniquely identifiable.

There is also the issue of sending patches between each developer, that was a huge problem in Subversion which is mitigated in git and hg by uniquely identifiable revisions. Once someone has merged his changes (i.e. made a merge commit) and sends it for everyone else in the team to consume by either pushing to a central repository or sending patches then they don't have to worry about the merge, because it already happened. Martin Fowler calls this way of working promiscuous integration.

Because the structure is different from Subversion, by instead employing a DAG, it enables branching and merging to be done in an easier manner not only for the system but for the user as well.

Spoike
I don't agree with your branches==noise argument. Lots of branches doesn't confuse people because the lead dev should tell people which branch to use for big features... so two devs might work on branch X to add "flying dinosaurs", 3 might work on Y to "let you throw cars at people"
John
John: Yes, for small number of branches there is little noise and is managable. But come back after you've witnessed 50+ branches and tags or so in subversion or clear case where most of them you can't tell if they're active or not. Usability issue from the tools aside; why have all that litter around in your repository?At least in p4 (since a user's "workspace" is essentially a per-user branch), git or hg you've got the option to not let everyone know about the changes you do until you push them upstream, which is a safe-guard for when the changes are relevant to others.
Spoike
Well I think personally I'd _consider_ a branch for deletion every time it is merged back to trunk, although of course in an iterative build process it might happen many times before a feature is marked done. Perhaps feature-branches are better for a waterfall model, where you can deliver a new feature and close the branch.
John
I fought the trunk ... and the .. trunk won (singing). +1
Tim Post
This may be one of the best answers I've ever read on this site. Nicely done!
Shaun
I don't get your "too many experimental branches are noise argument either, @Spoike. We have a "Users" folder where every user has his own folder. There he can branch as often as he wishes. Branches are inexpensive in Subversion and if you ignore the folders of the other users (why should you care about them anyway), then you don't see noise. But for me merging in SVN does not suck (and I do it often, and no, it's not a small project). So maybe I do something wrong ;) Nevertheless the merging of Git and Mercurial is superior and you pointed it out nicely.
John Smithers
@John Smithers: I do admit that I might be a bit inflammatory with some of my claims; but that's pretty much the nature of the criticism that SVN gets. I don't really hate SVN, but I have seen one project where all branches are in one virtual directory and reintegrating branches with trunk and back again makes people pull their hair out. So the whole branches==noise argument is not valid for well organized projects however subversion does not enforce well organized projects by convention at all.
Spoike
@Spoike: Maybe they should have read the documentation first. Saves a lot of hair pulling ;)
John Smithers
@John Smithers: Well, the svn book is quite verbose and apologetic about branching. The whole chapter about branching alone made me realize very early (coming from a CVS background) that branching feels "tacked on" in SVN; as if you don't need it and the authors sometimes feel like they're sorry that SVN has virtual directories and cheap copies. In git and hg, you can't even get code without branching (by cloning the repository); it has to be explained from start.
Spoike
In svn it's easy to kill inactive branches, you just delete them. The fact that people don't remove unused branches therefore creating clutter is just a matter of housekeeping. You could just as easily wind up with lots of temporary branches in Git as well. In my workplace we use a "temp-branches" top-level directory in addition to the standard ones - personal branches and experimental branches go in there instead of cluttering the branches directory where "official" lines of code are kept (we don't use feature branches).
Ken Liu
Very nice answer.
Ant
I don't think this answer answers the question. The biggest difference between Git/Mercurial and Subversion in how they merge is that the DVCSs track revision graphs, while Subversion history is a tree. But is that *inherently* so? Is there any reason Subversion couldn't model history as a tree, while remaining a centralized VCS?
Avi
Avi: I think you're asking a whole other question. Practically a SVN repository models the history as a _sequence of commits_, touching relevant files under the same directory. It is still a centralized VCS. Subversion wasn't designed to be a distributed VCS.
Spoike