Transferring legacy code base from cvs to distributed repository (e.g. git or mercurial). Suggestions needed for initial repository design.

Just a quick comment to remind you that:

those migrations often offer the opportunity to reorganize the sources, not along modules (each with one repositories) but rather along a functional domain split (several modules for a same given functional domain being put in the same repository).

Then submodules are to be used, as a way to define a configuration.

Git is alright, but from Linus's admission himself, to put everything into one repository can be problematic.

[...] CVS, ie it really ends up being pretty much oriented to a "one file at a time" model.

Which is nice in that you can have a million files, and then only check out a few of them - you'll never even see the impact of the other 999,995 files.

Git fundamentally never really looks at less than the whole repo. Even if you limit things a bit (ie check out just a portion, or have the history go back just a bit), git ends up still always caring about the whole thing, and carrying the knowledge around.

So git scales really badly if you force it to look at everything as one huge repository. I don't think that part is really fixable, although we can probably improve on it.

And yes, then there's the "big file" issues. I really don't know what to do about huge files. We suck at them, I know.

Those two aforementioned points advocate for a more component-oriented approach for large system (and large legacy repository).

With Git submodule, you can checkout them in your project (even if it is a two-steps process). You have however tools than can make the submodule management easier (git.rake for instance).

When I'm thinking of fixing a bug in a module that's shared between several projects, I just fix the bug and commit it and all just do their updates

That is what I describe in the post Vendor Branch as the "system approach": everyone works on the latest (HEAD) of everything, and it is effective for small number of projects.
For a large number of modules though, the notion of "module" is still very useful, but its management is not the same with DVCS:

for closely related modules (aka "in the same functional domain", like "all modules related to PNL - Profit aNd Losses - or "Risk analysis", in a financial domain), you do need to work with the latest (HEAD) of all components involved.
That would be achieved with the use of a subtree strategy, not in order for you to publish (push) corrections on those other submodules, but to track works done by other teams.
Git allows that with the extra-bonus that this "tracking" does not have to take place between your repository and one "central" repository, but can also take place between you and the local repository of the other team, allowing for a very quick back-and-forth integration and testing between projects of similar nature.
however, for modules which are not directly in your functional domain, submodules are a better option, because they refer to a fix version of a module (a commit):
when a low-level framework changes, you do not want it to be propagated instantaneously, since it would impact all the other teams, which would then have to drop what they were doing to adapt their code to that new version (you do want though all the other teams to be aware of this new version, in order for them to not forget to update that low-level component or "module").
That allows you to work only with official stable identified versions of other modules, and not potentially un-stabled or not fully tested HEADs.

As for the Mercurial side, the recommendation is also to refactor large legacy CVS/SVN repositories into smaller components. Common code should be put into its own libraries, and the application code will then depend on those libraries in a similar way to how it depends on other libraries.

Mercurial has the forest extension which allows you to manage a "forest" of "source trees". With that approach you combine several smaller repositories into a larger one. With CVS you do the opposite: you checkout a smaller portion of a large repository.

I have not personally used the forest extension and its page says that one should use an updated version compared to the one bundled with Mercurial. However, it is used by a big organization like Sun in its OpenJDK project.

There is also currently work underway to add sub-repository report directly to the core of Mercurial, as per the design on nested repositories page in the Mercurial wiki.

Thank you for your reply. I guess the biggest obstacle for me (and perhaps for most users of single non distributed repositories) is how the mind is set in a certain way. I mean how you look at things and how you organize your code etc. I'm beginning to get the taste of it.To be cont.

Magnus Skog 2009-05-22 19:40:52

So you are more or less saying that you have to let go (perhaps not fully) of the "module thinking". When I'm thinking of fixing a bug in a module that's shared between several projects, I just fix the bug and commit it and all just do their updates. But with git, that "same" module in projectA is a repository of its own and another repository in projectB? So when I fix a bug in my repository version of that module they can just pull the changes from me.

Magnus Skog 2009-05-22 19:44:00

+1 for answering. Will look into this as well. Thanks :)

Magnus Skog 2009-05-26 07:00:17

As of version 1.3 (July 1, 2009) Mercurial has the beginnings of submodule support builtin under the name "subrepos" (http://mercurial.selenic.com/wiki/subrepos). I wouldn't assume the feature will stabilize immediately but it is coming.

quark 2009-07-25 19:41:10

ansaurus

tags:

views:

answers:

Transferring legacy code base from cvs to distributed repository (e.g. git or mercurial). Suggestions needed for initial repository design.

related questions