views:

616

answers:

8

The theory that "disk" is cheap has gotten a bit out of hand lately. There are some powerful aspects of version control that have enabled us to onboard new developers with a few bootstrap files and one simple command to pull the toolchain over.

Recently upgrades to the systems have prompted requests for storing built binaries. This has been followed on by a request to version the entire virtualized build system. Each layer added on top creates important relationships between repositories and a good fundamental design is necessary to manage it.

The storing of the toolchain brought instant benefit while the storing of the built binaries brough instant liabilities. Git, unfortunately, has some fundamental issues when dealing with large binary files.

Where do you draw the lines at using VC in the right ways and when do you start investigating more appropriate solutions?

+7  A: 

I'd say there's an order of operations here:

If they need to store files, use a file system.

If they need to track changes, use version control.

If they need to track relationships to data, use a database.

The more complicated the requirements, the more complicated the solution. But discipline is in order for those who want the more complicated solution; in these uncertain times one must avoid wasting time.

Michael Hedgpeth
A: 

I stick to the classic answer of storing anything that is needed to build the final product. The binaries and intermediate files aren't needed, but any scripts that are used in the build are included.

I use my git repos as backups, storing bare clones in at least two places so I try not to leave anything out that is needed for a build, but I don't bother storing anything transient.

Abizern
+1  A: 

Version control that which cannot be recreated without it. So, the tool chain cannot readily be recreated - there is sense in version controlling that. With the tool chain (and source code) under version control, there is no need to archive the build products - or, at least, not after the testing of the build is complete.

Jonathan Leffler
if a patch happens to the OS? What then in recreating the machine?
ojblass
You should be keeping a record of where the install media for the o/s is, and of which patches are installed. Done properly, it is hard; that's why most people and companies don't do it properly - me included.
Jonathan Leffler
+9  A: 

You probably shouldn't be storing the "entire virtualized build system" as a giant binary. If you want to version your application, you version the source code, not the compiled binary.

What many shops do is store in version control the steps to recreate the build server. Then you need one fixed image (a stock, out-of-the-box OS install), plus a small number of files (what to install on it, and how). Some places even have their server rebuild the app from source, on a clean OS install, for every deploy/reboot.

Versioning the OS image itself as a giant binary isn't nearly as useful. You can't branch. You can't merge. You can't diff. What's the point? You might save space if your VCS can do binary diffs, but that probably takes a ton of CPU and memory to do, and if they're on a "disk is cheap" binge, then there's no reason to make life painful just to save disk space.

Either store your install scripts/libraries in VC and rebuild the VM image as needed, or just store VM images in normal files. I don't see any point in putting the images in VCS.

Ken
+1. If you don't have recreation steps, then consider named backups of the VM. VCS adds nothing, and creates a chicken-and-egg situation (how will you *check out* the image? should you version control the machine used for checkout? what about the VM hosting environment? etc).
Zac Thompson
You /can/ branch COW images... But I would not do this in git! (ie, take a look at 'snapshot manager' in VMWare.).It is, however, impossible to merge, and very difficult to diff (manually, of course). :)Enjoy. Ken's answer is still the best.
Arafangion
+1  A: 

Common sense, rather than IT fussiness, should guide how you control and configure your toolchain. If you have standard hardware and are often adding developers, storing your built toolchain as images makes sense; but the images don't have to be under version control. If you have 50 developers, a configuration management system for your toolchain will cut overhead; if you have 5 developers, it is more overhead -- another system to learn.

So, is Git getting in the way of what you want to do? Or are you getting requests because users are trying to say you should make your system more complicated because you could?

If your build tools are mature, then the date of the build may be sufficient to determine the versions of the tools that were used. You can have your build script poll write a text file of your build tools and their versions, similar to a dependencies list.

If you are using rapidly changing in-house tools, or alpha/beta versions of projects under active development, then there would be a good rationale for putting the build tools under version control -- but it would be solving the wrong problem. Why would you build with unstable toolchain???

Paul
Its a strange spot to sit in when a vendor gives you customizations not to be found elsewhere.
ojblass
You need to back them up, then. If you put them under version control or config management, does that reduce or add to the info that your developers have to sort through? Your developers' attention in your most limited resource.
Paul
+2  A: 

For a rather extreme approach check out Vesta.

From Allan Heydon, Roy Levin, Timothy Mann, Yuan Yu. The Vesta Approach to Software Configuration Management:

The Vesta approach is based on the following foundations:

  • Immutable, immortal, versioned storage of all sources and tools. Unlike ClearCASE, Vesta uses explicit version numbers rather than views.

  • Complete, source-based configuration descriptions. By complete, we mean that the descriptions name all elements contributing to a build. Every aspect of the computing environment, including tools, libraries, header files, and environment variables, is fully described and controlled by Vesta. By source-based, we mean that configuration descriptions specify how to build a system from scratch using only immutable sources (i.e., non-derived files). The descriptions themselves are versioned and immutable sources, and their meaning is constant; a particular top-level description always describes precisely the same build using the same sources, even after new versions of sources and descriptions have been created.

  • Automatic derived file management. The storage and naming of derived files is managed automatically by the Vesta storage repository, thereby easing the burden of building multiple releases or building for multiple target platforms.

  • Site-wide caching of all build work. Vesta features a shared site-wide cache of build results so developers can benefit from each others’ builds.

  • Automatic dependency detection. The Vesta builder dynamically detects and records all dependencies, so none can be omitted by human error. By dynamically, we mean that the builder detects which sources are actually used in the process of constructing a build result and records dependencies only on them. Vesta’s dependency analysis does not make use of any knowledge of how the build tools work; it is thus semantics-independent in the terminology of Gunter [7]. For example, if a compiler reads file foo.h in the process of compiling file foo.c, Vesta will assume that the compiler’s output depends on all of foo.h, even though a tool with knowledge of C might be able to find individual items in foo.h that could be changed without changing the result of the compilation.

starblue
No, they use NFS to track file accesses.
starblue
NOt the dependencies but the storage of the system "Automatic derived file management. The storage and naming of derived files is managed automatically by the Vesta storage repository" the dependencies are possible through NFS monitoring of access.
ojblass
+4  A: 

What I always put in version control:

  • source code and makefiles: the minimum needed for building binaries.
  • tests suites

What I never put in version control:

  • built binaries: they can be recreated from source control, and if I know that I may need a specific release immediately, I store them in file system in a way similar to Linux kernel.

What I put in version control depending on project:

  • build chain: I don't put it in version control when I trust the provider or when I can recreate the environment (Apple's Xcode, open-source tools such as gcc or doxygen, ...). I put it in version control when it is specifically dedicated for the project (e.g., home made cross compilation chain) and when I need to recreate a binary exactly as it was built for delivery (for finding heisenbugs when any component may be involved, from the code to the OS or compiler).
mouviciel
A: 

I've been using source control for my entire toolchain. As stated above, this has great benefits:

  • Everyone uses the same tools, so we never have to worry about incompatibilities.
  • The build machine produces the same artifacts as developers.
  • We can always recreate any artifact at any point in the future, because the toolchain is fully versioned.

I drew the line somewhere above the operating system; some of what I have submitted is:

Windows

Linux

  • gcc
  • make
  • glibc

Both

  • JDK

Some of the issues I've run into while trying to do this were:

  • On Linux things like Perl and gcc embed their installation directory into their executables. This means that developers and the build machine have a post-checkout script to run to update these by slamming in their paths into the binaries.
  • On either platform you have a much longer and more complicated compilation option list to specify each header and library directory; that sort of thing is automatic with a "normal" install. One of those things that isn't obvious is that crti.o and friends is something that's found in /usr/lib by default and is actually owned by glibc-devel (or libc6-dev), so it's not in the filesystem unless glibc-devel is installed.
  • For Windows the compilers after 2003 all use Side-by-side Assemblies, so to avoid an installation procedure on the target machine, I had to dig them out and place them next to the compiler executables in source control.
  • Windows SDK v6.1 with compilers (and without help/samples) is huge: 427MB if I counted right.

I've started to try to use Apache Ivy (similar to Maven) to help me manage the toolchain, but I have not yet seen any examples of Ivy or Maven being used to manage something that isn't a Java .jar file. I don't know if I'll be able to manage things like the C compiler.

Ideally I want either a source control checkout or Ivy or Maven resolve to have every tool and library in the developer's file system, ready to use. But I'm starting to think that requiring the developer to install a small number of critical things, like the Windows SDK, or the gcc and glibc-devel packages isn't such a bad idea. As mentioned above, it's a question of 5 or 50 developers, and the time involved in creating such a solution.

Jared Oberhaus
By the way, I found git to be horribly slow for my toolchain management technique. git does not seem usable when handling more than a few thousand files at once. http://www.jaredoberhaus.com/tech_notes/2008/12/git-is-slow-too-many-lstat-operations.html
Jared Oberhaus
Lots of large binary files can exacerbate the problem, as it uses lots of memory trying to calculate deltas. Yes, I tried to put my photos under Git control to be able to deal with syncing them up from different machines... :)
araqnid