views:

70

answers:

2

My organisation is preparing to release an open-source version of our software using github, however I'm not sure the best way to approach this:

We have two branches master and release, master contains some proprietary components that we have decided not to release, and release contains the cleaned-up version that we want to distribute. The problem is, if we just push the release branch to github, the proprietary components can be retrieved by looking through the revision history.

I was considering creating a separate repository, copying the HEAD of relase into it, doing a git init, and pushing that repository to github. However, we want to retain the ability to cherry-pick certain patches from master into release in the future, and push those changes up to github.

Is there a way to do this without maintaining two separte repositories?

Thanks!

Update:

To be a little more specific, this is sort-of what our commit history looks like at the moment:

--- o - o - o - o - f - o - o - f - master
             \
              c - c - c - c - c - c - c - REL - f - f

Where 'o' are commits in the master, proprietary branch, 'c' are commits that remove things that should not be published (often not removing entire files, but reworking existing ones not to rely on proprietary components), and 'f' are fixes in master that apply to release as well, and so have been cherry-picked. REL is a tagged version of the code we deem safe to publish, with no history whatsoever (even previous versions of the release branch, since not all the proprietary material had been removed before the REL tag).

+2  A: 

The SHA of a commit is based on the commit blob, which includes the parent SHA, the commit text and the SHA of the tree of files. The tree contains the SHA of every blob in the tree. Thus any given commit depends on everything in that revision and every parent revision back to an empty repository. If you have a commit derived from a version (no matter how indirectly) that includes files you don't want to release, then you don't want to release that branch.

The very first example of git filter-branch talks about removing a confidential file from a repository. It does this by creating an alternate history (rewriting all of the trees and commits). You can see why this must be true if you understand the first part of my answer.

You should be able to run the filter-branch commands to create a new commit from your "clean" commit. The history will be somewhat odd (older versions may not build because they are now incomplete or otherwise broken). This won't destroy any of your existing branches or blobs in your repository. It will create all new (parallel) ones which share the file blobs but not the trees or commits. You should be able to safely push that branch without exposing any of the objects that it does not refer to (when you push a branch, only the SHA named by that branch and its dependencies are pushed). However, this would be somewhat risky because one git merge into the "clean" branch and you could end up dragging in "private" branches and objects. You may want to use a hook (commit or push trigger) to double check that private files are not escaping.

Ben Jackson
thanks, can you elaborate on 'run the filter-branch commands to create a new commit from your "clean" commit' i'm not sure what flags I would use for filter-branch to get the right version to come out. As per my comment on Jefromi's answer, it is not a simple case of some files being private and some being public, but rather the 'release' branch has introduced commits that remove private content from files in both branches.
David Claridge
+4  A: 

Ben Jackson's answer already covers the general idea, but I'd like to add a few notes (more than a comment's worth) about the ultimate goal here.

You can quite easily have two branches, one with an entirely clean (no private files) history, and one complete (with the private files), and share content appropriately. The key is to be careful about how you merge. An oversimplified history might look something like this:

o - o - o - o - o - o - o (public)
 \       \           \   \
  x ----- x ----x---- x - x (private)

The o commits are the "clean" ones, and the x are the ones containing some private information. As long as you merge from public to private, they can both have all the desired shared content, without ever leaking anything. As Ben said, you do need to be careful about this - you can't ever merge the other way. Still, it's quite possible to avoid - and you don't have to limit yourself to cherry-picking. You can use your normal desired merge workflow.

In reality, your workflow could end up a little more complex, of course. You could develop a topic (feature/bugfix) on its own branch, then merge it into both the public and the private versions. You could even cherry-pick now and then. Really, anything goes, with the key exception of merging private into public.

filter-branch

So, your problem right now is simply getting your repository into this state. Unfortunately, this can be pretty tricky. Assuming that some commits exist which touch both private and public files, I believe that the simplest method is to use filter-branch to create the public (clean) version:

git branch public master   # create the public branch from current master
git filter-branch --tree-filter ... -- public    # filter it (remove private files with a tree filter)

then create a temporary private-only branch, containing only the private content:

git branch private-temp master
git filter-branch --tree-filter ... -- private-temp    # remove public files

And finally, create the private branch. If you're okay with only having one complete version, you can simply merge once:

git branch private private-temp
git merge public

That'll get you a history with only one merge:

o - o - o - o - o - o - o - o - o - o (public)
                                     \
  x -- x -- x -- x -- x -- x -- x --- x (private)

Note: there are two separate root commits here. That's a little weird; if you want to avoid it, you can use git rebase --root --onto <SHA1> to transplant the entire private-temp branch onto some ancestor of the public branch.

If you'd like to have some intermediate complete versions, you can do the exact same thing, just stopping here and there to merge and rebase:

git checkout -b private <private-SHA1>  # use the SHA1 of the first ancestor of private-temp
                                        # you want to merge something from public into
git merge <public-SHA1>           # merge a corresponding commit of the public branch
git rebase private private-temp   # rebase private-temp to include the merge
git checkout private
git merge <private-SHA1>          # use the next SHA1 on private-temp you want to merge into
                                  # this is a fast-forward merge
git merge <public-SHA1>           # merge something from public
git rebase private private-temp   # and so on and so on...

This will get you a history something like this:

o - o - o - o - o - o - o - o - o - o (public)
      \              \               \
  x -- x -- x -- x -- x -- x -- x --- x (private)

Again, if you want them to have a common ancestor, you can do an initial git rebase --root --onto ... to get started.

Note: if you have merges in your history already, you'll want to use the -p option on any rebases to preserve the merges.

fake it

Edit: If reworking the history really turns out to be intractable, you can always totally fudge it: squash the entire history down to one commit, on top of the same root commit you already have. Something like this:

git checkout public
git reset --soft <root SHA1>
git commit

So you'll end up with this:

o - A' (public)
 \
  o - x - o - x - X - A (public@{1}, the previous position of public)
               \
                x - x (private)

where A and A' contain exactly the same content, and X is the commit in which you removed all private content from the public branch.

At this point, you can do a single merge of public into private, and from then on, follow the workflow that I described at the top of the answer:

git checkout private
git merge -s ours public

The -s ours tells git to use the "ours" merge strategy. This means it keeps all content exactly as it is in the private branch, and simply records a merge commit showing that you merged the public branch into it. This prevents git from ever applying those "remove private" changes from commit X to the private branch.

If the root commit has private information in it, then you'll probably want to create a new root commit, instead of committing once on top of the current one.

Jefromi
Thanks for all the detail. The thing that will be really tricky in my case, though, is creating the filter or branch containing only the private content. The granularity of entire files isn't enough, some files contained private content in 'master', and have been reworked to not rely on anything private in 'release'.
David Claridge
@David: Ouch. You are indeed going to have some trouble. You could potentially use an interactive rebase to retroactively apply those removals/separations of private content? There's not an easy answer, I don't think.
Jefromi
@David: I've added another option, which will work in all cases, but result in a public repository with no history at all, unfortunately.
Jefromi
Thanks! your 'fake it' strategy is just what I was looking for. It doesn't bother me that old history won't appear in the public repository.
David Claridge
The workflow for pulling in fixes it the other way around, though- isn't it? 'git checkout public; git merge -s ours private' - right?
David Claridge
@David: You never want to merge anything from private into public. Just think about that qualitatively! Clearly a bad thing. You want to merge from public into private, or from topic into both public and private.
Jefromi
Oh yes, I see what you mean. Most of our fixes will probably be taking place in master (private), which is why I was originally going to cherry-pick fixes that are safe to include in public, but having a topic branch is probably a better way to go about it. Thanks.
David Claridge