Incremental linearizing of git DAG

views:

714

answers:

+10 Q:

Incremental linearizing of git DAG

I'm the author of GitX. One of the features GitX has is the visualization of branches, as can be seen here.

This visualization is currently done by reading commits which are emitted from git in the correct order. For each commit the parents are known, so it's fairly easy to build up the lanes in the correct way.

I'd like to speed up this process by using my own commit pool and linearizing the commits myself. This allows me to reuse existing loaded commits and allows git to emit commits faster because it doesn't have to emit them in the correct order.

However, I'm not sure what algorithm to use to accomplish this. It is important that the building is incremental, as the loading of commits can take a long time (>5 seconds for 100,000 commits, which should all be displayed).

Gitk has gone the same way, and there's a patch here that shows how it is implemented, but my TCL skills are weak and the patch isn't very thoroughly commented and a bit hard to follow.

I'd also like this algorithm to be efficient, as it'll have to handle hundreds of thousands of commits. It also has to be displayed in a table, so it's important that access to specific rows is fast.

I'll describe the input I have so far, the output that I want and a few observations.

Input:

I have a current pool of commits in the form of a hash table that maps commit ids to commit objects. This pool does not have to be complete (have all commits necessary)
I have a separate thread loading in new commits from git, with a callback that can be called every time a new commit is loaded. There is no guaranteed order in which the commits come in, but in most of the cases the next commit is a parent of the previous commit.
A commit object has its own revision id and the revision ids of all its parents
I have a list of branch heads that should be listed. That is, there isn't a single 'top' of the DAG that should be displayed. There also does not have to be a single graph root.

Output:

I'll need to linearize these commits in topological order. That is, a commit cannot be listed after its parents have been listed.
I also need the 'branch lines' that can be seen in the screenshot above. These probably need to be precomputed as most of them depend on their children.

A few remarks:

It's necessary to relocate a list of commits. For example, we might have to commits (branch heads) that are unrelated, until a commit shows up which makes one head an ancestor of the other.
Multiple branch tips must be shown
It's important that this process is incremental, so that at least a partial view is available while the data is still loading. This means that new data has to be inserted halfway and that the branch lines have to be readjusted.

I hope I have described the problem clear enough :)

I haven't used GitX, so maybe I'm missing something, but it seems like you could walk back from child to parent(s) from the head of each current branch until you can draw a few screens of the graph.

That might not give you the optimal visual layout of branches that are rooted earlier. But it seems like responsiveness would be more important than waiting to draw a graph with the fewest crossings, since most users are likely to be interested in recent activity.

Paul 2009-03-31 18:17:18

Yes, you describe the basic process that I want to do, but don't answer my question, which is how to do it and how to do it efficiently :). Walking the DAG is just one of the things you have to do, if you want to linearize it you'll also have to do some relocation for example.

Pieter 2009-04-01 09:50:37

I think it's efficient to only process the most recent commits, maybe the last 200, to decide the order from left to right that the branches are drawn across a page which shows maybe 25 commits. That would simplify the problem to 6-10 cases. Or is there another reason to 'linearize' the whole repo?

Paul 2009-04-01 21:44:25

yes, it really should take the whole repository. I want to be able to scroll all the way down, paginating it just won't do.Also, it currently already shows all commits. I don't want to loose functionality over this.

Pieter 2009-04-02 11:03:21

Looks like a nice project; good luck with it. If I can't simplify the problem, it looks like maybe 300 or 400 lines of Objective C, which for me is beyond the scope of a StackOverflow response. Thanks for clarifying things.

Paul 2009-04-02 16:58:43

+4 A:

The standard topological sort is O(n) (OK, O(V+E)), i.e. you should be able to sort a million commits in memory in a fraction of a second. No incremental hack like those in Tcl is needed.

BTW, I use GitX (looks much better than Gitk on OS X) everyday and don't have any issue with it (maybe because I don't have those crazy merges in my repositories) :)

obecalp 2009-04-03 02:46:01

That might be possible, I'll check it out. I think it's more expensive to compute the graph lines, which I cache now. computation of those lines takes quite some time (~1second for 100k commits), so I won't be able to recalculate that everytime. I'll still need some incremental updating for that.

Pieter 2009-04-06 16:49:23

+2 A:

OK, so I'm having a similarly hard time reading the entirety of that patch, but let's see if I can piece it together from what I did figure out.

To start with, gitk simplifies things by condensing a string of commits into an arc, containing a series of commits that each only have one parent and one child. Aside from anything else, doing this should cut down pretty dramatically on the number of nodes you have to consider for your sort, which will help out any algorithm you use. As a bonus, related commits will end up grouped together.

This does introduce some complexity in terms of finding an arc when you read a new commit. There are a few situations:

The new commit has a single parent, or no parents. It extends a (possibly empty) arc. Most of the time, you'll just extend the most recent arc. There are a few interesting subcases:
- It may cause an existing arc to be split, if its parent already has a child (i.e. its parent turns out to be a branch point, which I gather you don't know ahead of time).
- It could be a "missing link" that connects two arcs together.
- You may already know that this commit has multiple children
The new commit has multiple parents (a merge commit).

You may want to include the multi-child or multi-parent commits in arcs, or it may make more sense to keep them separate. Either way, it shouldn't be too difficult to build up this set of arcs incrementally.

Once you have these arcs, you're still left with trying to linearize them. In your case, the first algorithm described on the aforementioned Wikipedia page sounds useful, as you have a known set of branch points to use as your initial set S.

Other notes:

Relocating commits should be manageable. First of all, you only have to care when you connect two arcs, either through a new merge commit, a newly-discovered branch point, or combining two arcs into one. Any given arc can easily maintain its current row number range (assuming you're fine with putting an arc on sequential rows), so traversing up the tree checking that all new ancestors show up later should be pretty quick.
I don't know enough to say much about drawing the graph lines, but I imagine it won't be too different from what you do now.

Anyway, I hope that helps. It was interesting to think about, at least.

C Pirate 2009-04-04 22:09:11

Do you really need to display 100k commits at once? What kind of user can soak up that kind of info?

Have you thought about paging? I.e just compute for ~100 commits or something. If a branch-line goes way back (off-page), you could use something like Github's back-pointing arrow to show that.

Marcus Lindblom 2009-04-09 11:36:23

Yes, the current system works very nicely and allows you to quickly scroll to a specific date, for instance. Using paging is just tedious and annoying. Also, I'm not going back to a less capable system (paging). If I can't find a nicer solution that the current, I'll just stick to that.

Pieter 2009-04-09 16:40:06

ansaurus

tags:

views:

answers:

Incremental linearizing of git DAG

related questions