tags:

views:

63

answers:

5

I have a git repository (covering more or less project history) and separate sources (just a tarball with few files) which have forked some time ago (actually somewhere in 2004 or 2005).

The sources from tarball have undergone quite a lot of changes from which I'd like to incorporate some. Now the question is - how to find out what was actually the branch point for the changed sources to get minimal diff of what has happened there.

So what I basically want is to find place in git history, where the code is most similar to the tarball of sources I have. And I don't want to do that manually.

It is also worth mentioning that the changed sources include only subset of files and have split some files into more. However the code which is in there seem to get only small modifications and several additions.

If you want to play with that yourself, the tarball with sources is here and Git is hosted at Gitorious: git://gitorious.org/gammu/mainline.git

A: 

how was the fork made? was it a clone that someone else made and then did their own work? if so, then this is really easy. all you need to do is create a local branch that pulls in the code from the fork. git will see the ancestry of the forked branch pointing to one of the commits from your original repository and will "connect the dots" so to speak... it will reconnect the history from your original repository to the fork.

you should be able to do this:

git remote add thefork git://wherever.it.lives/thefork.git

git fetch thefork

git branch -f thefork-branch thefork/branchname

git checkout thefork-branch

at this point, you can run gitk and see the complete history of the forked branch and your local repository, and see if they connect or not.

Derick Bailey
Ah, I was not clear that the forked sources are just a tarball not actually the git repo. Will update the question to make it clear.
Michal Čihař
ouch! yeah... that's new to me... not sure i know how to handle that situation.
Derick Bailey
+1  A: 

Not a great solution, but to get a guess of which revisions it might be: Assume that some of the files in the tar ball have not been changed since they were branched. Run git hash object against each file in the tar ball, then search for those files in the repository using git show. Then try and find the commits under which these files were included, possibly using git whatchanged. The answer to your question might then be the commit with the most common files, but it'll still be a bit hit and miss.

Douglas
This is a great idea, actually - I wrote my answer assuming all of the files would have small diffs, and so you wouldn't be able to find the exact version in the repo.
Jefromi
Great idea, unfortunately there is no file without changes.
Michal Čihař
@Michal Čihař: Then move on to my answer, which provides some basic ways to try and find a minimal diff version!
Jefromi
+2  A: 

In the general case, you'd actually have to examine every single commit, because you have no way of knowing if you might have a huge diff in one, small diff the next, then another huge diff, then a medium diff...

Your best bet is probably going to be to limit yourself to specific files. If you consider just a single file, it should not take long to iterate through all the versions of that file (use git rev-list <path> to get a list, so you don't have to test every commit). For each commit which modified the file, you can check the size of the diff, and fairly quickly find a minimum. Do this for a handful of files, hopefully they'll agree!

The best way to set yourself up for the diffing is to make a temporary commit by simply copying in your tarball, so you can have a branch called tarball to compare against. That way, you could do this:

git rev-list path/to/file | while read hash; do echo -n "$hash "; git diff --numstat tarball $hash path/to/file; done

to get a nice list of all the commits with their diff sizes (the first three columns will be SHA1, number of lines added, and number of lines removed). Then you could just pipe it on into awk '{print $1,$2+$3}' | sort -n -k 2, and you'd have a sorted list of commits and their diff sizes!

If you can't limit yourself to a small handful of files to test, I might be tempted to hand-implement something similar to git-bisect - just try to narrow your way down to a small diff, making the assumption that in all likelihood, commits near to your best case will also have smaller diffs, and commits far from it will have larger diffs. (Somewhere between Newton's method and a full on binary/grid search, probably?)

Jefromi
I think that a good place to start in limiting the file set that you are looking at is probably files which are common to both but have either not changed in a long time or have changed rarely in either one (or better yet either) tree. Header files are likely to be good candidates as well as long as they do not contain too many crazy preprocessor conditional stuff. It's much easier to quantify changes in a diff of a long line of `#define`s than of actual code.
nategoose
This seems to be best approach. I only changed it not to use single file but a complete file list I have in changed tree and limited list of revisions to interval I guessed from some code parts. Thanks.
Michal Čihař
A: 

Import that files in the tarball into a git revision, on a separate branch or a completely new one: the position in the revision graph isn't important, we just want it available as a tree.

Now for each revision in master, just diff against that tree/revision ('imported') and just output how big the diff is. Something like:

git rev-list master | while read rev; do patchsize=$(git diff $rev imported | wc -c); echo $rev $patchsize; done

So the revision with the smallest patch size will be the "closest", by a very rough rule of thumb. (An identical revision will produce a patch size of 0, and anything else will certainly be non-zero, and the more that's changed, the bigger).

araqnid
Unfortunately diffing whole tree leads always to oldest revision, because it does not have so many extra files.
Michal Čihař
A: 

based on what araqnid said I came up with 9c6c864426bf88429e77c7e22b5aa78e9295b97a (just asked for stuff between 0.61.0 and HEAD) this is probably not the best) you might do better with something like

git rev-list --no-merges --all | while read rev; do patchsize=$(git diff $rev | wc -c); echo $patchsize $rev; done | sort -n | less

assuming you've imported the tarball into git and have that revision checked out (I did this by untaring and then

git init
git add .
git commit -m "import tarball"
git remote add origin git://gitorious.org/gammu/mainline.git

So after you do that and the run the above it should output the size of all the diffs in ascending order of patchsize (the first one will be 0 since it'll find the current head) it'll take a long time... but it should find the smallest diff...

Spudd86