tags:

views:

88

answers:

4

I'm new to git and I have a moderately large number of weekly tarballs from a long running project. Each tarball has on average a few hundred files in it. I'm looking for a git strategy that will allow me to add the expanded contents of each tarball to a new git repository, starting from version 1.001 and going through version 1.650. As of this stage of the project 99.5% of tarball(n) is just a copy of version(n-1) - in other words, a perfect candidate for git. The desired end result is to have only the master branch remaining at the end of the process.

I think I know git well enough to do this "by hand". As I understand it there is no possibility of a merge conflict since there will be no opportunity to change the master before the next version is added and committed. A shell script is my first guess, but I'm not sure how well bash will like it when git checkout branch_n gets processed while bash is executing in branch_n-1. For the purposes of this project the host environment is Ubuntu 10.4, resources available are 8 Gig RAM, 500 Gig Disk space free and 4 CPU processor at 3.ghz .

I don't need someone else to solve the problem but I could use a nudge in the right direction as to how a git expert would approach it. Any advice from someone who's "been there done that" would be appreciated.

Hotei

PS: I have looked at site's suggested "related questions" and found nothing relevant.

+1  A: 

Without having been exactly there, yu should simply:

  • untar an archive anywhere you want
  • rsync it with the git working directory in order to:
    • change the relevant file
    • add the new files from that archive to the working directory
    • remove the files from the working directory that are no linger part of the current archive
  • git add -A
  • git commit -m "archive n"
  • repeat

The idea is not to checkout branch_n+1, but to stay within the same branch, committing each tar content one after the other within the same branch of the same git repo.
Should you truly have somehow two concurrent processes, you could then:

  • git clone the first git repo
  • git branch -b a_new_branch to make sure you isolate that parallel process in its own branch that you will be able to push back to the first repo when done.
VonC
I like this idea but it might be easier to combine the first two ansers. So far it looks like:git inituntar tarball(n) into new reposgit add -Agit commit -m "version number foo"rm -rf * (but not .git)repeat
Hotei
+2  A: 

What I would do in this situation, as you have tarballs that are in the end 'tagged versions':

  1. create empty git repository
  2. extract a tarball to that directory overwriting any files
  3. add all files git add .
  4. git commit -a -m 'version foo'
  5. git tag current version
  6. remove all files
  7. repeat from step 2 for each tarball

In your case it's not necessary to create branches as all your tarballs are distinct, successive versions; each iteration overwrites previous one.

Marcin Gil
You're missing one step - removing the previous contents before dumping in the tarball.
Jefromi
True, added it to the list. Otherwise deletions wouldn't be handled.
Marcin Gil
I called them versions but that's not exactly what they are. Its more like a 'snapshot number'. I usually do about 1 real version update every month or two.
Hotei
I gave a +1 for mentioning the git tag step. I need the ability to go back and checkout an earlier commit and I believe the tag will let me do that easily.
Hotei
+2  A: 

Regarding this comment:

I'm not sure how well bash will like it when git checkout branch_n gets processed while bash is executing in branch_n-1

Are you concerned about two operations running concurrently and getting in each others' way? This shouldn't be a problem unless you intentionally run operations in parallel.

Assuming the tarballs follow a linear evolution, branching shouldn't come into this at all.

The process should be fairly straightforward:

  1. git init
  2. untar ball n
  3. git add; git commit (with appropriate flags)
  4. rm -rf * (to handle deletions in the history; you want to leave .git intact, of course)
  5. goto 2
Marcelo Cantos
Maybe 'concerned' isn't the right word. I'm "unsure what will happen" if git takes $PWD out from under bash.<detour>Maybe bash doesn't do it but ages ago I vaguely remember reading that sh creates a copy of the commands being executed somewhere as a text file and then modifies that text file as the script gets executed. If that doesn't happen in $PWD then it's not a problem anyway.</detour>
Hotei
Back to your other question - no, I didn't intend to run this in parallel as I want the result to be a chronologically ordered merge line which I can then browse with something like gitg.
Hotei
I like your suggestion and incorporated with the one above in a comment.
Hotei
Tested the solution on first 3 tarballs and it seems to be doing what I want. Scripting it should be relatively simple. Thanks to all who assisted.
Hotei
+4  A: 

Take a look at $GIT_SRC_DIR/contrib/fast-import/import-tars.perl

Stefan Näwe
http://git.kernel.org/?p=git/git.git;a=blob;hb=HEAD;f=contrib/fast-import/import-tars.perl
Jakub Narębski
Stefan,Good suggestion!! Perl isn't my favorite scripting language but it does validate the general approach and points out some potential problems if I have symbolic links inside the tarball. I need to apply the updates in a specific order to get desired result but this will be a good baseline. import-tars also seems to rely on fast-import which I'm not familiar with - but that's another story.Thanks
Hotei
@Hotei: import-tars.perl is an *example* script; it serves (among others) to ilustrate how to use fast-import interface. You can write your own script in your favorite scripting language (there is example import-zips.py in contrib/fast-import in Python).
Jakub Narębski