tags:

views:

110

answers:

5
+2  Q: 

Redaction in git

I started working on a little Python script for FTP recently. To start off with, I had server, login and password details for an FTP site hardwired in the script, but this didn't matter because I was only working on it locally.

I then had the genius idea of putting the project on github. I realised my mistake soon after, and replaced the hardwired details with a solution involving .netrc. I've now removed the project from github, as anyone could look at the history, and see the login details in plain text.

The question is, is there any way to go through the git history and remove user name and password throughout, but otherwise leave the history intact? Or do I need to start a new repo with no history?

A: 

First, if the credentials are for your local development machine only, why bother?

Second: The whole point of a SCM is to have a complete history, so without starting a fresh project/repo, it can't be done.

DaDaDom
The credentials are for a remote FTP.
Skilldrick
Of course it can be done. It's just some bits on a disk somewhere.
Thomas
+6  A: 

http://help.github.com/removing-sensitive-data/ should help

MBO
This is not particularly helpful, as it mentions `--index-filter`, which will only work for deleting the entire file, not modifying the file to modify a single line.
Brian Campbell
@brian You're right, my bad. I haven't check carefully that.
MBO
Actually, I should say that it is helpful, as it mentions the other information you need to know about dealing with confidential information propagated to GitHub; but it's not the whole story, as it doesn't help the OP with the exact cleanup he needs.
Brian Campbell
+3  A: 

I believe you should be able to change all of your commits using the filter-branch command. See the section in the ProGit book for details.

However, as @MBO's link notes

force-pushing does not erase commits on the remote repo, it simply introduces new ones and moves the branch pointer to point to them

So you'll need to remove the repository completely from GitHub to remove those commits (i.e. even if they're not in your commit history, they're still floating around in the repository)

bdukes
+6  A: 

Maybe just easier to change your password on the FTP site? Unless you're embarrassed by the code...

Paddy
+1, Even if he successfully removes the file from the git history the password and associated account must be treated as compromised.
Iceman
+9  A: 

First of all, you should change the password on the FTP site. The password has already been made public; you can't guarantee that no one has cloned the repo, or it's not in plain-text in a backup somewhere, or something of the sort. If the password is at all valuable, I would consider it compromised by now.

Now, for your question about how to edit history. The git filter-branch command is intended for this purpose; it will walk through each commit in your repository's history, apply a command to modify it, and then create a new commit.

In particular, you want git filter-branch --tree-filter. This allows you to edit the contents of the tree (the actual files and directories) for each commit. It will run a command in a directory containing the entire tree, your command may edit files, add new files, delete files, move them, and so on. Git will then create a new commit object with all of the same metadata (commit message, date, and so on) as the previous one, but with the tree as modified by your command, treating new files as adds, missing files as deletes, etc (so, your command does not need to do git add or git rm, it just needs to modify the tree).

For your purposes, something like the following should work, with the appropriate regular expression and file name depending on your exact situation:

git filter-branch --tree-filter "sed -i -e 's/SekrtPassWrd/REDACTED/' myscript.py" -- --all

Remember to do this to a copy of your repository, so if something goes wrong, you will still have the original and can start over again. filter-branch will also save references to your original branches, as original/refs/heads/master and so on, so you should be able to recover even if you forget to do this; when doing some global modification to my source code history, I like to make sure I have multiple fallbacks in case something goes wrong.

To explain how this works in more detail:

sed -i -e 's/SekrtPassWrd/REDACTED/' myscript.py

This will replace SekrtPassWrd in your myscript.py file with REDACTED; the -i option to sed tells it to edit the file in place, with no backup file (as that backup would be picked up by Git as a new file).

If you need to do something more complicated than a single substitution, you can write a script, and just invoke that for your command; just be sure to call it with an absolute pathname, as git filter-branch call your command from within a temporary directory.

git filter-branch --tree-filter <command> -- --all

This tells git to run a tree filter, as described above, over every branch in your repository. The -- --all part tells Git to apply this to all branches; without it, it would only edit the history of the current branch, leaving all of the other branches unchanged (which probably isn't what you want).

See the documentation on GitHub on Removing Sensitive Data (as originally pointed out by MBO) for some more information about dealing with the copies of the information that have been pushed to GitHub. Note that they reiterate my advice to change your password, and provide some tips for dealing with cached copies that GitHub may still have.

Brian Campbell
This is very helpful, thanks very much. Also the note about changing the password - I needed that kick up the arse, cheers.
Skilldrick