tags:

views:

2243

answers:

8

As I understand it when Git assigns a SHA1 hash to a file this SHA1 is unique to the file based on its contents.

As a result if a file moves from one repository to another the SHA1 for the file remains the same as its contents have not changed.

How does Git calculate the SHA1 digest? Does it do it on the full uncompressed file contents?

I would like to emulate assigning SHA1's outside of Git.

+1  A: 

Take a look at the man page for git-hash-object. You can use it to compute the git hash of any particular file. I think that git feeds more than just the contents of the file into the hash algorithm, but I don't know for sure, and if it does feed in extra data, I don't know what it is.

Dale Hagglund
+49  A: 

This is how Git calculates the SHA1 for a file (or, in Git terms, a "blob"):

sha1("blob " + filesize + "\0" + data)

So you can easily compute it yourself without having Git installed. Note that "\0" is the NULL-byte, not a two-character string.

For example, the hash of an empty file:

sha1("blob 0\0") = "e69de29bb2d1d6434b8b29ae775ad8c2e48c5391"

$ touch empty
$ git hash-object empty
e69de29bb2d1d6434b8b29ae775ad8c2e48c5391

Another example:

sha1("blob 7\0foobar\n") = "323fae03f4606ea9991df8befbb2fca795e648fa"

$ echo "foobar" > foo.txt
$ git hash-object foo.txt 
323fae03f4606ea9991df8befbb2fca795e648fa

Here is a Python implementation:

from hashlib import sha1
def githash(data):
    s = sha1()
    s.update("blob %u\0" % len(data))
    s.update(data)
    return s.hexdigest()
Ferdinand Beyer
Very useful. thanks
robw
This is awesome, +10 if I could man!
hasen j
A: 

This is great! I was wondering (just out of curiosity) how Git calculated its sha1s, and this page explains it perfectly! Many thanks!

pythonic
+6  A: 

You can make a bash shell function to calculate it quite easily if you don't have git installed.

git_id () { printf 'blob %s\0' "$(ls -l "$1" | awk '{print $5;}')" | cat - "$1" | sha1sum | awk '{print $1}'; }
Charles Bailey
A: 
/// Calculates the SHA1 for a given string
let calcSHA1 (text:string) =
    text 
      |> System.Text.Encoding.ASCII.GetBytes
      |> (new System.Security.Cryptography.SHA1CryptoServiceProvider()).ComputeHash
      |> Array.fold (fun acc e -> 
           let t = System.Convert.ToString(e, 16)
           if t.Length = 1 then acc + "0" + t else acc + t) 
           ""
/// Calculates the SHA1 like git
let calcGitSHA1 (text:string) =
    let s = text.Replace("\r\n","\n")
    sprintf "blob %d%c%s" (s.Length) (char 0) s
      |> calcSHA1

This is a solution in F#.

forki23
I still have problems with umlauts:calcGitSHA1("ü").ShouldBeEqualTo("0f0f3e3b1ff2bc6722afc3e3812e6b782683896f") But my function gives 0d758c9c7bc06c1e307f05d92d896aaf0a8a6d2c.Any ideas how git hash-object handles umlauts?
forki23
it should handle the blob as a bytestream, that means ü has probably length 2 (unicode), F♯’s Length property will return length 1 (because it's only one visible character)
knittl
But System.Text.Encoding.ASCII.GetBytes("ü") returns a byte array with 1 element.
forki23
Maybe I have to use UTF8?
forki23
Using UTF8 and 2 as string length gives an byte array:[98; 108; 111; 98; 32; 50; 0; 195; 188]and therefor a SHA1 of 99fe40df261f7d4afd1391fe2739b2c7466fe968. Which is also not the git SHA1.
forki23
System.Text.Encoding.Default.GetBytes solves the problem.
forki23
+2  A: 

a little goodie: in shell

echo -en "blob ${#CONTENTS}\0$CONTENTS" | sha1sum
knittl
A: 

And in Perl (see also Git::PurePerl at http://search.cpan.org/dist/Git-PurePerl/ )


use strict;
use warnings;
use Digest::SHA1;

my @input = <>;

my $content = join("", @input);

my $git_blob = 'blob' . ' ' . length($content) . "\0" . $content;

my $sha1 = Digest::SHA1->new();

$sha1->add($git_blob);

print $sha1->hexdigest();
Alec the Geek
A: 

How do we do this for an entire project? If my entire project is checked out (say a PHP web page), is there a programmatic way I can determine what the commit name (sha1 of the commit that results with this exact layout) is?

leighmcc