tags:

views:

591

answers:

3

This is sort of a follow-up to this question.

If there are multiple blobs with the same contents, they are only stored once in the git repository because their SHA-1's will be identical. How would one go about finding all duplicate files for a given tree?

Would you have to walk the tree and look for duplicate hashes, or does git provide backlinks from each blob to all files in a tree that reference it?

+1  A: 

The scripting answers from your linked question pretty much apply here too.

Try the following git command from the root of your git repository.

git ls-tree -r HEAD

This generates a recursive list of all 'blobs' in the current HEAD, including their path and their sha1 id.

git doesn't maintain back links from a blob to tree so it would be a scripting task (perl, python?) to parse a git ls-tree -r output and create a summary report of all sha1s that appear more than once in the list.

Charles Bailey
+3  A: 

Running this on the codebase I work on was an eye-opener I can tell you!

#!/usr/bin/perl

# usage: git ls-tree -r HEAD | $PROGRAM_NAME

use strict;
use warnings;

my $sha1_path = {};

while (my $line = <STDIN>) {
    chomp $line;

    if ($line =~ m{ \A \d+ \s+ \w+ \s+ (\w+) \s+ (\S+) \z }xms) {
        my $sha1 = $1;
        my $path = $2;

        push @{$sha1_path->{$sha1}}, $path;
    }
}

foreach my $sha1 (keys %$sha1_path) {
    if (scalar @{$sha1_path->{$sha1}} > 1) {
        foreach my $path (@{$sha1_path->{$sha1}}) {
            print "$sha1  $path\n";
        }

        print '-' x 40, "\n";
    }
}
lmop
You're right...The results are very interesting!
Readonly
Little correction to support spaces in your paths: change the end of the regex from "\s+ (\S+) \z" to "\s+ (.+) \z".
Mathieu Longtin
+1  A: 

Just made a one-liner that highlights the duplicates rendered by git ls-tree.
Might be useful

git ls-tree -r HEAD |
    sort -t ' ' -k 3 |
    perl -ne '$1 && / $1\t/ && print "\e[0;31m" ; / ([0-9a-f]{40})\t/; print "$_\e[0m"'
Romuald Brunet