ansaurus

Question

Find files in git repo over x megabytes, that don't exist in HEAD

Answer 1

+12 A:

This is an adaptation of the git-find-blob script I posted previously:

#!/usr/bin/perl
use 5.008;
use strict;
use Memoize;

sub usage { die "usage: git-large-blob <size[b|k|m]> [<git-log arguments ...>]\n" }

@ARGV or usage();
my ( $max_size, $unit ) = ( shift =~ /^(\d+)([bkm]?)\z/ ) ? ( $1, $2 ) : usage();

my $exp = 10 * ( $unit eq 'b' ? 0 : $unit eq 'k' ? 1 : 2 );
my $cutoff = $max_size * 2**$exp; 

sub walk_tree {
    my ( $tree, @path ) = @_;
    my @subtree;
    my @r;

    {
        open my $ls_tree, '-|', git => 'ls-tree' => -l => $tree
            or die "Couldn't open pipe to git-ls-tree: $!\n";

        while ( <$ls_tree> ) {
            my ( $type, $sha1, $size, $name ) = /\A[0-7]{6} (\S+) (\S+) +(\S+)\t(.*)/;
            if ( $type eq 'tree' ) {
                push @subtree, [ $sha1, $name ];
            }
            elsif ( $type eq 'blob' and $size >= $cutoff ) {
                push @r, [ $size, @path, $name ];
            }
        }
    }

    push @r, walk_tree( $_->[0], @path, $_->[1] )
        for @subtree;

    return @r;
}

memoize 'walk_tree';

open my $log, '-|', git => log => @ARGV, '--pretty=format:%T %h %cr'
    or die "Couldn't open pipe to git-log: $!\n";

my %seen;
while ( <$log> ) {
    chomp;
    my ( $tree, $commit, $age ) = split " ", $_, 3;
    my $is_header_printed;
    for ( walk_tree( $tree ) ) {
        my ( $size, @path ) = @$_;
        my $path = join '/', @path;
        next if $seen{ $path }++;
        print "$commit $age\n" if not $is_header_printed++;
        print "\t$size\t$path\n";
    }
}

Aristotle Pagaltzis 2008-11-18 14:32:14

I'm having difficulties understanding this code. Any examples of how to use your nice command?

neoneye 2009-05-30 19:59:49

aha. no arguments. it just took some time for it to output anything to the screen. git-large-blob 500k

neoneye 2009-05-30 21:27:39

love it! thnx a ton!

alex 2010-07-21 20:29:23

Answer 2

+4 A:

Aristote's script will show you what you want. You also need to know that deleted files will still take up space in the repo.

By default, Git keeps changes around for 30 days before they can be garbage-collected. If you want to remove them now:

$ git reflog expire --expire=1.minute refs/heads/master
     # all deletions up to 1 minute  ago available to be garbage-collected
$ git fsck --unreachable 
     # lists all the blobs(file contents) that will be garbage-collected 
$ git prune 
$ git gc

A side comment: While I'm big fan of Git, Git doesn't bring any advantages to storing your collection of "random scripts, text files, websites" and binary files. Git tracks changes in content, particularly the history of coordinated changes among many text files, and does so very efficiently and effectively. As your question illustrates, Git doesn't have good tools for tracking individual file changes. And it doesn't track changes in binaries, so any revision stores another full copy in the repo.

Of course this use of Git is a perfectly good way to get familiar with how it works.

Paul 2008-11-18 22:41:39

There's no advantage to using git like this, but it handles it fine, and using a different VCS just because it handles binary files (or random bunches of files) better would be inconvenient (convenience being the only reason I keep the directory in git!)

dbr 2008-11-19 10:28:56

ansaurus

tags:

views:

answers:

Find files in git repo over x megabytes, that don't exist in HEAD

related questions