tags:

views:

772

answers:

5

I am still learning Perl. Can anyone please suggest me the Perl code to compare files from .tar.gz and a directory path.

Let's say I have tar.gz backup of following directory path which I have taken few days back.

a/file1
a/file2
a/file3
a/b/file4
a/b/file5
a/c/file5
a/b/d/file and so on..

Now I want to compare files and directories under this path with the tar.gz backup file.

Please suggest Perl code to do that.

+1  A: 

Perl is kind of overkill for this, really. A shell script would do fine. The steps you need to take though:

  • Extract the tar to a temporary folder somewhere.
  • diff -uR the two folders and redirect the output somewhere (or perhaps pipe to less as appropriate)
  • Clean up the temporary folder.

And you're done. Shouldn't be more than 5-6 lines. Something quick and untested:

#!/bin/sh
mkdir $TEMP/$$
tar -xz -f ../backups/backup.tgz $TEMP/$$
diff -uR $TEMP/$$ ./ | less
rm -rf $TEMP/$$
Matthew Scharley
I don't want to create any folder. Isn't there any method to read files from .tar.gz and put in a hash and do a compare.
Space
why extracting and comparing using diff.. why not compress and then compare using zdiff..it should take less space although I am not sure how zdiff works but just curious :)
Neeraj
I think the zdiff will only work for files but i have directories under .tar.gz file.
Space
*"I don't want to create any folder."* Don't want to, or can't? It's relatively difficult to do what you're describing, certainly non-trivial and beyond anything anyone will be willing to do for you here.
Matthew Scharley
Also, tar (from a file structure point of view) doesn't really care about files/directories beyond the idea that files in subdirectories have really long names with '/'s in them.
Matthew Scharley
A note on zdiff... while it would (likely) work correctly, it would be difficult to tell what files changes actually occured in (unless zdiff has switches to specifically deal with tar files, which it may, I've never used it before).
Matthew Scharley
To do it without extracting the tarball you would want to use `Archive::Tar`, loop over each member of the archive, and then compare it to the existing file on disk in a manner dependent on the type of the file (comparing the contents and possibly times for regular files, `readlink` for symlinks, peeking at `stat` info for device specials, etc.) It's not an ideal task for a beginner. Oh, and `Archive::Tar` doesn't know how to stream the file from disk; it loads all the data into memory. I think the low-tech diff solution wins. You can help yourself out by putting `/tmp` on a tmpfs.
hobbs
+5  A: 

See Archive::Tar.

Sinan Ünür
+5  A: 

The Archive::Tar and File::Find modules will be helpful. A basic example is shown below. It just prints information about the files in a tar and the files in a directory tree.

It was not clear from your question how you want to compare the files. If you need to compare the actual content, the get_content() method in Archive::Tar::File will likely be needed. If a simpler comparison is adequate (for example, name, size, and mtime), you won't need much more than methods used in the example below.

#!/usr/bin/perl
use strict;
use warnings;

# A utility function to display our results.
sub Print_file_info {
    print map("$_\n", @_), "\n";
}

# Print some basic information about files in a tar.
use Archive::Tar qw();
my $tar_file = 'some_tar_file.tar.gz';
my $tar = Archive::Tar->new($tar_file);
for my $ft ( $tar->get_files ){
    # The variable $ft is an Archive::Tar::File object.
    Print_file_info(
        $ft->name,
        $ft->is_file ? 'file' : 'other',
        $ft->size,
        $ft->mtime,
    );
}

# Print some basic information about files in a directory tree.
use File::Find;
my $dir_name = 'some_directory';
my @files;
find(sub {push @files, $File::Find::name}, $dir_name);
Print_file_info(
    $_,
    -f $_ ? 'file' : 'other',
    -s,
    (stat)[9],
) for @files;
FM
@FM AFAIK, `Archive::Tar->new` needs to be told the file is compressed.
Sinan Ünür
@Sinan Unur. Good point; that's how I read the documentation too. However, I just tested `$ft->get_content` in the script above, and it returned the correct content, even without adding the compressed flag (on a Windows box). At this point I'm not certain one way or the other ... sounds like a good question for SO.
FM
@FM A-ha! Looking at the source code, it seems like the `$compressed` flag is used for output by `Arcive::Tar` whereas the internal `_get_handle` detects if the file is compressed.
Sinan Ünür
@Sinan Unur. Good to know. Thanks.
FM
+2  A: 

Heres an example that checks to see if every file that is in an archive, also exists in a folder.

# $1 is the file to test
# $2 is the base folder
for file in $( tar --list -f $1 | perl -pe'chomp;$_=qq["'$2'$_" ]' )
do
  # work around bash deficiency
  if [[ -e "$( perl -eprint$file )" ]]
    then
      echo "   $file"
    else
      echo "no $file"
  fi
done

This is how I tested this:

I removed / renamed config, then ran the following:

bash test Downloads/update-dnsomatic-0.1.2.tar.gz Downloads/

Which gave the output of:

   "Downloads/update-dnsomatic-0.1.2/"
no "Downloads/update-dnsomatic-0.1.2/config"
   "Downloads/update-dnsomatic-0.1.2/update-dnsomatic"
   "Downloads/update-dnsomatic-0.1.2/README"
   "Downloads/update-dnsomatic-0.1.2/install.sh"

I am new to bash / shell programming, so there is probably a better way to do this.

Brad Gilbert
+1  A: 

This might be a good starting point for a good Perl program. It does what the question asked for though.

It was just hacked together, and ignores most of the best practices for Perl.

perl test.pl full                            \
     Downloads/update-dnsomatic-0.1.2.tar.gz \
     Downloads/                              \
     update-dnsomatic-0.1.2
#! /usr/bin/env perl
use strict;
use 5.010;
use warnings;
use autodie;

use Archive::Tar;
use File::Spec::Functions qw'catfile catdir';

my($action,$file,$directory,$special_dir) = @ARGV;

if( @ARGV == 1 ){
  $file = *STDOUT{IO};
}
if( @ARGV == 3 ){
  $special_dir = '';
}

sub has_file(_);
sub same_size($$);
sub find_missing(\%$);

given( lc $action ){

  # only compare names
  when( @{[qw'simple name names']} ){
    my @list = Archive::Tar->list_archive($file);

    say qq'missing file: "$_"' for grep{ ! has_file } @list;
  }

  # compare names, sizes, contents
  when( @{[qw'full aggressive']} ){
    my $next = Archive::Tar->iter($file);
    my( %visited );

    while( my $file = $next->() ){
      next unless $file->is_file;
      my $name = $file->name;
      $visited{$name} = 1;

      unless( has_file($name) ){
        say qq'missing file: "$name"' ;
        next;
      }

      unless( same_size( $name, $file->size ) ){
        say qq'different size: "$name"';
        next;
      }

      next unless $file->size;

      unless( same_checksum( $name, $file->get_content ) ){
        say qq'different checksums: "$name"';
        next;
      }
    }

    say qq'file not in archive: "$_"' for find_missing %visited, $special_dir;
  }

}

sub has_file(_){
  my($file) = @_;
  if( -e catfile $directory, $file ){
    return 1;
  }
  return;
}

sub same_size($$){
  my($file,$size) = @_;
  if( -s catfile($directory,$file) == $size ){
    return $size || '0 but true';
  }
  return; # empty list/undefined
}

sub same_checksum{
  my($file,$contents) = @_;
  require Digest::SHA1;

  my($outside,$inside);

  my $sha1 = Digest::SHA1->new;
  {
    open my $io, '<', catfile $directory, $file;
    $sha1->addfile($io);
    close $io;
    $outside = $sha1->digest;
  }

  $sha1->add($contents);
  $inside = $sha1->digest;


  return 1 if $inside eq $outside;
  return;
}

sub find_missing(\%$){
  my($found,$current_dir) = @_;

  my(@dirs,@files);

  {
    my $open_dir = catdir($directory,$current_dir);
    opendir my($h), $open_dir;

    while( my $elem = readdir $h ){
      next if $elem =~ /^[.]{1,2}[\\\/]?$/;

      my $path = catfile $current_dir, $elem;
      my $open_path = catfile $open_dir, $elem;

      given($open_path){
        when( -d ){
          push @dirs, $path;
        }
        when( -f ){
          push @files, $path, unless $found->{$path};
        }
        default{
          die qq'not a file or a directory: "$path"';
        }
      }
    }
  }

  for my $path ( @dirs ){
    push @files, find_missing %$found, $path;
  }

  return @files;
}

After renaming config to config.rm, adding an extra char to README, changing a char in install.sh, and adding a file .test. This is what it outputted:

missing file: "update-dnsomatic-0.1.2/config"
different size: "update-dnsomatic-0.1.2/README"
different checksums: "update-dnsomatic-0.1.2/install.sh"
file not in archive: "update-dnsomatic-0.1.2/config.rm"
file not in archive: "update-dnsomatic-0.1.2/.test"
Brad Gilbert
http://search.cpan.org/dist/Archive-Tar/bin/ptardiff is probably better though.
Brad Gilbert