tags:

views:

468

answers:

6

Duplicate data removal using Perl called within via a batch file within Windows A DOS window in Windows called via a batch file. A batch file calls the Perl script which carries out the actions. I have the batch file. The code script I have works duplicate data is removal so long as the data file is not too big. The problem that requires resolving is with data files which are larger, (2 GB or more), with this size of file a memory error occurs when trying to load the complete file in to an array for duplicate data removal. The memory error occurs in the subroutine at:-

@contents_of_the_file = <INFILE>;

(A completely different method is acceptable so long as it solves this issue, please suggest). The subroutine is:-

sub remove_duplicate_data_and_file
{
 open(INFILE,"<" . $output_working_directory . $output_working_filename) or dienice ("Can't open $output_working_filename : INFILE :$!");
  if ($test ne "YES")
   {
    flock(INFILE,1);
   }
  @contents_of_the_file = <INFILE>;
  if ($test ne "YES")
   {
    flock(INFILE,8);
   }
 close (INFILE);
### TEST print "$#contents_of_the_file\n\n";
 @unique_contents_of_the_file= grep(!$unique_contents_of_the_file{$_}++, @contents_of_the_file);

 open(OUTFILE,">" . $output_restore_split_filename) or dienice ("Can't open $output_restore_split_filename : OUTFILE :$!");
 if ($test ne "YES")
  {
   flock(OUTFILE,1);
  }
for($element_number=0;$element_number<=$#unique_contents_of_the_file;$element_number++)
  {
   print OUTFILE "$unique_contents_of_the_file[$element_number]\n";
  }
 if ($test ne "YES")
  {
   flock(OUTFILE,8);
  }
}
+2  A: 

Perl does heroic things with large files, but 2GB may be a limitation of DOS/Windows.

How much RAM do you have?

If your OS doesn't complain, it may be best to read the file one line at a time, and write immediately to output.

I'm thinking of something using the diamond operator <> but I'm reluctant to suggest any code because on the occasions I've posted code, I've offended a Perl guru on SO.

I'd rather not risk it. I hope the Perl cavalry will arrive soon.

In the meantime, here's a link.

pavium
Slurping a 2GB file is always a bad idea, whether the OS complains or otherwise.
Matthew Scharley
can you suggest a modification for that in my code?
Sunny Rockzzs
pavium, don't worry about offending a Perl guru. It's a good way to learn, and if people comment, it's not you, it's your code. Not the same thing. One of the Perl mottos is "Have fun".
dland
+4  A: 

You should be able to do this efficiently using hashing. You don't need to store the data from the lines, just identify which ones are the same. So...

  • Don't slurp - Read one line at a time.
  • Hash the line.
  • Store the hashed line representation as a key in a Perl hash of lists. Store the line number as the first value of the list.
  • If the key already exists, append the duplicate line number to the list corresponding to that value.

At the end of this process, you'll have a data-structure identifying all the duplicate lines. You can then do a second pass through the file to remove those duplicates.

ire_and_curses
+1 for the general idea. But unless I'm overlooking something, storing the dup info as a hash of lists does not seem very convenient as far as the second pass over the data is concerned -- no quick way to know whether to print the line. Seems handier to build up a Perl hash with the wanted line numbers as the hash keys.
FM
@FM: Yes, I take your point. I was trying to avoid having a second hash of line numbers to keep the memory usage down, but I the tradeoff is that reconstructing the file from my representation is quite complex compared to your solution. I like your approach more. ;)
ire_and_curses
+5  A: 

You are unnecessarily storing a full copy of the original file in @contents_of_the_file and -- if the amount of duplication is low relative to the file size -- nearly two other full copies in %unique_contents_of_the_file and @unique_contents_of_the_file. As ire_and_curses noted, you can reduce the storage requirements by making two passes over the data: (1) analyze the file, storing information about the line numbers of non-duplicate lines; and (2) process the file again to write non-dups to the output file.

Here is an illustration. I don't know whether I've picked the best module for the hashing function (Digest::MD5); perhaps others will comment on that. Also note the 3-argument form of open(), which you should be using.

use strict;
use warnings;

use Digest::MD5 qw(md5);

my (%seen, %keep_line_nums);
my $in_file  = 'data.dat';
my $out_file = 'data_no_dups.dat';

open (my $in_handle, '<', $in_file) or die $!;
open (my $out_handle, '>', $out_file) or die $!;

while ( defined(my $line = <$in_handle>) ){
    my $hashed_line = md5($line);
    $keep_line_nums{$.} = 1 unless $seen{$hashed_line};
    $seen{$hashed_line} = 1;
}

seek $in_handle, 0, 0;
$. = 0;
while ( defined(my $line = <$in_handle>) ){
    print $out_handle $line if $keep_line_nums{$.};
}    

close $in_handle;
close $out_handle;
FM
+1 for actually constructing the code.
ire_and_curses
This will be a win as long as the lines being hashed are 16 characters or greater. If the line length is less than 16, use the line itself instead as a `%seen` key. my $hashed_line = length($line) > 15 ? md5($line) : $line;will do the trick. See also `Bit::Vector` as a replacement to `%keep_line_num` to reduce the memory footprint.
dland
A: 

In the "completely different method" category, if you've got Unix commands (e.g. Cygwin):

cat infile | sort | uniq > outfile

This ought to work - no need for Perl at all - which may, or may not, solve your memory problem. However, you will lose the ordering of the infile (as outfile will now be sorted).

EDIT: An alternative solution that's better able to deal with large files may be by using the following algorithm:

  1. Read INFILE line-by-line
  2. Hash each line to a small hash (e.g. a hash# mod 10)
  3. Append each line to a file unique to the hash number (e.g. tmp-1 to tmp-10)
  4. Close INFILE
  5. Open and sort each tmp-# to a new file sortedtmp-#
  6. Mergesort sortedtmp-[1-10] (i.e. open all 10 files and read them simultaneously), skipping duplicates and writing each iteration to the end output file

This will be safer, for very large files, than slurping.

Parts 2 & 3 could be changed to a random# instead of a hash number mod 10.

Here's a script BigSort that may help (though I haven't tested it):

# BigSort
#
# sort big file
#
# $1 input file
# $2 output file
#
# equ   sort -t";" -k 1,1 $1 > $2

BigSort()
{
if [ -s $1 ]; then
  rm $1.split.* > /dev/null 2>&1
  split -l 2500 -a 5 $1 $1.split.
  rm $1.sort > /dev/null 2>&1
  touch $1.sort1
  for FILE in `ls $1.split.*`
  do
    echo "sort $FILE"
    sort -t";" -k 1,1 $FILE > $FILE.sort
    sort -m -t";" -k 1,1 $1.sort1 $FILE.sort > $1.sort2
    mv $1.sort2 $1.sort1
  done
  mv $1.sort1 $2
  rm $1.split.* > /dev/null 2>&1
else
  # work for empty file !
  cp $1 $2
fi
}
Brian M. Hunt
Sort can't work without having the entire file available to process, so this will suffer from the same memory issues as the OP's original example. No minus from me though, as it's a useful solution in a number of related situations.
ire_and_curses
A: 

Well you could use the inline replace mode of command line perl.

perl -i~ -ne 'print unless $seen{$_}++' uberbigfilename
Scimon
You're still looking to store the entire contents of the file in RAM though, which is the original problem.
David Precious
Very good point.
Scimon
A: 

Here's a solution that works no matter how big the file is. But it doesn't use RAM exclusively, so its slower than a RAM-based solution. You can also specify the amount of RAM you want this thing to use.

The solution uses a temporary file that the program treats as a database with SQLite.

#!/usr/bin/perl

use DBI;
use Digest::SHA 'sha1_base64';
use Modern::Perl;

my $input= shift;
my $temp= 'unique.tmp';
my $cache_size_in_mb= 100;
unlink $temp if -f $temp;
my $cx= DBI->connect("dbi:SQLite:dbname=$temp");
$cx->do("PRAGMA cache_size = " . $cache_size_in_mb * 1000);
$cx->do("create table x (id varchar(86) primary key, line int unique)");
my $find= $cx->prepare("select line from x where id = ?");
my $list= $cx->prepare("select line from x order by line");
my $insert= $cx->prepare("insert into x (id, line) values(?, ?)");
open(FILE, $input) or die $!;
my ($line_number, $next_line_number, $line, $sha)= 1;
while($line= <FILE>) {
  $line=~ s/\s+$//s;
  $sha= sha1_base64($line);
  unless($cx->selectrow_array($find, undef, $sha)) {
    $insert->execute($sha, $line_number)}
  $line_number++;
}
seek FILE, 0, 0;
$list->execute;
$line_number= 1;
$next_line_number= $list->fetchrow_array;
while($line= <FILE>) {
  $line=~ s/\s+$//s;
  if($next_line_number == $line_number) {
    say $line;
    $next_line_number= $list->fetchrow_array;
    last unless $next_line_number;
  }
  $line_number++;
}
close FILE;
Donnie