ansaurus

Question

How can I remove non-unique lines from a large file with Perl?

Answer 1

+2 A:

Perl does heroic things with large files, but 2GB may be a limitation of DOS/Windows.

How much RAM do you have?

If your OS doesn't complain, it may be best to read the file one line at a time, and write immediately to output.

I'm thinking of something using the diamond operator <> but I'm reluctant to suggest any code because on the occasions I've posted code, I've offended a Perl guru on SO.

I'd rather not risk it. I hope the Perl cavalry will arrive soon.

In the meantime, here's a link.

pavium 2009-09-24 11:14:47

Slurping a 2GB file is always a bad idea, whether the OS complains or otherwise.

Matthew Scharley 2009-09-24 11:22:00

can you suggest a modification for that in my code?

Sunny Rockzzs 2009-09-24 11:24:53

pavium, don't worry about offending a Perl guru. It's a good way to learn, and if people comment, it's not you, it's your code. Not the same thing. One of the Perl mottos is "Have fun".

dland 2009-09-26 17:18:57

Answer 2

+4 A:

You should be able to do this efficiently using hashing. You don't need to store the data from the lines, just identify which ones are the same. So...

Don't slurp - Read one line at a time.
Hash the line.
Store the hashed line representation as a key in a Perl hash of lists. Store the line number as the first value of the list.
If the key already exists, append the duplicate line number to the list corresponding to that value.

At the end of this process, you'll have a data-structure identifying all the duplicate lines. You can then do a second pass through the file to remove those duplicates.

ire_and_curses 2009-09-24 12:47:20

+1 for the general idea. But unless I'm overlooking something, storing the dup info as a hash of lists does not seem very convenient as far as the second pass over the data is concerned -- no quick way to know whether to print the line. Seems handier to build up a Perl hash with the wanted line numbers as the hash keys.

FM 2009-09-24 16:35:49

@FM: Yes, I take your point. I was trying to avoid having a second hash of line numbers to keep the memory usage down, but I the tradeoff is that reconstructing the file from my representation is quite complex compared to your solution. I like your approach more. ;)

ire_and_curses 2009-09-24 17:42:31

Answer 3

+5 A:

You are unnecessarily storing a full copy of the original file in @contents_of_the_file and -- if the amount of duplication is low relative to the file size -- nearly two other full copies in %unique_contents_of_the_file and @unique_contents_of_the_file. As ire_and_curses noted, you can reduce the storage requirements by making two passes over the data: (1) analyze the file, storing information about the line numbers of non-duplicate lines; and (2) process the file again to write non-dups to the output file.

Here is an illustration. I don't know whether I've picked the best module for the hashing function (Digest::MD5); perhaps others will comment on that. Also note the 3-argument form of open(), which you should be using.

use strict;
use warnings;

use Digest::MD5 qw(md5);

my (%seen, %keep_line_nums);
my $in_file  = 'data.dat';
my $out_file = 'data_no_dups.dat';

open (my $in_handle, '<', $in_file) or die $!;
open (my $out_handle, '>', $out_file) or die $!;

while ( defined(my $line = <$in_handle>) ){
    my $hashed_line = md5($line);
    $keep_line_nums{$.} = 1 unless $seen{$hashed_line};
    $seen{$hashed_line} = 1;
}

seek $in_handle, 0, 0;
$. = 0;
while ( defined(my $line = <$in_handle>) ){
    print $out_handle $line if $keep_line_nums{$.};
}    

close $in_handle;
close $out_handle;

FM 2009-09-24 14:21:01

+1 for actually constructing the code.

ire_and_curses 2009-09-24 17:43:07

This will be a win as long as the lines being hashed are 16 characters or greater. If the line length is less than 16, use the line itself instead as a `%seen` key. my $hashed_line = length($line) > 15 ? md5($line) : $line;will do the trick. See also `Bit::Vector` as a replacement to `%keep_line_num` to reduce the memory footprint.

dland 2009-09-26 17:15:55

Answer 4

A:

In the "completely different method" category, if you've got Unix commands (e.g. Cygwin):

cat infile | sort | uniq > outfile

This ought to work - no need for Perl at all - which may, or may not, solve your memory problem. However, you will lose the ordering of the infile (as outfile will now be sorted).

EDIT: An alternative solution that's better able to deal with large files may be by using the following algorithm:

Read INFILE line-by-line
Hash each line to a small hash (e.g. a hash# mod 10)
Append each line to a file unique to the hash number (e.g. tmp-1 to tmp-10)
Close INFILE
Open and sort each tmp-# to a new file sortedtmp-#
Mergesort sortedtmp-[1-10] (i.e. open all 10 files and read them simultaneously), skipping duplicates and writing each iteration to the end output file

This will be safer, for very large files, than slurping.

Parts 2 & 3 could be changed to a random# instead of a hash number mod 10.

Here's a script BigSort that may help (though I haven't tested it):

# BigSort
#
# sort big file
#
# $1 input file
# $2 output file
#
# equ   sort -t";" -k 1,1 $1 > $2

BigSort()
{
if [ -s $1 ]; then
  rm $1.split.* > /dev/null 2>&1
  split -l 2500 -a 5 $1 $1.split.
  rm $1.sort > /dev/null 2>&1
  touch $1.sort1
  for FILE in `ls $1.split.*`
  do
    echo "sort $FILE"
    sort -t";" -k 1,1 $FILE > $FILE.sort
    sort -m -t";" -k 1,1 $1.sort1 $FILE.sort > $1.sort2
    mv $1.sort2 $1.sort1
  done
  mv $1.sort1 $2
  rm $1.split.* > /dev/null 2>&1
else
  # work for empty file !
  cp $1 $2
fi
}

Brian M. Hunt 2009-09-24 19:06:33

Sort can't work without having the entire file available to process, so this will suffer from the same memory issues as the OP's original example. No minus from me though, as it's a useful solution in a number of related situations.

ire_and_curses 2009-09-24 19:13:16

Answer 5

A:

Well you could use the inline replace mode of command line perl.

perl -i~ -ne 'print unless $seen{$_}++' uberbigfilename

Scimon 2009-09-25 15:01:38

You're still looking to store the entire contents of the file in RAM though, which is the original problem.

David Precious 2009-09-25 16:04:11

Very good point.

Scimon 2009-10-02 12:44:20

Answer 6

A:

Here's a solution that works no matter how big the file is. But it doesn't use RAM exclusively, so its slower than a RAM-based solution. You can also specify the amount of RAM you want this thing to use.

The solution uses a temporary file that the program treats as a database with SQLite.

#!/usr/bin/perl

use DBI;
use Digest::SHA 'sha1_base64';
use Modern::Perl;

my $input= shift;
my $temp= 'unique.tmp';
my $cache_size_in_mb= 100;
unlink $temp if -f $temp;
my $cx= DBI->connect("dbi:SQLite:dbname=$temp");
$cx->do("PRAGMA cache_size = " . $cache_size_in_mb * 1000);
$cx->do("create table x (id varchar(86) primary key, line int unique)");
my $find= $cx->prepare("select line from x where id = ?");
my $list= $cx->prepare("select line from x order by line");
my $insert= $cx->prepare("insert into x (id, line) values(?, ?)");
open(FILE, $input) or die $!;
my ($line_number, $next_line_number, $line, $sha)= 1;
while($line= <FILE>) {
  $line=~ s/\s+$//s;
  $sha= sha1_base64($line);
  unless($cx->selectrow_array($find, undef, $sha)) {
    $insert->execute($sha, $line_number)}
  $line_number++;
}
seek FILE, 0, 0;
$list->execute;
$line_number= 1;
$next_line_number= $list->fetchrow_array;
while($line= <FILE>) {
  $line=~ s/\s+$//s;
  if($next_line_number == $line_number) {
    say $line;
    $next_line_number= $list->fetchrow_array;
    last unless $next_line_number;
  }
  $line_number++;
}
close FILE;

Donnie 2009-09-26 20:36:39

ansaurus

tags:

views:

answers:

How can I remove non-unique lines from a large file with Perl?

related questions