ansaurus

Question

What's the fastest way in Perl to get all lines of file1 that do not appear in file2?

Answer 1

+13 A:

Create a hashmap containing each line from file 2. Then for each line in file 1, if it is not in the hashmap then output it. This will be O(N), which is the best efficiency class you can achieve given that you have to read the input.

Perl implementation:

#!/usr/bin/env perl
use warnings;
use strict;
use Carp ();

my $file1 = 'file1.txt';
my $file2 = 'file2.txt';

my %map;
{
    open my $in, '<',$file2 or Carp::croak("Cant open $file2");
    while (<$in>) {
      $map{$_} = 1;
    }
    close($in) or Carp::carp("error closing $file2");
}
{
   open my $in,'<', $file1 or Carp::croak("Cant open $file1");
   while (<$in>) {
    if (!$map{$_}) {
      print $_;
    }
   }
   close $in or Carp::carp("error closing $file1");
}

If file 2 is so large that the hashmap doesn't fit in memory, then we have a different problem at hand. A possible solution is then to use the above solution on chunks of file 2 (small enough to fit into memory), outputing the results to temporary files. Provided there are sufficient matches between file 1 and file 2, then total output should be of reasonable size. To calculate the final results, we perform an intersection of the lines in temporary files, i.e. for a line to be in the final results it must occur in every temporary file.

marcog 2009-04-30 20:45:23

What do you mean with hashmap? Sounds like memory based. I understood that (very large) files means that it is not possible to load everything into memory

Norbert Hartl 2009-04-30 20:49:33

A perl dictionary is a hashmap. Rini asked for the fastest method, not the most memory efficient solution.

marcog 2009-04-30 20:53:38

Added a perl implementation

marcog 2009-04-30 20:55:39

If it does not fit into RAM than it swaps which is not very fast. Or it dies than it is fast but... :)

Norbert Hartl 2009-04-30 20:56:06

Yep, that would cause a problem. :)

marcog 2009-04-30 20:58:14

Perl is not very memory efficient - you are very likely to run out of memory doing this

1800 INFORMATION 2009-04-30 20:59:17

@Norbert Added a solution if we run out of memory. @1800 Rini asked for a Perl solution.

marcog 2009-04-30 21:03:45

@Norbet Hartl In Perl you can tie a hash to a DBM file that resides on disk, it is slower than in memory hashes, but is still amortized O(1), so his/her runtime will still be O(n). The nice thing about it is you can decided whether to tie or not based on the file sizes without changing any of the code that does the work. See AnyDBM_File.

Chas. Owens 2009-04-30 21:16:58

my %map = {}; is an error. It was trying to assign an anonymous hash to a hash. Hash assignment requires an even number of items. You probably meant my %map = (); which is also a mistake, but not a bad one. People say that when they create a hash and want it to be empty, but a hash is empty upon creation, so the right thing to say is my $map; Also, the idiomatic Perl name for that variable is %seen because you have seen the value before.

Chas. Owens 2009-04-30 21:34:08

Another optimization would be to say my $use_file1_first = (-s $file1 < -s $file2); to determine which file to read first. The smaller file should be in the hash (-s returns the file's size in bytes).

Chas. Owens 2009-04-30 21:37:40

It's misleading to say that hashes are O(1). They are only O(1) provided that they are sufficiently sparsely populated -- any hash with n buckets that is loaded with n*n items has O(n) performance at best. They can also degrade to O(n) performance for certain (unlikely) pathological inputs that cause them to essentially become linked lists.

j_random_hacker 2009-05-01 05:31:35

@Chase Yes, you are right. I couldn't quite remember the infamous tie. I did no perl the last couple of years :)

Norbert Hartl 2009-05-01 08:08:34

Answer 2

+6 A:

Not Perl but effective:

diff <(sort file1) <(sort file2)

Ichorus 2009-04-30 20:46:53

This can be run from Perl and it is probably the fastest solution for very large files. Perl is likely to run out of memory if you try to slurp large files

1800 INFORMATION 2009-04-30 20:58:46

Doesn't diff try to do it in memory as well? Then this is also not feasible on very large files. And if so you have to postprocess because the diff is the result wanted but without context lines and markers

Norbert Hartl 2009-04-30 21:03:13

Diff can handle much larger files than Perl - Perl tends to trade memory for speed, normally this is a good tradeoff, but you will see memory usage be very much higher than the size of the strings involved

1800 INFORMATION 2009-04-30 23:14:42

Answer 3

+2 A:

What is very large to you? Larger than your RAM? Your basic answer is to use a hash, if the files are larger than your RAM then you need to use a hash tied to a dbm.

Chas. Owens 2009-04-30 20:47:44

Answer 4

+4 A:

#!/usr/bin/perl                                                                 

use warnings; 
use strict;

open(my $alpha, '<', 'file1') || die "Couldn't open file1";
open(my $beta, '<' , 'file2') || die "Couldn't open file2";

my %data;
map {$data{$_} = 1} <$alpha>;
map {print $_ unless $data{$_}} <$beta>;

slipset 2009-04-30 21:20:28

although your code will work, be warned, with large files it might have a tendancy to crash, as <> in list context will likely load the whole file into memory prior to performing the map operation. Marcogs solution is better for this reason. Perl /might/ be smart and do-the-right thing, but its a bit hairy.

Kent Fredric 2009-05-01 00:05:41

@Kent: marcog's solution also loads all lines into memory -- both these solutions will have roughly the same memory usage.

j_random_hacker 2009-05-01 05:22:28

j_random_hacker: in cases of signifincantly duplicate, the read-into-hash version will use less memory. As well as this, the top one *doesnt* load all lines of the *second* file in memory at once in *ADDITION* to the lines of the first.

Kent Fredric 2009-05-01 12:34:45

slipsets variant will have up to 2 effective copies of the first file at one point, and 1 effective copy of the second in conjunction with a copy of the first, marcogs solution will have at most 1 effective file copy at any one time.

Kent Fredric 2009-05-01 12:36:45

@Kent: Good points, I didn't think about a mostly-duplicates scenario, or notice that slipset read the 2nd file into memory too -- my bad.

j_random_hacker 2009-05-01 14:53:56

Answer 5

+4 A:

I'm surprised no-one has yet suggested the most memory-efficient way, which is to sort both files separately, then perform a list merge. If you're on Unix, the following 3 simple commands will do what you need, quickly:

sort < file1 > file1.sorted
sort < file2 > file2.sorted
comm -2 -3 file1.sorted file2.sorted > differences

If the files are so large that Perl has to page VM to load all the lines into a hashtable, this technique will be much faster. Otherwise, a hashtable-based approach should be faster.

If you're on Unix, it's better to use your system's external sort command as it is intelligent about memory usage -- using Perl's sort() entails reading the entire contents of the file into memory.

If you're on Windows, note that the supplied sort command is case-insensitive and this can't be turned off! Also there is no comm command on Windows, so you'll need to roll your own -- replace the third line above with:

perl subtract_sets.pl file1.sorted file2.sorted > differences.txt

subtract_sets.pl

#!/usr/bin/perl
open my $f1, '<', $ARGV[0] or die;
open my $f2, '<', $ARGV[1] or die;

my $x = <$f1>;
my $y = <$f2>;

while (defined $x && defined $y) {
    if ($x lt $y) {
        print $x;
        $x = <$f1>;
    } elsif ($y lt $x) {
        print $y;
        $y = <$f2>;
    } else {
        # Lines match
        $x = <$f1>;
        $y = <$f2>;
    }
}

while (defined $x) {
    print $x;
    $x = <$f1>;
}

while (defined $y) {
    print $y;
    $y = <$f2>;
}

j_random_hacker 2009-05-01 05:15:03

Answer 6

+2 A:

Just some efficiency benchmarks:

10k lines, 10-characters random strings max per line.

          Rate slipset marcog
slipset 47.6/s      --   -16%
marcog  56.7/s     19%     --

100k lines, 10-characters random strings max per line.

          Rate slipset marcog
slipset 3.02/s      --   -34%
marcog  4.60/s     52%     -

1000k lines, 10-characters random strings max per line.

        s/iter slipset marcog
slipset   4.09      --   -33%
marcog    2.75     49%     --

1k lines, 100-characters random strings max per line.

         Rate  slipset marcog
slipset 379/s      --   -12%
marcog  431/s     14%     --

100k lines, 100-characters random strings max per line

          Rate slipset  marcog
slipset 2.15/s      --    -30%
marcog  3.08/s     44%      --

1k lines, 1000-character random strings max per line

         Rate slipset  marcog
slipset 133/s      --    -10%
marcog  148/s     11%      --

100k lines, 1000-character random strings max per line

          Rate slipset  marcog
slipset 1.01/s      --    -18%
marcog  1.22/s     22%      --

Memory Efficiency

Marcog: 100k lines, 1000-character random strings max per line:

Memory usage summary: heap total: 163_259_635, heap peak: 61_536_800, stack peak: 17_648
         total calls     total memory   failed calls
 malloc|     307_425      162_378_090              0
realloc|       1_461           96_878              0  (nomove:1_218, dec:1_026, free:0)
 calloc|      12_762          784_667              0
   free|     307_598      155_133_460

Slipset: 100k lines, 1000-character random strings max per line:

Memory usage summary: heap total: 647_103_469, heap peak: 118_445_776, stack peak: 17_648
         total calls     total memory   failed calls
 malloc|     508_089      186_752_811              0
realloc|     399_907      459_553_775              0  (nomove:334_169, dec:196_380, free:0)
 calloc|      12_765          796_883              0
   free|     507_584      256_315_688

Kent Fredric 2009-05-01 13:40:08

ansaurus

tags:

views:

answers:

What's the fastest way in Perl to get all lines of file1 that do not appear in file2?

subtract_sets.pl

related questions