tags:

views:

549

answers:

6

I have two (very large) text files. What is the fastest way - in terms of run time - to create a third file containing all lines of file1 that do not appear in file2?

So if file1 contains:

Sally  
Joe  
Tom  
Suzie

And file2 contains:

Sally  
Suzie  
Harry  
Tom

Then the output file should contain:

Joe
+13  A: 

Create a hashmap containing each line from file 2. Then for each line in file 1, if it is not in the hashmap then output it. This will be O(N), which is the best efficiency class you can achieve given that you have to read the input.

Perl implementation:

#!/usr/bin/env perl
use warnings;
use strict;
use Carp ();

my $file1 = 'file1.txt';
my $file2 = 'file2.txt';

my %map;
{
    open my $in, '<',$file2 or Carp::croak("Cant open $file2");
    while (<$in>) {
      $map{$_} = 1;
    }
    close($in) or Carp::carp("error closing $file2");
}
{
   open my $in,'<', $file1 or Carp::croak("Cant open $file1");
   while (<$in>) {
    if (!$map{$_}) {
      print $_;
    }
   }
   close $in or Carp::carp("error closing $file1");
}

If file 2 is so large that the hashmap doesn't fit in memory, then we have a different problem at hand. A possible solution is then to use the above solution on chunks of file 2 (small enough to fit into memory), outputing the results to temporary files. Provided there are sufficient matches between file 1 and file 2, then total output should be of reasonable size. To calculate the final results, we perform an intersection of the lines in temporary files, i.e. for a line to be in the final results it must occur in every temporary file.

marcog
What do you mean with hashmap? Sounds like memory based. I understood that (very large) files means that it is not possible to load everything into memory
Norbert Hartl
A perl dictionary is a hashmap. Rini asked for the fastest method, not the most memory efficient solution.
marcog
Added a perl implementation
marcog
If it does not fit into RAM than it swaps which is not very fast. Or it dies than it is fast but... :)
Norbert Hartl
Yep, that would cause a problem. :)
marcog
Perl is not very memory efficient - you are very likely to run out of memory doing this
1800 INFORMATION
@Norbert Added a solution if we run out of memory. @1800 Rini asked for a Perl solution.
marcog
@Norbet Hartl In Perl you can tie a hash to a DBM file that resides on disk, it is slower than in memory hashes, but is still amortized O(1), so his/her runtime will still be O(n). The nice thing about it is you can decided whether to tie or not based on the file sizes without changing any of the code that does the work. See AnyDBM_File.
Chas. Owens
my %map = {}; is an error. It was trying to assign an anonymous hash to a hash. Hash assignment requires an even number of items. You probably meant my %map = (); which is also a mistake, but not a bad one. People say that when they create a hash and want it to be empty, but a hash is empty upon creation, so the right thing to say is my $map; Also, the idiomatic Perl name for that variable is %seen because you have seen the value before.
Chas. Owens
Another optimization would be to say my $use_file1_first = (-s $file1 < -s $file2); to determine which file to read first. The smaller file should be in the hash (-s returns the file's size in bytes).
Chas. Owens
It's misleading to say that hashes are O(1). They are only O(1) provided that they are sufficiently sparsely populated -- any hash with n buckets that is loaded with n*n items has O(n) performance at best. They can also degrade to O(n) performance for certain (unlikely) pathological inputs that cause them to essentially become linked lists.
j_random_hacker
@Chase Yes, you are right. I couldn't quite remember the infamous tie. I did no perl the last couple of years :)
Norbert Hartl
+6  A: 

Not Perl but effective:

diff <(sort file1) <(sort file2)

Ichorus
This can be run from Perl and it is probably the fastest solution for very large files. Perl is likely to run out of memory if you try to slurp large files
1800 INFORMATION
Doesn't diff try to do it in memory as well? Then this is also not feasible on very large files. And if so you have to postprocess because the diff is the result wanted but without context lines and markers
Norbert Hartl
Diff can handle much larger files than Perl - Perl tends to trade memory for speed, normally this is a good tradeoff, but you will see memory usage be very much higher than the size of the strings involved
1800 INFORMATION
+2  A: 

What is very large to you? Larger than your RAM? Your basic answer is to use a hash, if the files are larger than your RAM then you need to use a hash tied to a dbm.

Chas. Owens
+4  A: 
#!/usr/bin/perl                                                                 

use warnings; 
use strict;

open(my $alpha, '<', 'file1') || die "Couldn't open file1";
open(my $beta, '<' , 'file2') || die "Couldn't open file2";

my %data;
map {$data{$_} = 1} <$alpha>;
map {print $_ unless $data{$_}} <$beta>;
slipset
although your code will work, be warned, with large files it might have a tendancy to crash, as <> in list context will likely load the whole file into memory prior to performing the map operation. Marcogs solution is better for this reason. Perl /might/ be smart and do-the-right thing, but its a bit hairy.
Kent Fredric
@Kent: marcog's solution also loads all lines into memory -- both these solutions will have roughly the same memory usage.
j_random_hacker
j_random_hacker: in cases of signifincantly duplicate, the read-into-hash version will use less memory. As well as this, the top one *doesnt* load all lines of the *second* file in memory at once in *ADDITION* to the lines of the first.
Kent Fredric
slipsets variant will have up to 2 effective copies of the first file at one point, and 1 effective copy of the second in conjunction with a copy of the first, marcogs solution will have at most 1 effective file copy at any one time.
Kent Fredric
@Kent: Good points, I didn't think about a mostly-duplicates scenario, or notice that slipset read the 2nd file into memory too -- my bad.
j_random_hacker
+4  A: 

I'm surprised no-one has yet suggested the most memory-efficient way, which is to sort both files separately, then perform a list merge. If you're on Unix, the following 3 simple commands will do what you need, quickly:

sort < file1 > file1.sorted
sort < file2 > file2.sorted
comm -2 -3 file1.sorted file2.sorted > differences

If the files are so large that Perl has to page VM to load all the lines into a hashtable, this technique will be much faster. Otherwise, a hashtable-based approach should be faster.

If you're on Unix, it's better to use your system's external sort command as it is intelligent about memory usage -- using Perl's sort() entails reading the entire contents of the file into memory.

If you're on Windows, note that the supplied sort command is case-insensitive and this can't be turned off! Also there is no comm command on Windows, so you'll need to roll your own -- replace the third line above with:

perl subtract_sets.pl file1.sorted file2.sorted > differences.txt

subtract_sets.pl

#!/usr/bin/perl
open my $f1, '<', $ARGV[0] or die;
open my $f2, '<', $ARGV[1] or die;

my $x = <$f1>;
my $y = <$f2>;

while (defined $x && defined $y) {
    if ($x lt $y) {
        print $x;
        $x = <$f1>;
    } elsif ($y lt $x) {
        print $y;
        $y = <$f2>;
    } else {
        # Lines match
        $x = <$f1>;
        $y = <$f2>;
    }
}

while (defined $x) {
    print $x;
    $x = <$f1>;
}

while (defined $y) {
    print $y;
    $y = <$f2>;
}
j_random_hacker
+2  A: 

Just some efficiency benchmarks:

10k lines, 10-characters random strings max per line.

          Rate slipset marcog
slipset 47.6/s      --   -16%
marcog  56.7/s     19%     --

100k lines, 10-characters random strings max per line.

          Rate slipset marcog
slipset 3.02/s      --   -34%
marcog  4.60/s     52%     -

1000k lines, 10-characters random strings max per line.

        s/iter slipset marcog
slipset   4.09      --   -33%
marcog    2.75     49%     --

1k lines, 100-characters random strings max per line.

         Rate  slipset marcog
slipset 379/s      --   -12%
marcog  431/s     14%     --

100k lines, 100-characters random strings max per line

          Rate slipset  marcog
slipset 2.15/s      --    -30%
marcog  3.08/s     44%      --

1k lines, 1000-character random strings max per line

         Rate slipset  marcog
slipset 133/s      --    -10%
marcog  148/s     11%      --

100k lines, 1000-character random strings max per line

          Rate slipset  marcog
slipset 1.01/s      --    -18%
marcog  1.22/s     22%      --

Memory Efficiency

Marcog: 100k lines, 1000-character random strings max per line:

Memory usage summary: heap total: 163_259_635, heap peak: 61_536_800, stack peak: 17_648
         total calls     total memory   failed calls
 malloc|     307_425      162_378_090              0
realloc|       1_461           96_878              0  (nomove:1_218, dec:1_026, free:0)
 calloc|      12_762          784_667              0
   free|     307_598      155_133_460

Slipset: 100k lines, 1000-character random strings max per line:

Memory usage summary: heap total: 647_103_469, heap peak: 118_445_776, stack peak: 17_648
         total calls     total memory   failed calls
 malloc|     508_089      186_752_811              0
realloc|     399_907      459_553_775              0  (nomove:334_169, dec:196_380, free:0)
 calloc|      12_765          796_883              0
   free|     507_584      256_315_688
Kent Fredric