ansaurus

Question

Convert Perl script to Python: dedupe 2 files based on hash keys

Answer 1

+7 A:

You can use sets in Python if you don't care about order:

file1=set(open("file1").readlines())
file2=set(open("file2").readlines())
intersection = file1 & file2 #common lines
non_intersection = file2 - file1  #uncommon lines (in file2 but not file1)
for items in intersection:
    print items
for nitems in non_intersection:
    print nitems

Other methods include using difflib, filecmp libraries.

The other way, only using list comparison.

# lines in file2 common with file1
data1=map(str.rstrip,open("file1").readlines())
for line in open("file2"):
    line=line.rstrip()
    if line in data1:
        print line

# lines in file2 not in file1, use "not"
data1=map(str.rstrip,open("file1").readlines())
for line in open("file2"):
    line=line.rstrip()
    if not line in data1:
        print line

ghostdog74 2009-11-23 09:39:28

+1 Short, concise, pythonic and to-the-point

Adam Matan 2009-11-23 09:42:57

Better to use readlines() instead of read().split().Unique lines in file2 are file2-file1 (set difference).Using | yields set combination, all lines in both files as a set.

gimel 2009-11-23 10:01:16

Thanks for this, but this script does not preserve the integrity of the complete line in each file. This splits the line on spaces and outputs individual words.

galaxywatcher 2009-11-23 10:15:09

i have changed it to readlines(). Please use difflib (and look at filecmp if you are interested) to do this kind of stuff in you want order. Its easier and the modules has other options you might be interested in too.

ghostdog74 2009-11-23 10:19:46

Your code and comments were quite helpful. Thanks.

galaxywatcher 2009-11-27 17:26:09

Answer 2

+3 A:

Here's a slightly different solution that's a little more memory friendly, should the files be very large. This only creates a set for the original file (as there doesn't seem to be a need to have all of file2 in memory at once):

with open("file1.txt", "r") as file1:
    file1set = set(line.rstrip() for line in file1)

with open("file2.txt", "r") as file2:
    with open("duplicate.txt", "w") as dfh:
        with open("file2_clean.txt", "w") as ofh:
            for line in file2:
                if line.rstrip() in file1set:
                    dfh.write(line)     # duplicate line
                else:
                    ofh.write(line)     # not duplicate

Note, if you want to include trailing whitespace and the end-of-line characters in the comparisons, you can replace the second line.rstrip() with just line and simplify the second line to:

    file1set = set(file1)

Also, as of Python 3.1, the with statement allows multiple items so the three with statements could be combined into one.

Ned Deily 2009-11-23 10:15:35

Answer 3

+4 A:

Yet another variant (merely syntaxic changes from other proposals, there is also more than one way to do it using python).

file1set = set([line for line in file("file1.txt")])
file2set = set([line for line in file("file2.txt")])

for name, results in [
    ("duplicate.txt", file1set.intersection(file2set)),
    ("file2_clean.txt", file2set.difference(file1set))]:
    with file(name, 'w') as fh:
        for line in results:
            fh.write(line)

Side note: we should also contribute another perl version, the one proposed in not very perlish... below is the perl equivalent of my python version. Does not looks much like the initial one. What I want to point out is that in proposed answers issue is as much algorithmic and language independant that perl vs python.

use strict;

open my $file1, '<', "file1.txt" or die $!;
my %file1hash = map { $_ => 1 } <$file1>;

open my $file2, '<', "file2.txt" or die $!;
my %file2hash = map { $_ => 1 } <$file2>;

for (["duplicate.txt", [grep $file1hash{$_}, keys(%file2hash)]],
     ["file2_clean.txt", [grep !$file1hash{$_}, keys(%file2hash)]]){
    my ($name, $results) = @$_;
    open my $fh, ">$name" or die $!;
    print $fh @$results;
}

kriss 2009-11-23 12:15:53

Thanks. I tested this with 2 large files and it works just like the perl script. The speed is blazing. Literally 3 seconds to compare 2 files: 150,000 records in file1 and 200,000 in file2. Looking at your 2 scripts, the python just looks much cleaner.

galaxywatcher 2009-11-23 13:34:16

What are the performances using the perl version for your tests files?My perl version can easily be optimized (obviously it performs the same loop two times on file2hash). However I bet most time is spent in IO so there should not be much difference between versions.

kriss 2009-11-23 14:12:34

tsee 2009-11-23 16:02:10

I like your python version but I do want to point out that there is more than a syntactic difference among the versions presented so far. All but one version (including the perl versions) build dicts/hashes for `both` files. Your solution nicely avoids first reading into memory `all` of file2 before creating the set (as the `readlines` solution does). The solution with one set only requires enough memory to hold the dict of the original file. For files with very large numbers of records (and in this case 200,000 might be small) such performance characteristics may be important.

Ned Deily 2009-11-23 18:01:41

@tsee: updated perl script to keep you happy, sorry I can't easily remove {} for map (it's block map). Also tried version with only one loop on file2hash and it's about 15% faster on my testset.

kriss 2009-11-23 23:47:04

@Ned: I totally agree with you, there is also some algorithmic differences between version. With my testset yours is fastest, and mine slowest with version of ghostdog in between (as you could expect looking at the algorithmic differences). However difference between the 3 versions is quite small (20% between fastest and slowest). My perl version is about two times slower and I do not see any obvious way to optimize it at that level.

kriss 2009-11-23 23:52:30

@kriss: You're right. Sorry about the map. If it was really about CPU time and not IO, "$hash{$_}=1 for @array" would be faster than any map, by the way.

tsee 2009-11-24 00:24:39

@tsee: I wrongly thought map was faster, but I just tried and got 20% more performance... but it's not yet as fast as python version.

kriss 2009-11-24 00:46:16

ansaurus

tags:

views:

answers:

Convert Perl script to Python: dedupe 2 files based on hash keys

related questions