tags:

views:

487

answers:

3

I am new to Python and would like to know if someone would kindly convert an example of a fairly simple Perl script to Python?

The script takes 2 files and outputs only unique lines from the second file by comparing hash keys. It also outputs duplicate lines to a file. I have found that this method of deduping is extremely fast with Perl, and would like to see how Python compares.

#! /usr/bin/perl

## Compare file1 and file2 and output only the unique lines from file2.

## Opening file1.txt and store the data in a hash.
open my $file1, '<', "file1.txt" or die $!;
while ( <$file1> ) {
    my $name = $_;
    $file1hash{$name}=$_;
}
## Opening file2.txt and store the data in a hash.
open my $file2, '<', "file2.txt" or die $!;

while  ( <$file2> ) {
    $name = $_;
    $file2hash{$name}=$_;
}

open my $dfh, '>', "duplicate.txt";

## Compare the keys and remove the duplicate one in the file2 hash
foreach ( keys %file1hash ) {
    if ( exists ( $file2hash{$_} ))
    {
    print $dfh $file2hash{$_};
    delete $file2hash{$_};
    }
}

open my $ofh, '>', "file2_clean.txt";
print  $ofh values(%file2hash) ;

I have tested both perl and python scripts on 2 files of over 1 million lines and total time was less than 6 seconds. For the business purpose this served, the performance is outstanding!

I modified the script Kriss offered and I am very happy with both results: 1) The performance of the script and 2) the ease with which I modified the script to be more flexible:

#!/usr/bin/env python

import os

filename1 = raw_input("What is the first file name to compare? ")
filename2 = raw_input("What is the second file name to compare? ")

file1set = set([line for line in file(filename1)])
file2set = set([line for line in file(filename2)])

for name, results in [
    (os.path.abspath(os.getcwd()) + "/duplicate.txt", file1set.intersection(file2set)),
    (os.path.abspath(os.getcwd()) + "/" + filename2 + "_clean.txt", file2set.difference(file1set))]:
    with file(name, 'w') as fh:
        for line in results:
            fh.write(line)
+7  A: 

You can use sets in Python if you don't care about order:

file1=set(open("file1").readlines())
file2=set(open("file2").readlines())
intersection = file1 & file2 #common lines
non_intersection = file2 - file1  #uncommon lines (in file2 but not file1)
for items in intersection:
    print items
for nitems in non_intersection:
    print nitems

Other methods include using difflib, filecmp libraries.

The other way, only using list comparison.

# lines in file2 common with file1
data1=map(str.rstrip,open("file1").readlines())
for line in open("file2"):
    line=line.rstrip()
    if line in data1:
        print line

# lines in file2 not in file1, use "not"
data1=map(str.rstrip,open("file1").readlines())
for line in open("file2"):
    line=line.rstrip()
    if not line in data1:
        print line
ghostdog74
+1 Short, concise, pythonic and to-the-point
Adam Matan
Better to use readlines() instead of read().split().Unique lines in file2 are file2-file1 (set difference).Using | yields set combination, all lines in both files as a set.
gimel
Thanks for this, but this script does not preserve the integrity of the complete line in each file. This splits the line on spaces and outputs individual words.
galaxywatcher
i have changed it to readlines(). Please use difflib (and look at filecmp if you are interested) to do this kind of stuff in you want order. Its easier and the modules has other options you might be interested in too.
ghostdog74
Your code and comments were quite helpful. Thanks.
galaxywatcher
+3  A: 

Here's a slightly different solution that's a little more memory friendly, should the files be very large. This only creates a set for the original file (as there doesn't seem to be a need to have all of file2 in memory at once):

with open("file1.txt", "r") as file1:
    file1set = set(line.rstrip() for line in file1)

with open("file2.txt", "r") as file2:
    with open("duplicate.txt", "w") as dfh:
        with open("file2_clean.txt", "w") as ofh:
            for line in file2:
                if line.rstrip() in file1set:
                    dfh.write(line)     # duplicate line
                else:
                    ofh.write(line)     # not duplicate

Note, if you want to include trailing whitespace and the end-of-line characters in the comparisons, you can replace the second line.rstrip() with just line and simplify the second line to:

    file1set = set(file1)

Also, as of Python 3.1, the with statement allows multiple items so the three with statements could be combined into one.

Ned Deily
+4  A: 

Yet another variant (merely syntaxic changes from other proposals, there is also more than one way to do it using python).

file1set = set([line for line in file("file1.txt")])
file2set = set([line for line in file("file2.txt")])

for name, results in [
    ("duplicate.txt", file1set.intersection(file2set)),
    ("file2_clean.txt", file2set.difference(file1set))]:
    with file(name, 'w') as fh:
        for line in results:
            fh.write(line)

Side note: we should also contribute another perl version, the one proposed in not very perlish... below is the perl equivalent of my python version. Does not looks much like the initial one. What I want to point out is that in proposed answers issue is as much algorithmic and language independant that perl vs python.

use strict;

open my $file1, '<', "file1.txt" or die $!;
my %file1hash = map { $_ => 1 } <$file1>;

open my $file2, '<', "file2.txt" or die $!;
my %file2hash = map { $_ => 1 } <$file2>;

for (["duplicate.txt", [grep $file1hash{$_}, keys(%file2hash)]],
     ["file2_clean.txt", [grep !$file1hash{$_}, keys(%file2hash)]]){
    my ($name, $results) = @$_;
    open my $fh, ">$name" or die $!;
    print $fh @$results;
}
kriss
Thanks. I tested this with 2 large files and it works just like the perl script. The speed is blazing. Literally 3 seconds to compare 2 files: 150,000 records in file1 and 200,000 in file2. Looking at your 2 scripts, the python just looks much cleaner.
galaxywatcher
What are the performances using the perl version for your tests files?My perl version can easily be optimized (obviously it performs the same loop two times on file2hash). However I bet most time is spent in IO so there should not be much difference between versions.
kriss
tsee
I like your python version but I do want to point out that there is more than a syntactic difference among the versions presented so far. All but one version (including the perl versions) build dicts/hashes for `both` files. Your solution nicely avoids first reading into memory `all` of file2 before creating the set (as the `readlines` solution does). The solution with one set only requires enough memory to hold the dict of the original file. For files with very large numbers of records (and in this case 200,000 might be small) such performance characteristics may be important.
Ned Deily
@tsee: updated perl script to keep you happy, sorry I can't easily remove {} for map (it's block map). Also tried version with only one loop on file2hash and it's about 15% faster on my testset.
kriss
@Ned: I totally agree with you, there is also some algorithmic differences between version. With my testset yours is fastest, and mine slowest with version of ghostdog in between (as you could expect looking at the algorithmic differences). However difference between the 3 versions is quite small (20% between fastest and slowest). My perl version is about two times slower and I do not see any obvious way to optimize it at that level.
kriss
@kriss: You're right. Sorry about the map. If it was really about CPU time and not IO, "$hash{$_}=1 for @array" would be faster than any map, by the way.
tsee
@tsee: I wrongly thought map was faster, but I just tried and got 20% more performance... but it's not yet as fast as python version.
kriss