I am new to Python and would like to know if someone would kindly convert an example of a fairly simple Perl script to Python?
The script takes 2 files and outputs only unique lines from the second file by comparing hash keys. It also outputs duplicate lines to a file. I have found that this method of deduping is extremely fast with Perl, and would like to see how Python compares.
#! /usr/bin/perl
## Compare file1 and file2 and output only the unique lines from file2.
## Opening file1.txt and store the data in a hash.
open my $file1, '<', "file1.txt" or die $!;
while ( <$file1> ) {
my $name = $_;
$file1hash{$name}=$_;
}
## Opening file2.txt and store the data in a hash.
open my $file2, '<', "file2.txt" or die $!;
while ( <$file2> ) {
$name = $_;
$file2hash{$name}=$_;
}
open my $dfh, '>', "duplicate.txt";
## Compare the keys and remove the duplicate one in the file2 hash
foreach ( keys %file1hash ) {
if ( exists ( $file2hash{$_} ))
{
print $dfh $file2hash{$_};
delete $file2hash{$_};
}
}
open my $ofh, '>', "file2_clean.txt";
print $ofh values(%file2hash) ;
I have tested both perl and python scripts on 2 files of over 1 million lines and total time was less than 6 seconds. For the business purpose this served, the performance is outstanding!
I modified the script Kriss offered and I am very happy with both results: 1) The performance of the script and 2) the ease with which I modified the script to be more flexible:
#!/usr/bin/env python
import os
filename1 = raw_input("What is the first file name to compare? ")
filename2 = raw_input("What is the second file name to compare? ")
file1set = set([line for line in file(filename1)])
file2set = set([line for line in file(filename2)])
for name, results in [
(os.path.abspath(os.getcwd()) + "/duplicate.txt", file1set.intersection(file2set)),
(os.path.abspath(os.getcwd()) + "/" + filename2 + "_clean.txt", file2set.difference(file1set))]:
with file(name, 'w') as fh:
for line in results:
fh.write(line)