tags:

views:

127

answers:

6

Edit: solution added.

Hi, I currently have some working albeit slow code.

It merges 2 CSV files line by line using a primary key. For example, if file 1 has the line:

"one,two,,four,42"

and file 2 has this line;

"one,,three,,42"

where in 0 indexed $position = 4 has the primary key = 42;

then the sub: merge_file($file1,$file2,$outputfile,$position);

will output a file with the line:

"one,two,three,four,42";

Every primary key is unique in each file, and a key might exist in one file but not in the other (and vice versa)

There are about 1 million lines in each file.

Going through every line in the first file, I am using a hash to store the primary key, and storing the line number as the value. The line number corresponds to an array[line num] which stores every line in the first file.

Then I go through every line in the second file, and check if the primary key is in the hash, and if it is, get the line from the file1array and then add the columns I need from the first array to the second array, and then concat. to the end. Then delete the hash, and then at the very end, dump the entire thing to file. (I am using a SSD so I want to minimise file writes.)

It is probably best explained with a code:

sub merge_file2{
 my ($file1,$file2,$out,$position) = ($_[0],$_[1],$_[2],$_[3]);
 print "merging: \n$file1 and \n$file2, to: \n$out\n";
 my $OUTSTRING = undef;

 my %line_for;
 my @file1array;
 open FILE1, "<$file1";
 print "$file1 opened\n";
 while (<FILE1>){
      chomp;
      $line_for{read_csv_string($_,$position)}=$.; #reads csv line at current position (of key)
      $file1array[$.] = $_; #store line in file1array.
 }
 close FILE1;
 print "$file2 opened - merging..\n";
 open FILE2, "<", $file2;
 my @from1to2 = qw( 2 4 8 17 18 19); #which columns from file 1 to be added into cols. of file 2.
 while (<FILE2>){
      print "$.\n" if ($.%1000) == 0;
      chomp;
      my @array1 = ();
      my @array2 = ();
      my @array2 = split /,/, $_; #split 2nd csv line by commas

      my @array1 = split /,/, $file1array[$line_for{$array2[$position]}];
      #                            ^         ^                  ^
      # prev line  lookup line in 1st file,lookup hash,     pos of key
      #my @output = &merge_string(\@array1,\@array2); #merge 2 csv strings (old fn.)

      foreach(@from1to2){
           $array2[$_] = $array1[$_];
      }
      my $outstring = join ",", @array2;
      $OUTSTRING.=$outstring."\n";
      delete $line_for{$array2[$position]};
 }
 close FILE2;
 print "adding rest of lines\n";
 foreach my $key (sort { $a <=> $b } keys %line_for){
      $OUTSTRING.= $file1array[$line_for{$key}]."\n";
 }

 print "writing file $out\n\n\n";
 write_line($out,$OUTSTRING);
}

The first while is fine, takes less than 1 minute, however the second while loop takes about 1 hour to run, and I am wondering if I have taken the right approach. I think it is possible for a lot of speedup? :) Thanks in advance.


Solution:

sub merge_file3{
my ($file1,$file2,$out,$position,$hsize) = ($_[0],$_[1],$_[2],$_[3],$_[4]);
print "merging: \n$file1 and \n$file2, to: \n$out\n";
my $OUTSTRING = undef;
my $header;

my (@file1,@file2);
open FILE1, "<$file1" or die;
while (<FILE1>){
    if ($.==1){
        $header = $_;
        next;
    }
    print "$.\n" if ($.%100000) == 0;
    chomp;
    push @file1, [split ',', $_];
}
close FILE1;

open FILE2, "<$file2" or die;
while (<FILE2>){
    next if $.==1;
    print "$.\n" if ($.%100000) == 0;
    chomp;
    push @file2, [split ',', $_];
}
close FILE2;

print "sorting files\n";
my @sortedf1 = sort {$a->[$position] <=> $b->[$position]} @file1;
my @sortedf2 = sort {$a->[$position] <=> $b->[$position]} @file2;   
print "sorted\n";
@file1 = undef;
@file2 = undef;
#foreach my $line (@file1){print "\t [ @$line ],\n";    }

my ($i,$j) = (0,0);
while ($i < $#sortedf1 and $j < $#sortedf2){
    my $key1 = $sortedf1[$i][$position];
    my $key2 = $sortedf2[$j][$position];
    if ($key1 eq $key2){
        foreach(0..$hsize){ #header size.
            $sortedf2[$j][$_] = $sortedf1[$i][$_] if $sortedf1[$i][$_] ne undef;
        }
        $i++;
        $j++;
    }
    elsif ( $key1 < $key2){
        push(@sortedf2,[@{$sortedf1[$i]}]);
        $i++;
    }
    elsif ( $key1 > $key2){ 
        $j++;
    }
}

#foreach my $line (@sortedf2){print "\t [ @$line ],\n"; }

print "outputting to file\n";
open OUT, ">$out";
print OUT $header;
foreach(@sortedf2){
    print OUT (join ",", @{$_})."\n";
}
close OUT;

}

Thanks everyone, the solution is posted above. It now takes about 1 minute to merge the whole thing! :)

A: 

Assuming around 20 bytes lines each of your file would amount to about 20 MB, which isn't too big. Since you are using hash your time complexity doesn't seem to be a problem.

In your second loop, you are printing to the console for each line, this bit is slow. Try removing that should help a lot. You can also avoid the delete in the second loop.

Reading multiple lines at a time should also help. But not too much I think, there is always going to be a read ahead behind the scenes.

neal aise
Um, he's printing to the console only once every 1000 lines, and the "delete" is very important for what he does in the loop following that while statement.
Daniel Martin
oh right! i need some sleep :)
neal aise
20 bytes per line. LOL. You don't know Perl memory efficiency a lot. If you parse it and store in hash it takes more.
Hynek -Pichi- Vychodil
+3  A: 

Two techniques come to mind.

  1. Read the data from the CSV files into two tables in a DBMS (SQLite would work just fine), and then use the DB to do a join and write the data back out to SQLite. The database will use indexes to optimize the join.

  2. First, sort each file by primary key (using perl or unix sort), then do a linear scan over each file in parallel (read a record from each file; if the keys are equal then output a joined row and advance both files; if the keys are unequal then advance the file with the lesser key and try again). This step is O(n + m) time instead of O(n * m), and O(1) memory.

hobbs
The 2nd idea is very good. Thanks!
Dave
What do you mean by O(n*m) ? He's not doing anything O(n*m) here. He's looping once over one file, and once over the second, and not doing anything silly like a sequential scan of the array inside the second loop.
Daniel Martin
@Daniel: If you think that hash lookup is O(1) then you are naive. It is only in books but in reality it is not true. First, hash map lookup is proportional to hash computation it is proportional to length of key multiplied to length of hash and length of hash is typically logN of key space. So lookup is in reality at least O(logN). (Yes, Perl uses adaptive hash length.) Second, there are some additional effects as CPU cache hits and so and so. In reality it is much more O(N) than O(logN) and O(1) never.
Hynek -Pichi- Vychodil
My original hash got slower and slower, the intial 20,000 lines were done in under a few seconds, but the final 1000s of lines took about 5 seconds for each 1000.
Dave
+3  A: 

What's killing the performance is this code, which is concatenating millions of times.

$OUTSTRING.=$outstring."\n";

....

foreach my $key (sort { $a <=> $b } keys %line_for){
    $OUTSTRING.= $file1array[$line_for{$key}]."\n";
}

If you want to write to the output file only once, accumulate your results in an array, and then print them at the very end, using join. Or, even better perhaps, include the newlines in the results and write the array directly.

To see how concatenation does not scale when crunching big data, experiment with this demo script. When you run it in concat mode, things start slowing down considerably after a couple hundred thousand concatenations -- I gave up and killed the script. By contrast, simply printing an array of a million lines took less than a than a minute on my machine.

# Usage: perl demo.pl 50 999999 concat|join|direct
use strict;
use warnings;

my ($line_len, $n_lines, $method) = @ARGV;
my @data = map { '_' x $line_len . "\n" } 1 .. $n_lines;

open my $fh, '>', 'output.txt' or die $!;

if ($method eq 'concat'){         # Dog slow. Gets slower as @data gets big.
    my $outstring;
    for my $i (0 .. $#data){
        print STDERR $i, "\n" if $i % 1000 == 0;
        $outstring .= $data[$i];
    }
    print $fh $outstring;
}
elsif ($method eq 'join'){        # Fast
    print $fh join('', @data);
}
else {                            # Fast
    print $fh @data;
}
FM
`join` would be just as slow, I think... but this would get around that: `foreach my $line (@outputarray) { print $line, "\n"; }`
Ether
@Ether No, `join` is very fast -- orders of magnitude faster than building up a giant string through repeated concatenation. Try it out: I modified my demo script.
FM
Thanks, in my solution I posted, the file is output from the array.
Dave
+1  A: 

I can't see anything that strikes me as obviously slow, but I would make these changes:

  • First, I'd eliminate the @file1array variable. You don't need it; just store the line itself in the hash:

    while (){ chomp; $line_for{read_csv_string($,$position)}=$; }

  • Secondly, although this shouldn't really make much of a difference with perl, I wouldn't add to $OUTSTRING all the time. Instead, keep an array of output lines and push onto it each time. If for some reason you still need to call write_line with a massive string you can always use join('', @OUTLINES) at the end.

  • If write_line doesn't use syswrite or something low-level like that, but rather uses print or other stdio-based calls, then you aren't saving any disk writes by building up the output file in memory. Therefore, you might as well not build your output up in memory at all, and instead just write it out as you create it. Of course if you are using syswrite, forget this.

  • Since nothing is obviously slow, try throwing Devel::SmallProf at your code. I've found that to be the best perl profiler for producing those "Oh! That's the slow line!" insights.

Daniel Martin
Thanks for the tips! :)
Dave
1. Initially I stored the line in a hash, but I thought it was slowing it down, so I tried to minimise the key-values size to just the key and the line number to see if it would help. (obviously it didn't)2. yes, the point is made. I will use arrays instead of concatenating everything into a big string.3. Not using syswrite, advice taken.4. yep, will look into using SmallProf for future code.
Dave
3. Btw, I found that if I write line by line with a print OUT, $_ statement in a foreach() loop, it will crash/disconnect my SSD drive. Whereas, if I use a single print OUT $OUTSTRING; then this will work fine. (maybe the controller for the SSD drive is bad). When I run the program on a mechanical rotating hard drive, then I can do both no problem.
Dave
A: 

I'd store each record in a hash whose keys are the primary keys. A given primary key's value is a reference to an array of CSV values, where undef represents an unknown value.

use 5.10.0;  # for // ("defined-or")
use Carp;
use Text::CSV;

sub merge_csv {
  my($path,$record) = @_;

  open my $fh, "<", $path or croak "$0: open $path: $!";

  my $csv = Text::CSV->new;
  local $_;
  while (<$fh>) {
    if ($csv->parse($_)) {
      my @f = map length($_) ? $_ : undef, $csv->fields;
      next unless @f >= 1;

      my $primary = pop @f;
      if ($record->{$primary}) {
        $record->{$primary}[$_] //= $f[$_]
          for 0 .. $#{ $record->{$primary} };
      }
      else {
        $record->{$primary} = \@f;
      }
    }
    else {
      warn "$0: $path:$.: parse failed; skipping...\n";
      next;
    }
  }
}

Your main program will resemble

my %rec;
merge_csv $_, \%rec for qw/ file1 file2 /;

The Data::Dumper module shows that the resulting hash given the simple inputs from your question is

$VAR1 = {
  '42' => [
    'one',
    'two',
    'three',
    'four'
  ]
};
Greg Bacon
+1  A: 

If you want merge you should really merge. First of all you have to sort your data by key and than merge! You will beat even MySQL in performance. I have a lot of experience with it.

You can write something along those lines:

#!/usr/bin/env perl
use strict;
use warnings;

use Text::CSV_XS;
use autodie;

use constant KEYPOS => 4;

die "Insufficient number of parameters" if @ARGV < 2;
my $csv = Text::CSV_XS->new( { eol => $/ } );
my $sortpos = KEYPOS + 1;
open my $file1, "sort -n -k$sortpos -t, $ARGV[0] |";
open my $file2, "sort -n -k$sortpos -t, $ARGV[1] |";
my $row1 = $csv->getline($file1);
my $row2 = $csv->getline($file2);
while ( $row1 and $row2 ) {
    my $row;
    if ( $row1->[KEYPOS] == $row2->[KEYPOS] ) {    # merge rows
        $row  = [ map { $row1->[$_] || $row2->[$_] } 0 .. $#$row1 ];
        $row1 = $csv->getline($file1);
        $row2 = $csv->getline($file2);
    }
    elsif ( $row1->[KEYPOS] < $row2->[KEYPOS] ) {
        $row  = $row1;
        $row1 = $csv->getline($file1);
    }
    else {
        $row  = $row2;
        $row2 = $csv->getline($file2);
    }
    $csv->print( *STDOUT, $row );
}

# flush possible tail
while ( $row1 ) {
    $csv->print( *STDOUT, $row1 );
    $row1 = $csv->getline($file1);
}
while ( $row2 ) {
    $csv->print( *STDOUT, $row2 );
    $row2 = $csv->getline($file1);
}
close $file1;
close $file2;

Redirect output to file and measure.

If you like more sanity around sort arguments you can replace file opening part with

(open my $file1, '-|') || exec('sort',  '-n',  "-k$sortpos",  '-t,',  $ARGV[0]);
(open my $file2, '-|') || exec('sort',  '-n',  "-k$sortpos",  '-t,',  $ARGV[1]);
Hynek -Pichi- Vychodil
This code was really helpful, thanks!
Dave