ansaurus

Question

Perl merging 2 csv files line by line with a primary key

Answer 1

A:

Assuming around 20 bytes lines each of your file would amount to about 20 MB, which isn't too big. Since you are using hash your time complexity doesn't seem to be a problem.

In your second loop, you are printing to the console for each line, this bit is slow. Try removing that should help a lot. You can also avoid the delete in the second loop.

Reading multiple lines at a time should also help. But not too much I think, there is always going to be a read ahead behind the scenes.

neal aise 2010-06-27 00:59:07

Um, he's printing to the console only once every 1000 lines, and the "delete" is very important for what he does in the loop following that while statement.

Daniel Martin 2010-06-27 01:02:33

oh right! i need some sleep :)

neal aise 2010-06-27 01:17:53

20 bytes per line. LOL. You don't know Perl memory efficiency a lot. If you parse it and store in hash it takes more.

Hynek -Pichi- Vychodil 2010-06-27 09:44:01

Answer 2

+3 A:

Two techniques come to mind.

Read the data from the CSV files into two tables in a DBMS (SQLite would work just fine), and then use the DB to do a join and write the data back out to SQLite. The database will use indexes to optimize the join.
First, sort each file by primary key (using perl or unix sort), then do a linear scan over each file in parallel (read a record from each file; if the keys are equal then output a joined row and advance both files; if the keys are unequal then advance the file with the lesser key and try again). This step is O(n + m) time instead of O(n * m), and O(1) memory.

hobbs 2010-06-27 01:01:24

The 2nd idea is very good. Thanks!

Dave 2010-06-27 01:06:47

What do you mean by O(n*m) ? He's not doing anything O(n*m) here. He's looping once over one file, and once over the second, and not doing anything silly like a sequential scan of the array inside the second loop.

Daniel Martin 2010-06-27 01:20:22

@Daniel: If you think that hash lookup is O(1) then you are naive. It is only in books but in reality it is not true. First, hash map lookup is proportional to hash computation it is proportional to length of key multiplied to length of hash and length of hash is typically logN of key space. So lookup is in reality at least O(logN). (Yes, Perl uses adaptive hash length.) Second, there are some additional effects as CPU cache hits and so and so. In reality it is much more O(N) than O(logN) and O(1) never.

Hynek -Pichi- Vychodil 2010-06-27 10:33:00

My original hash got slower and slower, the intial 20,000 lines were done in under a few seconds, but the final 1000s of lines took about 5 seconds for each 1000.

Dave 2010-06-27 14:00:13

Answer 3

+3 A:

What's killing the performance is this code, which is concatenating millions of times.

$OUTSTRING.=$outstring."\n";

....

foreach my $key (sort { $a <=> $b } keys %line_for){
    $OUTSTRING.= $file1array[$line_for{$key}]."\n";
}

If you want to write to the output file only once, accumulate your results in an array, and then print them at the very end, using join. Or, even better perhaps, include the newlines in the results and write the array directly.

To see how concatenation does not scale when crunching big data, experiment with this demo script. When you run it in concat mode, things start slowing down considerably after a couple hundred thousand concatenations -- I gave up and killed the script. By contrast, simply printing an array of a million lines took less than a than a minute on my machine.

# Usage: perl demo.pl 50 999999 concat|join|direct
use strict;
use warnings;

my ($line_len, $n_lines, $method) = @ARGV;
my @data = map { '_' x $line_len . "\n" } 1 .. $n_lines;

open my $fh, '>', 'output.txt' or die $!;

if ($method eq 'concat'){         # Dog slow. Gets slower as @data gets big.
    my $outstring;
    for my $i (0 .. $#data){
        print STDERR $i, "\n" if $i % 1000 == 0;
        $outstring .= $data[$i];
    }
    print $fh $outstring;
}
elsif ($method eq 'join'){        # Fast
    print $fh join('', @data);
}
else {                            # Fast
    print $fh @data;
}

FM 2010-06-27 01:02:16

`join` would be just as slow, I think... but this would get around that: `foreach my $line (@outputarray) { print $line, "\n"; }`

Ether 2010-06-27 03:46:41

@Ether No, `join` is very fast -- orders of magnitude faster than building up a giant string through repeated concatenation. Try it out: I modified my demo script.

FM 2010-06-27 10:38:25

Thanks, in my solution I posted, the file is output from the array.

Dave 2010-06-27 14:01:29

Answer 4

+1 A:

I can't see anything that strikes me as obviously slow, but I would make these changes:

First, I'd eliminate the @file1array variable. You don't need it; just store the line itself in the hash:

while (){ chomp; $line_for{read_csv_string($,$position)}=$; }
Secondly, although this shouldn't really make much of a difference with perl, I wouldn't add to $OUTSTRING all the time. Instead, keep an array of output lines and push onto it each time. If for some reason you still need to call write_line with a massive string you can always use join('', @OUTLINES) at the end.
If write_line doesn't use syswrite or something low-level like that, but rather uses print or other stdio-based calls, then you aren't saving any disk writes by building up the output file in memory. Therefore, you might as well not build your output up in memory at all, and instead just write it out as you create it. Of course if you are using syswrite, forget this.
Since nothing is obviously slow, try throwing Devel::SmallProf at your code. I've found that to be the best perl profiler for producing those "Oh! That's the slow line!" insights.

Daniel Martin 2010-06-27 01:18:09

Thanks for the tips! :)

Dave 2010-06-27 13:52:26

1. Initially I stored the line in a hash, but I thought it was slowing it down, so I tried to minimise the key-values size to just the key and the line number to see if it would help. (obviously it didn't)2. yes, the point is made. I will use arrays instead of concatenating everything into a big string.3. Not using syswrite, advice taken.4. yep, will look into using SmallProf for future code.

Dave 2010-06-27 14:04:53

3. Btw, I found that if I write line by line with a print OUT, $_ statement in a foreach() loop, it will crash/disconnect my SSD drive. Whereas, if I use a single print OUT $OUTSTRING; then this will work fine. (maybe the controller for the SSD drive is bad). When I run the program on a mechanical rotating hard drive, then I can do both no problem.

Dave 2010-06-27 23:46:11

Answer 5

A:

I'd store each record in a hash whose keys are the primary keys. A given primary key's value is a reference to an array of CSV values, where undef represents an unknown value.

use 5.10.0;  # for // ("defined-or")
use Carp;
use Text::CSV;

sub merge_csv {
  my($path,$record) = @_;

  open my $fh, "<", $path or croak "$0: open $path: $!";

  my $csv = Text::CSV->new;
  local $_;
  while (<$fh>) {
    if ($csv->parse($_)) {
      my @f = map length($_) ? $_ : undef, $csv->fields;
      next unless @f >= 1;

      my $primary = pop @f;
      if ($record->{$primary}) {
        $record->{$primary}[$_] //= $f[$_]
          for 0 .. $#{ $record->{$primary} };
      }
      else {
        $record->{$primary} = \@f;
      }
    }
    else {
      warn "$0: $path:$.: parse failed; skipping...\n";
      next;
    }
  }
}

Your main program will resemble

my %rec;
merge_csv $_, \%rec for qw/ file1 file2 /;

The Data::Dumper module shows that the resulting hash given the simple inputs from your question is

$VAR1 = {
  '42' => [
    'one',
    'two',
    'three',
    'four'
  ]
};

Greg Bacon 2010-06-27 01:51:00

Answer 6

+1 A:

If you want merge you should really merge. First of all you have to sort your data by key and than merge! You will beat even MySQL in performance. I have a lot of experience with it.

You can write something along those lines:

#!/usr/bin/env perl
use strict;
use warnings;

use Text::CSV_XS;
use autodie;

use constant KEYPOS => 4;

die "Insufficient number of parameters" if @ARGV < 2;
my $csv = Text::CSV_XS->new( { eol => $/ } );
my $sortpos = KEYPOS + 1;
open my $file1, "sort -n -k$sortpos -t, $ARGV[0] |";
open my $file2, "sort -n -k$sortpos -t, $ARGV[1] |";
my $row1 = $csv->getline($file1);
my $row2 = $csv->getline($file2);
while ( $row1 and $row2 ) {
    my $row;
    if ( $row1->[KEYPOS] == $row2->[KEYPOS] ) {    # merge rows
        $row  = [ map { $row1->[$_] || $row2->[$_] } 0 .. $#$row1 ];
        $row1 = $csv->getline($file1);
        $row2 = $csv->getline($file2);
    }
    elsif ( $row1->[KEYPOS] < $row2->[KEYPOS] ) {
        $row  = $row1;
        $row1 = $csv->getline($file1);
    }
    else {
        $row  = $row2;
        $row2 = $csv->getline($file2);
    }
    $csv->print( *STDOUT, $row );
}

# flush possible tail
while ( $row1 ) {
    $csv->print( *STDOUT, $row1 );
    $row1 = $csv->getline($file1);
}
while ( $row2 ) {
    $csv->print( *STDOUT, $row2 );
    $row2 = $csv->getline($file1);
}
close $file1;
close $file2;

Redirect output to file and measure.

If you like more sanity around sort arguments you can replace file opening part with

(open my $file1, '-|') || exec('sort',  '-n',  "-k$sortpos",  '-t,',  $ARGV[0]);
(open my $file2, '-|') || exec('sort',  '-n',  "-k$sortpos",  '-t,',  $ARGV[1]);

Hynek -Pichi- Vychodil 2010-06-27 08:29:22

This code was really helpful, thanks!

Dave 2010-06-27 13:57:46

ansaurus

tags:

views:

answers:

Perl merging 2 csv files line by line with a primary key

related questions