tags:

views:

360

answers:

3

I have two text files that contain columnar data of the variety position-value, sorted by position.

Here is an example of the first file (file A):

100   1
101   1
102   0
103   2
104   1
...

Here is an example of the second file (B):

20    0
21    0
...
100   2
101   1
192   3
193   1
...

Instead of reading one of the two files into a hash table, which is prohibitive due to memory constraints, what I would like to do is walk through two files simultaneously, in a stepwise fashion.

What this means is that I would like to stream through lines of either A or B and compare position values.

If the two positions are equal, then I perform a calculation on the values associated with that position.

Otherwise, if the positions are not equal, I move through lines of file A or file B until the positions are equal (when I again perform my calculation) or I reach EOF of both files.

Is there a way to do this in Perl?

+1  A: 

For looping through files you can use the core Tie::File module. It represents a regular text file as an array.

eugene y
+4  A: 

If the files are sorted, step through them based on which one has the lower position.

Pseudocode:

read Apos, Aval from A # initial values
read Bpos, Bval from B 
until eof(A) or eof(B)
  if Apos == Bpos then
    compare()
    read Apos, Aval from A # advance both files to get a new position
    read Bpos, Bval from B
  fi
  if Apos < Bpos then read Apos, Aval from A
  if Bpos < Apos then read Bpos, Bval from B
end

You could also use join(1) to isolate the lines with common positions and process that at your leisure.

rjp
You're duplicating too much code in that pseudo code. :)
brian d foy
+3  A: 

Looks like a problem one would likely stumble upon, for example database table data with keys and values. Here's an implementation of the pseudocode provided by rjp.

#!/usr/bin/perl

use strict;
use warnings;

sub read_file_line {
  my $fh = shift;

  if ($fh and my $line = <$fh>) {
    chomp $line;
    return [ split(/\t/, $line) ];
  }
  return;
}

sub compute {
   # do something with the 2 values
}

open(my $f1, "file1");
open(my $f2, "file2");

my $pair1 = read_file_line($f1);
my $pair2 = read_file_line($f2);

while ($pair1 and $pair2) {
  if ($pair1->[0] < $pair2->[0]) {
    $pair1 = read_file_line($f1);
  } elsif ($pair2->[0] < $pair1->[0]) {
    $pair2 = read_file_line($f2);
  } else {
    compute($pair1->[1], $pair2->[1]);
    $pair1 = read_file_line($f1);
    $pair2 = read_file_line($f2);
  }
}

close($f1);
close($f2);

Hope this helps!

Terence
One assumes there's a `use autodie` in there as well to check those bare opens for errors. ;)
pjf
This worked well as a start, thanks! One complication is that the `while ($pair1 and $pair2)` test will cause the loop to finish as soon as either one of the files reaches EOF. My question, as framed, makes this a non-issue — however, I do need to do stuff with the other two non-pair-equality cases. So I modified `read_file_line` to return either the next line or the current line, and I keep a pair of booleans to check if the pair-line has changed. Instead of testing for EOF, I test if both lines have been unchanged by running `read_file_line`. If so, then I can safely exit the `while` loop.
Alex Reynolds