My task is to create a parent-child hierarchy file using perl.
Sample Input file (tab delimited). The records would be arranged in the file in random order and the "parent" may appear after the "child".
S5 S3
S5 S8
ROOT S1
S1 S7
S2 S5
S3 S4
S1 S2
S4 77
S2 S9
S3 88
Sample Output File (tab delimited)
ROOT S1 S2 S5 S3 S4 77
ROOT S1 S2 S5 S3 88
ROOT S1 S7
ROOT S1 S2 S5 S8
ROOT S1 S2 S9
The code which produces the above output file
use strict;
# usage: perl parent_child_generator.pl input.txt output.txt
my $input0=$ARGV[0] or die "must provide input.txt as the first argument\n";
my $output1=$ARGV[1] or die "must provide output.txt as the second argument\n";
open(IN0,"<",$input0) || die "Cannot open $input0 for reading: $!";
open(OUT1,">",$output1) || die "Cannot open $output1 for writing: $!";
sub trim
{
my $string=shift;
$string=~s/\r?\n$//;
$string=~s/^\s+//;
$string=~s/\s+$//;
return $string;
}
sub connectByPrior
{
my $in_child=$_[0];
my %in_hash=%{$_[1]};
my @anscestor_arr;
for (sort keys %in_hash)
{
my $key=$_;
my @key_arr=split(/\t/,$key);
my $parent=$key_arr[0];
my $child=$key_arr[1];
if ($in_child eq $child)
{
push (@anscestor_arr,$parent);
@anscestor_arr=(@{connectByPrior($parent,\%in_hash)},@anscestor_arr);
last;
}
}
return \@anscestor_arr;
}
my %parent_hash;
my %child_hash;
my %unsorted_hash;
while(<IN0>)
{
my @cols=split(/\t/);
for (my $i=0; $i < scalar(@cols); $i++)
{
$cols[$i]= trim($cols[$i]);
}
my $parent=$cols[0];
my $child=$cols[1];
my $parent_child="$parent\t$child";
$parent_hash{$parent}=1;
$child_hash{$child}=1;
$unsorted_hash{$parent_child}=1;
}
close(IN0);
my @lev0_arr;
for (sort keys %child_hash)
{
my $rec=$_;
if (!exists($parent_hash{$rec}))
{
push (@lev0_arr,$rec);
}
}
for (@lev0_arr)
{
my $child=$_;
my @anscestor_arr=@{connectByPrior($child,\%unsorted_hash)};
push (@anscestor_arr,$child);
print OUT1 join("\t",@anscestor_arr)."\n";
}
Question: The code works fine if input file is not too big. The actual input file contains more than 200k lines and the code is taking too long time to process the output. What improvements / changes do you suggest so that it does not take too long to process?