I have a list of sorted coordinates (let's call it xycord.txt
) that looks like this:
chr1 10003486 10043713
chr1 10003507 10043106
chr2 10003486 10043713
chr2 10003507 10043162
chr2 10003532 10042759
In reality the this file is very2 large with 10^7 lines.
What I want to do is given another two-point coordinates I want to check if they
fall in between any coordinates in xycord.txt
file.
The current approach I have is super slow. Because
there are also many others two-point coordinates against this large xycord.txt
file.
Is there a fast way to do it?
#!/usr/bin/perl -w
my $point_to_check_x = $ARGV[0] || '10003488';
my $point_to_check_y = $ARGV[1] || '10003489';
my $chrid = $ARGV[2] || "chr1";
my %allxycordwithchr;
# skip file opening construct
while (<XYCORD_FILE>) {
my ($chr,$tx,$ty) = split(/\s+/,$_);
push @{$allxycordwithchr{$chr}},$tx."-".$ty;
}
my @chosenchr_cord = @{$allxycordwithchr{$chrid}};
for my $chro_cords (@chosenchr_cord){
my ($repox,$repoy) = split("-",$chro_cord);
my $stat = is_in_xycoordsfile($repox,$repoy,$point_to_check_x,$point_to_check_y);
if ($stat eq "IN"){
print "IN\n";
}
}
sub is_in_xycoordsfile {
my ($x,$y,$xp,$yp) = @_;
if ( $xp >= $x && $yp <= $y ) {
return "IN";
}
else {
return "OUT";
}
}
Update: I apologize for correcting this. In my earlier posting I oversimplified the problem.
Actually, there is one more query field (e.g. chromosome name). Hence the DB/RB-trees/SQL approaches maybe infeasible in this matter?