views:

169

answers:

7

I have a list of sorted coordinates (let's call it xycord.txt) that looks like this:

chr1    10003486        10043713
chr1    10003507        10043106
chr2    10003486        10043713
chr2    10003507        10043162
chr2    10003532        10042759

In reality the this file is very2 large with 10^7 lines.

What I want to do is given another two-point coordinates I want to check if they fall in between any coordinates in xycord.txt file.

The current approach I have is super slow. Because there are also many others two-point coordinates against this large xycord.txt file.

Is there a fast way to do it?

#!/usr/bin/perl -w

my $point_to_check_x = $ARGV[0] || '10003488';
my $point_to_check_y = $ARGV[1] || '10003489';
my $chrid = $ARGV[2] || "chr1";

my %allxycordwithchr;   
# skip file opening construct
while (<XYCORD_FILE>) {
  my ($chr,$tx,$ty) = split(/\s+/,$_);
  push @{$allxycordwithchr{$chr}},$tx."-".$ty;
}


 my @chosenchr_cord = @{$allxycordwithchr{$chrid}};

 for my $chro_cords (@chosenchr_cord){

  my ($repox,$repoy) = split("-",$chro_cord);
   my $stat = is_in_xycoordsfile($repox,$repoy,$point_to_check_x,$point_to_check_y);
   if ($stat eq "IN"){
      print "IN\n";
   }
 }

sub is_in_xycoordsfile  {

    my      ($x,$y,$xp,$yp) = @_;  
    if ( $xp >= $x && $yp <= $y ) {
        return "IN";
    }
    else {
        return "OUT";
    }

}

Update: I apologize for correcting this. In my earlier posting I oversimplified the problem.

Actually, there is one more query field (e.g. chromosome name). Hence the DB/RB-trees/SQL approaches maybe infeasible in this matter?

+4  A: 

Try a binary search rather than sequential search. There are two appearant options to do this:

  1. Split the files to smaller files (xycord001.txt, xycord002.txt and so on). Now you can easily determine in which file to search, and the search is rather faster. The big con here is that if you need to add data to a file it might get messy.

  2. Make a binary search over the file: Start at the middle, splitting the file into two logical parts. Decide in which part you coordinates might be, and look at the middle of that part. You'll rapidly (exponentially) reduce the size of the file you're looking in, till you'll be searching one line only. Read more about seeking into files; There is a perl example about binary searching a file here.

EDIT: Generally, using a database or DB file is preferred; However, binary file search is the quick-and-dirty way, especially if the script should run on different files on different machines (thanks @MiniQuark, @Chas. Owens)

Adam Matan
I like your ideas, but I think your answer needs to be tweaked a bit because you do not seem to take into account the fact that we are searching for ALL ranges that INCLUDE the given range. A simple binary approach would not work: you need at least two binary trees (one for the x coordinate and one for the y coordinate), or use another index structure, such as R-trees.
MiniQuark
Or perhaps, compare according to the distance between the two points - that might do the trick.
Adam Matan
+9  A: 

A few suggestions:

  1. You could store your data in a database, such as MySQL or SQLite. You could then use a simple request such as:

    "SELECT * FROM coordinates WHERE x<"+xp+" AND y>"+yp
    

    Provided you have indexes on x and y, this should be super fast.

  2. You could also take a look at R-Trees. I used R-trees a few years ago to store tens of thousands of city coordinates, and I could find the closest city from a given point in a fraction of a second. In your example, you are storing 1D ranges but I'm pretty sure R-trees would work fine too. You might find R-tree implementations for Perl here. Or you can use RectanglesContainingDot, which seems to do what you need.

  3. You could cache the coordinates in memory: each number looks like it would take 4 bytes to store, so this would lead to about 80 MB of memory usage if you have 10^7 couples of numbers. That's what firefox uses on my machine! Of course if you do this, you need to have some sort of daemon running in order to avoid reloading the whole file every time you need to check coordinates.

  4. You can mix solutions 2 & 3.

My preference is for solution 1: it has a good efficiency/complexity ratio.

MiniQuark
For number 1, you should be using SQL placeholders (my $sth = $dbh->prepare("SELECT * FROM coordinates WHERE x<? AND y>?"); $sth->execute($xp, $yp);") instead of interpolating values directly into the SQL. If the query is being run more than once, this will be more efficient, as it will only need to prepare the query for execution once. If xp/yp come from an untrusted source (e.g., user-entered), it will also protect against SQL injection attacks.
Dave Sherohman
+7  A: 

In addition to Udi Pasmon's good advice, you could also convert your large file to a DBM and then tie the DBM file to a hash for easy look ups.

Convert the file:

#!/usr/bin/perl

use strict;
use warnings;

use DB_File;

my $dbfile = "coords.db";

tie my %coords, "DB_File", $dbfile
    or die "could not open $dbfile: $!";

while (<>) {
    my ($x, $y) = split;
    $coords{"$x-$y"} = 1;
}

Check to see if arguments are members of the file:

#!/usr/bin/perl

use strict;
use warnings;

use DB_file;

my ($x, $y) = @ARGV;

tie my %coords, "DB_File", "coords.db"
    or die "could not open coords.db: $!";

print "IN\n" if $coords{"$x-$y"};
Chas. Owens
Nice! Didn't know that trick. Would it work for very large files?
Adam Matan
Yes, the DBM is stored on disk, not in memory. Converting the file will take a long time, but once it is converted look ups will be very fast (it uses a disk based hash table by default).
Chas. Owens
Just make sure you don't use the `keys` function on the tied hash (you will get back every key in the database). If you want to iterate over the entries you will need to use `each` instead.
Chas. Owens
I'll make the same comment as I did for Udi Pasmon's answer: I like your idea, but you need to tweak your algorithm to make it search ALL ranges that INCLUDE the given range. If I am not mistaken, your algorithm will only answer "IN" if the exact same coordinates are found in the file. It should answer "IN" if any range (x..y) in the file INCLUDES the given range (xp..yp), that is: x <= xp <= yp <= y.
MiniQuark
@Chas: I updated my OP, truly sorry for that. Based on my new update maybe DB approach is too costly?
neversaint
+1  A: 

Restating your question, do you want to print all ranges in a file that contains the (x, y) pair and also have the same id? If that's the case, you don't need to parse the file and storing it in memory.

while (<DATA>) {
    my ($chr, $tx, $ty) = split /\s+/;
    print "IN : $chr, $tx, $ty\n"
        if $chr eq $chrid
            && $point_to_check_x >= $tx
            && $point_to_check_y <= $ty;
}
Leonardo Herrera
+2  A: 

If both inputs or atleast the large one are sorted you can try a variation of merge-join between them.

If the lookup file (smaller file) isn't too large, then easiest is to just read it in, put it in a hash keyed by the name with sorted arrays of start-end pairs for value. Then go through each row in the large file, lookup the array of lookup values that could match it by its name. Go through each pair in the lookup array, if the lookup start is less than the input pairs start, discard that value as it can no longer match anything. If the lookup start is past input end, break the loop as no further lookup values can match. If the lookup end is before the input end you have a match and you can add the input and lookup to the list of matches.

My Perl is rusty, so no Perl example code, but I threw together a quick and dirty Python implementation. On my arbitrary randomly generated dataset matching 10M rows to 10k lookup rows for 14k matches took 22s, matching to 100k lookups for 145k matches took 24s and matching to 1M lookups for 1.47M matches took 35s.

If the smaller file is too big to fit in memory at once, it can be loaded in batches of keys as the keys are encountered in the input file.

Ants Aasma
+1  A: 

OK, so let me clarify the problem, based on my understanding of your code. You have a file with a very large number of entries in it. Each entry includes a label "chr1", "chr2", etc. and two numbers, the first of which is less than the second. You then have a query which comprises a label and a number, and you wish to know whether there is a record in the large file which has the same label as the query, and has the two values such that one is less than the query number and the other is greater than it. Essentially, whether the number in the query is within the interval specified by the two numbers in the record.

Provided my understanding is correct, the first thing to notice is that any of the records which do not have the same label as the query have no part to play in the problem. So you can ignore them. Your program reads them all in, puts them in a hash and then doesn't look at most of the data. Sure, if you have several queries to do, you're going to need to keep data for each of the labels you're interested in, but you can throw the rest away as early as possible. This will keep your memory requirement down.

I would be tempted to take this a little further. Is there a possibility of breaking the huge file up into smaller files? It would seem to be a good idea to break it into files which have only certain labels in them. For example, you might have one file per label, or one file for all data with labels beginning with "a", or so on. This way you can open only those files which you know you're going to be interested in, and save yourself a lot of time and effort.

These changes alone may make enough difference for you. If you still need more performance, I would start thinking about how you are storing the records you're going to need. Storing them ordered on the lower (or higher) of the two numbers should cut down a bit the time taken to find what you're after, particularly if you store them in a binary search tree or a trie.

That should give you enough to work on.

Tim
A: 

PDL for genomic data processing

We processed a lot of files in the same format as you show in your question and found that PDL (documentation) is a very good tool for doing this. You need some time to learn it --- but it's definitely worth the effort (if you do genomics data processing): PDL can process huge files a few thousand times faster than MySQL.

Here are some hints where to go:

First of all, PDL is a language somewhat like Matlab --- but fully integrated with perl. Read the documentation, do some examples. Consult a mathematician for advise which features to use for what purpose.

PDL stores it's data in plain C arrays. Learn about Inline::C and access this data directly from C if PDL doesn't do the job for you. To me, PDL and Inline::C seem like a perfect match: PDL for high-level operations; Inline::C for anything missing. Still PDL is as fast as your best C because it does it's work in C.

use PDL::IO::FastRaw --- store and access data in files on disk. I often write the files "by hand" (see below) and read them as memory mapped files (using PDL::IO::FastRaw::mapfraw, often with the flag ReadOnly=>1). This is the most efficient way to read data in Linux from the disk.

The format of the data files is trivial: just a sequence of C numbers. You can easily write such files in perl with 'print FileHandle pack "i*",@data;' Check 'perldoc -f pack'.

In my experience, just reading the input files line by line and printing them out in binary format is the slowest part of processing: Once you have them ready for PDL to 'mmap' them, processing is much faster.

I hope this advise helps --- even though not much code is given.

Yaakov Belch