tags:

views:

99

answers:

3

I have two data.frames each with three columns: chrom, start & stop, let's call them rangesA and rangesB. For each row of rangesA, I'm looking to find which (if any) row in rangesB fully contains the rangesA row - by which I mean rangesAChrom == rangesBChrom, rangesAStart >= rangesBStart and rangesAStop <= rangesBStop.

Right now I'm doing the following, which I just don't like very much. Note that I'm looping over the rows of rangesA for other reasons, but none of those reasons are likely to be a big deal, it just ends up making things more readable given this particular solution.

rangesA:

chrom   start   stop
 5       100     105
 1       200     250
 9       275     300

rangesB:

chrom    start    stop
  1       200      265
  5       99       106
  9       275      290

for each row in rangesA:

matches <- which((rangesB[,'chrom']  == rangesA[row,'chrom']) &&
                 (rangesB[,'start'] <= rangesA[row, 'start']) &&
                 (rangesB[,'stop'] >= rangesA[row, 'stop']))

I figure there's got to be a better (and by better, I mean faster over large instances of rangesA and rangesB) way to do this than looping over this construct. Any ideas?

+4  A: 

This would be a lot easier / faster if you can merge the two objects first.

ranges <- merge(rangesA,rangesB,by="chrom",suffixes=c("A","B"))
ranges[with(ranges, startB <= startA & stopB >= stopA),]
#  chrom startA stopA startB stopB
#1     1    200   250    200   265
#2     5    100   105     99   106
Joshua Ulrich
A: 

For your example data:

rangesA <- data.frame(
    chrom = c(5, 1, 9),
    start = c(100, 200, 275),
    stop = c(105, 250, 300)
)
rangesB <- data.frame(
    chrom = c(1, 5, 9),
    start = c(200, 99, 275),
    stop = c(265, 106, 290)
)

This will do it with sapply, such that each column is one row in rangesA and each row is corresponding row in rangesB:

> sapply(rangesA$stop, '>=', rangesB$start) & sapply(rangesA$start, '<=', rangesB$stop)
      [,1]  [,2]  [,3]
[1,] FALSE  TRUE FALSE
[2,]  TRUE FALSE FALSE
[3,] FALSE FALSE  TRUE
eyjo
+2  A: 

Use the IRanges/GenomicRanges packages from Bioconductor, which is made for dealing with these exact problems (and scales massively)

source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")

There are a few appropriate containers for ranges on different chromosomes, one is RangesList

library(IRanges)
rangesA <- split(IRanges(rangesA$start, rangesA$stop), rangesA$chrom)
rangesB <- split(IRanges(rangesB$start, rangesB$stop), rangesB$chrom)
#which rangesB wholly contain at least one rangesA?
ov <- countOverlaps(rangesB, rangesA, type="within")>0
Aaron Statham
Good pointer on IRanges, forgot about that. I didn't end up going with this as it didn't fit my own situation for a variety of reasons, but still good to know. My toy example left out a couple of key bits (my own fault) that made working w/ IRanges difficult for me to figure out, and the merge() solution provided a massive speedup as it was. Also, while it scales massively we also see lots of very small cases and what I'm assuming was the overhead of the S4 was slowing it down in those cases.
geoffjentry