views:

140

answers:

3

I'm examining some biological data which is basically a long list (a few million values) of integers, each saying how well this position in the genome is covered. Here is a graphical example for a data set: alt text alt text

I would like to look for "valleys" in this data, that is, regions which are significantly lower than their surrounding environment.

Note that the size of the valleys I'm looking for is not really known - it may range from 50 bases to a few thousands. Defining what is a valley is of course one of the questions I'm struggling with, but the previous examples are relatively easy for me: alt text alt text

What kind of paradigms would you recommend using to find those valleys? I mainly program using Perl and R.

Thanks!

+1  A: 

You can define a valley by different criterion :

  • depth
  • width
  • volume (depth*width)

You might also have valley in a big mountain, do you want these too ?

For example there is a valley here : 1 2 3 4 1000 1000 800 800 800 1000 1000 500 200 3

Try to explain with more details how YOU (or any expert in your field) would choose the valleys given the data

You might want to look at watershed : http://en.wikipedia.org/wiki/Watershed_(image_processing)

Loïc Février
Generally, by volume. The wider it is, the less deep I require it to be and still consider it a valley. But, there are two depths we can think of - an absolute depth and a relative one (with respect to the surroundings). I'm considering relative depth.
David B
Find points that have higher but closer neighbors (choose the criterion you want, for example, neighbor higher by at least 300 in a 100 neighborhood). These points will be your starting ones : for each value h, compute the width/height/volume of the valley og height h that has this point as a minimum. With a criterion on width/height stop at some point, for a given h.
Loïc Février
A: 

You might want to try the peak detection function to identify the regions of interest. The desired minimum width of the valleys can be specified with the span parameter.

It might be a good idea to smooth the data first, to get rid of the noise peaks like the one in the right "valley" of the blue graph. A simple stats::filter should be enough.

The final step would be to check the depth of the found "valleys". This really depends on your requirements. As a first approximation, you can simply compare the peak value with the median level of the data.

Andrei
+4  A: 

We do peak detection (and valley detection) using running medians and median absolute deviation. You can specify how much deviation from the running median means a peak.

In a next step, we use a binomial model to check which regions contain more "extreme" values than can be expected. This model (basically a score test) results in "peak regions" instead of single peaks. Turning it around to get "valley regions" is trivial.

The running median is calculated using the function weightedMedian from the package aroma.light. We use the embed() function to make a list of "windows" and apply a kernel function on it.

The application of the weighted median :

center <- apply(embed(tmp,wdw),1,weightedMedian,w=weights,na.rm=T)

Here tmp is the temporary data vector and wdw the window size (has to be uneven). tmp is constructed by adding (wdw-1)/2 NA values at every side of the data vector. the weights are constructed using a customized function. For the mad we use the same procedure, but then on diff(data) instead of the data itself.

Running Sample code :

require(aroma.light)
# make.weights : function to make weights on basis of a normal distribution
# n is window size !!!!!!
make.weights <- function(n,
      type=c("gaussian","epanechnikov","biweight","triweight","cosinus")){
    type <- match.arg(type)
    x <- seq(-1,1,length.out=n)
    out <-switch(type,
          gaussian=(1/sqrt(2*pi)*exp(-0.5*(3*x)^2)),
          epanechnikov=0.75*(1-x^2),
          biweight=15/16*(1-x^2)^2,
          triweight=35/32*(1-x^2)^3,
          cosinus=pi/4*cos(x*pi/2),
          )
    out <- out/sum(out)*n
    return(out)
}

# score.test : function to become a p-value based on the score test
# uses normal approximation, but is still quite correct when p0 is
# pretty small.
# This test is one-sided, and tests whether the observed proportion
# is bigger than the hypothesized proportion
score.test <- function(x,p0,w){
    n <- length(x)
    if(missing(w)) w<-rep(1,n)
    w <- w[!is.na(x)]
    x <- x[!is.na(x)]

    if(sum(w)!=n) w <- w/sum(w)*n

    phat <- sum(x*w)/n
    z <- (phat-p0)/sqrt(p0*(1-p0)/n)
    p <- 1-pnorm(z)
    return(p)
}

# embed.na is a modification of embed, adding NA strings
# to the beginning and end of x. window size= 2n+1
embed.na <- function(x,n){
    extra <- rep(NA,n)
    x <- c(extra,x,extra)
    out <- embed(x,2*n+1)
    return(out)
}

# running.score : function to calculate the weighted p-value for the chance of being in
# a run of peaks. This chance is based on the weighted proportion of the neighbourhood
# the null hypothesis is calculated by taking the weighted proportion
# of detected peaks in the whole dataset.
# This lessens the need for adjusting parameters and makes the
# method more automatic.
# for a correct calculation, the weights have to sum up to n

running.score <- function(sel,n=20,w,p0){
    if(missing(w)) w<- rep(1,2*n+1)
    if(missing(p0))p0 <- sum(sel,na.rm=T)/length(sel[!is.na(sel)])   # null hypothesis
    out <- apply(embed.na(sel,n),1,score.test,p0=p0,w=w)
    return(out)
}

# running.med : function to calculate the running median and mad
# for a dataset. Window size = 2n+1
running.med <- function(x,w,n,cte=1.4826){
    wdw <- 2*n+1
    if(missing(w)) w <- rep(1,wdw)

    center <- apply(embed.na(x,n),1,weightedMedian,w=w,na.rm=T)
    mad <- median(abs(x-center))*cte
    return(list(med=center,mad=mad))
}

##############################################
#
# Create series
set.seed(100)
n = 1000
series <- diffinv(rnorm(20000),lag=1)

peaks <- apply(embed.na(series,n),1,function(x) x[n+1] < quantile(x,probs=0.05,na.rm=T))

pweight <- make.weights(0.2*n+1)
p.val <- running.score(peaks,n=n/10,w=pweight)

plot(series,type="l")
points((1:length(series))[p.val<0.05],series[p.val<0.05],col="red")
points((1:length(series))[peaks],series[peaks],col="blue")

The sample code above is developed to find regions with big fluctuations rather than valleys. I adapted it a bit, but it's not optimal. On top of that, for series larger than 20000 values you need a whole lot of memory, I can't run it on my computer any more.

Alternatively, you could work with an approximation of the numerical derivative and second derivative to define valleys. In your case, this might even work better. A pragmatic way of calculating the derivatives and the minima/maxima of the first derivative :

#first derivative
f.deriv <- diff(lowess(series,f=n/length(series),delta=1)$y)
#second derivative
f.sec.deriv <- diff(f.deriv)
#minima and maxima defined by where f.sec.deriv changes sign :
minmax <- cumsum(rle(sign(f.sec.deriv))$lengths)

op <- par(mfrow=c(2,1))
plot(series,type="l")
plot(f.deriv,type="l")
points((1:length(f.deriv))[minmax],f.deriv[minmax],col="red")
par(op)
Joris Meys
+1 Hi Joris and thanks for your reply. This seems interesting and very relevant. Would you mind sharing a working example I could play with?
David B
@David : extra code and another option added.
Joris Meys
@David : Could you let me know how your experiences were? I'd like to check if we could generalize our approach, but then I need to know if putting in the effort is worth it.
Joris Meys
Hi Joris, I'm still working on it and examining different approaches. i will contact you offline if that's OK with you.
David B
@David : that's very much OK to me. I'll add my contact to my profile page.
Joris Meys