ansaurus

Question

How can I find low regions in a graph using Perl/R?

Answer 1

+1 A:

You can define a valley by different criterion :

depth
width
volume (depth*width)

You might also have valley in a big mountain, do you want these too ?

For example there is a valley here : 1 2 3 4 1000 1000 800 800 800 1000 1000 500 200 3

Try to explain with more details how YOU (or any expert in your field) would choose the valleys given the data

You might want to look at watershed : http://en.wikipedia.org/wiki/Watershed_(image_processing)

Loïc Février 2010-09-24 18:51:43

Generally, by volume. The wider it is, the less deep I require it to be and still consider it a valley. But, there are two depths we can think of - an absolute depth and a relative one (with respect to the surroundings). I'm considering relative depth.

David B 2010-09-24 19:12:16

Find points that have higher but closer neighbors (choose the criterion you want, for example, neighbor higher by at least 300 in a 100 neighborhood). These points will be your starting ones : for each value h, compute the width/height/volume of the valley og height h that has this point as a minimum. With a criterion on width/height stop at some point, for a given h.

Loïc Février 2010-09-24 19:19:18

Answer 2

A:

You might want to try the peak detection function to identify the regions of interest. The desired minimum width of the valleys can be specified with the span parameter.

It might be a good idea to smooth the data first, to get rid of the noise peaks like the one in the right "valley" of the blue graph. A simple stats::filter should be enough.

The final step would be to check the depth of the found "valleys". This really depends on your requirements. As a first approximation, you can simply compare the peak value with the median level of the data.

Andrei 2010-09-24 19:27:36

Answer 3

+4 A:

We do peak detection (and valley detection) using running medians and median absolute deviation. You can specify how much deviation from the running median means a peak.

In a next step, we use a binomial model to check which regions contain more "extreme" values than can be expected. This model (basically a score test) results in "peak regions" instead of single peaks. Turning it around to get "valley regions" is trivial.

The running median is calculated using the function weightedMedian from the package aroma.light. We use the embed() function to make a list of "windows" and apply a kernel function on it.

The application of the weighted median :

center <- apply(embed(tmp,wdw),1,weightedMedian,w=weights,na.rm=T)

Here tmp is the temporary data vector and wdw the window size (has to be uneven). tmp is constructed by adding (wdw-1)/2 NA values at every side of the data vector. the weights are constructed using a customized function. For the mad we use the same procedure, but then on diff(data) instead of the data itself.

Running Sample code :

require(aroma.light)
# make.weights : function to make weights on basis of a normal distribution
# n is window size !!!!!!
make.weights <- function(n,
      type=c("gaussian","epanechnikov","biweight","triweight","cosinus")){
    type <- match.arg(type)
    x <- seq(-1,1,length.out=n)
    out <-switch(type,
          gaussian=(1/sqrt(2*pi)*exp(-0.5*(3*x)^2)),
          epanechnikov=0.75*(1-x^2),
          biweight=15/16*(1-x^2)^2,
          triweight=35/32*(1-x^2)^3,
          cosinus=pi/4*cos(x*pi/2),
          )
    out <- out/sum(out)*n
    return(out)
}

# score.test : function to become a p-value based on the score test
# uses normal approximation, but is still quite correct when p0 is
# pretty small.
# This test is one-sided, and tests whether the observed proportion
# is bigger than the hypothesized proportion
score.test <- function(x,p0,w){
    n <- length(x)
    if(missing(w)) w<-rep(1,n)
    w <- w[!is.na(x)]
    x <- x[!is.na(x)]

    if(sum(w)!=n) w <- w/sum(w)*n

    phat <- sum(x*w)/n
    z <- (phat-p0)/sqrt(p0*(1-p0)/n)
    p <- 1-pnorm(z)
    return(p)
}

# embed.na is a modification of embed, adding NA strings
# to the beginning and end of x. window size= 2n+1
embed.na <- function(x,n){
    extra <- rep(NA,n)
    x <- c(extra,x,extra)
    out <- embed(x,2*n+1)
    return(out)
}

# running.score : function to calculate the weighted p-value for the chance of being in
# a run of peaks. This chance is based on the weighted proportion of the neighbourhood
# the null hypothesis is calculated by taking the weighted proportion
# of detected peaks in the whole dataset.
# This lessens the need for adjusting parameters and makes the
# method more automatic.
# for a correct calculation, the weights have to sum up to n

running.score <- function(sel,n=20,w,p0){
    if(missing(w)) w<- rep(1,2*n+1)
    if(missing(p0))p0 <- sum(sel,na.rm=T)/length(sel[!is.na(sel)])   # null hypothesis
    out <- apply(embed.na(sel,n),1,score.test,p0=p0,w=w)
    return(out)
}

# running.med : function to calculate the running median and mad
# for a dataset. Window size = 2n+1
running.med <- function(x,w,n,cte=1.4826){
    wdw <- 2*n+1
    if(missing(w)) w <- rep(1,wdw)

    center <- apply(embed.na(x,n),1,weightedMedian,w=w,na.rm=T)
    mad <- median(abs(x-center))*cte
    return(list(med=center,mad=mad))
}

##############################################
#
# Create series
set.seed(100)
n = 1000
series <- diffinv(rnorm(20000),lag=1)

peaks <- apply(embed.na(series,n),1,function(x) x[n+1] < quantile(x,probs=0.05,na.rm=T))

pweight <- make.weights(0.2*n+1)
p.val <- running.score(peaks,n=n/10,w=pweight)

plot(series,type="l")
points((1:length(series))[p.val<0.05],series[p.val<0.05],col="red")
points((1:length(series))[peaks],series[peaks],col="blue")

The sample code above is developed to find regions with big fluctuations rather than valleys. I adapted it a bit, but it's not optimal. On top of that, for series larger than 20000 values you need a whole lot of memory, I can't run it on my computer any more.

Alternatively, you could work with an approximation of the numerical derivative and second derivative to define valleys. In your case, this might even work better. A pragmatic way of calculating the derivatives and the minima/maxima of the first derivative :

#first derivative
f.deriv <- diff(lowess(series,f=n/length(series),delta=1)$y)
#second derivative
f.sec.deriv <- diff(f.deriv)
#minima and maxima defined by where f.sec.deriv changes sign :
minmax <- cumsum(rle(sign(f.sec.deriv))$lengths)

op <- par(mfrow=c(2,1))
plot(series,type="l")
plot(f.deriv,type="l")
points((1:length(f.deriv))[minmax],f.deriv[minmax],col="red")
par(op)

Joris Meys 2010-09-24 23:38:51

+1 Hi Joris and thanks for your reply. This seems interesting and very relevant. Would you mind sharing a working example I could play with?

David B 2010-09-25 07:15:18

@David : extra code and another option added.

Joris Meys 2010-09-25 18:37:17

@David : Could you let me know how your experiences were? I'd like to check if we could generalize our approach, but then I need to know if putting in the effort is worth it.

Joris Meys 2010-09-28 11:24:41

Hi Joris, I'm still working on it and examining different approaches. i will contact you offline if that's OK with you.

David B 2010-09-28 15:15:01

@David : that's very much OK to me. I'll add my contact to my profile page.

Joris Meys 2010-09-28 15:20:02

ansaurus

tags:

views:

answers:

How can I find low regions in a graph using Perl/R?

related questions