We do peak detection (and valley detection) using running medians and median absolute deviation. You can specify how much deviation from the running median means a peak.
In a next step, we use a binomial model to check which regions contain more "extreme" values than can be expected. This model (basically a score test) results in "peak regions" instead of single peaks. Turning it around to get "valley regions" is trivial.
The running median is calculated using the function weightedMedian from the package aroma.light. We use the embed() function to make a list of "windows" and apply a kernel function on it.
The application of the weighted median :
center <- apply(embed(tmp,wdw),1,weightedMedian,w=weights,na.rm=T)
Here tmp is the temporary data vector and wdw the window size (has to be uneven). tmp is constructed by adding (wdw-1)/2 NA values at every side of the data vector. the weights are constructed using a customized function. For the mad we use the same procedure, but then on diff(data) instead of the data itself.
Running Sample code :
require(aroma.light)
# make.weights : function to make weights on basis of a normal distribution
# n is window size !!!!!!
make.weights <- function(n,
type=c("gaussian","epanechnikov","biweight","triweight","cosinus")){
type <- match.arg(type)
x <- seq(-1,1,length.out=n)
out <-switch(type,
gaussian=(1/sqrt(2*pi)*exp(-0.5*(3*x)^2)),
epanechnikov=0.75*(1-x^2),
biweight=15/16*(1-x^2)^2,
triweight=35/32*(1-x^2)^3,
cosinus=pi/4*cos(x*pi/2),
)
out <- out/sum(out)*n
return(out)
}
# score.test : function to become a p-value based on the score test
# uses normal approximation, but is still quite correct when p0 is
# pretty small.
# This test is one-sided, and tests whether the observed proportion
# is bigger than the hypothesized proportion
score.test <- function(x,p0,w){
n <- length(x)
if(missing(w)) w<-rep(1,n)
w <- w[!is.na(x)]
x <- x[!is.na(x)]
if(sum(w)!=n) w <- w/sum(w)*n
phat <- sum(x*w)/n
z <- (phat-p0)/sqrt(p0*(1-p0)/n)
p <- 1-pnorm(z)
return(p)
}
# embed.na is a modification of embed, adding NA strings
# to the beginning and end of x. window size= 2n+1
embed.na <- function(x,n){
extra <- rep(NA,n)
x <- c(extra,x,extra)
out <- embed(x,2*n+1)
return(out)
}
# running.score : function to calculate the weighted p-value for the chance of being in
# a run of peaks. This chance is based on the weighted proportion of the neighbourhood
# the null hypothesis is calculated by taking the weighted proportion
# of detected peaks in the whole dataset.
# This lessens the need for adjusting parameters and makes the
# method more automatic.
# for a correct calculation, the weights have to sum up to n
running.score <- function(sel,n=20,w,p0){
if(missing(w)) w<- rep(1,2*n+1)
if(missing(p0))p0 <- sum(sel,na.rm=T)/length(sel[!is.na(sel)]) # null hypothesis
out <- apply(embed.na(sel,n),1,score.test,p0=p0,w=w)
return(out)
}
# running.med : function to calculate the running median and mad
# for a dataset. Window size = 2n+1
running.med <- function(x,w,n,cte=1.4826){
wdw <- 2*n+1
if(missing(w)) w <- rep(1,wdw)
center <- apply(embed.na(x,n),1,weightedMedian,w=w,na.rm=T)
mad <- median(abs(x-center))*cte
return(list(med=center,mad=mad))
}
##############################################
#
# Create series
set.seed(100)
n = 1000
series <- diffinv(rnorm(20000),lag=1)
peaks <- apply(embed.na(series,n),1,function(x) x[n+1] < quantile(x,probs=0.05,na.rm=T))
pweight <- make.weights(0.2*n+1)
p.val <- running.score(peaks,n=n/10,w=pweight)
plot(series,type="l")
points((1:length(series))[p.val<0.05],series[p.val<0.05],col="red")
points((1:length(series))[peaks],series[peaks],col="blue")
The sample code above is developed to find regions with big fluctuations rather than valleys. I adapted it a bit, but it's not optimal. On top of that, for series larger than 20000 values you need a whole lot of memory, I can't run it on my computer any more.
Alternatively, you could work with an approximation of the numerical derivative and second derivative to define valleys. In your case, this might even work better. A pragmatic way of calculating the derivatives and the minima/maxima of the first derivative :
#first derivative
f.deriv <- diff(lowess(series,f=n/length(series),delta=1)$y)
#second derivative
f.sec.deriv <- diff(f.deriv)
#minima and maxima defined by where f.sec.deriv changes sign :
minmax <- cumsum(rle(sign(f.sec.deriv))$lengths)
op <- par(mfrow=c(2,1))
plot(series,type="l")
plot(f.deriv,type="l")
points((1:length(f.deriv))[minmax],f.deriv[minmax],col="red")
par(op)