ansaurus

Question

How do I sample n values at random nearest to value y when the data aren't continuous?

Answer 1

+1 A:

How about something like the following:

day = 1:1000

search = seq(from=5, to=max(day), by=30)
x = sort(setdiff(day, search))
pos = match(x[unlist(lapply(findInterval(search, x), seq, len=2))], day)

day[pos]

To get the rows from your data.frame just subset it:

rows = data[pos, ]

This is maybe slightly cleaner than the unlist/lapply/seq combo:

pos = match(x[outer(c(0, 1), findInterval(search, x), `+`)], day)

Also note that if you want a larger window (eg say 4), its just a matter of going back a bit:

pos = match(x[outer(-1:2, findInterval(search, x), `+`)], day)

Charles 2010-10-16 22:03:23

much appreciated, Charles! I learned a lot from your example. cheers.

Maiasaura 2010-10-16 22:19:12

darn nice solution!

Joris Meys 2010-10-16 22:24:29

Thanks, glad it helped. Actually it seems more complex than I'd anticipated - there's probably a simpler way...

Charles 2010-10-16 22:39:35

Oh, and one thing I didn't mention is that this is just for a single species. You can just split your data.frame by species, or use tapply to work on a per-species basis.

Charles 2010-10-16 22:42:31

I did forget one thing though. The lowest value from each sample should become the starting value for the next round. That throws a monkey wrench into the nice search sequence. But I should be able to work this out.

Maiasaura 2010-10-16 22:44:55

Answer 2

A:

Loved the solution of Charles, which works perfectly for the case n=2. Alas, it's not extendible to larger windows. It still has the problem described by OP: with larger windows, the selection is not centered around the search value. Given n is even, I came up with following solution, heavily based on Charles idea.

The function controls the borders. if there are 100 days, and the next midpoint is say the second last day, a window of 4 would mean that you select index 101, which gives NA. This function shifts the window so all selected indices lie within the original data. This also has the side effect that depending on the values of start (st), length(l) and window(n) values of the start and the end have a higher chance of been selected twice. The lengths should always be at least twice the window size.

The output of the function are the indices of the bootstrap sample. It can be used as the pos variable of Charles on vectors and dataframes.

bboot <- function(day,st,l,n){
  mid <- seq(st,max(day),by=l)
  x <-sort(setdiff(day,mid))
  lx <- length(x)

  id <- sapply(mid,
          function(y){
            m <- match(T,x>y)
            seq(
              from=min( lx-n, max(1,m+(-n/2)) ),
              to=min( lx, max(n,m+(n/2-1)) )
            )
          }
        )

  pos <- match(x[id],day)
  return(pos)
}

Then

>   day <- sample(1:100,50)
> sample.rownr <- bboot(day,10,20,6)
> sort(day)
 [1]  3  4  5  7  9 10 13 15 16 18 19 21 22 24 25 26 27 28 29 
[20] 30 31 32 35 36 38 40 45 49 51 52 54 55 58 59 62 65 69 72 73
[40] 74 80 84 87 88 91 92 94 97 98 99
> day[sample.rownr]
 [1]  5  7  9 13 15 16 27 28 29 31 32 35 40 45 49 51 52 54 62 
[20] 65 69 72 73 74 84 87 88 91 92 94
>

edit : regarding bootstrapping for time series, you should go through The CRAN taskview on time series, especially the section about resampling. For irregular time series, the zoo package also offers quite some other functionalities that can come in handy.

Joris Meys 2010-10-17 00:17:59

ansaurus

tags:

views:

answers:

How do I sample n values at random nearest to value y when the data aren't continuous?

related questions