views:

79

answers:

2

I have a dataset that includes a list of species, their counts, and the day count from when the survey began. Since many days were not sampled, day is not continuous. So for example, there could be birds counted on day 5,6,9,10,15,34,39 and so on. I set the earliest date to be day 0.

Example data:

species     counts      day
Blue tit    234         0
Blue tit    24          5
Blue tit    45          6
Blue tit    32          9
Blue tit    6           10
Blue tit    98          15
Blue tit    40          34
Blue tit    57          39
Blue tit    81          43
..................

I need to bootstrap this data and get a resulting dataset where I specify when to start, what interval to proceed in and number of points to sample.

Example: Let's say I randomly pick day 5 as the start day, the interval as 30, and number of rows to sample as 2. It means that I will start at 5, add 30 to it, and look for 2 rows around 35 days (but not day 35 itself). In this case I will grab the two rows where day is 34 and 39.

Next I add 30 to 35 and look for two points around 65. Rinse, repeat till I get to the end of the dataset.

I've written this function to do the sampling but it has flaws (see below):

resample <- function(x, ...) x[sample.int(length(x), ...)]
 locate_points<- function(dataz,l,n) #l is the interval, n is # points to sample. This is called by another function that specifies start time among other info.
{
   tlength=0
   i=1
    while(tlength<n)   
    {
        low=l-i
        high=l+i
        if(low<=min(dataz$day)) { low=min(dataz$day) }
        if(high>=max(dataz$day)) { high=max(dataz$day) }
        test=resample(dataz$day[dataz$day>low & dataz$day<high & dataz$day!=l])
          tlength=length(test)
         i=i+1
      } 
  test=sort(test)
  k=test[1:n]
 return (k)
 } 

Two issues I need help with:

  1. While my function does return the desired number of points, it is not centered around my search value. Makes sense because as I get wider, I get more points and when I sort those and pick the first n, They tend not to be the low values.

  2. Second, how do I get the actual rows out? For now I have another function to locate these rows using which, then rbind 'ing those rows together. Seems like there should be a better way.

thanks!

+1  A: 

How about something like the following:

day = 1:1000

search = seq(from=5, to=max(day), by=30)
x = sort(setdiff(day, search))
pos = match(x[unlist(lapply(findInterval(search, x), seq, len=2))], day)

day[pos]

To get the rows from your data.frame just subset it:

rows = data[pos, ]

This is maybe slightly cleaner than the unlist/lapply/seq combo:

pos = match(x[outer(c(0, 1), findInterval(search, x), `+`)], day)

Also note that if you want a larger window (eg say 4), its just a matter of going back a bit:

pos = match(x[outer(-1:2, findInterval(search, x), `+`)], day)
Charles
much appreciated, Charles! I learned a lot from your example. cheers.
Maiasaura
darn nice solution!
Joris Meys
Thanks, glad it helped. Actually it seems more complex than I'd anticipated - there's probably a simpler way...
Charles
Oh, and one thing I didn't mention is that this is just for a single species. You can just split your data.frame by species, or use tapply to work on a per-species basis.
Charles
I did forget one thing though. The lowest value from each sample should become the starting value for the next round. That throws a monkey wrench into the nice search sequence. But I should be able to work this out.
Maiasaura
A: 

Loved the solution of Charles, which works perfectly for the case n=2. Alas, it's not extendible to larger windows. It still has the problem described by OP: with larger windows, the selection is not centered around the search value. Given n is even, I came up with following solution, heavily based on Charles idea.

The function controls the borders. if there are 100 days, and the next midpoint is say the second last day, a window of 4 would mean that you select index 101, which gives NA. This function shifts the window so all selected indices lie within the original data. This also has the side effect that depending on the values of start (st), length(l) and window(n) values of the start and the end have a higher chance of been selected twice. The lengths should always be at least twice the window size.

The output of the function are the indices of the bootstrap sample. It can be used as the pos variable of Charles on vectors and dataframes.

bboot <- function(day,st,l,n){
  mid <- seq(st,max(day),by=l)
  x <-sort(setdiff(day,mid))
  lx <- length(x)

  id <- sapply(mid,
          function(y){
            m <- match(T,x>y)
            seq(
              from=min( lx-n, max(1,m+(-n/2)) ),
              to=min( lx, max(n,m+(n/2-1)) )
            )
          }
        )

  pos <- match(x[id],day)
  return(pos)
}

Then

>   day <- sample(1:100,50)
> sample.rownr <- bboot(day,10,20,6)
> sort(day)
 [1]  3  4  5  7  9 10 13 15 16 18 19 21 22 24 25 26 27 28 29 
[20] 30 31 32 35 36 38 40 45 49 51 52 54 55 58 59 62 65 69 72 73
[40] 74 80 84 87 88 91 92 94 97 98 99
> day[sample.rownr]
 [1]  5  7  9 13 15 16 27 28 29 31 32 35 40 45 49 51 52 54 62 
[20] 65 69 72 73 74 84 87 88 91 92 94
> 

edit : regarding bootstrapping for time series, you should go through The CRAN taskview on time series, especially the section about resampling. For irregular time series, the zoo package also offers quite some other functionalities that can come in handy.

Joris Meys