tags:

views:

96

answers:

3

a function I wrote extracts timestamps from a XML document. Timestamps are coupled to events, which are repeated elements of the series element.

series elements have a variable amount of events, so my function returns a data.frame (if the series have the same length). in general it returns a more generic list and I want it to work with matrices as well. I was pointed out (Thanks Eduardo) that 'list' is the generic type, but I still have trouble with functions that work on generic lists but not with more specific types, like data.frame or matrix.

what I need to do with the data at the moment is to see what is the most common distance between timestamps (I expect it to appear (much) more often than 50% of the times), I have written and rewritten a function doing this:

R> mostCommonStep( list(a=cumsum(c(1,3,3,2,3,3,4,3,2,3,3)), b=cumsum(c(2,3,2,3))) )
[1] 3
R> mostCommonStep( data.frame(a=c(2,4,6,8,12,14,18), b=c(12,14,16,18,22,24,28)) )
[1] 2
R> mostCommonStep( matrix(c(2,4,6,8,12,14,18, 12,14,16,18,22,24,28), 7, 2) )
[1] 2

but I would like to see a more "R" conformant version

+2  A: 

data frames are lists. Suppose the distance between time stamps is in the vector "x" in list/data.frame y. you could do sort(-table(y[["x"]]))[1] to get the mode.

Eduardo Leoni
my data contains only timestamps. that is: all columns contain timestamps and I want to examine all columns.
mariotomo
+2  A: 

The best way to approach this is probably to use an irregular time series object (see the time series view on CRAN). You have several good options (e.g. timeSeries, its, fts, xts), but the most popular of these is the zoo package. You can create a time series like so:

library(zoo)
x.Date <- as.Date("2003-02-01") + c(1, 3, 7, 9, 14) - 1
x <- zoo(rnorm(5), x.Date)

Then, to see the difference in time between each event, you can just use the diff function to create a difftime object:

> diff(index(x))
Time differences in days
[1] 2 4 2 5

You can analyze these time difference just like you would any other variable, for instance:

> summary(diff(index(x)))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00    2.00    3.00    3.25    4.25    5.00

Similarly, to find the most common time difference, you can use any other standard approach such as table():

> table(diff(index(x)))
2 4 5 
2 1 1
Shane
I'm afraid my problem is that I am not (yet) confident with the "any other case" and with the "other standard approach[es]".
mariotomo
A: 

I think I would settle with this one (works if the most common step really occurs more often than in 50% of the cases).

mostCommonStep <- function(L) {
  ## returns the value of the most common difference between
  ## subsequent elements.

  ## takes into account only forward steps, all negative steps are
  ## discarded.  works with list, data.frame, matrix.
  L <- diff(unlist(sapply(as.list(L), as.numeric)))
  as.numeric(quantile(L[L>0], 0.5))
}
mariotomo