ansaurus

Question

What is an efficient method for partitioning and aggregating intervals from timestamped rows in a data frame?

Answer 1

+4 A:

Use a time series package. The xts package has functions designed specifically to do that. Or look at the aggregate and rollapply functions in the zoo package.

The rmetrics ebook has a useful discussion, including a performance comparison of the various packages: https://www.rmetrics.org/files/freepdf/TimeSeriesFAQ.pdf

Edit: Look at my answer to this question. Basically you need to truncate every timestamp into a specific interval and then do the aggregation using those new truncated timestamps as your grouping vector.

Shane 2010-03-14 04:45:14

This looks promising, thank you! I may not have been clear, though every row is timestamped all rows that should fall within the same interval will have different exact times. For example, two timestamps that differ only by a few minutes should belong to the same 15 minute interval. I've updated the question with some example data.

mattrepl 2010-03-14 05:23:17

@mattrepl: Start with the `zoo` package and its vignettes -- as Shane says, these are made for this task.

Dirk Eddelbuettel 2010-03-14 12:44:00

I'm trying to use `xts` now. I have noticed that there are some duplicate timestamps (events occurring simultaneously) in the data, but I've read that xts and zoo should be able to handle that in the Time Series FAQ ebook.Presently I get an error when trying to create a time series object with `xts`: "order.by requires an appropriate time-based object". I've tried POSIXct, timeDate, etc. E.g., using a vector of POSIXct timestamps will get me that error. Any thoughts? I can provide a small test case if the problem isn't obvious.

mattrepl 2010-03-14 17:12:04

Answer 2

A:

This is an interesting question; with the proliferation of the various time series packages and methods, there ought to be an approach for binning irregular time series other than by brute force that the OP suggests. Here is one "high-level" way to get the intervals that you can then use for aggregate et al, using a version of cut defined for chron objects.

require(chron)
require(timeSeries)

my.times <- "
2010-01-13 03:02:38 UTC
2010-01-13 03:08:14 UTC
2010-01-13 03:14:52 UTC
2010-01-13 03:20:42 UTC
2010-01-13 03:22:19 UTC
"

time.df <- read.delim(textConnection(my.times),header=FALSE,sep="\n",strip.white=FALSE)
time.seq <- seq(trunc(timeDate(time.df[1,1]),units="hours"),by=15*60,length=nrow(time.df))
intervals <- as.numeric(cut(as.chron(as.character(time.df$V1)),breaks=as.chron(as.character(time.seq))))

You get

intervals  
[1] 1 1 1 2 2

which you can now append to the data frame and aggregate.

The coersion acrobatics above (from character to timeDate to character to chron) is a little unfortunate, so if there are cleaner solutions for binning irregular time data using xts or any of the other timeSeries packages, I'd love to hear about them as well!..

I am also curious to know what would be the most efficient approach for binning large high-frequency irregular time series, e.g. creating 1-minute volume bars on tick data for a very liquid stock.

Leo Alekseyev 2010-03-14 07:30:19

Answer 3

+3 A:

Standard functions to split vectors are cut and findInterval:

v <- as.POSIXct(c(
  "2010-01-13 03:02:38 UTC",
  "2010-01-13 03:08:14 UTC",
  "2010-01-13 03:14:52 UTC",
  "2010-01-13 03:20:42 UTC",
  "2010-01-13 03:22:19 UTC"
))

# Your function return list:
interv(v, as.POSIXlt("2010-01-13 03:00:00 UTC"), 900)
# [[1]]
# [1] "2010-01-13 03:00:00"
# [[2]]
# [1] "2010-01-13 03:00:00"
# [[3]]
# [1] "2010-01-13 03:00:00"
# [[4]]
# [1] "2010-01-13 03:15:00 CET"
# [[5]]
# [1] "2010-01-13 03:15:00 CET"

# cut returns factor, you must provide proper breaks:
cut(v, as.POSIXlt("2010-01-13 03:00:00 UTC")+0:2*900)
# [1] 2010-01-13 03:00:00 2010-01-13 03:00:00 2010-01-13 03:00:00
# [4] 2010-01-13 03:15:00 2010-01-13 03:15:00
# Levels: 2010-01-13 03:00:00 2010-01-13 03:15:00

# findInterval returns vector of interval id (breaks like in cut)
findInterval(v, as.POSIXlt("2010-01-13 03:00:00 UTC")+0:2*900)
# [1] 1 1 1 2 2

For the record: cut has a method for POSIXt type, but unfortunately there is no way to provide start argument, effect is:

cut(v,"15 min")
# [1] 2010-01-13 03:02:00 2010-01-13 03:02:00 2010-01-13 03:02:00
# [4] 2010-01-13 03:17:00 2010-01-13 03:17:00
# Levels: 2010-01-13 03:02:00 2010-01-13 03:17:00

As you see it's start at 03:02:00. You could mess with labels of output factor (convert labels to time, round somehow and convert back to character).

Marek 2010-03-14 11:12:29

Thanks, this looks like it'll get me a little farther down the path of hacking up a simple ts aggregation. I'm still keeping this open because I think we're close to a good solution using `xts` with irregular, possibly non-unique timestamps.

mattrepl 2010-03-14 17:14:42

ansaurus

tags:

views:

answers:

What is an efficient method for partitioning and aggregating intervals from timestamped rows in a data frame?

related questions