tags:

views:

157

answers:

3

I have a time series of data in TSV like so:

ID \t Date \t Value
-------------------------------
1234567 \t 2009-01-01T00:00:00.000Z \t 121
12131 \t 2009-06-01T00:00:00.000Z \t 151
12131 \t 2009-07-01T00:00:00.000Z \t 15153
...

It easily fits in RAM, but is too big for Excel.

There is one value per month per ID, but not all IDs have entries for all 12 months.

The data spans 12 months, but not all IDs have all 12 months. I want to go through the data for each ID, and if there is an entry for the previous month, take the current month minus the previous month and store it in a new column to get a delta. If there is no entry for the previous month, then return 0. Then, for each month, I want the top 100 positive and negative of those deltas, along with the ID.

I'd like to do this in R, because its hard in Excel and it keeps crashing. I have R, Rattle, etc. installed and I've worked through basic examples, but ... the learning curve is steep. I would really appreciate some help :)

Thanks in advance!

+1  A: 

Pseudo Code to start with:

For Each ID
  If Previous month data Exists 
    compute Diff
  Else diff = 0
return diff

For Each Month
  Max 100 (Positive)
  Min 100 (Negative)


#Realish Code
dataset$diff <- lappply(dataset,function(ID,month,value){IF dataset[month-1] = TRUE{value-(value[month-1]})})
#This gets tricky since you need to know the month and what the previous month is in a format you can test
+4  A: 

Ok, first some code to generate some test data. This makes 100 random IDs and for each one chooses 20 months from a 2 year period along with random values. The order is then shuffled for extra fun.

## Generate some IDs
ids <- sample(1000, 100)

## Generate the data
data <- do.call(rbind,
                lapply(ids,
                       function(id)
                       data.frame(ID = id,
                                  Date = sample(as.Date(paste(rep(c(2008:2009), each=12),
                                    1:12, 1, sep="-")),
                                    20),
                                  Value = sample(1000, 20))))

## Shuffle
data <- data[sample(nrow(data), nrow(data)),]

Here's what it looks like for me:

> head(data)
      ID       Date Value
1007 205 2008-07-01   235
1391 840 2008-12-01   509
918  278 2009-12-01   951
1213 945 2009-03-01   842
1369 766 2009-07-01   555
798  662 2008-12-01   531

Ok, now let's iterate over IDs and find the diff for each month for each ID. Before that, let's convert the month to a number so it'll be easier to take differences (this is a bit unclean, does anyone know a better way to do arithmetic on Date objects?). This just does year * 12 + month so that normal arithmetic works:

data$Month <- as.POSIXlt(data$Date)$mon + as.POSIXlt(data$Date)$year * 12

Now compute the differences:

by.id <- by(data, data$ID, function(x) {
  ## Sort by month.
  x <- x[order(x$Month),]
  ## Compute the month and value differences, taking care to pad the edge case.
  data.frame(ID=x$ID,
             Date = x$Date,             
             Month.diff=c(0, diff(x$Month)),
             Value.diff=c(0,diff(x$Value)))
})
by.id <- do.call(rbind, by.id)

Here's what the result looks like:

> head(by.id)
    ID       Date Month.diff Value.diff
4.1  4 2008-02-01          0          0
4.2  4 2008-03-01          1        123
4.3  4 2008-05-01          2        -94
4.4  4 2008-06-01          1       -243
4.5  4 2008-08-01          2       -327
4.6  4 2008-10-01          2        656

If the difference between consecutive months was greater than 1, then the months were not adjacent and we should set their values to zero.

by.id$Value.diff <- ifelse(by.id$Month.diff == 1,
                           by.id$Value.diff,
                           0)

Finally, we iterate by month and take the top and bottom N differences (I'll set N to 10 here rather than 100 since my test data set is rather small).

by.month <- by(by.id, by.id$Date, function(x) {
  ## Sort the data in each month
  x <- x[order(x$Value.diff),]
  ## Take the top and bottom and label them accordingly.
  cbind(rbind(head(x, 10), tail(x, 10)),
        type=rep(c("min", "max"), each=10))
})

And there we have it. Here's an example result:

> by.month[[24]]
        ID       Date Month.diff Value.diff type
130.20 130 2009-12-01          1       -951  min
415.20 415 2009-12-01          1       -895  min
662.20 662 2009-12-01          1       -878  min
107.20 107 2009-12-01          1       -744  min
824.20 824 2009-12-01          1       -731  min
170.20 170 2009-12-01          1       -719  min
502.20 502 2009-12-01          1       -714  min
247.20 247 2009-12-01          1       -697  min
789.20 789 2009-12-01          1       -667  min
132.20 132 2009-12-01          1       -653  min
64.20   64 2009-12-01          1        622  max
82.20   82 2009-12-01          1        647  max
381.20 381 2009-12-01          1        698  max
303.20 303 2009-12-01          1        700  max
131.20 131 2009-12-01          1        751  max
221.20 221 2009-12-01          1        765  max
833.20 833 2009-12-01          1        791  max
806.20 806 2009-12-01          1        806  max
780.20 780 2009-12-01          1        843  max
912.20 912 2009-12-01          1        929  max
Jonathan Chang
+7  A: 

Start by adding in all missing months:

all_combs <- expand.grid(
  ID = unique(data$ID),
  Date = unique(data$Date))

data <- merge(data, all_combs, by = c("ID", "Date"), all = T)
# Ensure data ordered by date
data <- data[with(data, order(ID, Date)), ]

Then add a column of deltas (calculated with diff)

library(plyr)
data <- ddply(data, "ID", transform, delta = c(NA, diff(Value)))

Finally, remove missing deltas, order by their value and extract the top and bottom 10 within each group.

changed <- subset(data, !is.na(delta))
changed <- changed[with(changed, order(ID, delta)), ]

# Select top 100 for each
top10 <- ddply(changed, "ID", function(df) {
 rbind(head(df, 10), tail(df, 10))
})
hadley
Thanks, running this now. The step: data <- ddply is taking a long time, hopefully it will finish soon :) The original data was 50MB or so, but it was probably greatly expanded by adding the missing months.
rjurney
i ♡ expand.grid
Brendan OConnor