ansaurus

Question

Help me replace a for loop with an "apply" function

Answer 1

+3 A:

The apply functions are not always (or even generally) faster than a for loop. That is a remnant of R's associate with S-Plus (in the latter, apply is faster than for). One exception is lapply, which is frequently faster than for (because it uses C code). See this related question.

So you should use apply primarily to improve the clarity of code, not to improve performance.

You might find Dirk's presentation on high-performance computing useful. One other brute force approach is "just-in-time compilation" with Ra instead of the normal R version, which is optimized to handle for loops.

[Edit:] There are clearly many ways to achieve this, and this is by no means better even if it's more compact. Just working with your code, here's another approach:

dt <- data.frame(table(dat))[,2:3]
dt.b <- by(dt[,2], dt[,1], rle)
t(data.frame(lapply(dt.b, function(x) max(x$length))))

You would probably need to manipulate the output a little further.

Shane 2009-10-01 16:06:48

Answer 2

+3 A:

EDIT: Fixed. I originally assumed that I would have to modify most of rle(), but it turns out only a few tweaks were needed.

This isn't an answer about an *apply method, but I wonder if this might not be a faster approach to the process overall. As Shane says, loops aren't so bad. And... I rarely get to show my code to anyone, so I'd be happy to hear some critique of this.

#Shane, I told you this was awesome
dat <- getSOTable("http://stackoverflow.com/questions/1504832/help-me-replace-a-for-loop-with-an-apply-function", 1)
colnames(dat) <- c("day", "user_id")
#Convert to dates so that arithmetic works properly on them
dat$day <- as.Date(dat$day)

#Custom rle for dates
rle.date <- function (x)
{
    #Accept only dates
    if (class(x) != "Date")
        stop("'x' must be an object of class \"Date\"")
    n <- length(x)
    if (n == 0L)
        return(list(lengths = integer(0L), values = x))
    #Dates need to be sorted
    x.sort <- sort(x)
    #y is a vector indicating at which indices the date is not consecutive with its predecessor
    y <- x.sort[-1L] != (x.sort + 1)[-n]
    #i returns the indices of y that are TRUE, and appends the index of the last value
    i <- c(which(y | is.na(y)), n)
    #diff tells you the distances in between TRUE/non-consecutive dates. max gets the largest of these.
    max(diff(c(0L, i)))
}

#Loop
max.consec.use <- matrix(nrow = length(unique(dat$user_id)), ncol = 1)
rownames(max.consec.use) <- unique(dat$user_id)

for(i in 1:length(unique(dat$user_id))){
    user <- unique(dat$user_id)[i]
    uses <- subset(dat, user_id %in% user)
    max.consec.use[paste(user), 1] <- rle.date(uses$day)
}

max.consec.use

Matt Parker 2009-10-01 19:40:00

Forgot to add: the getSOTable function is from Shane's answer here: http://stackoverflow.com/questions/1434897/how-do-i-load-example-datasets-in-r/1434927#1434927

Matt Parker 2009-10-01 19:53:03

oh that's Sweet. And thanks to Shane.

kpierce8 2009-10-01 20:05:57

Answer 3

A:

If you've got a really long list of data than it sounds like maybe a clustering problem. Each cluster would be defined by a user and dates with a maximum separation distance of one. Then retrieve the largest cluster by user. I'll edit this if I think of a specific method.

kpierce8 2009-10-01 19:48:03

Answer 4

A:

This was Chris's suggestion for how to get the data:

dat <- read.table(textConnection(
 "day      user_id
 2008/11/01    2001
 2008/11/01    2002
 2008/11/01    2003
 2008/11/01    2004
 2008/11/01    2005
 2008/11/02    2001
 2008/11/02    2005
 2008/11/03    2001
 2008/11/03    2003
 2008/11/03    2004
 2008/11/03    2005
 2008/11/04    2001
 2008/11/04    2003
 2008/11/04    2004
 2008/11/04    2005
 "), header=TRUE)

Shane 2009-10-01 20:01:03

... yeah, that's probably a bit more sensible. But I like a little magic in my programming from time to time.

Matt Parker 2009-10-01 20:18:08

Answer 5

+1 A:

another option

# convert to Date
day_table$day <- as.Date(day_table$day, format="%Y/%m/%d")
# split by user and then look for contiguous days
contig <- sapply(split(day_table$day, day_table$user_id), function(.days){
    .diff <- cumsum(c(TRUE, diff(.days) != 1))
    max(table(.diff))
})

gd047 2010-01-12 07:34:25

ansaurus

tags:

views:

answers:

Help me replace a for loop with an "apply" function

related questions