ansaurus

Question

Answer 1

A:

Is this any faster? (I am assuming the combination of B$id and B$date is a unique identifier not replicated anywhere - implied by your code)

B$idDate <- factor(B$id):factor(B$date)
B$past10 <- sapply(B$idDate, function(x){with(B[B$idDate == x,], 
    prod(1+A$ret[A$id == id & A$date > date-10 & A$date < date])-1)})

John 2010-07-12 04:59:08

thanks for your reply john. It looks very neat.But when I tried on the data, i got: Error: cannot allocate vector of size 6.8 Mbtoo much work for the B$past10 guy?

jason K 2010-07-12 13:08:55

This shouldn't be much more memory intensive than your method. Try clearing out some stuff in memory with the ls() and rm() commands. Or, try restarting R and executing again.

John 2010-07-12 14:59:07

tks! i'll try that.. but i forgot to mention that my naive method never finished looping.. :S

jason K 2010-07-12 15:35:10

If A is very large with lots of dates, and especially ids, not in B then you might want to look at my other answer.

John 2010-07-12 22:56:54

Answer 2

+1 A:

Did you try ?merge ?

"Merge two data frames by common columns or row names, or do other versions of database join operations. "

Besides I suggest to use a little local MySQL / PostgreSQL (RMySQL / RPostgreSQL) database if you continously sport composite PKs or whatsoever as unique identifiers. To me SQL rearranging of data and afterwards using data.frames from view is a lot easier than looping.

ran2 2010-07-12 08:20:15

hm.. i might need to look into sql. thanks!

jason K 2010-07-12 13:09:35

RMySQL was easier to get started with…

ran2 2010-07-12 13:25:33

Answer 3

A:

If you haven't got data that is replicated in both A and B, then rbind is the simplest solution.

#Sample data
A <- data.frame(
  id = rep(letters[1:3], each = 13),
  date = Sys.Date() + -12:0,
  ret = runif(39)
)

B <- data.frame(
  id = rep(letters[5:6], each = 5),
  date = Sys.Date() + -4:0,
  ret = runif(10)
)

#Only take the last ten days from A
A_past_10_days <- A[A$date > Sys.Date() - 10,]

#Bind by rows
rbind(A_past_10_days, B)

Richie Cotton 2010-07-12 11:18:58

You're taking the past 10 days from today but his code takes the past 10 days from each possible date in B qualified by the identifier that goes with that date.

John 2010-07-12 14:52:35

@John: Well spotted. I seem to have answered the question's text, but not the question hidden in the code. **Sigh**

Richie Cotton 2010-07-12 15:32:56

Answer 4

A:

In general, you ought to avoid looping in R. It's much quicker if your code operates on vectors.

I would use merge, as suggested by ran2. You can set all.x = T (or all.y or all) to get all the rows from one (or other or both) - data frames. This is quick and will typically work-out which fields to match by itself. Otherwise you'll need to specify by.x (and by.y or by) as a lookup field. By the sounds of it you may need to create this field yourself (as per John's comment).

You can then filter by date.

RobinGower 2010-07-12 11:48:00

tks! B is actually a much smaller subset of A, so if i understand 'merge' correctly, then A would be the result of the merge.

jason K 2010-07-12 13:11:37

This will take up a lot more memory than your original solution. Given that you were hitting memory limits with my suggestion this will be unlikely to work. Furthermore, all this does is get you the data into one place. It doesn't solve your problem of turning those past 10 days into a single value... which your code suggests is necessary.

John 2010-07-12 21:03:51

Ah! My mistake. I thought you were getting a set of rows from A into B. I hadn't realised you actually wanted to summarise the product. You may also be interested in aggregate (the convenience function for sapply). If you're having memory trouble then you may want to subset take a small sample that you can practice on until you're sure that the code works (see also: http://www.r-bloggers.com/memory-management-in-r-a-few-tips-and-tricks/).

RobinGower 2010-07-13 08:52:33

Answer 5

A:

I think the key is to vectorize and use the %in% operator to subset data frame A. And, I know, prices are not random numbers, but I didn't want to code a random walk... I created a stock-date index using paste, but I'm sure you could use the index from pdata.frame in the plm library, which is the best I've found for panel data.

A  <- data.frame(stock=rep(1:10, each=100), date=rep(Sys.Date()-99:0, 10), price=rnorm(1000))
B <- A[seq(from=100, to=1000, by=100), ]
A <- cbind(paste(A$stock, A$date, sep="-"), A)
B <- cbind(paste(B$stock, B$date, sep="-"), B)
colnames(A) <- colnames(B) <- c("index", "stock", "date", "price")
index <- which(A[, 1] %in% B[, 1])
returns <- (A$price[index] - A$price[index-10]) / A$price[index-10]
B <- cbind(B, returns)

richardh 2010-07-12 13:52:25

Answer 6

A:

Given that you're having memory issues perhaps paring down A first might help. First, get rid of extraneous ids.

A <- A[A$id %in% B$id,]

Reducing the A dataset completely still wants to grab more memory. It's not possible without storing some variables. Nevertheless, we can get rid of a bunch of it I'm hoping by lopping off every date below our absolute minimum and above our absolute maximum.

A <- A[A$date > (min(B$date) - 10) & A$date <= max(B$date),]

Of course, by not qualifying this by id we haven't get the smallest version of A possible but hopefully it's enough smaller.

Now run the code I first proposed and see if you still have a memory error

B$idDate <- factor(B$id):factor(B$date)
B$past10 <- sapply(B$idDate, function(x){with(B[B$idDate == x,], 
    prod(1+A$ret[A$id == id & A$date > date-10 & A$date < date])-1)})

John 2010-07-12 23:17:38

Thanks John, The two lines help cut down a lot in A. but after i tried with just 500 rows, it still gave me the error: cannot allocate vector of size 6.8 Mb. i think the problem could be in 'sapply'. somehow it is taking the id and date stuff as vector and trying to compare A$id against B$id, etc, instead of by element

jason K 2010-07-13 13:47:43

looks like it's sapply(B$idDate) that's creating the problem, I changed B$idDate to chars and the code has been running........smells like a loop

jason K 2010-07-13 17:09:42

try not putting past10 in B (remove B$ at the beginning of the second line). Does that have a memory issue too? sapply() is taking each element of B and then using it to subselect A and get the product... as in your original code. I demoed it on my code but on my computer 6.8meg is a trivial amount of memory (and I'm only using about 200 lines). And if it does report an error message, what is the exact message?

John 2010-07-13 17:10:55

so after I changed B$idDate to charactors, it worked with 500 lines. now i'm running on full data, it just has been running... without any error msg, but no sign of stopping either.

jason K 2010-07-13 18:23:37

ansaurus

tags:

views:

answers:

[R] how to avoid loops

related questions