views:

257

answers:

6

HI All, I'm new to R.

I have two panel data files, with columns "id", "date" and "ret"

file A has a lot more data than file B, but i'm primarily working with file B data.

Combination of "id" and "date" is unqiue indentifier.

Is there an elegent way of looking up for each (id, date) in B, I need to get the past 10 days ret from file A, and store them back into B?

my naive way of doing it is to loop for all rows in B,

for i in 1:length(B) {
    B$past10d[i] <- prod(1+A$ret[which(A$id == B$id[i] & A$date > B$date[i]-10 & A$date < B$date[i])])-1
}

but the loops takes forever.

Really appreciate your thoughts.

Thank you very much.

A: 

Is this any faster? (I am assuming the combination of B$id and B$date is a unique identifier not replicated anywhere - implied by your code)

B$idDate <- factor(B$id):factor(B$date)
B$past10 <- sapply(B$idDate, function(x){with(B[B$idDate == x,], 
    prod(1+A$ret[A$id == id & A$date > date-10 & A$date < date])-1)})
John
thanks for your reply john. It looks very neat.But when I tried on the data, i got: Error: cannot allocate vector of size 6.8 Mbtoo much work for the B$past10 guy?
jason K
This shouldn't be much more memory intensive than your method. Try clearing out some stuff in memory with the ls() and rm() commands. Or, try restarting R and executing again.
John
tks! i'll try that.. but i forgot to mention that my naive method never finished looping.. :S
jason K
If A is very large with lots of dates, and especially ids, not in B then you might want to look at my other answer.
John
+1  A: 

Did you try ?merge ?

"Merge two data frames by common columns or row names, or do other versions of database join operations. "

Besides I suggest to use a little local MySQL / PostgreSQL (RMySQL / RPostgreSQL) database if you continously sport composite PKs or whatsoever as unique identifiers. To me SQL rearranging of data and afterwards using data.frames from view is a lot easier than looping.

ran2
hm.. i might need to look into sql. thanks!
jason K
RMySQL was easier to get started with…
ran2
A: 

If you haven't got data that is replicated in both A and B, then rbind is the simplest solution.

#Sample data
A <- data.frame(
  id = rep(letters[1:3], each = 13),
  date = Sys.Date() + -12:0,
  ret = runif(39)
)

B <- data.frame(
  id = rep(letters[5:6], each = 5),
  date = Sys.Date() + -4:0,
  ret = runif(10)
)

#Only take the last ten days from A
A_past_10_days <- A[A$date > Sys.Date() - 10,]

#Bind by rows
rbind(A_past_10_days, B)
Richie Cotton
You're taking the past 10 days from today but his code takes the past 10 days from each possible date in B qualified by the identifier that goes with that date.
John
@John: Well spotted. I seem to have answered the question's text, but not the question hidden in the code. **Sigh**
Richie Cotton
A: 

In general, you ought to avoid looping in R. It's much quicker if your code operates on vectors.

I would use merge, as suggested by ran2. You can set all.x = T (or all.y or all) to get all the rows from one (or other or both) - data frames. This is quick and will typically work-out which fields to match by itself. Otherwise you'll need to specify by.x (and by.y or by) as a lookup field. By the sounds of it you may need to create this field yourself (as per John's comment).

You can then filter by date.

RobinGower
tks! B is actually a much smaller subset of A, so if i understand 'merge' correctly, then A would be the result of the merge.
jason K
This will take up a lot more memory than your original solution. Given that you were hitting memory limits with my suggestion this will be unlikely to work. Furthermore, all this does is get you the data into one place. It doesn't solve your problem of turning those past 10 days into a single value... which your code suggests is necessary.
John
Ah! My mistake. I thought you were getting a set of rows from A into B. I hadn't realised you actually wanted to summarise the product. You may also be interested in aggregate (the convenience function for sapply). If you're having memory trouble then you may want to subset take a small sample that you can practice on until you're sure that the code works (see also: http://www.r-bloggers.com/memory-management-in-r-a-few-tips-and-tricks/).
RobinGower
A: 

I think the key is to vectorize and use the %in% operator to subset data frame A. And, I know, prices are not random numbers, but I didn't want to code a random walk... I created a stock-date index using paste, but I'm sure you could use the index from pdata.frame in the plm library, which is the best I've found for panel data.

A  <- data.frame(stock=rep(1:10, each=100), date=rep(Sys.Date()-99:0, 10), price=rnorm(1000))
B <- A[seq(from=100, to=1000, by=100), ]
A <- cbind(paste(A$stock, A$date, sep="-"), A)
B <- cbind(paste(B$stock, B$date, sep="-"), B)
colnames(A) <- colnames(B) <- c("index", "stock", "date", "price")
index <- which(A[, 1] %in% B[, 1])
returns <- (A$price[index] - A$price[index-10]) / A$price[index-10]
B <- cbind(B, returns)
richardh
A: 

Given that you're having memory issues perhaps paring down A first might help. First, get rid of extraneous ids.

A <- A[A$id %in% B$id,]

Reducing the A dataset completely still wants to grab more memory. It's not possible without storing some variables. Nevertheless, we can get rid of a bunch of it I'm hoping by lopping off every date below our absolute minimum and above our absolute maximum.

A <- A[A$date > (min(B$date) - 10) & A$date <= max(B$date),]

Of course, by not qualifying this by id we haven't get the smallest version of A possible but hopefully it's enough smaller.

Now run the code I first proposed and see if you still have a memory error

B$idDate <- factor(B$id):factor(B$date)
B$past10 <- sapply(B$idDate, function(x){with(B[B$idDate == x,], 
    prod(1+A$ret[A$id == id & A$date > date-10 & A$date < date])-1)})
John
Thanks John, The two lines help cut down a lot in A. but after i tried with just 500 rows, it still gave me the error: cannot allocate vector of size 6.8 Mb. i think the problem could be in 'sapply'. somehow it is taking the id and date stuff as vector and trying to compare A$id against B$id, etc, instead of by element
jason K
looks like it's sapply(B$idDate) that's creating the problem, I changed B$idDate to chars and the code has been running........smells like a loop
jason K
try not putting past10 in B (remove B$ at the beginning of the second line). Does that have a memory issue too? sapply() is taking each element of B and then using it to subselect A and get the product... as in your original code. I demoed it on my code but on my computer 6.8meg is a trivial amount of memory (and I'm only using about 200 lines). And if it does report an error message, what is the exact message?
John
so after I changed B$idDate to charactors, it worked with 500 lines. now i'm running on full data, it just has been running... without any error msg, but no sign of stopping either.
jason K