ansaurus

Question

Answer 1

+2 A:

I want to do this too when I'm using Sweave. I'd suggest putting all of your expensive functions (loading and reshaping data) at the beginning of your code. Run that code, then save the workspace. Then, comment out the expensive functions, and load the workspace file with load(). This is, of course, riskier if you make unwanted changes to the workspace file, but in that event, you still have the code in comments if you want to start over from scratch.

JoFrhwld 2010-07-27 19:56:09

Jo, thanks for the answer. Do you know how to save the workspace using TextMate?

Roberto 2010-07-27 19:58:46

but what if I dont work in Sweave?

Roberto 2010-07-27 20:13:21

the load() approach is a good one to use and it is not dependent on Sweave.

JD Long 2010-07-27 20:26:10

Answer 2

+2 A:

Some simple ways are doable with some combinations of

exists("foo") to test if a variable exists, else re-load or re-compute
file.info("foo.Rd")$ctime which you can compare to Sys.time() and see if it is newer than a given amount of time you can load, else recompute.

There are also caching packages on CRAN that may be useful.

Dirk Eddelbuettel 2010-07-27 19:57:59

Dirk,but does not every object have to be recreated anyway every time the script is re-run? So foo will never exist and always be recomputed, right?

Roberto 2010-07-27 20:08:46

It depends. Sometimes one gets data from, say, a database which may be extensive. You could then cache this in a file and use the timestamp (as I described) to see whether you need a new db access or not. It all depends on the particulars of your situation.

Dirk Eddelbuettel 2010-07-27 20:26:29

Answer 3

+1 A:

Without going into too much detail, I usually follow one of three approaches:

Use assign to assign a unique name for each important object throughout my execution. Then include an if(exists(...)) get(...) at the top of each function to get the value or else recompute it. (same as Dirk's suggestion)
Use cacheSweave with my Sweave documents. This does all the work for you of caching computations and retrieves them automatically. It's really trivial to use: just use the cacheSweave driver and add this flag to each block: <<..., cache=true>>=
Use save and load to save the environment at crucial moments, again making sure that all names are unique.

Shane 2010-07-27 20:07:51

Answer 4

+2 A:

## load the file from disk only if it 
## hasn't already been read into a variable
if(!(exists("mytable")){
  mytable=read.csv(...)
}

Edit: fixed typo - thanks Dirk.

chrisamiller 2010-07-27 20:08:57

thanks chris,but how do I make sure the table is kept into the workspace in TextMate (or another editor)?

Roberto 2010-07-27 20:14:57

if you're running non-interactively, use the save.image(file="mydata.Rdata") comand to save your workspace. Then load the workspace with load() at the beginning of each run. There's still going to be some grinding involved, as R needs to get all that data back into memory, but it'll save you the expensive computational steps.

chrisamiller 2010-07-27 20:21:29

Also, consider leaving an R session open, editing your scripts in text mate, saving them, then loading the new code into R like so: source("~/pathto/myRscript.R") This way you don't have to reload data every time. Combine with some exists() statements and it'll speed things up considerably.

chrisamiller 2010-07-27 20:24:01

chris: that explains it very well, thanks!

Roberto 2010-07-27 20:33:42

`mytable` must be given as a character string. As posted, the code does not work.

Dirk Eddelbuettel 2010-07-27 20:41:48

Answer 5

+1 A:

After you do something you discover to be costly, save the results of that costly step in an R data file.

For example, if you loaded a csv into a data frame called myVeryLargeDataFrame and then created summary stats from that data frame into a df called VLDFSummary then you could do this:

save(c(myVeryLargeDataFrame, VLDFSummary), 
  file="~/myProject/cachedData/VLDF.RData", 
  compress="bzip2")

The compress option there is optional and to be used if you want to compress the file being written to disk. See ?save for more details.

After you save the RData file you can comment out the slow data loading and summary steps as well as the save step and simply load the data like this:

load("~/myProject/cachedData/VLDF.RData")

This answer is not editor dependent. It works the same for Emacs, TextMate, etc. You can save to any location on your computer. I recommend keeping the slow code in your R script file, however, so you can always know where your RData file came from and be able to recreate it from the source data if needed.

JD Long 2010-07-27 20:36:00

ansaurus

tags:

views:

answers:

Cache expensive operations in R

related questions