tags:

views:

93

answers:

5

A very simple question:

I am writing and running my R scripts using a text editor to make them reproducible, as has been suggested by several members of SO.

This approach is working very well for me, but I sometimes have to perform expensive operations (e.g. read.csv or reshape on 2M-row databases) that I'd better cache in the R environment rather than re-run every time I run the script (which is usually many times as I progress and test the new lines of code).

Is there a way to cahe what a script does up to a certain point so every time I am only running the incremental lines of code (just as I would do by running R interactively)?

Thanks,

Roberto

+2  A: 

I want to do this too when I'm using Sweave. I'd suggest putting all of your expensive functions (loading and reshaping data) at the beginning of your code. Run that code, then save the workspace. Then, comment out the expensive functions, and load the workspace file with load(). This is, of course, riskier if you make unwanted changes to the workspace file, but in that event, you still have the code in comments if you want to start over from scratch.

JoFrhwld
Jo, thanks for the answer. Do you know how to save the workspace using TextMate?
Roberto
but what if I dont work in Sweave?
Roberto
the load() approach is a good one to use and it is not dependent on Sweave.
JD Long
+2  A: 

Some simple ways are doable with some combinations of

  • exists("foo") to test if a variable exists, else re-load or re-compute
  • file.info("foo.Rd")$ctime which you can compare to Sys.time() and see if it is newer than a given amount of time you can load, else recompute.

There are also caching packages on CRAN that may be useful.

Dirk Eddelbuettel
Dirk,but does not every object have to be recreated anyway every time the script is re-run? So foo will never exist and always be recomputed, right?
Roberto
It depends. Sometimes one gets data from, say, a database which may be extensive. You could then cache this in a file and use the timestamp (as I described) to see whether you need a new db access or not. It all depends on the particulars of your situation.
Dirk Eddelbuettel
+1  A: 

Without going into too much detail, I usually follow one of three approaches:

  1. Use assign to assign a unique name for each important object throughout my execution. Then include an if(exists(...)) get(...) at the top of each function to get the value or else recompute it. (same as Dirk's suggestion)
  2. Use cacheSweave with my Sweave documents. This does all the work for you of caching computations and retrieves them automatically. It's really trivial to use: just use the cacheSweave driver and add this flag to each block: <<..., cache=true>>=
  3. Use save and load to save the environment at crucial moments, again making sure that all names are unique.
Shane
+2  A: 
## load the file from disk only if it 
## hasn't already been read into a variable
if(!(exists("mytable")){
  mytable=read.csv(...)
}

Edit: fixed typo - thanks Dirk.

chrisamiller
thanks chris,but how do I make sure the table is kept into the workspace in TextMate (or another editor)?
Roberto
if you're running non-interactively, use the save.image(file="mydata.Rdata") comand to save your workspace. Then load the workspace with load() at the beginning of each run. There's still going to be some grinding involved, as R needs to get all that data back into memory, but it'll save you the expensive computational steps.
chrisamiller
Also, consider leaving an R session open, editing your scripts in text mate, saving them, then loading the new code into R like so: source("~/pathto/myRscript.R") This way you don't have to reload data every time. Combine with some exists() statements and it'll speed things up considerably.
chrisamiller
chris: that explains it very well, thanks!
Roberto
`mytable` must be given as a character string. As posted, the code does not work.
Dirk Eddelbuettel
+1  A: 

After you do something you discover to be costly, save the results of that costly step in an R data file.

For example, if you loaded a csv into a data frame called myVeryLargeDataFrame and then created summary stats from that data frame into a df called VLDFSummary then you could do this:

save(c(myVeryLargeDataFrame, VLDFSummary), 
  file="~/myProject/cachedData/VLDF.RData", 
  compress="bzip2")

The compress option there is optional and to be used if you want to compress the file being written to disk. See ?save for more details.

After you save the RData file you can comment out the slow data loading and summary steps as well as the save step and simply load the data like this:

load("~/myProject/cachedData/VLDF.RData")

This answer is not editor dependent. It works the same for Emacs, TextMate, etc. You can save to any location on your computer. I recommend keeping the slow code in your R script file, however, so you can always know where your RData file came from and be able to recreate it from the source data if needed.

JD Long