tags:

views:

1459

answers:

10

When I undertake an R project of any complexity, my scripts quickly get long and confusing.

What are some practices I can adopt so that my code will always be a pleasure to work with? I'm thinking about things like

  • Placement of functions in source files
  • When to break something out to another source file
  • What should be in the master file
  • Using functions as organizational units (whether this is worthwhile given that R makes it hard to access global state)
  • Indentation / line break practices.
    • Treat ( like {?
    • Put things like )} on 1 or 2 lines?

Basically, what are your rules of thumb for organizing large R scripts?

+18  A: 

The standard answer is to use packages -- see the Writing R Extensions manual as well as different tutorials on the web.

It gives you

  • a quasi-automatic way to organize your code by topic
  • strongly encourages you to write a help file, making you think about the interface
  • a lot of sanity checks via R CMD check
  • a chance to add regression tests
  • as well as a means for namespaces.

Just running source() over code works for really short snippets. Everything else should be in a package -- even if you do not plan to publish it as you can write internal packages for internal repositories.

As for the 'how to edit' part, the R Internals manual has excellent R coding standards in Section 6. Otherwise, I tend to use defaults in Emacs' ESS mode.

Update 2008-Aug-13: David Smith just blogged about the Google R Style Guide.

Dirk Eddelbuettel
If you're growing your source-tree/analysis "organically," don't you find this hard to do/cumbersome? If you notice a mistake in your code (common while exploring a new problem space), you have to (i) fix source; (ii) reinstall package; (iii) reload it into your workspace?Is there a way to call library(...) in order to reload a package that's already loaded (step iii above)? Don't you have to kill your workspace, restart R then reload your library/package in order to see if it's right?
Steve Lianoglou
Dirk, are you aware of any coding style guidelines similar to the Python PEP: http://www.python.org/dev/peps/pep-0008/ ... I've read the coding standards you mentioned, and they cover slightly different ground. I thought I'd ask since you seem pretty knowledgable and I've never managed to come across anything similar in R. Thanks.
ars
Trying googling for R coding style.
hadley
@hadley: I could have been clearer; I meant something roughly analogous to a python PEP: an official/semi-offical R document like the one linked to by Dirk. Perhaps I'm missing something obvious here. Thanks though.
ars
Ars, the manuals by R Core are as formal as it gets. The rest is contributed as e.g. Henrik's RCC.
Dirk Eddelbuettel
Thanks, Dirk. I guess this is the norm in most OSS project. Python's an outlier because of Guido's "beloved dictator" status (which accounts for the remarkable consistency of style across many Python libraries). I hadn't thought of it that way before, but it seems obvious now; yes, I'm slow. :)
ars
Ars, it so happens that a link just came through with a Google R Style Guide, I amended the post to show it.
Dirk Eddelbuettel
Oooh, very nice. Thank you, Dirk! (I know it's a commentary on lame corp culture, but I can use the google name to sell this to the few folk here who might be willing to experiment with R, but so far hold out with really silly excuses.)
ars
+5  A: 

I agree with Dirk advice! IMHO, organizing your programs from simple scripts to documented packages is, for Programming in R, like switching from Word to TeX/LaTeX for writing. I recommend to take a look at the very useful Creating R Packages: A Tutorial by Friedrich Leisch.

Paolo
Packages look compelling. However, I worried they might be overkill. I'm not writing general-purpose code. Most of what I'm doing is test this hypothesis, test that hypothesis, plot this, adjust plot parameters, plot that, reshape the data, plot that. I'm doing stuff that, once finished, will probably never be re-run.
Dan Goldstein
In that case you should have a look at Sweave. It combines the R code with LaTeX. So you have the analysis and the report source together.
Thierry
+2  A: 

I also agree. Use the package.skeleton() function to get started. Even if you think your code may never be run again, it may help motivate you to create more general code that could save you time later.

As for accessing the global environment, that is easy with the <<- operator, though it is discouraged.

cameron.bracken
+3  A: 

I've been meaning to figure out how to write packages but haven't invested the time. For each of my mini-projects I keep all of my low-level functions in a folder called 'functions/', and source them into a separate namespace that I explicitly create.

The following lines of code will create an environment named "myfuncs" on the search path if it doesn't already exist (using attach), and populate it with the functions contained in the .r files in my 'functions/' directory (using sys.source). I usually put these lines at the top of my main script meant for the "user interface" from which high-level functions (invoking the low-level functions) are called.

if( length(grep("^myfuncs$",search()))==0 )
  attach("myfuncs",pos=2)
for( f in list.files("functions","\\.r$",full=TRUE) )
  sys.source(f,pos.to.env(grep("^myfuncs$",search())))

When you make changes you can always re-source it with the same lines, or use something like

evalq(f <- function(x) x * 2, pos.to.env(grep("^myfuncs$",search())))

to evaluate additions/modifications in the environment you created.

It's kludgey I know, but avoids having to be too formal about it (but if you get the chance I do encourage the package system - hopefully I will migrate that way in the future).

As for coding conventions, this is the only thing I've seen regarding aesthetics (I like them and loosely follow but I don't use too many curly braces in R):

http://www1.maths.lth.se/help/R/RCC/

There are other "conventions" regarding the use of [,drop=FALSE] and <- as the assignment operator suggested in various presentations (usually keynote) at the useR! conferences, but I don't think any of these are strict (though the [,drop=FALSE] is useful for programs in which you are not sure of the input you expect).

Stephen
A: 

R is OK for interactive use and small scripts, but I wouldn't use it for a large program. I'd use a mainstream language for most of the programming and wrap it in an R interface.

John D. Cook
There are seriously large packages (i.e. programs) out there. Are you seriously suggesting they should be rewritten in some other language? Why???
Eduardo Leoni
One consideration is efficiency. I've often rewritten R code as C++ code and made it 100x faster. Another is tool support. R has nothing comparable to IDEs like Eclipse or Visual Studio. Finally, if a program is very large, it's likely to be doing non-statistical tasks that R is not well suited to.
John D. Cook
There is a plugin (Stat-ET) available that allows Eclipse to interact with R.I aggree that C++ can work much faster than R. But how much time do you need to recode the R stuff into C++? Unless you can reuse the code frequently, the benefit of the faster code is not worth much compared to the effort to recode it in C++.
Thierry
Yes, there's a trade off (productivity v performance). And for purely data analysis/statistical work, R often wins. But for writing other tasks, e.g. GUI, web, etc, I'm not sure that's the case. We often prototype and work in R but deploy production code in Python/C++. With the latter you get performance *and* very mature and reusable libraries/frameworks for various tasks. But, this is a fluid situation and the R ecosystem is constantly evolving.
ars
+1  A: 

Having not learned how to write packages yet, I have always organized by sourcing sub scripts. Its similar to writing classes but not as involved. Its not programatically elegant but I find I build up analyses over time. Once I have a big section that works I often move it to a different script and just source it since it will use the workspace objects. Perhaps I need to import data from several sources, sort all of them and find the intersections. I might put that section into an additional script. However, if you want to distribute your "application" for other people, or it uses some interactive input, a package is probably a good route. As a researcher I rarely need to distribute my analysis code but I OFTEN need to augment or tweak it.

kpierce8
+11  A: 

This might sound a little obvious especially if you're a programmer, but here's how I think about logical and physical units of code.

I don't know if this is your case, but when I'm working in R, I rarely start out with a large complex program in mind. I usually start in one script and separate code into logically separable units, often using functions. Data manipulation and visualization code get placed in their own functions, etc. And such functions are grouped together in one section of the file (data manipulation at the top, then visualization, etc). Ultimately you want to think about how to make it easier for you to maintain your script and lower the defect rate.

How fine/coarse grained you make your functions will vary and there are various rules of thumb: e.g. 15 lines of code, or "a function should be responsible for doing one task which is identified by its name", etc. Your mileage will vary. Since R doesn't support call-by-reference, I'm usually vary of making my functions too fine grained when it involves passing data frames or similar structures around. But this may be overcompensation for some silly performance mistakes when I first started out with R.

When to extract logical units into their own physical units (like source files and bigger groupings like packages)? I have two cases. First, if the file gets too large and scrolling around among logically unrelated units is an annoyance. Second, if I have functions that can be reused by other programs. I usually start out by placing some grouped unit, say data manipulation functions, into a separate file. I can then source this file from any other script.

If you're going to deploy your functions, then you need to start thinking about packages. I don't deploy R code in production or for re-use by others for various reasons (briefly: org culture prefers other langauges, concerns about performance, GPL, etc). Also, I tend to constantly refine and add to my collections of sourced files, and I'd rather not deal with packages when I make a change. So you should check out the other package related answers, like Dirk's, for more details on this front.

Finally, I think your question isn't necessarily particular to R. I would really recommend reading Code Complete by Steve McConnell which contains a lot of wisdom about such issues and coding practices at large.

ars
Very helpful comment, ars, thank you. I am a programmer, but it's good to check in with others. When you say "Since R doesn't support call-by-reference, I'm usually wary of making my functions too fine grained", I hear you. I'm used to writing functions like ReadData(); CleanData(); AnalyzeData(); GraphData(); and R makes that cumbersome. I'm waking up to the idea that I need to be using "source" the way I use functions in other languages.
Dan Goldstein
You're right, Dan. I find myself using "source" that way for dataset preparation tasks, so I can just use a prepared data.frame across other scripts where the real analysis is done. I was never sure if this was good practice because it just feels weird relative to other languages -- more like shell scripting really. It's good to compare notes. :)
ars
+2  A: 

My concise answer:

  1. Write your functions carefully, identifying general enough outputs and inputs;
  2. Limit the use of global variables;
  3. Use S3 objects and, where appropriate, S4 objects;
  4. Put the functions in packages, especially when your functions are calling C/Fortran.

I believe R is more and more used in production, so the need for reusable code is greater than before. I find the interpreter much more robust than before. There is no doubt that R is 100-300x slower than C, but usually the bottleneck is concentrated around a few lines of code, which can be delegated to C/C++. I think it would be a mistake to delegate the strengths of R in data manipulation and statistical analysis to another language. In these instances, the performance penalty is low, and in any case well worth the savings in development effort. If execution time alone were the matter, we'd be all writing assembler.

gappy
+1  A: 

Count me as another person in favor of packages. I'll admit to being pretty poor on writing man pages and vignettes until if/when I have to (ie being released), but it makes for a real handy way to bundle source doe. Plus, if you get serious about maintaining your code, the points that Dirk brings up all come into plya.

geoffjentry
+6  A: 

I like putting different functionality in their own files.

But I don't like R's package system. It's rather hard to use.

I prefer a lightweight alternative, to place a file's functions inside an environment (what every other language calls a "namespace") and attach it. For example, I made a 'util' group of functions like so:

util = new.env()

util$bgrep = function [...]

util$timeit = function [...]

while("util" %in% search())
  detach("util")
attach(util)

This is all in a file util.R. When you source it, you get the environment 'util' so you can call util$bgrep() and such; but furthermore, the attach() call makes it so just bgrep() and such work directly. If you didn't put all those functions in their own environment, they'd pollute the interpreter's top-level namespace (the one that ls() shows).

I was trying to simulate Python's system, where every file is a module. That would be better to have, but this seems OK.

Brendan OConnor
Thanks, Brendan. That is very useful. What is up with the while loop? What is wrong with if (!("util" %in% search())) attach(util)
Dan Goldstein
so you can do source("util.R") again and again if you want to tweak it and such.
Brendan OConnor
you don't need a while loop really -- all you need it detach(util). i can't remember if it gives an error or not if it's not already loaded, but this is safest and does work. suggestions welcome.
Brendan OConnor