ansaurus

Question

Answer 1

+18 A:

The standard answer is to use packages -- see the Writing R Extensions manual as well as different tutorials on the web.

It gives you

a quasi-automatic way to organize your code by topic
strongly encourages you to write a help file, making you think about the interface
a lot of sanity checks via R CMD check
a chance to add regression tests
as well as a means for namespaces.

Just running source() over code works for really short snippets. Everything else should be in a package -- even if you do not plan to publish it as you can write internal packages for internal repositories.

As for the 'how to edit' part, the R Internals manual has excellent R coding standards in Section 6. Otherwise, I tend to use defaults in Emacs' ESS mode.

Update 2008-Aug-13: David Smith just blogged about the Google R Style Guide.

Dirk Eddelbuettel 2009-08-12 14:09:44

If you're growing your source-tree/analysis "organically," don't you find this hard to do/cumbersome? If you notice a mistake in your code (common while exploring a new problem space), you have to (i) fix source; (ii) reinstall package; (iii) reload it into your workspace?Is there a way to call library(...) in order to reload a package that's already loaded (step iii above)? Don't you have to kill your workspace, restart R then reload your library/package in order to see if it's right?

Steve Lianoglou 2009-08-13 04:18:10

Dirk, are you aware of any coding style guidelines similar to the Python PEP: http://www.python.org/dev/peps/pep-0008/ ... I've read the coding standards you mentioned, and they cover slightly different ground. I thought I'd ask since you seem pretty knowledgable and I've never managed to come across anything similar in R. Thanks.

ars 2009-08-13 07:51:54

Trying googling for R coding style.

hadley 2009-08-13 13:35:47

@hadley: I could have been clearer; I meant something roughly analogous to a python PEP: an official/semi-offical R document like the one linked to by Dirk. Perhaps I'm missing something obvious here. Thanks though.

ars 2009-08-13 15:30:22

Ars, the manuals by R Core are as formal as it gets. The rest is contributed as e.g. Henrik's RCC.

Dirk Eddelbuettel 2009-08-13 15:59:17

Thanks, Dirk. I guess this is the norm in most OSS project. Python's an outlier because of Guido's "beloved dictator" status (which accounts for the remarkable consistency of style across many Python libraries). I hadn't thought of it that way before, but it seems obvious now; yes, I'm slow. :)

ars 2009-08-13 17:51:52

Ars, it so happens that a link just came through with a Google R Style Guide, I amended the post to show it.

Dirk Eddelbuettel 2009-08-13 19:00:14

Oooh, very nice. Thank you, Dirk! (I know it's a commentary on lame corp culture, but I can use the google name to sell this to the few folk here who might be willing to experiment with R, but so far hold out with really silly excuses.)

ars 2009-08-13 19:19:53

Answer 2

+5 A:

I agree with Dirk advice! IMHO, organizing your programs from simple scripts to documented packages is, for Programming in R, like switching from Word to TeX/LaTeX for writing. I recommend to take a look at the very useful Creating R Packages: A Tutorial by Friedrich Leisch.

Paolo 2009-08-12 15:39:14

Packages look compelling. However, I worried they might be overkill. I'm not writing general-purpose code. Most of what I'm doing is test this hypothesis, test that hypothesis, plot this, adjust plot parameters, plot that, reshape the data, plot that. I'm doing stuff that, once finished, will probably never be re-run.

Dan Goldstein 2009-08-12 16:31:20

In that case you should have a look at Sweave. It combines the R code with LaTeX. So you have the analysis and the report source together.

Thierry 2009-08-12 22:58:36

Answer 3

+2 A:

I also agree. Use the package.skeleton() function to get started. Even if you think your code may never be run again, it may help motivate you to create more general code that could save you time later.

As for accessing the global environment, that is easy with the <<- operator, though it is discouraged.

cameron.bracken 2009-08-12 17:31:32

Answer 4

+3 A:

I've been meaning to figure out how to write packages but haven't invested the time. For each of my mini-projects I keep all of my low-level functions in a folder called 'functions/', and source them into a separate namespace that I explicitly create.

The following lines of code will create an environment named "myfuncs" on the search path if it doesn't already exist (using attach), and populate it with the functions contained in the .r files in my 'functions/' directory (using sys.source). I usually put these lines at the top of my main script meant for the "user interface" from which high-level functions (invoking the low-level functions) are called.

if( length(grep("^myfuncs$",search()))==0 )
  attach("myfuncs",pos=2)
for( f in list.files("functions","\\.r$",full=TRUE) )
  sys.source(f,pos.to.env(grep("^myfuncs$",search())))

When you make changes you can always re-source it with the same lines, or use something like

evalq(f <- function(x) x * 2, pos.to.env(grep("^myfuncs$",search())))

to evaluate additions/modifications in the environment you created.

It's kludgey I know, but avoids having to be too formal about it (but if you get the chance I do encourage the package system - hopefully I will migrate that way in the future).

As for coding conventions, this is the only thing I've seen regarding aesthetics (I like them and loosely follow but I don't use too many curly braces in R):

http://www1.maths.lth.se/help/R/RCC/

There are other "conventions" regarding the use of [,drop=FALSE] and <- as the assignment operator suggested in various presentations (usually keynote) at the useR! conferences, but I don't think any of these are strict (though the [,drop=FALSE] is useful for programs in which you are not sure of the input you expect).

Stephen 2009-08-12 19:41:03

Answer 5

A:

R is OK for interactive use and small scripts, but I wouldn't use it for a large program. I'd use a mainstream language for most of the programming and wrap it in an R interface.

John D. Cook 2009-08-12 19:42:45

There are seriously large packages (i.e. programs) out there. Are you seriously suggesting they should be rewritten in some other language? Why???

Eduardo Leoni 2009-08-12 21:11:35

One consideration is efficiency. I've often rewritten R code as C++ code and made it 100x faster. Another is tool support. R has nothing comparable to IDEs like Eclipse or Visual Studio. Finally, if a program is very large, it's likely to be doing non-statistical tasks that R is not well suited to.

John D. Cook 2009-08-12 22:03:27

There is a plugin (Stat-ET) available that allows Eclipse to interact with R.I aggree that C++ can work much faster than R. But how much time do you need to recode the R stuff into C++? Unless you can reuse the code frequently, the benefit of the faster code is not worth much compared to the effort to recode it in C++.

Thierry 2009-08-12 22:55:06

Yes, there's a trade off (productivity v performance). And for purely data analysis/statistical work, R often wins. But for writing other tasks, e.g. GUI, web, etc, I'm not sure that's the case. We often prototype and work in R but deploy production code in Python/C++. With the latter you get performance *and* very mature and reusable libraries/frameworks for various tasks. But, this is a fluid situation and the R ecosystem is constantly evolving.

ars 2009-08-13 04:32:22

Answer 6

+1 A:

Having not learned how to write packages yet, I have always organized by sourcing sub scripts. Its similar to writing classes but not as involved. Its not programatically elegant but I find I build up analyses over time. Once I have a big section that works I often move it to a different script and just source it since it will use the workspace objects. Perhaps I need to import data from several sources, sort all of them and find the intersections. I might put that section into an additional script. However, if you want to distribute your "application" for other people, or it uses some interactive input, a package is probably a good route. As a researcher I rarely need to distribute my analysis code but I OFTEN need to augment or tweak it.

kpierce8 2009-08-12 20:19:17

Answer 7

+11 A:

This might sound a little obvious especially if you're a programmer, but here's how I think about logical and physical units of code.

I don't know if this is your case, but when I'm working in R, I rarely start out with a large complex program in mind. I usually start in one script and separate code into logically separable units, often using functions. Data manipulation and visualization code get placed in their own functions, etc. And such functions are grouped together in one section of the file (data manipulation at the top, then visualization, etc). Ultimately you want to think about how to make it easier for you to maintain your script and lower the defect rate.

How fine/coarse grained you make your functions will vary and there are various rules of thumb: e.g. 15 lines of code, or "a function should be responsible for doing one task which is identified by its name", etc. Your mileage will vary. Since R doesn't support call-by-reference, I'm usually vary of making my functions too fine grained when it involves passing data frames or similar structures around. But this may be overcompensation for some silly performance mistakes when I first started out with R.

When to extract logical units into their own physical units (like source files and bigger groupings like packages)? I have two cases. First, if the file gets too large and scrolling around among logically unrelated units is an annoyance. Second, if I have functions that can be reused by other programs. I usually start out by placing some grouped unit, say data manipulation functions, into a separate file. I can then source this file from any other script.

If you're going to deploy your functions, then you need to start thinking about packages. I don't deploy R code in production or for re-use by others for various reasons (briefly: org culture prefers other langauges, concerns about performance, GPL, etc). Also, I tend to constantly refine and add to my collections of sourced files, and I'd rather not deal with packages when I make a change. So you should check out the other package related answers, like Dirk's, for more details on this front.

Finally, I think your question isn't necessarily particular to R. I would really recommend reading Code Complete by Steve McConnell which contains a lot of wisdom about such issues and coding practices at large.

ars 2009-08-13 03:02:17

Very helpful comment, ars, thank you. I am a programmer, but it's good to check in with others. When you say "Since R doesn't support call-by-reference, I'm usually wary of making my functions too fine grained", I hear you. I'm used to writing functions like ReadData(); CleanData(); AnalyzeData(); GraphData(); and R makes that cumbersome. I'm waking up to the idea that I need to be using "source" the way I use functions in other languages.

Dan Goldstein 2009-08-13 07:47:41

You're right, Dan. I find myself using "source" that way for dataset preparation tasks, so I can just use a prepared data.frame across other scripts where the real analysis is done. I was never sure if this was good practice because it just feels weird relative to other languages -- more like shell scripting really. It's good to compare notes. :)

ars 2009-08-13 19:04:34

Answer 8

+2 A:

My concise answer:

Write your functions carefully, identifying general enough outputs and inputs;
Limit the use of global variables;
Use S3 objects and, where appropriate, S4 objects;
Put the functions in packages, especially when your functions are calling C/Fortran.

I believe R is more and more used in production, so the need for reusable code is greater than before. I find the interpreter much more robust than before. There is no doubt that R is 100-300x slower than C, but usually the bottleneck is concentrated around a few lines of code, which can be delegated to C/C++. I think it would be a mistake to delegate the strengths of R in data manipulation and statistical analysis to another language. In these instances, the performance penalty is low, and in any case well worth the savings in development effort. If execution time alone were the matter, we'd be all writing assembler.

gappy 2009-08-16 12:32:24

Answer 9

+1 A:

Count me as another person in favor of packages. I'll admit to being pretty poor on writing man pages and vignettes until if/when I have to (ie being released), but it makes for a real handy way to bundle source doe. Plus, if you get serious about maintaining your code, the points that Dirk brings up all come into plya.

geoffjentry 2009-08-19 01:56:32

Answer 10

+6 A:

I like putting different functionality in their own files.

But I don't like R's package system. It's rather hard to use.

I prefer a lightweight alternative, to place a file's functions inside an environment (what every other language calls a "namespace") and attach it. For example, I made a 'util' group of functions like so:

util = new.env()

util$bgrep = function [...]

util$timeit = function [...]

while("util" %in% search())
  detach("util")
attach(util)

This is all in a file util.R. When you source it, you get the environment 'util' so you can call util$bgrep() and such; but furthermore, the attach() call makes it so just bgrep() and such work directly. If you didn't put all those functions in their own environment, they'd pollute the interpreter's top-level namespace (the one that ls() shows).

I was trying to simulate Python's system, where every file is a module. That would be better to have, but this seems OK.

Brendan OConnor 2009-08-23 22:42:54

Thanks, Brendan. That is very useful. What is up with the while loop? What is wrong with if (!("util" %in% search())) attach(util)

Dan Goldstein 2009-08-25 12:12:26

so you can do source("util.R") again and again if you want to tweak it and such.

Brendan OConnor 2009-08-28 08:09:24

you don't need a while loop really -- all you need it detach(util). i can't remember if it gives an error or not if it's not already loaded, but this is safest and does work. suggestions welcome.

Brendan OConnor 2009-08-28 08:10:09

ansaurus

tags:

views:

answers:

How to organize large R programs?

related questions