tags:

views:

107

answers:

4

Hi,

My question is about avoiding namespace pollution when writing modules in R.

Right now, in my R project, I have functions1.R with doFoo() and doBar(), functions2.R with other functions, and main.R with the main program in it, which first does source('functions1.R'); source('functions2.R'), and then calls the other functions.

I've been starting the program from the R GUI in Mac OS X, with source('main.R'). This is fine the first time, but after that, the variables that were defined the first time through the program are defined for the second time functions*.R are sourced, and so the functions get a whole bunch of extra variables defined.

I don't want that! I want an "undefined variable" error when my function uses a variable it shouldn't! Twice this has given me very late nights of debugging!

So how do other people deal with this sort of problem? Is there something like source(), but that makes an independent namespace that doesn't fall through to the main one? Making a package seems like one solution, but it seems like a big pain in the butt compared to e.g. Python, where a source file is automatically a separate namespace.

Any tips? Thank you!

+3  A: 

If you switch to using packages, you get namespaces as a side-benefit (provided you use a NAMESPACE file). There are other advantages for using packages.

If you were really trying to avoid packages (which you shouldn't), then you could try assigning your variables in specific environments.

Dirk Eddelbuettel
I don't really have anything against packages, except that it takes many steps to make them. As far as I can tell, you have to do package.skeleton(), then R CMD build, then R CMD install --- do I have to do this each time I make a change to my code? It seems like a clumsy way to do development.
rescdsk
There are numerous upsides to packaging, and that has been discussed often here for us not to rehash. If you don't like it, you don't use it. Your loss, not mine :) That said, for small incremental changes you have other options (fix(), edit(), source(), ...) but for anything with some structure, somewhat frequent use, larger than minuscule size, ... I tend to like packages. A lot.
Dirk Eddelbuettel
+5  A: 

I would explore two possible solutions to this.

a) Think more in a more functional manner. Don't create any variables outside of a function. so, for example, main.R should contain one function main(), which sources in the other files, and does the work. when main returns, none of the clutter will remain.

b) Clean things up manually:

#main.R
prior_variables <- ls()
source('functions1.R')
source('functions2.R')

#stuff happens

rm(list = setdiff(ls(),prior_variables))`
Ian Fellows
I do agree that most variables should be kept in a local environment (e.g., within functions).
Stephen
+2  A: 

Well avoiding namespace pollution, as you put it, is just a matter of diligently partitioning the namespace and keeping your global namespace uncluttered.

Here are the essential functions for those two kinds of tasks:

Understanding/Navigating the Namespace Structure

At start-up, R creates a new environment to store all objects created during that session--this is the "global environment".

# to get the name of that environment:
globalenv()

But this isn't the root environment. The root is an environment called "the empty environment"--all environments chain back to it:

emptyenv()
returns: <environment: R_EmptyEnv>

# to view all of the chained parent environments (which includes '.GlobalEnv'):
search()

Creating New Environments:

workspace1 = new.env()

is.environment(workspace1)
returns: [1] TRUE

class(workspace1)
returns: [1] "environment"

# add an object to this new environment:
with(workspace1, attach(what="/Users/doug/Documents/test_obj.RData",
     name=deparse(substitute(what)), warn.conflicts=T, pos=2))

# verify that it's there:
exists("test_obj", where=workspace1)
returns: [1] TRUE

# to locate the new environment (if it's not visible from your current environment)
parent.env(workspace1)
returns: <environment: R_GlobalEnv>

objects(".GlobalEnv")
returns: [1] "test_obj"

Coming from python, et al., this system (at first) seemed to me like a room full of carnival mirrors. The R Gurus on the other hand seem to be quite comfortable with it. I'm sure there are a number of reasons why, but my intuition is that they don't let environments persist. I notice that R beginners use 'attach', as in attach('this_dataframe'); I've noticed that experienced R users don't do that; they use 'with' instead eg,

with(this_dataframe, tapply(etc....))

(I suppose they would achieve the same thing if they used 'attach' then 'detach' but 'with' is faster and you don't have to remember the second step.) In other words, namespace collisions are avoided in part by limiting the objects visible from the global namespace.

doug
+3  A: 

The main function you want to use is sys.source(), which will load your functions/variables in a namespace ("environment" in R) other than the global one. One other thing you can do in R that is fantastic is to attach namespaces to your search() path so that you need not reference the namespace directly. That is, if "namespace1" is on your search path, a function within it, say "fun1", need not be called as namespace1.fun1() as in Python, but as fun1(). [Method resolution order:] If there are many functions with the same name, the one in the environment that appears first in the search() list will be called. To call a function in a particular namespace explicitly, one of many possible syntaxes - albeit a bit ugly - is get("fun1","namespace1")(...) where ... are the arguments to fun1(). This should also work with variables, using the syntax get("var1","namespace1"). I do this all the time (I usually load just functions, but the distinction between functions and variables in R is small) so I've written a few convenience functions that loads from my ~/.Rprofile.

  name.to.env <- function(env.name)
    ## returns named environment on search() path
    pos.to.env(grep(env.name,search()))

  attach.env <- function(env.name)
    ## creates and attaches environment to search path if it doesn't already exist
    if( all(regexpr(env.name,search())<0) ) attach(NULL,name=env.name,pos=2)

  populate.env <- function(env.name,path,...) {
    ## populates environment with functions in file or directory
    ## creates and attaches named environment to search() path 
    ##        if it doesn't already exist
    attach.env(env.name)
    if( file.info(path[1])$isdir )
      lapply(list.files(path,full.names=TRUE,...),
             sys.source,name.to.env(env.name)) else
    lapply(path,sys.source,name.to.env(env.name))
    invisible()
  }

Example usage:

populate.env("fun1","pathtofile/functions1.R")
populate.env("fun2","pathtofile/functions2.R")

and so on, which will create two separate namespaces: "fun1" and "fun2", which are attached to the search() path ("fun2" will be higher on the search() list in this case). This is akin to doing something like

attach(NULL,name="fun1")
sys.source("pathtofile/functions1.R",pos.to.env(2))

manually for each file ("2" is the default position on the search() path). The way that populate.env() is written, if a directory, say "functions/", contains many R files without conflicting function names, you can call it as

populate.env("myfunctions","functions/")

to load all functions (and variables) into a single namespace. With name.to.env(), you can also do something like

with(name.to.env("fun1"), doStuff(var1))

or

evalq(doStuff(var1), name.to.env("fun1"))

Of course, if your project grows big and you have lots and lots of functions (and variables), writing a package is the way to go.

Stephen
In my function I have logical `reload` argument which if set to `TRUE` cause `detach` if there is already attach exact name. It helps when your codes are modified often during analysis.
Marek
in the `attach(NULL, ...)`, it should be `attach(NULL, name="fun1")`And thanks, this is really great!
rescdsk
doh! @rescdsk - thanks for pointing that out. I made the change above for others looking to do something similar. and @Marek - I also thought about what to do when the environment is already attached... mine just loads the modifications into the existing environment, but I agree that your way might be cleaner as it does not retain deleted functions or variables if they are deleted from the sourced code.
Stephen
@rescdsk - you can also create a new environment not on search() list with something like myenv <- new.env(), and use newenv as the argument to 'envir' (rather than using pos.to.env() or name.to.env()) for the functions get(), assign(), with(), within(), or evalq() to set and get variable/function values in or from your environment(s).
Stephen