tags:

views:

329

answers:

2

I have 12 data frames to work with. They are similar and I have to do the same processing to each one, so I wrote a function that takes a data frame, processes it, and then returns a data frame. This works. But I am afraid that I am passing around a very big structure. I may be making temporary copies (am I?) This can't be efficient. What is the best way to avoid passing a data frame around? Thank you.

doSomething <- function(df) {
  // do something with the data frame, df
  return(df)
}
+3  A: 

You are, indeed, passing the object around and using some memory. But I don't think you can do an operation on an object in R without passing the object around. Even if you didn't create a function and did your operations outside of the function, R would behave basically the same.

The best way to see this is to set up an example. If you are in Windows open Windows Task Manager. If you are in Linux open a terminal window and run the top command. I'm going to assume Windows in this example. In R run the following:

col1<-rnorm(1000000,0,1)
col2<-rnorm(1000000,1,2)
myframe<-data.frame(col1,col2)

rm(col1)
rm(col2)
gc()

this creates a couple of vectors called col1 and col2 then combines them into a data frame called myframe. It then drops the vectors and forces garbage collection to run. Watch in your windows task manager at the mem usage for the Rgui.exe task. When I start R it uses about 19 meg of mem. After I run the above commands my machine is using just under 35 meg for R.

Now try this:

myframe<-myframe+1

your memory usage for R should jump to over 144 meg. If you force garbage collection using gc() you will see it drop back to around 35 meg. To try this using a function, you can do the following:

doSomething <- function(df) {
    df<-df+1-1
return(df)
}
myframe<-doSomething(myframe)

when you run the code above, memory usage will jump up to 160 meg or so. Running gc() will drop it back to 35 meg.

So what to make of all this? Well, doing an operation outside of a function is not that much more efficient (in terms of memory) than doing it in a function. Garbage collection cleans things up real nice. Should you force gc() to run? Probably not as it will run automatically as needed, I just ran it above to show how it impacts memory usage.

I hope that helps!

JD Long
+3  A: 

I'm no R expert, but most languages use a reference counting scheme for big objects. A copy of the object data will not be made until you modify the copy of the object. If your functions only read the data (i.e. for analysis) then no copy should be made.

anon
This is how R works as well, Neil. Good point.
JD Long