views:

409

answers:

8

I have a function inside a loop inside a function. The inner function acquires and stores a large vector of data in memory (as a global variable... I'm using "R" which is like "S-Plus"). The loop loops through a long list of data to be acquired. The outer function starts the process and passes in the list of datasets to be acquired.

I programmed the inner function to store each dataset before moving to the next, so all the work of the outer function occurs as side effects on global variables... a big no-no. Is this better or worse than collecting and returning a giant, memory-hogging vector of vectors? Is there a superior third approach?

Would the answer change if I were storing the data vectors in a database rather than in memory? Ideally, I'd like to be able to terminate the function (or have it fail due to network timeouts) without losing all the information processed prior to termination.

A: 

It's tough to say definitively without knowing the language/compiler used. However, if you can simply pass a pointer/reference to the object that you're creating, then the size of the object itself has nothing to do with the speed of the function calls. Manipulating this data down the road could be a different story.

Jeffrey
The language he's using is R: http://r-project.org/
Allen
+4  A: 

use variables in the outer function instead of global variables. This gets you the best of both approaches: you're not mutating global state, and you're not copying a big wad of data. If you have to exit early, just return the partial results.

(See the "Scope" section in the R manual: http://cran.r-project.org/doc/manuals/R-intro.html#Scope)

Allen
A: 

Third approach: inner function returns a reference to the large array, which the next statement inside the loop then dereferences and stores wherever it's needed (ideally with a single pointer store and not by having to memcopy the entire array).

This gets rid of both the side effect and the passing of large datastructures.

pjz
+3  A: 

It's not going to make much difference to memory use, so you might as well make the code clean.

Since R has copy-on-modify for variables, modifying the global object will have the same memory implications as passing something up in return values.

If you store the outputs in a database (or even in a file) you won't have the memory use issues, and the data will be incrementally available as it is created, rather than just at the end. Whether it's faster with the database depends primarily on how much memory you are using: is the reduction is garbage collection going to pay for the cost of writing to disk.

There are both time and memory profilers in R, so you can see empirically what the impacts are.

A: 

I'm not sure I understand the question, but I have a couple of solutions.

  1. Inside the function, create a list of the vectors and return that.

  2. Inside the function, create an environment and store all the vectors inside of that. Just make sure that you return the environment in case of errors.

in R:

help(environment)

# You might do something like this:



outer <- function(datasets) {
  # create the return environment
  ret.env <- new.env()
  for(set in dataset) {
    tmp <- inner(set)
    # check for errors however you like here.  You might have inner return a list, and
    # have the list contain an error component
    assign(set, tmp, envir=ret.env)
  }
  return(ret.env)
}

#The inner function might be defined like this

inner <- function(dataset) {
  # I don't know what you are doing here, but lets pretend you are reading a data file
  # that is named by dataset
  filedata <- read.table(dataset, header=T)
  return(filedata)
}

leif

leif
+1  A: 

Remember your Knuth. "Premature optimization is the root of all programming evil."

Try the side effect free version. See if it meets your performance goals. If it does, great, you don't have a problem in the first place; if it doesn't, then use the side effects, and make a note for the next programmer that your hand was forced.

Rob Hansen
A: 

Thank you all for your informative and helpful answers!

Thanks also for the (unintentionally?) humorous line "I'm not sure I understand the question, but I have a couple of solutions"! Put a smile on my face.

A: 

FYI, here's a full sample toy solution that avoids side effects:

outerfunc <- function(names) {
  templist <- list()
  for (aname in names) {
    templist[[aname]] <- innerfunc(aname)
  }
  templist
}

innerfunc <- function(aname) {
  retval <- NULL
  if ("one" %in% aname) retval <- c(1)
  if ("two" %in% aname) retval <- c(1,2)
  if ("three" %in% aname) retval <- c(1,2,3)
  retval
}

names <- c("one","two","three")

name_vals <- outerfunc(names)

for (name in names) assign(name, name_vals[[name]])