tags:

views:

288

answers:

2

I would like to save a whole bunch of relatively large data frames while minimizing the space that the files take up. When opening the files, I need to be able to control what names they are given in the workspace.

Basically I'm looking for the symantics of dput and dget but with binary files.

Example:

n<-10000

for(i in 1:100){
    dat<-data.frame(a=rep(c("Item 1","Item 2"),n/2),b=rnorm(n),
     c=rnorm(n),d=rnorm(n),e=rnorm(n))
    dput(dat,paste("data",i,sep=""))
}


##much later


##extract 3 random data sets and bind them
for(i in 1:10){
    nums<-sample(1:100,3)
    comb<-rbind(dget(paste("data",nums[1],sep="")),
      dget(paste("data",nums[2],sep="")),
      dget(paste("data",nums[3],sep="")))
    ##do stuff here
}
+7  A: 

Your best bet is to use rda files. You can use the save() and load() commands to write and read:

set.seed(101)
a = data.frame(x1=runif(10), x2=runif(10), x3=runif(10))

save(a, file="test.rda")
load("test.rda")

Edit: For completeness, just to cover what Harlan's suggestion might look like (i.e. wrapping the load command to return the data frame):

loadx <- function(x, file) {
  load(file)
  return(x)
}  

loadx(a, "test.rda")


Alternatively, have a look at the hdf5, RNetCDF and ncdf packages. I've experimented with the hdf5 package in the past; this uses the NCSA HDF5 library. It's very simple:

hdf5save(fileout, ...)
hdf5load(file, load = TRUE, verbosity = 0, tidy = FALSE)

A last option is to use binary file connections, but that won't work well in your case because readBin and writeBin only support vectors:

Here's a trivial example. First write some data with "w" and append "b" to the connection:

zz <- file("testbin", "wb")
writeBin(1:10, zz)
close(zz)

Then read the data with "r" and append "b" to the connection:

zz <- file("testbin", "rb")
readBin(zz, integer(), 4)
close(zz)
Shane
Nice answer Shane. I'd like to use 'save', but don't like the fact that I can't control the name of the data on loading
Ian Fellows
You could wrap the load() function in a new function that knows the name of the data in the file and renames it for a return value. The load function will insert the variables into the environment/namespace of the function.
Harlan
You can do what Harlan suggested, or you can just save one dataframe per file, and give both the file and dataframe the same name. Then you will have the same behavior as what you described above with dput and dget, right?
Shane
thanks Harlan, wrapping the call is a good idea.
Ian Fellows
+1  A: 

You may have a look at .saveRDS and .readRDS. They are "Internal functions" for serialization.

x = data.frame(x1=runif(10), x2=runif(10), x3=runif(10))

.saveRDS(x, file="myDataFile.rds")
x <- .readRDS(file="myDataFile.rds")
wind
Out of curiosity: why would someone use these over save/load? Is there some particular benefit?
Shane