tags:

views:

82

answers:

4

Hi R users,

I am working on a project where I have a lot of analysts creating statistical models in R. They usually provide me with the model objects (.Rdata files) and I automate executing them for various datasets.

My problem is:

  • Can I use a database and save these .RData files there ? Any hints on doing this? ( I currently store the .Rdata files to disk and use a database to store location information)

  • I get a lot of R scripts from other analysts who have done some pre-processing of data before they create the models. Does anyone have experience using PMML to make this process repeatable without manual intervention ? PMML stores the pre-processing steps, modeling steps as markup tags, and would repeat the same on a new dataset.

Thank you for the suggestions and feedback.

-Harsh

+2  A: 

Yes, this is possible using eg MySQL linked to R with the RMySQL and DBI package, or via the RODBC or RJDBC package. I'm not 100% sure if they all support blobs, but worst case scenario you could use the ascii representation and put them in a text field.

The trick is using the function serialize()

> x <- rnorm(100)
> y <- 5*x+4+rnorm(100,0,0.3)
> tt <- lm(y~x)
> obj <- serialize(tt,NULL,ascii=T)

Now you can store or retrieve obj in a database. It's actually no more than a vector of ascii (or binary) codes. ascii=F gives you a binary representation. After retrieving it, you use :

> unserialize(obj)
Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
      4.033        4.992  

Edit : regarding the pmml, there's a pmml package on CRAN. Maybe that one gets you somewhere?

Joris Meys
Even if blobs are not supported, you can serialize/unserialize to and from ascii (as you even do in your example) and then store the ascii string.
Dirk Eddelbuettel
I thought I said so? did I word it wrong?
Joris Meys
Note that the ascii serialized obj is 16k bytes big (length(obj)), the binary version is 11k bytes big, but if you save("tt",file="tt.RData") you get something only 5k big.
Spacedman
Thanks a ton. I was looking for something similar.
harshsinghal
+1  A: 

R can serialize and deserialize any object, that is how my digest package creates so-called 'hash digests' by running a hash function over the serialized object.

So once you have the serialized object (which can be serialized to character), store it. Any relational database will support this, as will the NoSQL key/value stores -- and for either backend you could even use the 'hash digest' as a key, or some other meta-information.

Other alternatives are for example RProtoBuf which can also serialize and de-serialize very efficiently (but you'd have to write the .proto files first).

Dirk Eddelbuettel
The NoSQL idea seems appealing. The new Tokyo Cabinet package in R could help here.
harshsinghal
A: 

Note that a .RData file can contain many R objects, so you need to decide how to deal with that. If you attach the .RData file you can get the objects in it with ls() with a pos argument:

> attach("three.RData")
> ls(pos=2)
[1] "x" "y" "z"

then you can loop over them, get() them by name from the position, and serialize them to a list (p is my list index)

> s=list()
> p=1
>  for(obn in obnames){
+ s[[p]] = serialize(get(obn,pos=2),NULL,ascii=TRUE)
+ p=p+1
+ }

Now you'll have to squirt the elements of s to your DB, probably in a table of Name (some kind of char) and Value (the serialized data, a BLOB or varchar I guess).

Spacedman
A: 

As others have mentioned, yes you can store the outputs from models as text in your database. I'm not convinced that that wll be very useful to you though.

If you want to be able to recreate those models at a later date, then you need to store the input dataset and code that created the models, rather than the output.

Of course, you could store the model output as well, in which case you need to think about its format in the database. If you want to be able to find particular model results and filter or order them, then it will be much easier if you add them to the database with some structure (and some metadata).

For example, you might want to retrieve all models where there was a significant gender response. In that case you need to add that information as a separate field in the database rather than having to search through the chunks of ascii. Adding other information like the model creator and date of creation will also help you later on.

Richie Cotton
You seem to have touched upon all aspects of my problem. I am trying to create a way to "markup" the independent variables in a glm model object, and if some variables where derived from source data columns (and their transformations). Currently, I save the model and the R script that went into creating it, but I want to create a more generic structure for re-tracing the path from data to model object.
harshsinghal