tags:

views:

132

answers:

4

I am using R to run simulations using time series data. I have been using arrays to store data but I need a less memory intensive solution for storing data at intermediate steps in order to document the process. I am not a programmer so I am looking for something relatively easy to setup on multiple platforms if possible (Windows, Mac, Linux). I also need to be able to directly call the database from R since learning another language is not feasible now. Ideally, I would like to be able to read and write frequently to the database in a manner similar to an array though I don't know if that is realistic. I will gladly sacrifice speed for ease of use but I am willing to work to learn open source solutions. Any suggestions would be appreciated.

+2  A: 

I also need to be able to directly call the database from R

I suggest setting up MySQL with RMySQL interface.

Once the DB connection is open, you can query the database and get the the data into R, example:

# Run an SQL statement by creating first a resultSet object
rs <- dbSendQuery(con, statement = paste(
                      "SELECT w.laser_id, w.wavelength, p.cut_off",
                      "FROM WL w, PURGE P",
                      "WHERE w.laser_id = p.laser_id",
                      "SORT BY w.laser_id")
# we now fetch records from the resultSet into a data.frame
data <- fetch(rs, n = -1)   # extract all rows

RMySQL: R interface to the MySQL database

Database interface and MySQL driver for R. This version complies with the database interface definition as implemented in the package DBI 0.2-2.

MySQL Database:

Available for all the platforms you cited in the question, and more, download here.

Bakkal
For a single user, RSQLite is easier to setup and quite fast.
Eduardo Leoni
Thank you for the code examples Bakkal. This will be used for a single user. Is there an advantage to using MySQL rather than SQLite for this application?
ProbablePattern
You can use the simplest DB for the current task, and then move to a more complete one later if you are missing a useful feature :)
Bakkal
It would be the same code for RSQLite -- see my answer regarding the DBI interface both packages use.
Dirk Eddelbuettel
Nthing MySQL. I use R and MySQL every day for my simulation work. It may seem like a lot of work at first but the learning curve is entirely worth it in the long run as the size of your data and complexity are bound to grow over time.
Maiasaura
+3  A: 

Quick comments:

  • R is good at this, as a language for programming with data, there are plenty of interfaces
  • There is an entire manual devoted to data import/export, and it has a section on relational databases, so start there.
  • R has the widely-used DBI package which provides a unified interface for many backends, among them SQLite, MySQL, PostgreSQL, Oracle, ... Use that, maybe with RSQLite to get something going quickly. You can still switch backends afterwards.
  • There is also RODBC but I find ODBC tedious to work with.
  • R also has a specialised variant in the TSdbi package by Paul Gilbert which brings the DBI-alike abstraction to timeseries databases. It also supports multiple backends.
  • The data.table package was written for this and is very fast on indexing.
Dirk Eddelbuettel
Thank you for the resources. So far, SQLite looks like the most appropriate for my skill level.
ProbablePattern
+1  A: 

Do you really need a database solution for your purpose? You say you want a "solution for storing data at intermediate steps " -- how about simply saving the data array to disk at the required time points?

Edit: to make it possible to retrieve the information, you can embed meta-information, e.g. trial index and/or timestamp, in the filename. Then later you can locate and load the file using the correct filename.

Amnon
I'm relatively new to R but it seems like this would generate several hundred files and make it difficult to audit when I'm done. Each run of the simulation requires creating a few three dimensional arrays and I can only run about 5,000 simulations at once. My current goal is to do 50,000 runs. If there is a better way of storing three dimensional arrays to disk and reading their results later in a systematic way, I would certainly appreciate the guidance.
ProbablePattern
+1  A: 

You can also take a look at the ff package.

Jyotirmoy Bhattacharya