tags:

views:

79

answers:

3

In a problem, I have a set of vectors. Each vector has sensor readings but are of different lengths. I'd like to compute the same descriptive statistics on each of these vectors. My question is, how should I store them in R. Using c() concatenates the vectors. Using list() seems to cause functions like mean() to misbehave. Is a data frame the right object?

What is the best practice for applying the same function to vectors if different sizes? Supposing the data resides in a SQL server, how should it be imported?

+6  A: 

Vectors of different sizes should be combined into a list: a data.frame expects each column to be the same length.

Use lapply to fetch your data. Then use lapply again to get the descriptive statistics.

x <- lapply(ids, sqlfunction)
stats <- lapply(x, summary)

Where sqlfunction is some function you created to query your database. You can collapse the stats list into a data.frame by calling do.call(rbind, stats) or by using plyr:

library(plyr)
x <- llply(ids, sqlfunction)
stats <- ldply(x, summary)
Shane
Shane, what is: ?"<~"
Tal Galili
Nice looking `<-` ;-)
mbq
DF's have columns of the same length but that doesn't mean that you shouldn't group vectors of differing lengths into them. They allow one to attach more meaning to the individual vectors than a list can.
John
@Tal: Oops! That's what happens when you get up with the baby in the middle of the night and answer a question on SO.
Shane
@John: in general, I disagree with you (as per my answer), especially as advice to a new R user. Trying to create a data.frame of varying length vectors can be problematic if you don't have a reason to associate each row as specific observations. In other words, unless you have good reason to create a data.frame here, it may make life more confusing (rather than attaching more meaning). And working with lists is more efficient for this kind of analysis (thanks to things like lapply).
Shane
I've never taught someone R who was using data where keeping vectors in list form was generally more useful than long formatted tabular data (a data frame). We obviously come from different data worlds and neither know the poster's world. It would only make life confusing if all of the vectors are only differentiated from each other by a single random variable. In that case a list is easier. If a data point is defined by two or more variables then tabular long format is much more useful.
John
+1  A: 

"What is the best practice for applying the same function to vectors if different sizes? Supposing the data resides in a SQL server, how should it be imported?"

As suggested by Shane, lappy is your definite choice here. You can, of course, use it with custom functions as well – in case you feel summary does not provide enough information.

For the SQL part: There are packages around for most relational DBMS: RPostgreSQL, RMySQL, ROracle and there´s RODBC as a general one. If you speak of MS SQL server, I am not sure if there is a specific package, but RODBC should do the job. I don´t know if you are married to MS SQL server stuff, but if it´s an option for you to run your own local database for R – RMySQL is really easy to set up.

In general, by using database packages you use wrappers like dbListTable, or dbReadTable which simply turns a table into a R data.frame.

If you really want to import the data you could use .csv exports of your database and use read.table or read.csv depending on what fits your needs. But I suggest to directly connect to the database – it´s not that difficult even if you haven´t done it before and it's more fun.

EDIT: I don´t use MS, but others done it before maybe the mailing list post helps

ran2
+1  A: 

I would tend to import this into a data frame and not a list. Each of your individual vectors is likely differentiated by one or more meaningful variables. Let's say you wanted to keep track of the time that the data was collected and location it was collected from. In a data frame you would have one column that was all of the vectors concatenated together but they would each be differentiated by values in the time and location columns. To get each individual vector mean then tapply() might be the tool of choice.

tapply(df$y, list(df$time, df$location), mean)

Or, perhaps aggregate() would be even better, depending on the number of variables and your future needs.

John