plyr

renaming the output column with the plyr package in R

Hadley turned me on to the plyr package and I find myself using it all the time to do 'group by' sort of stuff. But I find myself having to always rename the resulting columns since they default to V1, V2, etc. Here's an example: mydata<-data.frame(matrix(rnorm(144, mean=2, sd=2),72,2),c(rep("A",24),rep("B",24),rep("C",24))) colnames(...

for each group summarise means for all variables in dataframe (ddply? split?)

A week ago I would have done this manually: subset dataframe by group to new dataframes. For each dataframe compute means for each variables, then rbind. very cluncky ... Now i have learned about split and plyr, and I guess there must be an easier way using these tools. Please don't prove me wrong. test_data <- data.frame(cbind( var0...

from absolute numbers to proportion in two level data (R! SAC? plyr?)

I have data nested in to levels: L1 L2 x1 x2 x3 x4 A This 20 14 12 15 A That 11 NA 8 16 A Bat Na 22 13 9 B This 10 9 11 6 B That 3 3 1 NA B Bat 4 10 2 8 Now I want something simply - and I feel I have been able to do this just last month. But something has gone missing in my head: I want percentages (ignoring NA), ...

multiple transform on df with plyr

I have a df and I want to do multiple transform on it with plyr: idplot / idtree / species / condition / dbh_cm / h_m / hblc_m CalcG <- function (df) transform(df, g_m2 = pi * (dbh_cm^2)/40000) CalcHD <- function (df) transform(df, hd = h_m / dbh_cm) ... Can be done in one function? Many thanks. ...

doing a plyr operation on every row of a data frame in R

I like the plyr syntax. Any time I have to use one of the *apply() commands I end up kicking the dog and going on a 3 day bender. So for the sake of my dog and my liver, what's concise syntax for doing a ddply operation on every row of a data frame? Here's an example that works well for a simple case: x <- rnorm(10) y <- rnorm(10) df <...

How do I use plyr to number rows?

Basically I want an autoincremented id column based on my cohorts - in this case .(kmer, cvCut) > myDataFrame size kmer cvCut cumsum 1 8132 23 10 8132 10000 778 23 10 13789274 30000 324 23 10 23658740 50000 182 23 10 28534840 100000 65 23 10 33943283 200000 25 23 10 37954383 ...

How to better create stacked bar graphs with multiple variables from ggplot2?

I often have to make stacked barplots to compare variables, and because I do all my stats in R, I prefer to do all my graphics in R with ggplot2. I would like to learn how to do two things: First, I would like to be able to add proper percentage tick marks for each variable rather than tick marks by count. Counts would be confusing, whi...

Specifying column names from a list in the data.frame command.

I have a list called cols with column names in it: cols <- c('Column1','Column2','Column3') I'd like to reproduce this command, but with a call to the list: data.frame(Column1=rnorm(10)) Here's what happens when I try it: > data.frame(cols[1]=rnorm(10)) Error: unexpected '=' in "data.frame(I(cols[1])=" The same thing happens if I ...

How can I structure and recode messy categorical data in R?

I'm struggling with how to best structure categorical data that's messy, and comes from a dataset I'll need to clean. The Coding Scheme I'm analyzing data from a university science course exam. We're looking at patterns in student responses, and we developed a coding scheme to represent the kinds of things students are doing in their ...

Repeat elements of vector in R

Hi, I'm trying to repeat the elements of vector a, b number of times. That is, a="abc" should be "aabbcc" if y = 2. Why doesn't either of the following code examples work? sapply(a, function (x) rep(x,b)) and from the plyr package, aaply(a, function (x) rep(x,b)) I know I'm missing something very obvious ... ...

How do I sub sample data by group using ddply?

I've got a data frame with far too many rows to be able to do a spatial correlogram. Instead, I want to grab 40 rows for each species and run my correlogram on that subset. I wrote a function to subset a data frame as follows: samp <- function(dataf) { dataf[sample(1:dim(dataf)[1], size=40, replace=FALSE),] } Now I want to ap...

break dataframe into subsets by factor values, send to function that returns glm class, how to recombine?

Thanks to Hadley's plyr package ddply function we can take a dataframe, break it down into subdataframes by factors, send each to a function, and then combine the function results for each subdataframe into a new dataframe. But what if the function returns an object of a class like glm or in my case, a c("glm", "lm"). Then, these can't ...

ddply run in a function looks in the environment outside the function ?

Hello. I'm trying to write a function to do some often repeated analysis, and one part of this is to count the number of groups and number of members within each group, so ddply to the rescue !, however, my code has a problem.... Here is some example data > dput(BGBottles) structure(list(Machine = structure(c(1L, 1L, 1L, 2L, 2L, 2L, ...

Assigning group ID with ddply

Hi all, Pretty basic performance question from an R newbie. I'd like to assign a group ID to each row in a data frame by unique combinations of fields. Here's my current approach: > # An example data frame > df <- data.frame(name=c("Anne", "Bob", "Chris", "Dan", "Erin"), st.num=c("101", "102", "105", "102", "150"), st.name=c("Main", "E...

converting uneven hierarchical list to a data frame

I don't think this has been asked yet, but is there a way to combine information of a list with multiple levels and uneven structure into a data frame of "long" format? Specifically: library(XML) library(plyr) xml.inning <- "http://gd2.mlb.com/components/game/mlb/year_2009/month_05/day_02/gid_2009_05_02_chamlb_texmlb_1/inning/inning_5....

Reshape data based on column in dataframe

I need to take a data.frame in the format of: id1 id2 mean start end 1 A D 4 12 15 2 B E 5 14 15 3 C F 6 8 10 and generate duplicate rows based on the difference in start - end. For example, I need 3 rows for the first row, 1 for the second, and 2 for the third. The start and end fields should be in...

l_ply: how to pass the list's name attribute into the function?

Say I have an R list like this: > summary(data.list) Length Class Mode aug9104AP 18 data.frame list Aug17-10_acon_7pt_dil_series_01 18 data.frame list Aug17-10_Picro_7pt_dil_series_01 18 data.frame list Aug17-10_PTZ_7pt_dil_series_01 18 data.frame list Aug17...

Does the 'summarise' function in plyr still exist?

When using plyr, I often want to 1) perform an operation on only a subset of the variables and 2) name the output of the operation. For example: d = data.frame(sex=c("m","f","m","m","f","f"), age=c(30,20,15,50,10,40), weight=c(130,120,115,150,90,180)) ddply(d, .(sex), function(df) data.frame(age_mu = mean(df$age))) But this seems kind...

R: speeding up "group by" operations

I have a simulation that has a huge aggregate and combine step right in the middle. I prototyped this process using plyr's ddply() function which works great for a huge percentage of my needs. But I need this aggregation step to be faster since I have to run 10K simulations. I'm already scaling the simulations in parallel but if this one...

How to use string variables to create variables list for ddply?

Using R's builtin ToothGrowth example dataset, this works: ddply(ToothGrowth, .(supp,dose), function(df) mean(df$len)) But I would like to have the subsetting factors be variables, something like factor1 = 'supp' factor2 = 'dose' ddply(ToothGrowth, .(factor1,factor2), function(df) mean(df$len)) That doesn't work. How should this ...