Hadley turned me on to the plyr package and I find myself using it all the time to do 'group by' sort of stuff. But I find myself having to always rename the resulting columns since they default to V1, V2, etc.
Here's an example:
mydata<-data.frame(matrix(rnorm(144, mean=2, sd=2),72,2),c(rep("A",24),rep("B",24),rep("C",24)))
colnames(...
A week ago I would have done this manually: subset dataframe by group to new dataframes. For each dataframe compute means for each variables, then rbind. very cluncky ...
Now i have learned about split and plyr, and I guess there must be an easier way using these tools. Please don't prove me wrong.
test_data <- data.frame(cbind(
var0...
I have data nested in to levels:
L1 L2 x1 x2 x3 x4
A This 20 14 12 15
A That 11 NA 8 16
A Bat Na 22 13 9
B This 10 9 11 6
B That 3 3 1 NA
B Bat 4 10 2 8
Now I want something simply - and I feel I have been able to do this just last month. But something has gone missing in my head: I want percentages (ignoring NA), ...
I have a df and I want to do multiple transform on it with plyr:
idplot / idtree / species / condition / dbh_cm / h_m / hblc_m
CalcG <- function (df) transform(df, g_m2 = pi * (dbh_cm^2)/40000)
CalcHD <- function (df) transform(df, hd = h_m / dbh_cm)
...
Can be done in one function?
Many thanks.
...
I like the plyr syntax. Any time I have to use one of the *apply() commands I end up kicking the dog and going on a 3 day bender. So for the sake of my dog and my liver, what's concise syntax for doing a ddply operation on every row of a data frame?
Here's an example that works well for a simple case:
x <- rnorm(10)
y <- rnorm(10)
df <...
Basically I want an autoincremented id column based on my cohorts - in this case .(kmer, cvCut)
> myDataFrame
size kmer cvCut cumsum
1 8132 23 10 8132
10000 778 23 10 13789274
30000 324 23 10 23658740
50000 182 23 10 28534840
100000 65 23 10 33943283
200000 25 23 10 37954383
...
I often have to make stacked barplots to compare variables, and because I do all my stats in R, I prefer to do all my graphics in R with ggplot2. I would like to learn how to do two things:
First, I would like to be able to add proper percentage tick marks for each variable rather than tick marks by count. Counts would be confusing, whi...
I have a list called cols with column names in it:
cols <- c('Column1','Column2','Column3')
I'd like to reproduce this command, but with a call to the list:
data.frame(Column1=rnorm(10))
Here's what happens when I try it:
> data.frame(cols[1]=rnorm(10))
Error: unexpected '=' in "data.frame(I(cols[1])="
The same thing happens if I ...
I'm struggling with how to best structure categorical data that's messy, and comes from a dataset I'll need to clean.
The Coding Scheme
I'm analyzing data from a university science course exam. We're looking at patterns in
student responses, and we developed a coding scheme to represent the kinds of things
students are doing in their ...
Hi,
I'm trying to repeat the elements of vector a, b number of times. That is, a="abc" should be "aabbcc" if y = 2.
Why doesn't either of the following code examples work?
sapply(a, function (x) rep(x,b))
and from the plyr package,
aaply(a, function (x) rep(x,b))
I know I'm missing something very obvious ...
...
I've got a data frame with far too many rows to be able to do a spatial correlogram. Instead, I want to grab 40 rows for each species and run my correlogram on that subset.
I wrote a function to subset a data frame as follows:
samp <- function(dataf)
{
dataf[sample(1:dim(dataf)[1], size=40, replace=FALSE),]
}
Now I want to ap...
Thanks to Hadley's plyr package ddply function we can take a dataframe, break it down into subdataframes by factors, send each to a function, and then combine the function results for each subdataframe into a new dataframe.
But what if the function returns an object of a class like glm or in my case, a c("glm", "lm"). Then, these can't ...
Hello.
I'm trying to write a function to do some often repeated analysis, and one part of this is to count the number of groups and number of members within each group, so ddply to the rescue !, however, my code has a problem....
Here is some example data
> dput(BGBottles)
structure(list(Machine = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
...
Hi all,
Pretty basic performance question from an R newbie. I'd like to assign a group ID to each row in a data frame by unique combinations of fields. Here's my current approach:
> # An example data frame
> df <- data.frame(name=c("Anne", "Bob", "Chris", "Dan", "Erin"), st.num=c("101", "102", "105", "102", "150"), st.name=c("Main", "E...
I don't think this has been asked yet, but is there a way to combine information of a list with multiple levels and uneven structure into a data frame of "long" format?
Specifically:
library(XML)
library(plyr)
xml.inning <- "http://gd2.mlb.com/components/game/mlb/year_2009/month_05/day_02/gid_2009_05_02_chamlb_texmlb_1/inning/inning_5....
I need to take a data.frame in the format of:
id1 id2 mean start end
1 A D 4 12 15
2 B E 5 14 15
3 C F 6 8 10
and generate duplicate rows based on the difference in start - end. For example, I need 3 rows for the first row, 1 for the second, and 2 for the third. The start and end fields should be in...
Say I have an R list like this:
> summary(data.list)
Length Class Mode
aug9104AP 18 data.frame list
Aug17-10_acon_7pt_dil_series_01 18 data.frame list
Aug17-10_Picro_7pt_dil_series_01 18 data.frame list
Aug17-10_PTZ_7pt_dil_series_01 18 data.frame list
Aug17...
When using plyr, I often want to 1) perform an operation on only a subset of the variables and 2) name the output of the operation. For example:
d = data.frame(sex=c("m","f","m","m","f","f"), age=c(30,20,15,50,10,40), weight=c(130,120,115,150,90,180))
ddply(d, .(sex), function(df) data.frame(age_mu = mean(df$age)))
But this seems kind...
I have a simulation that has a huge aggregate and combine step right in the middle. I prototyped this process using plyr's ddply() function which works great for a huge percentage of my needs. But I need this aggregation step to be faster since I have to run 10K simulations. I'm already scaling the simulations in parallel but if this one...
Using R's builtin ToothGrowth example dataset, this works:
ddply(ToothGrowth, .(supp,dose), function(df) mean(df$len))
But I would like to have the subsetting factors be variables, something like
factor1 = 'supp'
factor2 = 'dose'
ddply(ToothGrowth, .(factor1,factor2), function(df) mean(df$len))
That doesn't work. How should this ...