tags:

views:

310

answers:

2

Whenever I want to do something "map"py in R, I usually try to use a function in the apply family. (Side question: I still haven't learned plyr or reshape -- would plyr or reshape replace all of these entirely?)

However, I've never quite understood the differences between them [how {sapply, lapply, etc.} apply the function to the input/grouped input, what the output will look like, or even what the input can be], so I often just go through them all until I get what I want =.

Can someone explain how to use which one when?

[My current (probably incorrect/incomplete) understanding is...

  1. sapply(vec, f): input is a vector. output is a vector/matrix, where element i is f(vec[i]) [giving you a matrix if f has a multi-element output]
  2. lapply(vec, f): same as sapply, but output is a list?
  3. apply(matrix, 1/2, f): input is a matrix. output is a vector, where element i is f(row/col i of the matrix)
  4. tapply(vector, grouping, f): output is a matrix/array, where an element in the matrix/array is the value of f at a grouping g of the vector, and g gets pushed to the row/col names
  5. by(dataframe, grouping, f): let g be a grouping. apply f to each column of the group/dataframe. pretty print the grouping and the value of f at each column.
  6. aggregate(matrix, grouping, f): similar to by, but instead of pretty printing the output, aggregate sticks everything into a dataframe.]
+7  A: 

On the side note, here is how the various plyr functions correspond to the base *apply functions (from the intro to plyr document from the plyr webpage http://had.co.nz/plyr/)

Base function   Input   Output   plyr function 
---------------------------------------
aggregate        d       d       ddply + colwise 
apply            a       a/l     aaply / alply 
by               d       l       dlply 
lapply           l       l       llply  
mapply           a       a/l     maply / mlply 
replicate        r       a/l     raply / rlply 
sapply           l       a       laply 

One of the goals of plyr is to provide consistent naming conventions for each of the functions, encoding the input and output data types in the function name. It also provides consistency in output, in that output from dlply() is easily passable to ldply() to produce useful output, etc.

Conceptually, learning plyr is no more difficult than understanding the base *apply functions.

plyr and reshape functions have replaced almost all of these functions in my every day use. But, also from the Intro to Plyr document:

Related functions tapply and sweep have no corresponding function in plyr, and remain useful. merge is useful for combining summaries with the original data.

JoFrhwld
When I started learning R from scratch I found plyr MUCH easier to learn than the `*apply()` family of functions. For me, `ddply()` was very intuitive as I was familiar with SQL aggregation functions. `ddply()` became my hammer for solving many problems, some of which could have been better solved with other commands.
JD Long
I guess I figured that the concept behind `plyr` functions is similar to `*apply` functions, so if you can do one, you can do the other, but `plyr` functions are easier to remember. But I totally agree on the `ddply()` hammer!
JoFrhwld
Got it, I'll have to finally pick up plyr soon! Its prefix naming alone is gold...
grautur
Couldn't have said it better myself. Thanks!
hadley
A: 

The plyr documentation is very clear and easy to follow, and I do recommend you read it from start to finish, because you'll learn lots of extra things that will help you later. In my view, this is one of those instances in which there will be large payoffs for taking the time to read the details from start to finish, as opposed to just skimming a help page to see the names of arguments.

dan