tags:

views:

86

answers:

2

I'm trying to collapse a data frame by removing all but one row from each group of rows with identical values in a particular column. In other words, the first row from each group.

For example, I'd like to convert this

> d = data.frame(x=c(1,1,2,4),y=c(10,11,12,13),z=c(20,19,18,17))
> d
  x  y  z
1 1 10 20
2 1 11 19
3 2 12 18
4 4 13 17

Into this:

    x  y  z
1   1 11 19
2   2 12 18
3   4 13 17

I'm using aggregate to do this currently, but the performance is unacceptable with more data:

> d.ordered = d[order(-d$y),]
> aggregate(d.ordered,by=list(key=d.ordered$x),FUN=function(x){x[1]})

I've tried split/unsplit with the same function argument as here, but unsplit complains about duplicate row numbers.

Is rle a possibility? Is there an R idiom to convert rle's length vector into the indices of the rows that start each run, which I can then use to pluck those rows out of the data frame?

+2  A: 

Maybe duplicated() can help:

R> d[ !duplicated(d$x), ]
  x  y  z
1 1 10 20
3 2 12 18
4 4 13 17
R> 

Edit Shucks, never mind. This picks the first in each block of repetitions, you wanted the last. So here is another attempt using plyr:

R> ddply(d, "x", function(z) tail(z,1))
  x  y  z
1 1 11 19
2 2 12 18
3 4 13 17
R> 

Here plyr does the hard work of finding unique subsets, looping over them and applying the supplied function -- which simply returns the last set of observations in a block z using tail(z, 1).

Dirk Eddelbuettel
I'd prefer all the columns, thanks
jkebinger
So then you need to simply add a 'processing step' to create a factor variable over which plyr can loop. It can all be done with indexing commands, give it a try. And by the way, you are inconsistent between your text (saying first row selected) and example (showing second row).
Dirk Eddelbuettel
By the way, cross-posting between r-help and here is also somewhat poor style. You got good answers at r-help, so why don't you study them?
Dirk Eddelbuettel
Sorry about the cross posting, and thanks for the solutions
jkebinger
My pleasure. As a matter of common best practices here on StackOverflow, you should accept one post as the solutions (if you feel it provides one) and vote each helpful post up by clicking on the up arrow. That is how the scoring works here.
Dirk Eddelbuettel
+2  A: 

Just to add a little to what Dirk provided... duplicated has a fromLast argument that you can use to select the last row:

d[ !duplicated(d$x,fromLast=TRUE), ]
Ian Fellows
Hi Ian -- unfortunately James never really made a clear case as to whether he wanted first or last and contradicts himself in the post ... but your hint about fromLast is a good one!
Dirk Eddelbuettel
thanks, that works like a charm. Whether its first or last I needed was really up to the ordering, and with fromLast I can attack it either way
jkebinger
I suggested the same thing and you shot it down on on the grounds of 'prefer all columns'. How come that no longer matters?
Dirk Eddelbuettel
Sorry, Dirk, I misunderstood how duplicated works at the time
jkebinger