ansaurus

Question

Percentile for Each Observation w/r/t Grouping Variable

Answer 1

+3 A:

I may be misunderstanding, but I think it can be done this way:

> years = c(2006, 2006, 2006, 2006, 2001, 2001, 2001, 2001, 2001)
> scores = c(13, 65, 23, 34, 78, 56, 89, 98, 100)
> tapply(scores, years, quantile)
$`2001`
  0%  25%  50%  75% 100% 
  56   78   89   98  100 

$`2006`
   0%   25%   50%   75%  100% 
13.00 20.50 28.50 41.75 65.00

Is this right?

I mean the actual percentile of each observation. – Ryan Rosario

Edit:

I think this may do it then:

> tapply(scores, years, function(x) { f = ecdf(x); sapply(x, f) })
$`2001`
[1] 0.4 0.2 0.6 0.8 1.0

$`2006`
[1] 0.25 1.00 0.50 0.75

With your data:

> tapply(scores, years, function(x) { f = ecdf(x); sapply(x, f) })
$`2000`
[1] 0.3333333 0.6666667 1.0000000

$`2008`
[1] 0.5 1.0

Edit 2:

This is probably faster:

tapply(scores, years, function(x) { f = ecdf(x); f(x) })

f() is vectorized :-)

Last, modification, I promise :-). If you want names:

> tapply(scores, years, function(x) { f = ecdf(x); r = f(x); names(r) <- x; r })
$`2000`
     1000      1700      2000 
0.3333333 0.6666667 1.0000000 

$`2008`
1500 2000 
 0.5  1.0

Vince 2010-01-29 06:19:54

Answer 2

A:

I found a method, but it requires a loop.

group.pctiles <- function(group.var, comparable) {
    unique.vals <- unique(group.var)
    pctiles <- vector(length = length(group.var))
    for (i in 1:length(unique.vals)) {
        slice <- which(group.var == unique.vals[i])
        F <- ecdf(comparable[slice])
        group.pctiles <- F(comparable[slice])
        pctiles[slice] <- group.pctiles
    }
    return(pctiles)
}

group.var is the variable that groups the data. In my example in my question, it is Year. comparable contains the values we want to find the percentiles for. In my question, comparable would be Score.

For the following data, I get the result below:

Year,School,Fees
2000,10,1000
2008,1,1050
2008,4,2000
2000,3,1700
2000,1,2000

> group.pctiles(dat, dat$Year, dat$Fees)
[1] 0.3333333 0.5000000 1.0000000 0.6666667 1.0000000

Then, I can cbind these percentiles back into the original data.frame for analysis, reporting, etc.

Anyone have a solution that doesn't require a loop?

Ryan Rosario 2010-01-29 06:43:10

Answer 3

+9 A:

Following up on Vince's solution, you can also do this with plyr or by:

ddply(df, .(years), function(x) transform(x, percentile=ecdf(x$scores)(x$scores)))

Jonathan Chang 2010-01-29 07:15:07

+1 because I should really spend some time with plyr

Ian Fellows 2010-01-29 07:22:29

This is beautiful. I knew there had to be a one-line way to do this, even if it requires loading a package.

Ryan Rosario 2010-01-29 08:29:50

Or a little more simply: `ddply(df, .(years), transform, percentile = ecdf(scores)(scores))`

hadley 2010-01-29 13:42:42

Answer 4

A:

Hi Ryan,

How about something like:

Year <- c(2000,2008,2008,2000,2000)
Fees <- c(1000,1050,2000,1700,2000)
dat <- data.frame(Fees,Year,result=NA)
res <- tapply(Fees,Year,function(x) rank(x,ties.method="max")/length(x))
for(i in 1:length(res))
   dat[Year==as.numeric(names(res)[i]),"result"] <-res[[i]]

which yields:

  Fees Year    result
1 1000 2000 0.3333333
2 1050 2008 0.5000000
3 2000 2008 1.0000000
4 1700 2000 0.6666667
5 2000 2000 1.0000000

Ian Fellows 2010-01-29 07:15:38

Answer 5

A:

You can also do something like this:

# first I'll create two dummy variables (Year, Score)
year <- rep(2001:2005, 2)
score <- round(rnorm(10, 35, 3))

# then coerce variables to data frame
d <- data.frame(year, score)

# then you can use split() function to apply
# function to each stratum of grouping variable
sapply(split(score, year), function(x) quantile(x, probs=seq(.1, .9, .1)))

Output will go something like this:

     2001 2002 2003 2004 2005
10%  34.3 32.1 34.3 29.6 36.1
20%  34.6 32.2 34.6 30.2 36.2
30%  34.9 32.3 34.9 30.8 36.3
40%  35.2 32.4 35.2 31.4 36.4
50%  35.5 32.5 35.5 32.0 36.5
60%  35.8 32.6 35.8 32.6 36.6
70%  36.1 32.7 36.1 33.2 36.7
80%  36.4 32.8 36.4 33.8 36.8
90%  36.7 32.9 36.7 34.4 36.9

You can utilize t() function to transpose rows and columns if you prefer. Writing a function will be a good way to tackle this kind of problems. I strongly recommend plyr package written by Hadley Wickam.

Hope this helps! All the best!

aL3xa 2010-01-29 20:24:01

Answer 6

+4 A:

Using ave

ave(d1$scores, d1$year, FUN=function(x) ecdf(x)(x))

Eduardo Leoni 2010-01-29 21:11:02

I think this is the best solution - it uses a function from stats and is oneline. Well done!

Vince 2010-02-02 01:35:08

ansaurus

tags:

views:

answers:

Percentile for Each Observation w/r/t Grouping Variable

related questions