tags:

views:

56

answers:

3

I am trying to iteratively sort data within columns to extract N maximum values.

My data is set up with the first and second columns containing occupation titles and codes, and all of the rest of the columns containing comparative values (in this case location quotients that had to be previously calculated for each city) for those occupations for various cities:

    *occ_code  city1  ...   city300*
     occ1      5      ...    7
     occ2      20     ...   22
     .         .       .     .
     .         .       .     .
     occ800    20     ...   25

For each city I want to sort by the maximum values, select a subset of those maximum values matched by their respective occupations titles and titles. I thought it would be relatively trivial but...

edit for clarification: I want end to with a sorted subset of the data for analysis.

     occ_code   city1
     occ200     10
     occ90      8
     occ20      2
     occ95      1.5

At the same time I want to be able to repeat the sort column-wise (so I've tried lots of order commands through calling columns directly: data[,2]; just to be able to run the same analysis functions over the entire dataset.

I've been messing with plyr for the past 3 days and I feel like the setup of my dataset is just not conducive to how plyer was meant to be used.

A: 

One way would be to use order with ddply from the package plyr

> library(plyr)
> d<-data.frame(occu=rep(letters[1:5],2),city=rep(c('A','B'),each=5),val=1:10)
> ddply(d,.(city),function(x) x[order(x$val,decreasing=TRUE)[1:3],])

order can sort on multiple columns if you want that.

Jyotirmoy Bhattacharya
A: 

This will output the max for each city. Similar results can be obtained using sort or order

# Generate some fake data
codes <- paste("Code", 1:100, sep="")
values <- matrix(0, ncol=20, nrow=100)
for (i in 1:20)
    values[,i] <- sample(0:100, 100, replace=T)

df <- data.frame(codes, values)

names(df) <- c("Code", paste("City", 1:20, sep=""))

# Now for each city we get the maximum
maxval <- apply(df[2:21], 2, which.max)
# Output the max for each city
print(cbind(paste("City", 1:20), codes[maxval]))
nico
A: 

I'm not exactly sure what your desired output is according to your example snippit. Here's how you could get a data frame like that for every city using plyr and reshape

#using the same df from nico's answer
library(reshape)
df.m <- melt(df, id = 1)
a.cities <- cast(df.m, codes ~ . | variable)

library(plyr)
a.cities.max <- aaply(a.cities, 1, function(x) arrange(x, desc(`(all)`))[1:4,])

Now, a.cities.max is an array of data frames, with the 4 largest values for each city in each data frame. To get one of these data frames, you can index it with

a.cities.max$X13

I don't know exactly what you'll be doing with this data, but you might want it back in data frame format.

df.cities.max <- adply(a.cities.max, 1)
JoFrhwld
I think that's it!
AzadA