tags:

views:

128

answers:

2

From the (simplified) data below that represents a user choosing between three options, I want to create a set of boxplots of the percentage of times a user chose a value, based upon the factor of value. So I want three boxplots, the percentage users chose 0, 1 and 2.

I'm sure I'm missing something obvious, as I often do with R. I can get the percentages using by(dat, dat$user, function(user) {table(user$value)/length(user$value)*100}), but don't know how to turn that into boxplots.

Hope that makes sense.

user|value
1|2
1|1
1|0
1|2
1|0
2|2
2|2
2|2
2|0
2|2
3|2
3|0
3|1
3|0
3|1
4|2
4|0
4|1
4|0
4|1
5|2
5|0
5|1
5|0
5|1
6|2
6|0
6|0
6|1
6|2
7|0
7|0
7|1
7|0
7|1
8|2
8|2
8|1
8|1
8|2
9|1
9|0
9|0
9|0
9|0
10|1
10|2
10|0
10|2
10|1
+1  A: 

I would approach creating the summary using the plyr package. First, you should convert value to a factor, so that when some user never picked some value, that value will have 0%.

dat$value <- factor(dat$value)

Now, you write your summary function that takes a data frame (technically this step can be smushed into the next step, but this way it's more legible).

p.by.user <- function(df){
  data.frame(prop.table(table(df$value)))
}

Then, apply this function to every subset of dat defined by user.

dat.summary <- ddply(dat, .(user), p.by.user)

A base graphics boxplot of this data would be done like this.

with(dat.summary, boxplot(Freq ~ Var1, ylim = c(0,1)))

If you don't mind my two cents, I don't know that boxplots are the right way to go with this kind of data. This isn't very dense data (if your sample is realistic), and boxplots don't capture the dependency between decisions. That is, if some user chose 1 super frequently, then they must have chosen the other much less frequently.

You could try a filled bar chart for each user, and it wouldn't require any pre-summarization if you use ggplot2. The code would look like this

ggplot(dat, aes(factor(user), fill = value)) + geom_bar()
    # or, to force the range to be between 0 and 1
    # + geom_bar(position = "fill")
JoFrhwld
I welcome your two cents! I'm interested in outliers, to see if any users chose a value substantially more than other users.
michaeltwofish
A: 

Is something like this what you're looking for?

user <- rep(1:10,each=5)
value <- sample(0:2,50,replace=T)
dat <- data.frame(user,value)

percent <- unlist(
    by(dat, dat$user,
        function(user) {
            table(user$value)/length(user$value)*100
        }
    )
)

# make a vector with all percentages
percent <- unlist(percent)
# extract the necessary info from the names
value <- gsub("\\d+\\.(\\d)","\\1",names(percent))

boxplot(percent~value)
Joris Meys