tags:

views:

902

answers:

5

I have been struggling with how to make a Pareto Chart in R using the ggplot2 package. In many cases when making a bar chart or histogram we want items sorted by the X axis. In a Pareto Chart we want the items ordered descending by the value in the Y axis. Is there a way to get ggplot to plot items ordered by the value in the Y axis? I tried sorting the data frame first but it seems ggplot reorders them.

Example:

val <- read.csv("http://www.cerebralmastication.com/wp-content/uploads/2009/11/val.txt")
val<-with(val, val[order(-Value), ])
p <- ggplot(val)
p + geom_bar(aes(State, Value, fill=variable), stat = "identity", position="dodge") + scale_fill_brewer(palette = "Set1")

the data frame val is sorted but the output looks like this:

alt text

Hadley correctly pointed out that this produces a much better graphic for showing actuals vs. predicted:

ggplot(val, aes(State, Value)) + geom_bar(stat = "identity", subset = .(variable == "estimate"), fill = "grey70") + geom_crossbar(aes(ymin = Value, ymax = Value), subset = .(variable == "actual"))

which returns:

alt text

But it's still not a Pareto Chart. Any tips?

+4  A: 

Subsetting and sorting your data;

valact <- subset(val, variable=='actual')
valsort <- valact[ order(-valact[,"Value"]),]

From there it's just a standard boxplot() with a very manual cumulative function on top:

op <- par(mar=c(3,3,3,3)) 
bp <- barplot(valsort [ , "Value"], ylab="", xlab="", ylim=c(0,1),    
              names.arg=as.character(valsort[,"State"]), main="How's that?") 
lines(bp, cumsum(valsort[,"Value"])/sum(valsort[,"Value"]), 
      ylim=c(0,1.05), col='red') 
axis(4)
box() 
par(op)

which should look like this

alt text

and it doesn't even need the overplotting trick as lines() happily annotates the initial plot.

Dirk Eddelbuettel
I accepted Chang's answer because I really wanted to do this with ggplot. But I still owe you a beer for giving such a kick ass answer.
JD Long
Well I did miss the ggplot2 requirements...
Dirk Eddelbuettel
you gave a far more through answer to the Perato part than I was expecting! My question was grossly stylized and I had coded myself into a corner where using ggplot2 was the easiest way out. What you did with base graphics was really cool. Thanks again.
JD Long
You're very welcome.
Dirk Eddelbuettel
+1  A: 

Also, see the package qcc which has a function pareto.chart(). Looks like it uses base graphics too, so start your bounty for a ggplot2-solution :-)

Dirk Eddelbuettel
+5  A: 

The bars in ggplot2 are ordered by the ordering of the levels in the factor.

val$State <- with(val, factor(val$State, levels=val[order(-Value), ]$State))
Jonathan Chang
That is awesome! That's exactly what I could not figure out how to do. Thank you!
JD Long
Or a little more succinctly, change your first aes call to: ` aes(reorder(State, Value), Value)`
hadley
I think you need aes(reorder(State, Value, mean), Value) - since there are two values for each state?
Andreas
+2  A: 

With a simple example:

 > data
    PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8     PC9    PC10 
0.29056 0.23833 0.11003 0.05549 0.04678 0.03788 0.02770 0.02323 0.02211 0.01925 

barplot(data) does things correctly

the ggplot equivalent "should be": qplot(x=names(data), y=data, geom='bar')

But that incorrectly reorders/sorts the bars alphabetically... because that's how levels(factor(names(data))) would be ordered.

Solution: qplot(x=factor(names(data), levels=names(data)), y=data, geom='bar')

Phew!

Yannick Wurm
A: 

To simplify things, let's just consider only the estimates.

estimates <- subset(val, variable == "estimate")

First we reorder the factor levels, so that States are plotted in decreasing order of Value.

estimates$State <- with(estimates, reorder(State, -Value))

Similarly, we reorder the dataset and calculate a cumulative value.

estimates <- estimates[order(estimates$Value, decreasing = TRUE),]
estimates$cumulative <- cumsum(estimates$Value)

Now we are ready to draw the plot. The trick to get a line and bar on the same axes is to convert the State variable (a factor) to be numeric.

p <- ggplot(estimates, aes(State, Value)) + 
  geom_bar() +
  geom_line(aes(as.numeric(State), cumulative))
p

As mentioned in the question, trying to draw two Pareto plots of two variable groups right next to each other isn't very easy. You'd probably be better off using facetting if you want multiple Pareto plots.

Richie Cotton