tags:

views:

42

answers:

2

I am trying to produce some example graphics using ggplot2, and one of the examples I picked was the birthday problem, here using code 'borrowed' from a Revolution computing presentation at Oscon.

birthday<-function(n){
    ntests<-1000
    pop<-1:365
    anydup<-function(i){
        any(duplicated(sample(pop,n,replace=TRUE)))
        }
    sum(sapply(seq(ntests), anydup))/ntests
    }

x<-data.frame(x=rep(1:100, each=5)) 
x<-ddply(x, .(x), function(df) {return(data.frame(x=df$x, prob=birthday(df$x)))})
birthdayplot<-ggplot(x, aes(x, prob))+
        geom_point()+geom_smooth()+
        theme_bw()+
        opts(title = "Probability that at least two people share a birthday in a random group")+
        labs(x="Size of Group", y="Probability")

Here my graph is what I would describe as exponential, but the geom_smooth doesn't fit the data particularly well. I've tried the loess method but this didn't change things much. Can anyone suggest how to add a better smooth ?

Thanks

Paul.

A: 

The problem is that the probabilities follow a logistic curve. You could fit a proper smoothing line if you change the birthday function to return the raw successes and failures instead of the probabilities.

birthday<-function(n){
  ntests<-1000
  pop<-1:365
  anydup<-function(i){
    any(duplicated(sample(pop,n,replace=TRUE)))
  }
  data.frame(Dups = sapply(seq(ntests), anydup) * 1, n = n)
}
x<-ddply(x, .(x),function(df) birthday(df$x))

Now, you'll have to add the points as a summary, and specify a logistic regression as the smoothing type.

ggplot(x, aes(n, Dups)) +
  stat_summary(fun.y = mean, geom = "point") +
  stat_smooth(method = "glm", family = binomial)
JoFrhwld
The curve is not really logistic, even if though it is S-shaped. You can see it using `scale_y_logit()` on the original plot
Aniko
hm, right. I'm not sure what the appropriate regression is then, but having the raw numbers will still let you fit that line with `stat_smooth()`.
JoFrhwld
A: 

The smoothing routine does not react to the sudden change for low values of x fast enough (and it has no way of knowing that the values of prob are restricted to a 0-1 range). Since you have so low variability, a quick solution is to reduce the span of values over which smoothing at each point is done. Check out the red line in this plot:

birthdayplot + geom_smooth(span=0.1, colour="red")
Aniko