tags:

views:

47

answers:

2

Using position_jitter creates random jitter to prevent overplotting of data points.

In the below I have used the example of baseball statistics to illustrate my problem. When I plot the same data with two layers, the same jitter call jitters the geoms a bit differently. This makes sense because it presumably generates the random jitter independently in the two calls, but yields the problem you can see in my graph below.

p=ggplot(baseball,aes(x=round(year,-1),y=sb,color=factor(lg))) 
p=p+stat_summary(fun.data="mean_cl_normal",position=position_jitter(width=3,height=0))+coord_cartesian(ylim=c(0,40))
p+stat_summary(fun.y=mean,geom="line",position=position_jitter(width=3,height=0))

Although the error bar points and the line refer to same data, they are disjointed—the lines and points do not connect.

Is there a work-around for this? I thought position dodge might be the answer but it doesn't seem to work with these kinds of plots. Alternatively, maybe there's some way to get the mean_cl_normal call to also add the lines? alt text

A: 

I think so, by setting the seed to be the same in the two instances:

p=ggplot(baseball,aes(x=round(year,-1),y=sb,color=factor(lg)))
myseed = 2010
set.seed(myseed)
p=p+stat_summary(fun.data="mean_cl_normal",
  position=position_jitter(width=3,height=0))+coord_cartesian(ylim=c(0,40))
set.seed(myseed)
p+stat_summary(fun.y=mean,geom="line",
           position=position_jitter(width=3,height=0))

This ensures that the random number generator is sent back to the same starting position as was used in the initial call. However I don't know how you could extract the random increments added to the values.

nullglob
Good idea, but it didn't work! I thought it would work, because looks like position_jitter uses the base package's jitter, which I expected would be using the same random number generator seeded by set.seed.I suppose a general workaround would be to create my own jittered version of x, but hopefully there's a better way.
Alex Holcombe
That won't work because the jittering is done at plot time, not at creation time.
hadley
+1  A: 

This is a weakness in the current ggplot2 syntax - there's no way to work around it except to add the jitter yourself.

Or you could do something like this:

ggplot(baseball, aes(round(year,-1) + as.numeric(factor(lg)), sb, color = factor(lg))) +
  stat_summary(fun.data="mean_cl_normal") +
  stat_summary(fun.y=mean,geom="line") +
  coord_cartesian(ylim=c(0,40))
hadley