views:

190

answers:

3

I am doing A/B testing and I am facing Simpson's paradox in my results (day vs month vs total duration of the test).

  1. Does it mean that my a/b testing is not correct/representative? (Some external factor impacted the testing?)
  2. If it is a sign of problem, what are the directions to follow?

Thanks for your great help.

Further reading: http://en.wikipedia.org/wiki/Simpson%27s_paradox

+1  A: 

The Simpson's paradox only occurs when your group sizes are different. Actually, the ginal results is a weighted average for the results from each group (and on this weighting, the paradox may come up).

It's not actually caused by external factors or stuff. It's simply because one group is much more significant (because has more elements in the group).

If you provide some more info, we could probably help better.

Samuel Carrijo
+7  A: 

It's a little difficult to say without seeing the exact data & the dimensions you are testing, but generally speaking you want to make decisions based on the uncombined data. This article from Microsoft gives a pretty clear example of Simpson's paradox in software testing.

Can you provide a clean example of your combined and uncombined data and a brief summary of the test?

Chris Clark
+1 for good link
BlueRaja - Danny Pflughoeft
The key word is: uncombined data. :) Thanks!!!
Toto
+1  A: 

If A is clearly, significantly better in individual A/B tests, while B scores better in aggregate, then the main implication is that you can't aggregate those data sets that way. A is better.

If the testing got the same results every day, you wouldn't get this clear result, even with varying sample sizes per day. So I think it additionally implies that something has changed. It could be anything, though. Maybe what you tested each day changed (perhaps in some very subtle way, like server speed). Or maybe the people you're testing it on changed (perhaps demographically, perhaps just in terms of their mood). That doesn't mean your testing is bad or invalid. It just means you're measuring something that's moving, and that makes things tricky.

And I might be miscalculating or misunderstanding the situation, but I think it is also necessarily true that you haven't been testing A and B the same number of times. That is, if on Monday you tested A 50 times and B 50 times, and on Tuesday you tested A 600 times and B 600 times, and so on, and A outscored B each day, then I don't see how you could get an aggregate result where B beats A. If this is true of your test setup, it certainly seems like something you could fix to make your data easier to reason about.

Jason Orendorff