I am trying to do some statistical analysis of different A/B tests to see which alternative is better and have found conflicting information about this.
First, I am interested in a couple different things:
- Tests that measure success by counting events, such as conversions or emails sent
- Tests that measure success by counting revenue
- Tests that have only two alternatives (control and new)
- Tests that have multiple alternatives (control and multiple new)
I was hoping to find a simple set of formulae or rules for doing this analysis but have found more questions than answers.
This site says that you can't compare multi-alternative tests; you can only do pairwise comparisons and do a chi-squared analysis to see if the whole test is statistically significant or not.
This site Suggests a way to do A/B/C/D testing (starts on slide 74), analysing the results using the G-Test (which it says is related to chi-squared) but isn't clear on the details of using a fudge factor. It also suggests that you can only use the A/B/C/D approach to eliminate alternatives until you end up with a clear winner in an A/B comparison.
This site gives an example of an A/B/C/D test (including control) and shows how to compare the conversion rate to determine a winner. Unlike this approach it does not recommend eliminating alternatives but rather picks a winner right off the bat (Assuming statistically significant results).
Perhaps I'm naive but I would think that by now a stats analysis library would exist to deal with this very problem. I would also appreciate more information about what algorithms/equations are needed to solve these problems. It's been a long time since my university Stats class.