We recently launched Push Experiments, which lets you conduct A/B tests on push notification campaigns to identify the most engaging message variant. With Push Experiments, we wanted to make it easy to run successful A/B tests. In this post, we’ll discuss some of the statistical techniques we’re using behind the scenes. And, catch a screencast showing you the ins and outs of the new tool below.
Parse Push Experiments in Action
Ready to try Parse Push Experiments? Watch the screencast here. Or, read on for A/B Testing best practices.
What makes an A/B test successful?
We can say that an A/B test succeeds whenever we get a precise, correct answer to the question that originally motivated us to run the test. In other words, a good platform for A/B testing should try to prevent two kinds of failure:
(1): We should rarely get a result that leaves the answer to our question in doubt.
(2): We should rarely get an answer that seems precise, but is actually incorrect.
Parse Push Experiments uses three strategies to prevent these two kinds of failure:
- Encourage developers to ask precise questions that can be answered unambiguously.
- Prevent developers from reaching wrong conclusions by always reporting results along with a margin of error.
- Ensure that most A/B tests will give a precise answer by suggesting the minimum number of users that must be included in an A/B test in order to reasonably expect accurate results.
Step 1: Asking Precise Questions
Here’s one of the most important things you can do while running A/B tests: Commit to the metric you’re testing before gathering any data. Instead of asking questions like, “Is A better than B?”, the Push Experiments platform encourages you to ask a much more precise question: “Does A have a higher open rate than B?”
The distinction between these two questions may seem trivial, but asking the more precise question prevents a common pitfall that can occur in A/B testing. If you allow yourself to choose metrics post hoc, it’s almost always possible to find a metric that makes A look better than B. By committing up front to using open rates as the definitive metric of success, you can rest assured that Push Experiments will produce precise answers.
Step 2: Acknowledging Margins of Error
Once you’ve chosen the question you’d like to answer, you can start gathering data. But the data you get might not be entirely representative of the range of results you’d get if you repeated the same test multiple times. For example, you might find that A seems to be better than B in 25% of your tests, but that B seems to be better than A in the other 75%.
As such, when reporting the difference between the A and B groups (we’ll call this difference the lift), it’s important to emphasize the potential for variability in future results by supplementing the raw result with a margin of error. If you have an A/B test that has a lift of +1% and a margin of error that is between -1% and +3%, you should report that your A/B test’s results were inconclusive. If you simply reported a +1% change, your results would be misleading and might set up unrealistic expectations about the success of your push strategy in the future. By reporting a range of values that should contain the true answer to your question (this range is what a statistician would call a 95% confidence interval), you can help to ensure that anyone reading a report about your A/B test will not reach premature conclusions.
At Parse, we determine margins of error for open rate data using a calculation called the Agresti-Caffo method. When you’re working with push notification open rates, the Agresti-Caffo method produces much more reliable margins of error than naive methods like normal approximations.
In addition to automatically calculating margins of error using the Agresti-Caffo method, the Push Experiments platform only reports results after it’s become clear that either A offers a lift over B or that B offers a lift over A — helping to further protect you from reaching premature conclusions. Until there’s enough data to determine a clear winner, the Push Experiments dashboard will report that there’s still uncertainty on whether A or B is more successful.
Step 3: Choosing the Right Sample Size
Given that the Push Experiments platform will always report results with a margin of error, you’ll want to try to make that margin smaller in order to draw definite conclusions from more of your tests. For example, if you think that your A group will show a lift of 1% over your B group, you’ll want to make sure you gather enough data to ensure your margin of error will be smaller than 1%.
The process of picking a sample size that ensures that your margin of error will be small enough to justify a definite conclusion is called power analysis. The Push Experiments platform automatically performs a power analysis for your A/B test based on the historical open rates for your previous push notifications. To simplify the process, we provide suggested sample sizes based on the assumption that you’ll be trying to measure lifts at least as large as 1% with your A/B tests.
Only running A/B tests with carefully chosen sample sizes makes it much more likely that your A/B tests will succeed. If you select a sample size that’s much smaller than the size we suggest, you should expect that many of your A/B tests will lead to inconclusive results.
Putting It All Together
We believe the combination of precise questions, clean statistics and careful choice of sample size is essential for running a successful A/B test. You can achieve that with Parse Push Experiments, and we hope this look into the statistical methods behind our platform will help you do it.