Math is hard.
I get it.
Many of us continue to suffer from the false belief that we are incapable of doing math.
I’m going to leave that alone, as this is a product blog and not a blog about the sad state of K-12 education or more specifically STEM education.
Don’t worry, I’m not going to make you do any math that a 5th grader can’t do.
But as a product leader, if you want to survive in a data-driven, product world, you do need a basic understanding of statistics. – Tweet This
Rather than bore you with concepts that will go in one ear and out the other, I’m going to start with one of the most common questions you might face.
Can you trust your A/B test results?
Most teams are now running at least some split tests (or more commonly referred to as A/B tests). The problem is many folks are making mistakes that are preventing them from getting any benefit from this testing.
It’s not enough to run the tests. You have to know how to run them properly. You have to know when you can and when you can’t trust the results you get from them.
Despite a growth in 3rd party tools, it’s still not easy to integrate A/B testing into your development process. Why do all that work and not get the benefit?
Read on to make sure you are getting the most from your effort.
Understand What Statistical Significance Means and Why Your Testing Tool Might Not Be Helping
If you aren’t checking for statistical significance on your A/B tests, you might as well give up A/B testing. – Tweet This
That might sound extreme, but it’s true.
Let’s start at the beginning. Statistical significance is a fancy term for a very simple concept that many people don’t understand.
It represents the likelihood that the difference you see between your two variations are actual differences and not due to chance.
That was a mouthful. Let’s look at an example. Suppose you are testing two designs, Design A and Design B, and you get the following results.
Design A: 1000 views, 12 conversions
Design B: 1000 views, 24 conversions
Always start by plugging these numbers into a statistical significance calculator. I like this one. Using that calculator, I get the following result:
Test “B” converted 100% better than Test “A.” We are 98% certain that the changes in Test “B” will improve your conversion rate.
Your A/B test is statistically significant!
What does that mean? Let’s break it down into its three parts.
First, Design B has twice as many conversions as Design A, so that’s a 100% improvement.
Second, “we are 98% certain that the changes in Test “B” will improve your conversion rate.” 98% is the probability that the difference in conversion rates between Test A and Test B is real and not due to chance.
Note, this also means there is a 2% chance that this difference in conversation rate was due to chance and that Design B won’t actually outperform Design A over the long term.
Let’s look at why. We all know that when you flip a coin, you have a 50% chance of getting heads and a 50% chance of getting tails. We also know that when you flip a coin ten times, you don’t always get 5 heads and 5 tails. Why is this?
It’s because the probability plays out over a very large sample. The more times you flip a coin the closer to 50 / 50 you will get. If you happen to get 10 heads in a row, you might conclude that you will always get heads. But as we know, this is a false conclusion.
The same problem occurs with our A/B tests. Because we are only looking at a sample and not all occurrences, the conversion rate of the sample might be the equivalent of 10 heads and we might draw incorrect conclusions from our data.
Fortunately, this second number, 98% in our example, is how confident we can be that our results represent an actual difference and not just chance. In the case of flipping our coin, the closer this number is to 100%, the more likely our results will show a 50/50 split.
Finally, the 3rd sentence, “Your A/B test is statistically significant.” is also important. This tool is telling you, you can trust your results. Design B converts better than Design A.
However, there are some things to be aware of. This particular tool will indicate any result with at least 95% confidence is significant. This is the generally accepted standard of statistical significance. Most researchers want to see 95% confidence before concluding the results are significant.
Unfortunately, many of the web-based A/B testing tools don’t follow this standard. I’ve seen some that draw the line as low as 80%. This means they will tell you the results are significant when they only have 80% confidence that the results aren’t due to chance.
Why is that a problem?
Remember, with 95% confidence, you still have a 5% chance that your results were due to chance (like flipping a coin and getting 10 heads) and don’t represent a real difference.
With 80% confidence, you have a 20% chance that your results are due to chance. That means that, on average, for every five A/B tests that you run, one of them is gong to indicate that Test B is better than Test A, when the difference isn’t actually real, but instead due to chance. This will lead to many false positives.
It means that you run the risk of releasing product changes that actually have no impact. Or worse. Have a negative impact that isn’t correctly being measured.
With a 20% chance of false positives, it means you will have a very hard time distinguishing between changes that have a real impact and changes that look like they are having an impact but are just statistical errors.
This is a huge problem in my view. 80% confidence is not nearly good enough. I have no idea why these tools draw the line at 80%.
Unfortunately, it can be hard to track down how some 3rd party tools measure significance. So rather than assuming they are doing it correctly, I calculate it myself using a tool that makes it very clear that it is drawing the line at 95% confidence.
Let’s take a step back. If your head is spinning, this is all you need to know.
- Use a statistical significance calculator that requires at least 95% confidence. I like this one.
- Only trust results that are statistically significant.
When You Stop the Test Matters and Why Your Testing Tool Might Be Doing It Wrong
I can’t tell you how many people I talk to say something along the lines of, “I run the test until the results are statistically significant.”
They have good intentions. At least they are checking for statistical significance.
It seems perfectly reasonable to let your test run until you get statistically significant results.
Unfortunately, this is dead wrong.
Have you ever noticed that your results will be statistically significant one day and not the next day?
Statistical confidence will vary from day to day, even hour to hour, as your views and conversions fluctuate.
If you want to reduce the chance of getting false positives, you can’t stop an A/B test whenever you want. – Tweet This
Remember, a false positive is a result that looks like one option is performing better than the other, but the difference isn’t real, it is just due to chance.
You need to decide ahead of time for how long the test will run. You can define this in many ways. For example, for 7 days, until there are 1,000 views, until I get 100 conversions.
The key is to set a fixed period of time and then to ignore your results until that time period ends. Consider your results meaningless until you hit this limit.
Naturally, you might be asking yourself, how do I decide how long to run the test for?
Here’s what I do. Whenever I run an A/B test, I start with a hypothesis. Continuing with our previous example, let’s say my hypothesis is:
Test B will increase conversions over Test A by 200%.
Notice that my hypothesis wasn’t just:
Test B will increase conversions over Test A.
I actually specified by how much. Now in reality, I have no idea what this number should be. But it’s critical that I make the best guess possible, because this is going to determine for how long I run the test.
So what I do is, I ask myself, what is the smallest improvement that I need to see that still makes this test worth implementing? And I use that number.
You might think any improvement is a good improvement. But that’s rarely true. Changes take time to implement. Change is disruptive to your users. Some options might be better than other options for reasons other than conversion rate. The smallest increment that makes it worthwhile is rarely anything greater than 0. If you think it through, it’s usually not too hard to determine what type of improvement you would need to see in order to make the change worthwhile. Use that number.
But keep in mind, the smaller the improvement, the bigger sample size you’ll need.
I then use this calculator to calculate the duration.
Once you’ve calculated the duration, let the test run for that duration. Don’t call it early. Don’t trust mid-test results. Even if the confidence level is high enough to suggest that they are significant.
Set a reminder to check in when the duration ends and forget about the test in the meantime. There’s no sense in tempting yourself to cheat.
More Variations Aren’t Better, Test Your Best Options Instead
It happens time and time again. A team adds A/B testing to their repertoire and suddenly they are testing anything and everything.
Instead of testing their two best headlines, they are testing 25 options.
Instead of testing their 2 best calls-to-action, they test every idea anyone has ever had.
They conclude, the data will tell us the truth.
Again, this sounds logical. And it is absolutely wrong.
Remember, even with 95% confidence, your test still has a 5% chance of being a false positive.
if you test 25 headlines, at least one of them is going to be a false positive. It’s going to look better than the rest, but it is just due to chance.
A/B testing isn’t magical. You still have to use judgement. Test your best options, not every option. – Tweet This
This Stuff is Genuinely Hard. But It Matters.
We’ve covered a lot of ground. And I’ll be the first to admit that this stuff makes my brain hurt.
But it is critically important to understanding what’s working and what isn’t.
You have to get it right.
So do me a favor, if something wasn’t clear, ask a question in the comments.
If it looks like I got something wrong, please share your thoughts in the comments.
We all need to get better at this stuff. Let’s help each other out.
Up next, we’ll take a look at root-cause analysis. Why it matters and how to get better at it. Don’t miss out, subscribe to the Product Talk mailing list.