A/B Testing Charts: The validity threat from oversampling

SUMMARY:

If you’re an experienced A/B tester, you know you can’t simply put up a control and treatment and pick a winner based on the results. Because the better numbers for the winner may simply be random chance.

So you need a big enough sample size to ensure statistical validity.

However, sometimes too many samples can also deceive. This is the validity threat from oversampling.

Read on to learn more about the occasional downside of big numbers.

by Daniel Burstein, Senior Director, Content & Marketing, MarketingSherpa and MECLABS Institute

A/B testing is an extremely helpful practice for the customer-first marketer. You’re letting the customer decide — with their actions — which messages, designs, products and offers are most appealing to them. You randomly split your audience between Headline A (or email, product, offer or landing page, etc.) and Headline B (or email … etc.) B and see which performs better.

But to be certain you’re truly measuring customer behavior, you need to make sure you have tracked enough instances so the difference you’re seeing between, say, two different headlines is because one performs better than the other, not just due to random chance.

For example, if you flipped a coin twice, you might get heads twice. That doesn’t mean it’s a two-headed coin. However, if you flipped the coin 100 times you’re much more likely to get results from your sampling (say, 51% of the attempts results in heads and 49% resulted in tails) that were an accurate reflection of reality (50% heads and 50% tails).

Now, you’ll never be 100% sure, because what you’re trying to do is predict future customer behavior based on a subset of all customers. But a good benchmark to shoot for is a 95% level of confidence.

What’s the validity threat from oversampling?

So one might think that more is always better when it comes to statistical validity in your marketing tests. But that is not the case.

“Given enough samples, even random ‘noise’ (random variance) can be statistically significant,” said Taylor Bartlinski, Senior Manager, Data Analytics, MECLABS Institute (parent research organization of MarketingSherpa).

An example of results appearing to indicate a difference when there really isn’t one

To test this principle, Bartlinski ran a simulation of a randomly split A/A (dual control) test to help visualize the issue of oversampling. A dual control test means that both the control and the treatment are the exact same thing. Traffic is evenly split between them. So, assuming no validity threats, both headlines, landing pages or whatever you’re testing should perform the exact same way.

The first chart, entitled “Conversion Rates,” shows the daily fluctuation between the conversion rates for Control A and Control B with the y-axis showing the conversion rate and the x-axis counting the number of days. Remember, this is not meant to represent any specific company’s conversion rate but merely to show an example scenario.

The second chart, “Aggregate Conversion Rates,” shows the aggregated relative differences between the two control groups’ conversion rates for 63 days.

The third chart shows the relative difference of conversion rates between the two control groups along with the levels of confidence (LoC) of that relative difference between the controls. The purple line represents the LoC, and the red is the relative difference.

A larger relative difference between treatments requires a smaller sample size

As you can see, the difference in conversion rates between the two identical treatments starts out above the desired 95% level of confidence in the very beginning. That’s because there is a relationship between the size of a difference and the level of confidence. The bigger the difference between two things, the fewer samples (i.e., observations) you need to confidently say those things are different.

This is why, as most experienced A/B testers know, you shouldn’t turn a test off within a day even if your platform says it has a 95% level of confidence. As you can see in the chart, there is a drastic change in the difference between conversion rates in the first few days.

And then, as the relative difference in conversion rates levels off, the level of confidence drops well below 95%.

A smaller relative difference between treatments requires a larger sample size

But then something interesting happens. The level of confidence creeps back up. This is because even a very small difference can reach a certain level of confidence with a really large sample size.

“The relative difference of 2% we see is just ‘noise’ because this is a dual control simulation. This 2% relative difference is statistically significant (above 95%) after Day 44 even though there is no true difference. This demonstrates that with enough samples even ‘noise’ can be conclusive,” Bartlinski said.

That noise is the natural variance between two things, even when they’re identical. For example, you may set your toaster to the exact same setting every morning to toast your bagel. But sometimes the bagel will be a little more toasted, sometimes less, and maybe, even on very rare occasions, burned. Subtle changes in the atmosphere, ambient temperature, wattage flowing through your home’s wall socket, cleanliness of the toaster’s elements and a million other minor random fluctuations can have an effect even though the toaster setting was identical.

The challenge is when this random noise is put into a beautiful chart or reported as specific numbers in an analytics platform, it can seem very real and impactful. “Random events often look like non-random events, and in interpreting human affairs we must take care not to confuse the two,” Leonard Mlodinow wrote in the book, The Drunkard's Walk: How Randomness Rules Our Lives.

“A key takeaway from this example is the importance of determining the correct sample size at the start of a test. There is a tendency for people to wait until statistical significance is achieved — but this shows that in doing so, you risk observing a significant difference that doesn’t really exist,” Steffi Renninger, Data Analyst, MECLABS Institute.

Avoiding oversampling

“This oversampling issue is something we try to avoid, especially with the clickthrough rate (CTR) for high-traffic websites. When we see relative differences less than 0.4% reach 95% LoC, we cannot be sure that this is not ‘noise.’ The precautions we have taken for large Research Partners are to establish a limit for the relative difference we look for (For CTR, we try to look for differences close to or above 1%.), make sure daily lifts are consistent, and when oversampling could be a concern, look at the LoC for smaller time periods to make sure the lift is ‘real,’” Bartlinski said.

The timeframe issue in testing is important. Without setting the timeframe up before the test is run, you could end up turning the test off when the results look good, rather than the results being true. An analogy one of our data scientists told me once goes like this — “To run a real experiment, you can’t simply throw a rock into a forest and, once it hits a tree, say, ‘Yep, that was a tree I meant to hit.’ You must first decide on which tree you’re trying to hit.”

So prior test planning to achieve statistical validity is integral to running a successful marketing experiment. “You should turn the test off once you've hit your sample size you’ve determined before the test,” Bartlinski said.

Tests don’t validate on their own

Understanding the risk of oversampling separates true, scientifically valid marketing experimentation from simply putting two headlines or call-to-action buttons in a test splitting tool and keeping it on until you get that desired 95% level of confidence.

For example, let’s say you’re at an 87% level of confidence after you’ve left the test running for at least a week (to overcome any day-of-the-week variations), and you’ve reached the sample size you need from your pretest planning.

You could feel an urge to just keep the test running until it hits 95% level of confidence and let the test validate as statistically significant on its own, something Bartlinski refers to as “reaching for 95%.”

But as you can see from the above example, this would put you at risk for oversampling because everything will validate eventually if you collect enough samples. Tests don’t validate on their own. They validate (or don’t) once they’ve met your predetermined qualifications.

A better strategy is to learn from what the experiment is telling you. An 87% level of confidence doesn’t necessarily mean you shouldn’t make a change and start using the winning treatment for all your traffic. It just means you must make a business decision and understand there is a higher level of risk than if you reached 95% level of confidence.

There’s nothing magic about the number 95%. It’s a good benchmark for level of confidence. But if you overdo your sample collection in your quest to hit that number, you will ultimately torpedo your testing efforts.

Related resources

Online Testing online certification course — Learn a proven methodology for executive effective and valid experiments from MECLABS Institute

Two Factors that Affect the Validity of Your Test Estimation

Validity Threats: How We Could Have Missed A 31% Increase In Conversions

A/B Testing Charts: The validity threat from oversampling

Improve Your Marketing