Email Optimization and Testing: 800,000 emails sent, 1 major lesson learned

SUMMARY: What happens when you send 800,000 email messages to your audience and let more than 200 marketers optimize the landing page for the email campaign? Well, MarketingSherpa, MarketingExperiments and HubSpot teamed up to find out by running a live email campaign test during the first-ever Optimization Summit.

In this unique case study, we hear from Austin McCraw, a lead analyst on the test, as he gives a play-by-play through the test’s two-day duration. You’ll see testing isn't as easy as it looks. Yet, despite all the challenges, he learns the true value of a real-world experiment.

by Austin McCraw, Senior Research Analyst, MECLABS

Two weeks ago I had the privilege of being in the tallest hotel in the Western Hemisphere, joined by more than 200 other evidence-based marketers, for the first-ever MECLABS conference on the topic of optimization and testing. Overall, it was a high-value week.

[Note: MECLABS is the parent company of MarketingSherpa and sister company MarketingExperiments.]

But what I found most uniquely valuable about Optimization Summit was a live experiment in which the audience was asked to optimize and test a marketing campaign during the course of the conference. I had a backstage pass to what would become a thriller of an experiment, with many ups and downs, bends and turns.

All in all, attendees learned a lot about testing in the process, and in this MarketingSherpa article, I’d like to highlight some of the key insights shared at the Summit.

The original page

The subject of the test was a real offer from the Premier Sponsor of the event, HubSpot. They had partnered with MarketingSherpa to offer a free chapter of the 2011 MarketingSherpa Landing Page Optimization Benchmark Report.

The overarching goal for the campaign was to increase HubSpot’s knowledge of their current email list as well as to grow it. So, the audience was charged with increasing the form submission rate for the email landing page.

Here’s the catch, they could only optimize five elements:

1. The headline
2. The product image
3. The call-to-action
4. The copy
5. The page layout

Democratic optimization

If you've ever been in a contentious meeting with your team, you know it can be extremely difficult to get two marketers to agree on the most effective approach. So, how do you get more than 200 marketers to agree on any one of these elements?

We broke them into small groups of about three to four, gave them a limited amount of design options, and then forced them to come to consensus in 15 minutes or less. The groups’ choices were then tallied up and whatever options got the most group votes made it into the treatment.

The treatment page

After all the votes were cast and counted, the audience had decided that a single-column layout with short copy, a stronger headline, and a benefit-centric call-to-action would perform better than the control.

From my perspective, the two major optimization advantages behind this new design were:

1 It strengthened the value proposition through specificity in the headline and image, highlighting the nature of the free content for download

2. It reduced the overall friction on the page by shortening the copy, and providing a clear, singular eye-path.

From here we launched the test, sending out 800,000 email messages (to the MarketingExperiments, MarketingSherpa and HubSpot lists), posting four blog posts (including this one here), and Tweeting to more than 140,000 followers. Overall, we received so much traffic that we could statistically validate a 1% change in conversion. And believe it or not, it would come down to that.

The results

According to the final numbers, our focus group of 200+ marketers was able to increase the amount of leads generated from this page by a whopping 0.7% at a 90% confidence level. This seems like a modest gain to say the least, but in this case, we were using our best email lists to drive the highest quality people to this page, so even a small gain could potentially generate impressive ROI.

With that said, it is also very interesting to note here the incredibly high conversion rate for this campaign, 47% (meaning nearly one out of every two visitors completes the form). This means that incoming traffic is incredibly motivated, and therefore any gains obtained through testing will most likely be modest.

As taught in the MarketingExperiments Conversion Heuristic, visitor motivation has the greatest influence on the likelihood of conversion. If your visitors are highly motivated, they will put up with a bad Web page in order to get what they want. Not to say that either one of these pages are bad, but with such a motivated segment, it will be hard to tell any difference in conversion performance between the designs.

But something still didn’t seem right

Despite the incremental success, something seemed "off" during the course of the test. We collected data every hour (though some hours are consolidated in the charts featured in the useful links below), and by the end of the first day of testing it looked as if we had a clear winner. At 11:50 p.m. of the first day, the treatment was outperforming the control by 5% and had reached a 93% confidence level. The test was in the bag, and attendees were breaking out the bottles of Champagne (okay… a little stretch there).

However, in the morning, everything had shifted.

The treatment which had performed with an average 51% conversion rate throughout the previous day, was now reporting performance at a comparatively dismal 34% conversion rate. This completely changed the results, the control and treatment were virtually tied, and the confidence level was now under 70%.

What happened? Why the shift? Did visitor behavior change drastically overnight? Did some extraneous factor compromise the traffic?

A major validity threat

As we looked deeper into the numbers, we noticed an interesting anomaly in the overnight traffic. In the course of seven hours, the treatment received 44% more traffic than the control, even though the test was designed to split the traffic in half. At certain periods of time, the traffic amount would swing significantly (as noted in the chart featured in the useful links below), and one of those times was overnight.

What was causing the traffic to swing so much? The answer to this question would reveal a major validity threat.

At first, we thought it was the splitter. Maybe for some reason the traffic splitter was just not working properly. But that didn’t make much sense due to the periods of time in which the traffic split was relatively even. If a splitter is broken, it usually is always broken. It doesn’t go back and forth between working and not working.

“When we see results like this, it often indicates that another online campaign was launched and directed unauthorized traffic to one of the test pages” mentioned Bob Kemper, Director of Sciences, MECLABS. So we checked with the marketing departments involved, but nothing new had been launched, and all the original campaigns were only sending to the splitter page.

We had two thoughts, and now, two dead ends. So, what else could it be?

You, o' email list, were our validity threat

Then it hit me. Seems obvious now, but for at least an hour, I felt like a genius. It wasn’t us that drove unauthorized traffic to the treatment or the control; it was you. That’s right, our own email subscribers began telling their friends (via Twitter and other social media outlets) about the offer, only instead of pointing them to the splitter URL, they directed straight to either the treatment page or the control page.

A test design flaw exposed

This validity threat exposed a problem with our own test design, and illustrates how easily validity threats can occur for the average online test. Working within the average technological constraints of a marketing team, we hit one of the most common validity threats in online testing – the selection effect.

Our data could have said we had a 175% gain with a 95% confidence level, and we could still potentially run off to make the biggest mistake of our marketing careers because of an undetected validity threat. Even the best test designs are subject to real-world extraneous factors that can threaten the validity of a test.

Does that not cause you to tremble in fear just a little? Good.

A little fear is healthy in email campaign testing. However, this fear should evoke prudence, not paralysis. We don’t want to stop testing just because it can be messy. In fact, this test should encourage you to do more testing. If I, a content guy with little statistical and data analysis background could find the problem, then anyone reading this article has hope. Testing anomalies are often obvious (if you are watching the data), even if you can’t exactly pinpoint the cause at first.

Simple in principle, messy in practice

In conclusion, what I learned most from this live experiment is that testing is that though testing is simple in principle, it can be messy in practice. When you get real people interacting with a real offer, the results are often unpredictable. There are potentially real-world extraneous factors that can completely invalidate your results. In this test, the aggregate data claimed an increase with a fair level of confidence, but something was interfering with the results. We would have to dig down deep to really figure it out.

Maybe it’s just the thrill seeker inside of me, but it was the unexpected messiness that made this test so exciting. Throughout Optimization Summit, I eagerly watched the data come in, not knowing what would happen next. I felt like a kid on the edge of his seat watching a good sci-fi movie.

Overall, this experience opened my eyes to a whole new world, and though it may take some time to get good at testing, at this point, I’m more than ready for the sequel.

Useful links related to this article

CREATIVE SAMPLES:

1. Test control
2. Test treatment
3. Test results
4. Time graph
5. Metrics
6. Live Tweet
7. Test design flaw

2011 MarketingSherpa Landing Page Optimization Benchmark Report

Members Library -- Landing Page Testing and Optimization: How Intuit grew revenue per visitor 23%

New Marketing Research: 3 profitable traffic sources most marketers are ignoring

Evidence-based Marketing: Marketers should channel their inner math wiz…not cheerleader

MarketingSherpa B2B Summit 2011 – September in Boston, October in San Francisco

Subscribe to the MarketingSherpa Inbound Marketing newsletter

Email Optimization and Testing: 800,000 emails sent, 1 major lesson learned

Improve Your Marketing