Your A/B Test 'Winner' Is Lying to You
How Aggregate Data Hides Your Most Profitable Insights
Written by Hafiz Dhanani · Topics: A/B Testing, CRO, Google Ads (SEM), Google Optimize / Visual Website Optimizer, advanced statistical significance
When I was the first growth hire at Rocket Doctor, we ran an experiment that highlights some of the subtler nuances of A/B testing.
The results looked clear-cut at first — we had a winner with statistical significance.
But when we dug deeper, we discovered something far more valuable than a simple conversion rate improvement.
Here's what happened, and why it matters for anyone running experiments.
The Setup
We were running a high volume of spend on Google Ads (SEM) — basically our entire marketing budget. Any improvements in Cost Per Lead meant we could acquire more patients / extend our runway.
Since marketing owned the signup forms post-click, we could iterate quickly without engineering resources.
Our experiment's core hypothesis was that users would respond better to seeing fewer fields at once.
More formally:
"By redistributing the patient intake fields from a single long page into a sequential, multi-step process, we will reduce the user's perceived cognitive load and visual friction. This change is expected to result in a statistically significant increase in form completion rates (conversion rate) and a subsequent decrease in Cost Per Lead."
We decided to test two form variations:
Version A (Control): A single-page form collecting all patient information at once.
Version B (Experimental): A 3-part form that broke the same fields across multiple screens, asking patients to click "next" between each section.

The Initial Results
After 71 days and 23,273 sessions, the numbers came in:
- Original Form: 2,887 conversions / 11,683 sessions = 24.71% conversion rate
- 3-Part Form: 3,008 conversions / 11,590 sessions = 25.95% conversion rate (+5.02%)

Google Optimize (rip) reported a 91% probability that the experimental condition was better. When we ran it through Evan Miller's statistical significance calculator at a 95% confidence level, we got a p-value of 0.0293 — statistically significant up to 97% confidence.
Not bad for a quick form change, right?
We could have declared victory, rolled out the 3-part form to 100% of traffic, and moved on to the next test.
But we didn't.
The Question That Changed Everything
Instead of stopping there, we asked: 1. Why did the split-up form work better? and 2. For whom did the split up form work better?
Was it really true that people universally respond better to seeing fewer fields at once? Or was there a specific segment of users driving these results?
Segmenting by Device Type
The simplest segmentation available in Google Analytics is device type. When we looked at our traffic breakdown, we discovered that 73.28% of sessions came from mobile devices — a critical insight for a telemedicine company.
Here's what the mobile segment revealed:
Mobile Traffic (17,082 sessions):
- Original Form: 2,106 conversions / 8,575 sessions = 24.56% conversion rate
- 3-Part Form: 2,253 conversions / 8,507 sessions = 26.46% conversion rate (+7.82%)

The lift on mobile wasn't just higher than the overall average — it was 55% larger (7.82% vs. 5.02%). Even better, the p-value dropped to 0.00392, meaning we had 99%+ confidence in this result.
But what about desktop and tablet users?
Non-Mobile Traffic (6,191 sessions):
- Original Form: 781 conversions / 3,108 sessions = 25.13% conversion rate
- 3-Part Form: 755 conversions / 3,083 sessions = 24.49% conversion rate (-2.55%)

Wait — the experimental form actually performed worse on non-mobile traffic. Was this statistically significant? No. The p-value of 0.56 was an order of magnitude away from the standard 0.05 threshold, meaning we couldn't draw any conclusions about which form was better for desktop users.
What This Actually Tells Us
Let's recap what we learned:
- Aggregate level: The 3-part form won with 97% confidence (p = 0.0293)
- Mobile segment: The 3-part form crushed it with 99%+ confidence (p = 0.00392)
- Non-mobile segment: No significant difference between forms (p = 0.56)
The real story wasn't "multi-step forms are better than single-page forms." The real story was "mobile users prefer multi-step forms, while for desktop users, the test was inconclusive."
Why This Matters
Here's the crucial insight: If the aggregate test had shown no significant difference, we would have missed the mobile win entirely.
Imagine a scenario where the traffic mix shifted slightly — perhaps desktop volume increased, or the negative trend on desktop strengthened just enough to lower the overall average. Mathematically, this dilution would push the aggregate p-value above the 0.05 threshold of significance. You would be left with a 'null result' — a statistical false negative — causing you to abandon the experiment and unknowingly forfeit a 7.82% conversion lift for the 73% of your users on mobile.
This happens more often than you'd think.
The Lesson: Always Segment Your Tests
Aggregate data can obscure powerful insights hiding in your user segments. The more granular you can get, the better you understand the mechanisms driving your results — and that understanding becomes the foundation for your next wave of optimizations.
When you run your next A/B test, don't just look at the top-line numbers. Ask yourself:
- What user segments make up my traffic?
- Could different segments be responding differently to this change?
- What's the simplest segmentation I can check first? (Hint: usually device type or traffic source)
In our case, segmentation revealed that we weren't just improving conversion rates — we were solving a real usability problem for mobile users struggling with long forms on small screens. That insight informed everything we built afterward.
Key Takeaways
- Statistical significance at the aggregate level isn't the end of your analysis — it's the beginning
- Aggregate data hides heterogeneity. Top-line metrics average out distinct user behaviors, often masking the true drivers of performance.
- Prioritize device segmentation. Given the drastic usability differences between mobile and desktop interfaces, this should be the first layer of any post-test analysis.
- Understanding why something works is more valuable than knowing that it works — it compounds into future wins.
- Null results are not always failures. A test that looks flat at the aggregate level may hide a significant win in a specific, high-value segment.
- Stop optimizing for the "Average User." They don't exist. Instead of hypothesizing that "Form B will convert better," refine your hypothesis to: "Form B will convert better for mobile users due to reduced vertical scrolling."
Remember, if you stop at the aggregate p-value, you aren't optimizing for your users — you are optimizing for an 'average' user who doesn't exist.
Written by Hafiz Dhanani