A/B Test Analysis Prompt Templates
AI prompt templates for A/B test analysis. Interpret experiment results and make data-driven decisions.
Overview
Running an A/B test is easy. Interpreting the results correctly is hard. These prompts help you design tests that answer the right questions, analyze results without common mistakes, and make decisions when results are ambiguous. Use them to avoid the pitfalls that lead to false positives and bad decisions.
Best Practices
Define your success metric and sample size before starting the test
Let tests run to completion even if early results look decisive
Check for novelty effects and segment differences before declaring a winner
Document what you tested and what you learned, even for failed tests
Consider practical significance, not just statistical significance
Prompt Templates
1. A/B Test Design Review
Review my A/B test design before I launch. Hypothesis: [WHAT_YOU_EXPECT_TO_HAPPEN] Variant A (control): [CURRENT_EXPERIENCE] Variant B (test): [NEW_EXPERIENCE] Primary metric: [WHAT_YOU'LL_MEASURE] Secondary metrics: [OTHER_THINGS_TO_WATCH] Traffic allocation: [% SPLIT] Expected duration: [HOW_LONG] Is this test set up correctly? What could go wrong?
Hypothesis: Adding social proof (customer count) to pricing page will increase conversions. Variant A: Current pricing page. Variant B: Same page with '10,000+ customers trust us' badge. Primary metric: Pricing page to checkout conversion rate. Secondary metrics: Time on pricing page, plan mix (which plan they choose). Traffic allocation: 50/50. Expected duration: 2 weeks.
**Design Review: Social Proof Test** **Looks good:** - Clear, testable hypothesis - Single variable change (good for clean interpretation) - Appropriate primary metric (conversion rate) - Smart secondary metric (plan mix) to check for unintended effects **Potential issues:** 1. **Sample size check needed**: 2 weeks might not be enough. What's your baseline conversion rate and how big a lift are you looking for? - If baseline is 5% and you want to detect a 10% relative lift, you need ~30,000 visitors per variant - Do the math: 2 weeks × daily visitors ÷ 2 = sample per variant 2. **Day-of-week effects**: 2 weeks is good (captures full weekly cycle), but make sure you start and end on the same day 3. **Audience mismatch risk**: Social proof works differently on different audiences - First-time visitors vs. returning visitors might respond differently - Segment your analysis afterward 4. **Novelty effect**: Social proof badge might get initial attention that fades. Consider extending to 3-4 weeks to check for decay 5. **Missing guardrail metric**: What if the badge increases conversions but to lower-tier plans? Add 'average revenue per conversion' as a guardrail **Pre-launch checklist:** - [ ] Sample size calculation confirms 2 weeks is enough - [ ] Analytics tracking verified on both variants - [ ] Test runs at specific start time (not mid-day) - [ ] Decision criteria defined: what lift makes this worth shipping?
Before launching any A/B test
- •Write down your prediction and decision criteria before seeing results. This prevents post-hoc rationalization
- •If you don't know your baseline conversion rate, run a measurement period first
2. A/B Test Results Interpreter
Help me interpret these A/B test results. Test description: [WHAT_YOU_TESTED] Duration: [HOW_LONG_IT_RAN] Results: - Control: [N], [CONVERSION_RATE_OR_METRIC] - Variant: [N], [CONVERSION_RATE_OR_METRIC] - Statistical significance: [P_VALUE_OR_CONFIDENCE] Should I ship the variant? What else should I check?
Test: Simplified checkout form (5 fields instead of 8). Duration: 3 weeks. Results: Control: 15,234 visitors, 4.2% conversion. Variant: 15,198 visitors, 4.5% conversion. Statistical significance: p = 0.08. Should I ship the variant?
**Results Analysis** **Quick stats:** - Relative lift: 7.1% (4.5% vs 4.2%) - Absolute lift: 0.3 percentage points - p-value: 0.08 (not significant at 0.05 threshold) **Verdict: Don't ship yet. Here's why and what to do:** **The stats don't quite support it:** With p = 0.08, there's an 8% chance you'd see this difference (or larger) even if the variant has no real effect. That's better than a coin flip, but standard practice requires p < 0.05. **What this actually means:** - The variant is probably better (92% confidence), but you can't be sure - The effect might be real but smaller than it appears **Options:** 1. **Extend the test** (recommended): You're close. Another 1-2 weeks might push you over the significance threshold or reveal the effect is smaller than hoped. 2. **Accept the risk and ship**: If the cost of a wrong decision is low and the variant doesn't hurt anything, some teams would ship at 90% confidence. But document that you're doing this. 3. **Segment analysis**: Check if specific segments show stronger results - Mobile vs. desktop - New vs. returning visitors - Geographic regions A segment with clear significance might justify a targeted rollout. **Additional checks before any decision:** 1. **Secondary metrics**: Did anything get worse? Check: - Form completion rate (did fewer people finish?) - Post-purchase satisfaction (did removing fields cause issues later?) - Revenue per visitor (did you lose upsells with fewer fields?) 2. **Time pattern**: Plot conversion rate by week. Is the variant consistently better or did it spike early and decay? 3. **Sample ratio mismatch**: Are the sample sizes close to 50/50? (They are: 15,234 vs 15,198). A big imbalance could indicate a tracking bug.
When your A/B test finishes and you need to make a decision
- •p = 0.05 is a convention, not magic. Understand the tradeoffs of your threshold
- •A non-significant result is not 'no effect.' It's 'we can't tell from this data'
3. Test Failure Post-Mortem
My A/B test showed no significant difference. Help me understand why and what to do next. What we tested: [HYPOTHESIS_AND_CHANGE] Results: [WHAT_HAPPENED] Our expectation: [WHAT_WE_THOUGHT_WOULD_HAPPEN] Was the test flawed, or is the hypothesis wrong?
What we tested: Adding urgency messaging ('Only 3 left in stock!') to product pages. Results: Control 2.8% add-to-cart, Variant 2.9% add-to-cart, p = 0.67 (not significant). Our expectation: 10-15% lift based on industry benchmarks.**Post-Mortem: Urgency Messaging Test**
**Observation**: Essentially no effect detected. 0.1 percentage point difference with high p-value.
**Possible explanations:**
**1. The hypothesis is wrong for your context**
- Urgency messaging works best for commodities with alternatives ("buy now or someone else will")
- If your products are unique or customers are researching, fake urgency might be ignored or feel manipulative
- Question: What's your product category? How price-sensitive are buyers?
**2. The execution didn't work**
- Was the message visible enough? Check heatmaps for attention
- Was '3 left' believable? Customers notice if it always says 3
- Did it appear on all product pages or just some?
- Check: Did customers actually see the message? Look at scroll depth data
**3. The test was underpowered**
- Calculate: With 2.8% baseline, detecting a 10% lift needs ~35,000 per variant
- If you had 10,000 per variant, you couldn't have detected the effect you expected
- Check: What was your actual sample size?
**4. There's a segment effect hiding in the average**
- Urgency might work on some visitors and backfire on others, netting to zero
- Segment by: new vs. returning, mobile vs. desktop, traffic source
- A pattern here could save the hypothesis
**5. Wrong metric**
- Add-to-cart might be too early in the funnel
- Urgency might affect purchase completion, not browsing behavior
- Check: Any movement in downstream metrics?
**Next steps:**
1. Run the segment analysis before concluding it failed
2. Check heatmaps to verify the treatment was noticed
3. Survey customers: Does urgency messaging feel authentic for your brand?
4. If truly no effect: document and move on. Not every idea works, and that's okay.
**Learning to capture:**
Even negative results are valuable. Add to your test repository: 'Urgency messaging (scarcity) showed no effect on [product category] pages. May not suit our customer base or may require stronger execution.'When a test you expected to win shows no difference
- •Failed tests are valuable if you learn something. Document them well
- •Most tests fail. The goal is learning, not winning
4. Multi-Variant Analysis
Help me analyze a test with multiple variants. Variants tested: [LIST_ALL_VARIANTS] Results: [RESULTS_FOR_EACH] Questions: Which variant wins? Is there a clear winner or should I test further?
Variants tested: A (control - blue button), B (green button), C (orange button), D (red button). Results: A: 3.1% conversion (n=8,000). B: 3.4% conversion (n=8,100). C: 3.3% conversion (n=7,900). D: 2.9% conversion (n=8,050). All compared to A: B p=0.12, C p=0.34, D p=0.45.
**Multi-Variant Analysis: Button Color Test** **Summary table:** | Variant | Conv Rate | vs Control | p-value | Verdict | |---------|-----------|------------|---------|---------|| | A (Blue/Control) | 3.1% |, |, | Baseline | | B (Green) | 3.4% | +9.7% | 0.12 | Promising, not conclusive | | C (Orange) | 3.3% | +6.5% | 0.34 | No evidence of difference | | D (Red) | 2.9% | -6.5% | 0.45 | No evidence of difference | **Key issue: Multiple comparison problem** When you test 4 variants against control, your chance of a false positive increases. With 3 comparisons at α=0.05, your actual false positive rate is ~14%, not 5%. Bonferroni correction: Divide α by number of comparisons (0.05/3 = 0.017). None of your results pass this threshold. **My interpretation:** - There's no clear winner yet - B (green) is the best performer, but the evidence isn't strong enough to declare victory - D (red) might be worse than control, but also not conclusive **Recommendations:** **Option 1: Run a focused follow-up test (recommended)** Drop C and D. Run A vs B with full 50/50 traffic. You'll reach significance faster with only 2 variants. Sample size needed: To detect 10% lift from 3.1% baseline at α=0.05, power=0.80, you need ~26,000 per variant. With A vs B only, you'll get there faster. **Option 2: Ship green with monitoring** If the decision isn't high stakes and green has other benefits (brand consistency, accessibility), ship it and monitor. The directional evidence supports it. **Option 3: Question the test** Button color tests often show small or no effects because button color isn't what drives conversion. Consider testing more impactful things: copy, layout, offer. **Don't do:** - Don't pick green as the winner just because it's highest. The p-values don't support it - Don't run all 4 variants again. You'll just get the same inconclusive result with more data
When you've tested more than two options and need to pick a winner
- •More variants means you need more traffic to find a winner. Consider sequential testing instead
- •If testing multiple variants, adjust your significance threshold for multiple comparisons
Common Mistakes to Avoid
Stopping tests early when results look significant. Early significance is unreliable and inflates false positive rates
Running multiple tests on the same page without accounting for interaction effects
Ignoring practical significance. A 0.1% lift might be statistically significant but not worth the engineering effort to maintain
Frequently Asked Questions
Running an A/B test is easy. Interpreting the results correctly is hard. These prompts help you design tests that answer the right questions, analyze results without common mistakes, and make decisions when results are ambiguous. Use them to avoid the pitfalls that lead to false positives and bad decisions.
Related Templates
SQL Query Prompt Templates
AI prompt templates for writing SQL queries. Create SELECT, JOIN, aggregate, and complex queries.
Data Analysis Prompt Templates
AI prompt templates for data analysis. Extract insights, identify patterns, and interpret results.
Data Visualization Prompt Templates
AI prompt templates for data visualization. Create effective charts, dashboards, and visual reports.
Have your own prompt to optimize?