ABIntuition Busters KDDTalk

Paper at http://bit.
ly/ABTestingIntuitionBusters
@RonnyK © 2022 Ron Kohavi

Motivation
• “6 hours of debugging can save you 5 minutes of reading documentation”
-- Tweet by Jakob Cosoroabă
• Two weeks of deep-diving into experiment results can save you an hour of
reading this paper
• Make sure you understand p-values and statistical power
• Be skeptical of many published results
© 2022 Ron Kohavi #2

Tough Paper to Write
• I’ve been involved in A/B tests at Amazon (Weblab), Microsoft (led
ExP, the experimentation platform), and Airbnb (ERF).
I co-authored a book on experimentation that is usually in the top 10 in
Data Mining on Amazon, and I teach a quarterly Zoom class on A/B testing.
Alex was at Microsoft ExP and Airbnb.
Lukas was director of experimentation at Booking and now Vista
• We called out multiple vendors’ mistakes, book authors’ mistakes, and we
shared (really shredded) an example from GuessTheTest where several
mistakes were made (with the owner’s permission)
• The meta-reviewer wrote
I would very much encourage the authors to reread it and tone it down in parts…
If not done so, the paper would in fact embarrass KDD for years…

P-Values
Misinterpretation and abuse of statistical tests,
confidence intervals, and statistical power have
been decried for decades, yet remain rampant.
A key problem is that there are no interpretations of these
concepts that are at once simple, intuitive, correct, and foolproof
-- Greenland et al (2016)
• Vendors (e.g., Optimizely) try to hide the complexity by calling 1-(p-value)
“confidence,” but it is misleading , as confidence is NOT the probability that
the result is a true positive. Documentation is often wrong.
• Book authors frequently get it wrong (see paper for examples)

What We Want vs. What we Get
• We run an A/B test, and we want the probability that
B is better than A on some metric, say conversion rate,
given the data we observed in the test
• Technically, we define 𝐻0 , the null hypothesis, as the hypothesis that there
is no difference, and we want:
1-P(𝐻0 | data observed)
• What p-value gives us is the opposite conditional probability:
P Δ 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑜𝑟 𝑚𝑜𝑟𝑒 𝑒𝑥𝑡𝑟𝑒𝑚𝑒 𝐻0 𝑖𝑠 𝑡𝑟𝑢𝑒)
• It makes no sense to associate p-value with the probability of 𝐻0 because
the definition assumes it is true

We Need a Prior for 𝐻0
• To get the probability that we want, we need to apply
Bayes Rule, which requires a prior 𝑃(𝐻0 )
• FPR, or False Probability Risk, is a proxy: the probability that 𝐻0 is true (no
real effect) when the test was statistically significant
• Assuming the experiment was properly run with 80% power (discussed
later), here is a useful table relative to reported success rates by companies
and p-value threshold (alpha) of 0.05 for statistical significance
Company / Source Success Rate FPR
Microsoft 33% 5.9%
Avinash Kaushik 20% 11.1%
Bing 15% 15.0%
Booking.com, Google 10% 22.0%
Ads, Netflix
Airbnb Search 8% 26.4% © 2022 Ron Kohavi #6
Key Point: Surprising Results Require
Strong Evidence – Lower P-values
• Surprising results have low probability by definition
• To override a low probability, you want stronger evidence,
that is, lower p-value
• Twyman’s law—any figure that looks interesting or different is usually wrong
• Extraordinary claims require extraordinary evidence" (ECREE) -- Carl Sagan
• Daryl Bem (2011) presented a paper with experiments on Extra-Sensory
Perception (ESP) and time reversal.
Most scientists assign a very low probability (e.g., 10−20 ) to ESP or time reversal
(including $1M offer), so to accept such a claim, we need a very low p-value
• Best tool: replication
By rerunning the experiment again, or a third time, the combined p-value can be
much lower (and helps with other problems of multiple testing)

Statistical Power
When I finally stumbled onto power analysis…
it was as if I had died and gone to heaven
-- Jacob Cohen (1990)
• Statistical power is the probability of detecting a meaningful difference

between the variants when there really is one, that is,
rejecting the null when there is a true difference of δ
• For 80% recommended power and p-value of 0.05, the simple formula is
16 𝜎 2
𝑛 = 2 , where
δ
• n is the number of users in each variant
• 𝜎 2 is the variance of the metric, and
• δ is the sensitivity, the minimum change you want to detect
Example of Power Calculation
• For a website, you are interested in users purchasing something
• The conversion rate (user to purchaser) is 3.7%
• You are interested in ideas that improve the conversion rate by
(relative) 10% or more
• A lower percentage (e.g., 5%) will require more users, so we go aggressive
• That said, very few ideas can generate 10% improvement to a key metric like this.
At Bing, perhaps 1 in 10,000 experiments (but your startup isn’t optimized, so maybe)
• Plug this into the power formula, and you need >41,642 users in each variant
• GuessTheTest on 16 Dec 2021, shared this example, but with…
80 users in each variant, and it was stat-sig showing 337% improvement

Winner’s Curse
• A stat-sig result with low power has a high probability
of exaggerating the actual number as follows →
(Gelman and Carlin 2014)
• The power to detect a 10% relative delta in the prior example was 3%
• With such low power, the False Positive Risk is at least 63%,
so at stat-sig result is more likely to be wrong than right!
• To trust it a result, a p-value threshold of 0.002 should be used.
The actual p-value was 0.013 (paper has detailed computations)

Post-hoc Power Calculations are Noisy and Misleading
• Pre-experiment power allows one to say that the treatment effect is unlikely
to be large if we got a non-statistically significant result
(e.g., with 80% probability, it was under 10%)
• After an experiment is run, some compute the “post-hoc” power based on
the observed delta, a terrible idea
• This is like giving odds on a horse race after seeing the outcome (Greenland)
• The post-hoc power is simply a 1-1 mapping from p-value and alpha
• Anything stat-sig would map to power >= 50%
• Ad-hoc power leads to a paradox of power reversal (Hoenig and Heisey)

Minimize Data Processing Options
Statistician: you have already calculated the p-value?
Surgeon: yes, I used multinomial logistic regression.
Statistician: Really? How did you come up with that?
Surgeon: I tried each analysis on the statistical software
dropdown menus, and that was the one
that gave the smallest p-value
-- Andrew Vickers (2009)
• Flexibility in data processing increases false positive rate due to multiple

hypothesis testing
• Examples:
• Allowing users to stop experiments when stat-sig (as opposed to fixed horizon)
• Outlier removal (yes/no, or worse, some %)
• Selecting segments (stat-sig for males, or in country X)
• With additional degrees-of-freedom, the p-value threshold needs to be adjusted
Beware of Unequal Variants
• In theory, a shared control can be larger than the treatments
and you gain statistical power when comparing multiple
treatments to the control
• In practice, we shared cautionary notes:
• Triggering gets very complicated with the control having to compute whether the user
triggered in each of multiple treatments
• Cookie churn causes SRMs (Sample Ratio Mismatches) if the variants are of different sizes
• Shared resources, such as LRU caches, can give performance advantages to a larger variant
• The paper shares another reason: the converges to a normal distribution is faster
when variants are equal. Unequal variants caused material over-estimation of
type-I error on one tail and under-estimated the other tail

Summary (1 of 2)
We shared five intuition busters and made recommendations on how to
address the issues
1. Surprising results require strong evidence—lower p-values.
Share FPR and apply Twyman’s law to surprising results.
Do replication runs and combine them for a lower p-value
2. Experiments with low statistical power are NOT trustworthy
Avoid running underpowered experiments.
Do not share “interesting” results from such experiments
3. Post-hoc power calculations are noisy and misleading
Do not show them.
If you see others using them, explain how misleading they are
Summary (2 of 2)
4. Minimize data processing options in experimentation platforms
Standard processing by default.
Optional processing should be specified pre-experiment
(or in replication run)
5. Beware of unequal variants
Make sure you checkoff all concerns before you use unequal variants

To learn more about A/B tests and controlled experiments,
I teach a 10-hour Zoom class (next one Aug 22).
See https://bit.ly/ABClassRKLI
Paper at https://bit.ly/ABTestingIntuitionBusters
These slides at https://bit.ly/ABTestingIntuitionBustersTalk

ABIntuition Busters KDDTalk

Uploaded by

Copyright:

Available Formats

ABIntuition Busters KDDTalk

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ABIntuition Busters KDDTalk

Uploaded by

Copyright:

Available Formats

Paper at http://bit.

@RonnyK © 2022 Ron Kohavi

© 2022 Ron Kohavi #2

© 2022 Ron Kohavi #3

© 2022 Ron Kohavi #4

© 2022 Ron Kohavi #5

© 2022 Ron Kohavi #7

• Statistical power is the probability of detecting a meaningful difference

© 2022 Ron Kohavi #9

© 2022 Ron Kohavi #10

© 2022 Ron Kohavi #11

• Flexibility in data processing increases false positive rate due to multiple

© 2022 Ron Kohavi #13

© 2022 Ron Kohavi #15

© 2022 Ron Kohavi #16

You might also like