ABIntuition Busters KDDTalk

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Paper at http://bit.

ly/ABTestingIntuitionBusters

@RonnyK © 2022 Ron Kohavi


Motivation
• “6 hours of debugging can save you 5 minutes of reading documentation”
-- Tweet by Jakob Cosoroabă
• Two weeks of deep-diving into experiment results can save you an hour of
reading this paper
• Make sure you understand p-values and statistical power
• Be skeptical of many published results

© 2022 Ron Kohavi #2


Tough Paper to Write
• I’ve been involved in A/B tests at Amazon (Weblab), Microsoft (led
ExP, the experimentation platform), and Airbnb (ERF).
I co-authored a book on experimentation that is usually in the top 10 in
Data Mining on Amazon, and I teach a quarterly Zoom class on A/B testing.
Alex was at Microsoft ExP and Airbnb.
Lukas was director of experimentation at Booking and now Vista
• We called out multiple vendors’ mistakes, book authors’ mistakes, and we
shared (really shredded) an example from GuessTheTest where several
mistakes were made (with the owner’s permission)
• The meta-reviewer wrote
I would very much encourage the authors to reread it and tone it down in parts…
If not done so, the paper would in fact embarrass KDD for years…

© 2022 Ron Kohavi #3


P-Values
Misinterpretation and abuse of statistical tests,
confidence intervals, and statistical power have
been decried for decades, yet remain rampant.
A key problem is that there are no interpretations of these
concepts that are at once simple, intuitive, correct, and foolproof
-- Greenland et al (2016)
• Vendors (e.g., Optimizely) try to hide the complexity by calling 1-(p-value)
“confidence,” but it is misleading , as confidence is NOT the probability that
the result is a true positive. Documentation is often wrong.
• Book authors frequently get it wrong (see paper for examples)

© 2022 Ron Kohavi #4


What We Want vs. What we Get
• We run an A/B test, and we want the probability that
B is better than A on some metric, say conversion rate,
given the data we observed in the test
• Technically, we define 𝐻0 , the null hypothesis, as the hypothesis that there
is no difference, and we want:
1-P(𝐻0 | data observed)
• What p-value gives us is the opposite conditional probability:
P Δ 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑜𝑟 𝑚𝑜𝑟𝑒 𝑒𝑥𝑡𝑟𝑒𝑚𝑒 𝐻0 𝑖𝑠 𝑡𝑟𝑢𝑒)
• It makes no sense to associate p-value with the probability of 𝐻0 because
the definition assumes it is true

© 2022 Ron Kohavi #5


We Need a Prior for 𝐻0
• To get the probability that we want, we need to apply
Bayes Rule, which requires a prior 𝑃(𝐻0 )
• FPR, or False Probability Risk, is a proxy: the probability that 𝐻0 is true (no
real effect) when the test was statistically significant
• Assuming the experiment was properly run with 80% power (discussed
later), here is a useful table relative to reported success rates by companies
and p-value threshold (alpha) of 0.05 for statistical significance
Company / Source Success Rate FPR
Microsoft 33% 5.9%
Avinash Kaushik 20% 11.1%
Bing 15% 15.0%
Booking.com, Google 10% 22.0%
Ads, Netflix
Airbnb Search 8% 26.4% © 2022 Ron Kohavi #6
Key Point: Surprising Results Require
Strong Evidence – Lower P-values
• Surprising results have low probability by definition
• To override a low probability, you want stronger evidence,
that is, lower p-value
• Twyman’s law—any figure that looks interesting or different is usually wrong
• Extraordinary claims require extraordinary evidence" (ECREE) -- Carl Sagan
• Daryl Bem (2011) presented a paper with experiments on Extra-Sensory
Perception (ESP) and time reversal.
Most scientists assign a very low probability (e.g., 10−20 ) to ESP or time reversal
(including $1M offer), so to accept such a claim, we need a very low p-value
• Best tool: replication
By rerunning the experiment again, or a third time, the combined p-value can be
much lower (and helps with other problems of multiple testing)

© 2022 Ron Kohavi #7


Statistical Power
When I finally stumbled onto power analysis…
it was as if I had died and gone to heaven
-- Jacob Cohen (1990)

• Statistical power is the probability of detecting a meaningful difference


between the variants when there really is one, that is,
rejecting the null when there is a true difference of δ
• For 80% recommended power and p-value of 0.05, the simple formula is
16 𝜎 2
𝑛 = 2 , where
δ
• n is the number of users in each variant
• 𝜎 2 is the variance of the metric, and
• δ is the sensitivity, the minimum change you want to detect
© 2022 Ron Kohavi #8
Example of Power Calculation
• For a website, you are interested in users purchasing something
• The conversion rate (user to purchaser) is 3.7%
• You are interested in ideas that improve the conversion rate by
(relative) 10% or more
• A lower percentage (e.g., 5%) will require more users, so we go aggressive
• That said, very few ideas can generate 10% improvement to a key metric like this.
At Bing, perhaps 1 in 10,000 experiments (but your startup isn’t optimized, so maybe)
• Plug this into the power formula, and you need >41,642 users in each variant
• GuessTheTest on 16 Dec 2021, shared this example, but with…
80 users in each variant, and it was stat-sig showing 337% improvement

© 2022 Ron Kohavi #9


Winner’s Curse
• A stat-sig result with low power has a high probability
of exaggerating the actual number as follows →
(Gelman and Carlin 2014)

• The power to detect a 10% relative delta in the prior example was 3%
• With such low power, the False Positive Risk is at least 63%,
so at stat-sig result is more likely to be wrong than right!
• To trust it a result, a p-value threshold of 0.002 should be used.
The actual p-value was 0.013 (paper has detailed computations)

© 2022 Ron Kohavi #10


Post-hoc Power Calculations are Noisy and Misleading
• Pre-experiment power allows one to say that the treatment effect is unlikely
to be large if we got a non-statistically significant result
(e.g., with 80% probability, it was under 10%)
• After an experiment is run, some compute the “post-hoc” power based on
the observed delta, a terrible idea
• This is like giving odds on a horse race after seeing the outcome (Greenland)
• The post-hoc power is simply a 1-1 mapping from p-value and alpha
• Anything stat-sig would map to power >= 50%
• Ad-hoc power leads to a paradox of power reversal (Hoenig and Heisey)

© 2022 Ron Kohavi #11


Minimize Data Processing Options
Statistician: you have already calculated the p-value?
Surgeon: yes, I used multinomial logistic regression.
Statistician: Really? How did you come up with that?
Surgeon: I tried each analysis on the statistical software
dropdown menus, and that was the one
that gave the smallest p-value
-- Andrew Vickers (2009)

• Flexibility in data processing increases false positive rate due to multiple


hypothesis testing
• Examples:
• Allowing users to stop experiments when stat-sig (as opposed to fixed horizon)
• Outlier removal (yes/no, or worse, some %)
• Selecting segments (stat-sig for males, or in country X)
• With additional degrees-of-freedom, the p-value threshold needs to be adjusted
© 2022 Ron Kohavi #12
Beware of Unequal Variants
• In theory, a shared control can be larger than the treatments
and you gain statistical power when comparing multiple
treatments to the control
• In practice, we shared cautionary notes:
• Triggering gets very complicated with the control having to compute whether the user
triggered in each of multiple treatments
• Cookie churn causes SRMs (Sample Ratio Mismatches) if the variants are of different sizes
• Shared resources, such as LRU caches, can give performance advantages to a larger variant

• The paper shares another reason: the converges to a normal distribution is faster
when variants are equal. Unequal variants caused material over-estimation of
type-I error on one tail and under-estimated the other tail

© 2022 Ron Kohavi #13


Summary (1 of 2)
We shared five intuition busters and made recommendations on how to
address the issues
1. Surprising results require strong evidence—lower p-values.
Share FPR and apply Twyman’s law to surprising results.
Do replication runs and combine them for a lower p-value
2. Experiments with low statistical power are NOT trustworthy
Avoid running underpowered experiments.
Do not share “interesting” results from such experiments
3. Post-hoc power calculations are noisy and misleading
Do not show them.
If you see others using them, explain how misleading they are
© 2022 Ron Kohavi #14
Summary (2 of 2)
4. Minimize data processing options in experimentation platforms
Standard processing by default.
Optional processing should be specified pre-experiment
(or in replication run)
5. Beware of unequal variants
Make sure you checkoff all concerns before you use unequal variants

© 2022 Ron Kohavi #15


To learn more about A/B tests and controlled experiments,
I teach a 10-hour Zoom class (next one Aug 22).
See https://bit.ly/ABClassRKLI
Paper at https://bit.ly/ABTestingIntuitionBusters
These slides at https://bit.ly/ABTestingIntuitionBustersTalk

© 2022 Ron Kohavi #16

You might also like