Applying The Delta Method in Metric Analytics: A Practical Guide With Novel Ideas

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Applying the Delta Method in Metric Analytics:

A Practical Guide with Novel Ideas


Alex Deng Ulf Knoblich Jiannan Lu∗
Microsoft Corporation Microsoft Corporation Microsoft Corporation
Redmond, WA Redmond, WA Redmond, WA
alexdeng@microsoft.com ulfk@microsoft.com jiannl@microsoft.com

ABSTRACT variance of a normal distribution from independent and identically


arXiv:1803.06336v4 [stat.AP] 12 Sep 2018

During the last decade, the information technology industry has distributed (i.i.d.) observations, we only need to obtain their sum
adopted a data-driven culture, relying on online metrics to measure and sum of squares, which are the corresponding summary statis-
and monitor business performance. Under the setting of big data, tics1 and can be trivially computed in a distributed fashion. In data-
the majority of such metrics approximately follow normal distribu- driven businesses such as information technology, these summary
tions, opening up potential opportunities to model them directly statistics are often referred to as metrics, and used for measuring
without extra model assumptions and solve big data problems via and monitoring key performance indicators [15, 18]. In practice, it
closed-form formulas using distributed algorithms at a fraction of is often the changes or differences between metrics, rather than
the cost of simulation-based procedures like bootstrap. However, measurements at the most granular level, that are of greater inter-
certain attributes of the metrics, such as their corresponding data est. In the context of randomized controlled experimentation (or
generating processes and aggregation levels, pose numerous chal- A/B testing), inferring the changes of metrics can establish causal-
lenges for constructing trustworthy estimation and inference pro- ity [36, 44, 50], e.g., whether a new user interface design causes
cedures. Motivated by four real-life examples in metric develop- longer view times and more clicks.
ment and analytics for large-scale A/B testing, we provide a prac-
tical guide to applying the Delta method, one of the most impor- 1.2 Central limit theorem and Delta method
tant tools from the classic statistics literature, to address the afore- We advocate directly modeling the metrics rather than the origi-
mentioned challenges. We emphasize the central role of the Delta nal data-set. When analyzing data at the most granular level (e.g.,
method in metric analytics by highlighting both its classic and user), we need some basic assumptions of the underlying proba-
novel applications. bilistic model, such as i.i.d. observations. When looking at the met-
rics level, we also need to know their joint distribution. This is
KEYWORDS where the blessing of big data comes into play. Given large sample
A/B testing, big data, distributed algorithm, large sample theory, sizes, the metrics often possess desirable asymptotic properties due
online metrics, randomization, quantile inference, longitudinal study to the central limit theorem [45]. To ensure that the paper is self-
contained, we first review the central limit theorem in its most
1 INTRODUCTION well-known form. Let X 1 , . . . , X n be n i.i.d. observations with fi-
nite mean µ and variance σ 2 > 0. We let X̄ denote the sample
1.1 Background average, then as the sample size n → ∞, in distribution
The era of big data brings both blessings and curses [21]. On one √
n(X̄ − µ)/σ → N (0, 1).
hand, it allows us to measure more accurately and efficiently, to
study smaller and more subtle effects, and to tackle problems with A common application of the central limit theorem is to√construct
smaller signal-to-noise-ratio. On the other hand, it demands larger the 100(1 − α)% confidence interval of µ as X̄ ± z α /2σ / n, where
storage and more intensive computations, forcing the data science z α /2 is the (α/2)th quantile for N (0, 1). This is arguably one of the
community to strive for efficient algorithms which can run in par- most influential results of asymptotic statistics used in almost all
allel in a distributed system. Many existing algorithms and meth- scientific fields whenever an estimate with error bars is presented.
ods (e.g., support vector machines) that are known to work well While an influential result, the central limit theorem in its basic
in small data scenarios do not scale well in a big data setting [12, form only applies to the average of i.i.d. random variables, and in
25, 47]. Recently, there has been an increasing amount of research practice our metrics are often more complex. To enjoy the blessing
interest in meta-algorithms, which can extend algorithms that are of big data, we employ the Delta method, which extends the nor-
difficult to parallelize into distributed algorithms [6, 52], and ideas mal approximations from the central limit theorem broadly. For
that resemble the divide-and-conquer algorithm [31, 34]. illustration we only review the uni-variate case. For any random
At the same time, there is a class of algorithms which are triv- variable Tn (the subscript indicates its dependency on n, e.g., sam-

ially parallelizable and therefore can downsize big data problems ple average) and constant θ such that n(Tn − θ) → N (0, 1) in
into smaller ones. The key idea behind them, which dates back to distribution as n → ∞, the Delta method allows us to extend its
Fisher [24], is to summarize the original data set by a low-dimensional asymptotic normality to any continuous transformation ϕ(Tn ). To

vector of summary statistics, which can often be computed in a par- be more specific, by using the fact that Tn − θ = O(1/ n) and the
allel and distributed way. For example, to estimate the mean and
1 Infact, they are sufficient statistics [10], i.e., they can represent the original data-set
∗ The authors are listed in alphabetical order. perfectly without losing any information.
first order Taylor expansion [43] to alternative methods that need to put up a model for in-
dividual subjects, our method requires less model assump-
ϕ(Tn ) − ϕ(θ) = ϕ ′ (θ)(Tn − θ) + O {(Tn − θ)2 }, (1)
tions.
we have in distribution The main purpose of this paper is to promote the Delta method
√ 
n {ϕ(Tn ) − ϕ(θ)} → N 0, ϕ ′ (θ)2 . as a general, memorable and efficient tool for metric analytics. In
particular, our contributions to the existing data mining literature
This is the Delta method. Without relying on any assumptions
include:
other than “big data,” the Delta method is general. It is also mem-
orable – anyone with basic knowledge of calculus can derive it. • Practical guidance for scenarios such as inferring ratio met-
Moreover, the calculation is trivially parallelizable and can be eas- rics, where the Delta method offers scalable and easier-to-
ily implemented in a distributed system. Nevertheless, although implement solutions;
conceptually and theoretically straightforward, the practical diffi- • A novel and computationally efficient solution to estimate
culty is to find the right “link” function ϕ that transforms the sim- the sampling variance of quantile metrics;
ple average to our desired metric. Because of different attributes • A novel data augmentation technique that employs the Delta
of the metrics, such as the underlying data generating process and method to model metrics resulting from within-subject or
aggregation levels, the process of discovering the corresponding longitudinal studies with unspecified missing data patterns.
transformation can be challenging. However, unfortunately, although For reproducibility and knowledge transfer, we provide all relevant
various applications of the Delta method have previously appeared computer programs at https://aka.ms/exp/deltamethod.
in the data mining literature [16, 36, 39, 44], the method itself and
the discovery of ϕ were often deemed technical details and only 2 INFERRING PERCENT CHANGES
briefly mentioned or relegated to appendices. Motivated by this
gap, we aim to provide a practical guide that highlights the ef- 2.1 Percent change and Fieller interval
fectiveness and importance of the Delta method, hoping to help Measuring change is a common theme in applied data science. In
fellow data scientists, developers and applied researchers conduct online A/B testing [15, 17, 36, 37, 44], we estimate the average
trustworthy metric analytics. treatment effect (ATE) by the difference of the same metric mea-
sured from treatment and control groups, respectively. In time se-
1.3 Scope and contributions ries analyses and longitudinal studies, we often track a metric over
As a practical guide, this paper presents four applications of the time and monitor changes between different time points. For il-
Delta method in real-life scenarios, all of which have been deployed lustration, let X 1 , . . . , X n be i.i.d. observations from the control
in Microsoft’s online A/B testing platform ExP [35, 36] and em- group with mean µ x and variance σx2 , Y1 , . . . , Yn i.i.d. observations
ployed to analyze experimental results on a daily basis: from the treatment group with mean µy and variance σy2 , and σxy
Í
the covariance4 between X i ’s and Yj ’s. Let X̄ = ni=1 X i /n and
• In Section 2, we derive the asymptotic distributions of ra- Ín
tio metrics2 . Compared with standard approaches by Fieller Ȳ = i =1 Yi /n be two measurements of the same metric from
[22, 23], the Delta method provides a much simpler and yet the treatment and control groups, respectively, and their difference
∆ˆ = Ȳ − X̄ is an unbiased estimate of the ATE ∆ = µy − µ x . Be-
almost equally accurate and effective solution;
• In Section 3, we analyze cluster randomized experiments, cause both X̄ and Ȳ are approximately normally distributed, their
difference ∆ ˆ also follows an approximate normal distribution with
where the Delta method offers an efficient alternative algo-
rithm to standard statistical machinery known as the mixed mean ∆ and variance
effect model [4, 26], and provides unbiased estimates; ˆ = (σy2 + σx2 − 2σxy )/n.
Var(∆)
• In Section 4, by extending the Delta method to outer con-
fidence intervals [42], we propose a novel hybrid method Consequently, the well-known 100(1-α)% confidence interval of ∆
to construct confidence intervals for quantile metrics with is ∆ˆ ± z α × Var( ˆ where Var(
c ∆), c ∆) ˆ is the finite-sample analogue
almost no extra computational cost3 . Unlike most existing of Var(∆), ˆ and can be computed using the sample variances and
methods, our proposal does not require repeated simulations covariance of the treatment and control observations, denoted as
as in bootstrap, nor does it require estimating an unknown sy2 , s x2 and s xy respectively.
density function, which itself is often a theoretically chal- In practice, however, absolute differences as defined above are of-
lenging and computationally intensive task, and has been a ten hard to interpret because they are not scale-invariant. Instead,
center of criticism [7]; we focus on the relative difference or percent change ∆% = (µy −
• In Section 5, we handle missing data in within-subject stud- µ x )/µ x , estimated by ∆%ˆ = (Ȳ − X̄ )/X̄ . The key problem of this sec-
ies by combining the Delta method with data augmentation. tion is constructing a 100(1-α)% confidence interval for ∆%. ˆ For this
We demonstrate the effectiveness of “big data small prob- classic problem, Fieller [22, 23] seems to be the first to provide a so-
lem” approach when directly modeling metrics. Comparing lution. To be specific, let tr (α) denote the (1 − α)th quantile for the
2 Pedantically,ratio of two measurements of the same metric, from the treatment and
t−distribution with degree of freedom r, and д = ns x2 t α2 /2 (r )/X̄ 2 ,
control groups.
3 Our procedure computes two more quantiles for confidence interval. Since the main
4 For
A/B testing where the treatment and control groups are independently sampled
cost of quantile computing is sorting the data, computing three and one quantiles cost
almost the same. from a super population, σx y = 0.
2
then Fieller’s interval of ∆% is 2.3 Numerical examples
( To illustrate the performance of Fieller’s interval in (2), the Delta
1 Ȳ дs xy
−1− 2 method interval in (4), and the Edgeworth interval with and with-
1 − д X̄ sx out bias correction under different scenarios, we let the sample
v
u
t !) (2) size n = 20, 50, 200, 2000. For each fixed n, we assume that the
2
t α /2 (r ) 2 Ȳ Ȳ 2 2 2
s xy treatment and control groups are independent, and consider three
± √ sy − 2 s xy + 2 s x − д sy − 2
nX̄ X̄ X̄ sx simulation models for i.i.d. experimental units i = 1, . . . , n :
(1) Normal: X i ∼ N (µ = 1, σ = 0.1), Yi ∼ N (µ = 1.1, σ = 0.1);
Although widely considered as the standard approach for estimat- (2) Poisson: X i ∼ Pois(λ = 1), Yi ∼ Pois(λ = 1.1);
ing variances of percent changes [38], deriving (2) is cumbersome (3) Bernoulli: X i ∼ Bern(p = 0.5), Yi ∼ Bern(p = 0.6).
(see [46] for a modern revisit). The greater problem is that, even The above models aim to mimic the behaviors of our most com-
when given this formula, applying it often requires quite some ef- mon metrics, discrete or continuous. For each case, we repeatedly
fort. According to (2) we need to not only estimate the sample vari- sample M = 10, 000 data sets, and for each data set we construct
ances and covariance, but also the parameter д. Fieller’s and the Delta method intervals, respectively, and then add
correction to the Delta method result. We also construct the Edge-
2.2 Delta method and Edgeworth correction worth expansion interval without and with the bias correction.
The Delta method provides a more intuitive alternative solution. Table 1: Simulated examples: The first two columns contain simulation mod-
els and sample sizes. The next five columns present the coverage rates of various
Although they can be found in classic textbooks such as [10], this 95% confidence intervals – Fieller’s, Delta method based (w/o and w/ bias correc-
paper (as a practical guide) still provides all the relevant technical tion) and Edgeworth expansion based (w/o and w/ bias correction).
details. We let Tn = (Ȳ , X̄ ), θ = (µy , µ x ) and ϕ(x, y) = y/x. A multi- Method n Fieller Delta Delta(BC) Edgeworth Edgeworth(BC)
variate analogue of (1) suggests that ϕ(Tn )−ϕ(θ) ≈ ∇ϕ(θ)·(Tn −θ), Normal 20 0.9563 0.9421 0.9422 0.9426 0.9426
which implies that Normal 50 0.9529 0.9477 0.9477 0.9478 0.9477
Normal 200 0.9505 0.9490 0.9491 0.9490 0.9490
Y µy 1 µy Normal 2000 0.9504 0.9503 0.9503 0.9503 0.9503
− ≈ (Ȳ − µy ) − (X̄ − µ x ) (3)
X µx µx µ x2 Poisson 20 0.9400 0.9322 0.9370 0.9341 0.9396
Poisson 50 0.9481 0.9448 0.9464 0.9464 0.9478
Poisson 200 0.9500 0.9491 0.9493 0.9496 0.9498
For i = 1, . . . , n, let Wi = Yi /µ x − µy X i /µ x2 , which are also i.i.d. Poisson 2000 0.9494 0.9494 0.9495 0.9494 0.9495
observations. Consequently, we can re-write (3) as Ȳ /X̄ − µy /µ x ≈
Ín Bernoulli 20 0.9539 0.9403 0.9490 0.9476 0.9521
i =1 Wi /n, leading to the Delta method based confidence interval Bernoulli 50 0.9547 0.9507 0.9484 0.9513 0.9539
Bernoulli 200 0.9525 0.9513 0.9509 0.9517 0.9513
s Bernoulli 2000 0.9502 0.9500 0.9499 0.9501 0.9500
Ȳ z α /2 Ȳ Ȳ 2
−1 ±√ sy2 − 2 s xy + 2 s x2 (4) We report the corresponding coverage rates in Table 1, from
X̄ nX̄ X̄ X̄
|{z} | {z } which we can draw several conclusions. For big data (n ≥ 200),
point estimate uncertainty quantification all methods achieve nominal (i.e., ≈ 95%) coverage rates for all
simulation models. For small data (n ≤ 50), although Fieller’s in-
(4) is easier to implement than (2), and is in fact the limit of (2) in terval seems more accurate for some simulation models (e.g., nor-
the large sample case, because as n → ∞, we have t α /2 (r ) → z α /2, mal), other methods perform comparably, especially after the bias
д → 0, and Var(X̄ ) ∼ O(1/n). Although Fieller’s confidence inter- correction. For simplicity in implementation and transparency in
val can be more accurate for small samples [30], this benefit ap- applications, we recommend Algorithm 1, which uses the Delta
pears rather limited for big data. Moreover, the Delta method can method based interval (4) with the bias correction.
also be easily extended for a better approximation by using Edge- Algorithm 1 Confidence interval for ratio: Delta method + bias correc-
worth expansion√ [5, 29]. To be more specific, (3) suggests that in
tion
distribution n Ȳ /X̄ − µy /µ x /σw → ν, whose cumulative dis- 1: function deltaci(X = X 1, . . . , X n ; Y1, . . . , Yn ; α = 0.05)
tribution function 2: X̄ = mean(X ); Ȳ = mean(Y );
3: s x2 = var(X ); sy2 = var(Y ); s xy = cov(X, Y );
F (t) = Φ(t) − 6n −1/2κw (t 2 − 1)ϕ(t) 4: bc = Ȳ /X̄ 3 × s x2 /n − 1/X̄ 2 × s xy /n ⊲ bias correction term
5: pest = Ȳ /X̄ − 1 + bc ⊲ point estimate
contains a correction term in addition to Φ(t), the cumulative dis- 6: vest = sy2 /X̄ 2 − 2 × Ȳ /X̄ 3 ∗ s x t + Ȳ 2 /X̄ 4 ∗ s x2
p
tribution function of the standard normal, and κw , the skewness 7: return: pest ± z1−α /2 × vest/n ⊲ 100(1 − α ) confidence interval
of the Wi ’s. By simply replacing the terms “±z α /2” in (4) with ν α /2
and ν 1−α /2, the (α/2)th and (1 − α/2)th quantiles of ν, respectively,
3 DECODING CLUSTER RANDOMIZATION
we obtain the corresponding Edgeworth expansion based confi-
dence interval. Finally, we can add a second-order bias correction 3.1 The variance estimation problem
term (Ȳs x2 /X̄ − s xy )/(nX̄ )2 to the point estimate in (4); the same Two key concepts in a typical A/B test are the randomization unit
correct term can be applied to the Edgeworth expansion based in- – the granularity level where sampling or randomization is per-
terval. formed, and the analysis unit – the aggregation level of metric
3
computation. Analysis is straightforward when the randomization the “fixed” effect for the treatment indicator term6 . Stan [9] offers
and analysis units agree [14], e.g., when randomizing by user while a Markov Chain Monte Carlo (MCMC) implementation aiming to
also computing the average revenue per user. However, often the infer the posterior distribution of those parameters, but this needs
randomization unit is a cluster of analysis units (it cannot be more significant computational effort for big data. Moreover, the esti-
granular than the analysis unit, otherwise the analysis unit would mated ATE, i.e., the coefficient for the treatment assignment indica-
contain observations under both treatment and control, nullifying tor, is for the randomization unit (i.e., cluster) but not the analysis
the purpose of differentiating the two groups). Such cases, some- unit level, because it treats all clusters with equal weights and can
Í Í
times referred to as cluster randomized experiments in the econo- be viewed as the effect on the double average i ( j Yi j /N j )/K,
metrics and statistics literature [1, 33], are quite common in prac- which is usually different than the population average Ȳ [15]. This
tice, e.g., enterprise policy prohibiting users within the same orga- distinction doesn’t make a systematic difference when effects across
nization from different experiences, or the need to reduce bias in clusters are homogeneous. However, in practice the treatment ef-
the presence of network interference [2, 20, 27]. Perhaps more ubiq- fects are often heterogeneous, and using mixed effect model esti-
uitously, for the same experiment we usually have metrics with dif- mates without further adjustment steps could lead to severe biases.
ferent analysis units. For example, to meet different business needs, On the contrary, the Delta method solves the problem directly
Í Í Í
most user-randomized experiments run by ExP contain both user- from the metric definition. Re-write Ȳ into i ( j Yi j )/ i Ni . Let
Í
level and page-level metrics. Si = j Yi j , and divide both the numerator and denominator by
We consider two average metrics5 of the treatment and con- K, Í
trol groups, assumed to be independent. Without loss of gener- Si /K
Ȳ = Í i ¯ N̄ .
= S/
ality, we only focus on the treatment group with K clusters. For i i /K
N
i = 1, . . . , K, the ith cluster contains Ni analysis unit level obser- Albeit not an average of i.i.d. random variables, Ȳ is a ratio of two
vations Yi j (j =1, . . . , Ni ). Then the corresponding average metric averages of i.i.d. randomization unit level quantities [14]. There-
Í Í
is Ȳ = i, j Yi j i N i . We assume that within each cluster the ob- fore, by following (3) in Section 2.2,
servations Yi j ’s are i.i.d. with mean µ i and variance σi2 , and across !
clusters (µ i , σi , Ni ) are also i.i.d. 1 2 µS µ S2 2
Var(Ȳ ) ≈ σ −2 σS N + 2 σ N . (6)
K µ 2N S µN µN
3.2 Existing and Delta method based solutions
Therefore, the variance of Ȳ depends on the variance of a centered
Ȳ is not an average of i.i.d. random variables, and the crux of our version of Si (by a multiple of Ni , not the constant E(Si ) as typi-
analysis is to estimate its variance. Under strict assumptions, closed- cally done). Intuitively, both the variance of the cluster size Ni , and
form solutions for this problem exist [19, 33]. For example, when Í
of the within-cluster sum of observations Si = j Yi j , contribute
Ni = m and σi2 = σ 2 for all i, 2 = VarN is an important contributor of the
to (6). In particular, σ N i
σ2 + τ2 variance of Ȳ , leading to a practical recommendation for the exper-
Var(Ȳ ) = {1 + (m − 1)ρ}, (5) imenters – it is desirable to make the cluster sizes homogeneous;
Km
otherwise it can be difficult to detect small treatment effects due
where τ 2 = Var(µ i ) is the between-cluster variance and ρ = τ 2 /(σ 2 +
to low statistical power.
τ 2 ) is the coefficient of intra-cluster correlation, which quantifies the
contribution of between-cluster variance to the total variance. To
3.3 Numerical examples
facilitate understanding of the variance formula (5), two extreme
cases are worth mentioning: Because the coverage property of (6) has been extensively covered
in Section 2.3, we only focus on comparing it with the mixed effect
(1) If σ = 0, then for each i = 1, . . . , K and all j = 1, . . . , Ni ,
model here. We first describe the data generating process, which
Yi j = µ i . In this case, ρ = 1 and Var(Ȳ ) = τ 2 /K;
consists of a two-level hierarchy as described in the previous sec-
(2) If τ = 0, then µ i = µ for all i = 1, . . . , K, and therefore the
tion. First, at the randomization unit level, let the total number of
observations Yi j ’s are in fact i.i.d. In this case, ρ = 0 and (5)
clusters K = 1000. To mimic cluster heterogeneity, we divide clus-
reduces to Var(Ȳ ) = σ 2 /(Km). ters into three categories: small, medium and large. We generate
However, unfortunately, although theoretically sound, (5) has only the numbers of clusters in the three categories by the following
limited practical value because the assumptions it makes are unre- multinomial distribution:
alistic. In reality, the cluster sizes Ni and distributions of Yi j for
(M 1 , M 2 , M 3 )′ ∼ Multi-nomial{n = K;p = (1/3, 1/2, 1/6)}.
each cluster i are different, which means that µ i and σi2 are differ-
ent. For the ith cluster, depending on which category it belongs to, we
Another common approach is the mixed effect model, also known generate Ni , µ i and σi in the following way7 :
as multi-level/hierarchical regression [26], where Yi j depends on • Small: Ni ∼ Poisson(2), µ i ∼ N (µ = 0.3, σi = 0.05);
µ i and σi2 , while the parameters themselves follow a “higher order” • Medium: Ni ∼ Poisson(5), µ i ∼ N (µ = 0.5, σi = 0.1);
distribution. Under this setting, we can infer the treatment effect as • High: Ni ∼ Poisson(30), µ i ∼ N (µ = 0.8, σi = 0.05);
5 Pedantically, they are two measurements of the same metric. We often use metric to 6 Y ∼ Treatment + (1|User) in lme4 notation. A detailed discussion is beyond the scope
refer to both the metric itself (e.g. revenue-per-user) and measurements of the metric of this paper; see Bates et al. [3], Gelman and Hill [26].
(e.g. revenue-per-user of the treatment group) and this distinction would be clear in 7 The positive correlation between µ and N is not important, and reader can try out
i i
the context. code with different configuration.
4
Second, for each fixed i, let Yi j ∼ Bernoulli(p = µ i ) for all j = Delta method can be seen as the ultimate simplification of GEE’s
1, . . . , Ni . This choice is motivated by binary metrics such as page- sandwich variance estimator after summarizing data points into
click-rate, and because of it we can determine the ground truth sufficient statistics. But the derivation of GEE is much more in-
E(Ȳ ) = 0.667 by computing the weighted average of µ i weighted volved than the central limit theorem, while we can explain the
by the cluster sizes and the mixture of small, medium and large Delta method in a few lines and it is not only more memorable but
clusters. also provides more insights in (6).
Our goal is to infer E(Ȳ ) and we compare the following three
methods: 4 EFFICIENT VARIANCE ESTIMATION FOR
(1) A naive estimator Ȳ , pretending all observations Yi j are i.i.d; QUANTILE METRICS
(2) Fitting a mixed effect model with a cluster random effect µ i ; 4.1 Sample quantiles and their asymptotics
(3) Using the metric Ȳ as in the first method, but using the Delta
method for variance estimation. Although the vast majority of metrics are averages of user teleme-
try data, quantile metrics form another category that is widely
Based on the aforementioned data generating mechanism, we re- used to focus on the tails of distributions. In particular, this is of-
peatedly and independently generate 1000 data sets. For each data ten the case for performance measurements, where we not only
set, we compute the point and variance estimates of E(Ȳ ) using care about an average user’s experience, but even more so about
the naive, mixed effect, and delta methods. We then compute em- those that suffer from the slowest responses. Within the web per-
pirical variances for the three estimators, and compare them to the formance community, quantiles (of, for example, page loading time)
average of estimated variances. We report the results in Table 2. at 75%, 95% or 99% often take the spotlight. In addition, the 50%
Table 2: Simulated examples: The first three columns contain the chosen quantile (median) is sometimes used to replace the mean, because
method, the true value of E(Ȳ ) and the true standard deviation of the corre- it is more robust to outlier observations (e.g., errors in instrumenta-
sponding methods. The last two columns contain the point estimates and aver-
age estimated standard errors. tion). This section focuses on estimating the variances of quantile
Method Ground Truth SD(True) Estimate Avg. SE(Model) estimates.
Suppose we have n i.i.d. observations X 1 , . . . , X n , generated by
Naive 0.667 0.00895 0.667 0.00522
a cumulative distribution function F (x) = P(X ≤ x) and a density
Mixed effect 0.667 0.00977 0.547 0.00956
Delta method 0.667 0.00895 0.667 0.00908 function f (x)8 . The theoretical pth quantile for the distribution F
is defined as F −1 (p). Let X (1) , . . . , X (n) be the ascending ordering of
Not surprisingly, the naive method under-estimates the true vari- the original observations. The sample quantile at p is X (np) if np
ance – we had treated the correlated observations as independent. is an integer. Otherwise, let ⌊np⌋ be the floor of np, then the sam-
Both the Delta method and the mixed effect model produced satis- ple quantile can be defined as any number between X ( ⌊np ⌋) and
factory variance estimates. However, although both the naive and X ( ⌊np ⌋+1) or a linear interpolation of the two9 . For simplicity here
the Delta method correctly estimated E(Y ), the mixed effect estima- we use X ( ⌊np ⌋) , which will not affect any asymptotic results. It is a
tor is severely biased. This shouldn’t be a big surprise if we look well-known fact that, if X 1 , . . . , X n are i.i.d. observations, follow-
deeper into the model Yi j = α + βi + ϵi j and E(ϵi j ) = 0, where ing the central limit theorem and a rather straightforward appli-
the random effects βi are centered so E(βi ) = 0. The sum of the cation of the Delta method, the sample quantile is approximately
intercept terms α and βi stands for the per-cluster mean µ i , and normal [10, 45]:
α represents the average of per-cluster mean, where we compute " #
√ n −1
o σ2
the mean within each cluster first, and then average over clusters. n X ⌊np ⌋ − F (p) → N 0,  , (7)
2
This is different from the metrics defined as simple average of Yi j f F −1 (p)
in the way that in the former all clusters are equally weighted and
where σ 2 = p(1 −p). However, unfortunately, in practice we rarely
in the latter case bigger clusters have more weight. The two defini-
have i.i.d. observations. A common scenario is in search engine per-
tions will be the same if and only if either there is no heterogeneity,
formance telemetry, where we receive an observation (e.g., page
i.e. per-cluster means µ i are all the same, or all clusters have the
loading time) for each server request or page-view, while random-
same size. We can still use the mixed effect model to get a unbiased
ization is done at a higher level such as device or user. This is the
estimate. This requires us to first estimate every βi (thus µ i ), and
Í same situation we have seen in Section 3, where X i are clustered.
then compute (α + βi )Ni / i Ni by applying the correct weight Ni .
To simplify future notations, we let Yi = I {X i ≤ F −1 (p)}, where I
The mixed effect model with the above formula gave a new esti-
is the indicator function. Then (6) can be used to compute Var(Ȳ ),
mate 0.662, much closer to the ground truth. Unfortunately, it is
and (7) holds in the clustered case with σ 2 = nVar(Ȳ ). This gener-
still hard to get the variance of this new estimator.
alizes the i.i.d. case where nVar(Ȳ ) = p(1 − p). Note that the Delta
In this study we didn’t consider the treatment effect. In ATE es-
method is instrumental in proving (7) itself, but a rigorous proof
timation, the mixed effect model will similarly result in a biased
involves a rather technical last step that is beyond our scope. A
estimate for the ATE for the same reason, as long as per-cluster
formal proof can be found in [45].
treatment effects vary and cluster sizes are different. The fact that
the mixed effect model provides a double average type estimate 8 We do not consider cases when X has discrete mass, and F will have jumps. In this

and the Delta method estimates the “population” mean is analo- case the quantile can take many values and is not well defined. In practice this case
can be seen as continuous case with some discrete correction.
gous to the comparison of the mixed effect model with GEE (gener- 9 When p is 0.5, the 50% quantile, or median, is often defined as the average of the
alized estimating equations) [41]. In fact, in the Gaussian case, the middle two numbers if we have even number of observations.
5
4.2 A Delta method solution to a practical issue However, in this algorithm the ranks depends on σ , whose compu-
Although theoretically sound, the difficulty of applying (7) in prac- tation depends on the quantile estimates (more specifically the Yi
tice lies in the denominator f {F −1 (p)}, whose computation requires requires a pass through the data after quantile is estimated). This
the unknown density function f at the unknown quantile F −1 (p). means that this algorithm requires a first ‘ntile’ retrieval, and then
A common approach is to estimate f from the observed X i using a pass through the data for σ computation, and then another two
non-parametric methods such as kernel density estimation [48]. ‘ntile’ retrievals. Turns out, computing all three ‘ntiles’ in one stage
However, any non-parametric density estimation method is trad- is much more efficient than splitting into two stages. This is be-
ing off between bias and variance. To reduce variance, more ag- cause retrieving ‘ntiles’ can be optimized in the following way: if
gressive smoothing and hence larger bias need to be introduced to we only need to fetch tail ranks, it is pointless to sort data that are
the procedure. This issue is less critical for quantiles at the body not at the tail; we can use sketching algorithm to narrow down the
of the distribution, e.g. median, where density is high and more possible range where our ranks reside and only sort in that range,
data exists around the quantile to make the variance smaller. As making it even more efficient to retrieve multiple ‘ntiles’ at once.
we move to the tail, e.g. 90%, 95% or 99%, however, the noise of Along this line of thoughts, to make the algorithm more efficient,
the density estimation gets bigger, so we have to introduce more we noticed that (7) also implies that the change from X i being i.i.d.
smoothing and more bias. Because the density shows up in the de- to clustered only requires an adjustment to the numerator σ , which
nominator and density in the tail often decays to 0, a small bias is a simple re-scaling step, and the correction factor does not de-
in estimated density can lead to a big bias for the estimated vari- pend on the unknown density function f in the denominator. If the
ance (Brown and Wolfe [7] raised similar criticisms with their sim- outer CI were to provide a good confidence interval in i.i.d. case, a
ulation study). A second approach is to bootstrap, re-sampling the re-scaled outer CI with the same correction term should also work
whole dataset many times and computing quantiles repeatedly. Un- for the clustered case, at least when n is large. This leads to the
like an average, computing quantiles requires sorting, and sorting outer CI with post-adjustment algorithm:
in distributed systems (data is distributed in the network) requires p
(1) compute L,U = n(p ± z α /2 p(1 − p)/n)
data shuffling between nodes, which incurs costly network I/O.
(2) fetch X ( ⌊np ⌋) , X (L) and X (U )
Thus, bootstrap works well for small scale data but tends to be
too expensive in large scale in its original form (efficient bootstrap (3) compute Yi = I {X i ≤ X ( ⌊np ⌋) }
on massive data is a research area of its own [34]). (4) compute µ S , µ N , σS2 , σS N , σ N2

An alternative method without the necessity for density estima- (5) compute σ by setting σ 2 /n equal topthe result of equation (6)
tion is more desirable, especially from a more practical perspec- (6) compute the correction factor σ / p(1 − p) and apply it to
tive. One such method is called outer confidence interval (outer X (L) and X (U )
CI) [40, 42], which produces a closed-form formula for quantile
confidence intervals using combinatorial arguments. Recall that We implemented this latter method in ExP using Apache Spark
Í [51] and SCOPE [11].
Yi = I {X i ≤ F −1 (p)} and Yi is the count of observations no
Í
greater than the quantile. In the aforementioned i.i.d. case, √Yi fol-
lows a binomial distribution. Consequently, when n is large n(Ȳ − 4.3 Numerical examples
p) ≈ N (0, σ 2 ) where σ 2 = p(1 − p). If the quantile value F −1 (p) ∈
To test the validity and performance of the adjusted outer CI method,
[X (r ) , X (r +1) ), then Ȳ = r /n. The above equation can be inverted
√ we compare its coverage to a standard non-parametric bootstrap
into a 100(1 − α)% confidence interval for r /n : p ± z α /2σ / n. (N B = 1000 replicates). The simulation setup consists of Nu =
This means with 95% probability the true percentile is between 100, . . . , 10000 users (clusters) with Niu = 1, . . . , 10 observations

the lower rank L = n(p − z α /2σ / n) and upper rank U = n(p + each (Niu are uniformly distributed). Each observation is the sum

z α /2σ / n) + 1! of two i.i.d. random variables X iu = X i + Xu , where Xu is con-
The traditional outer CI depends on X i being i.i.d. But when stant for each user. We consider two cases, one symmetric and one
there are clusters, σ 2 /n instead of being p(1 − p)/n simply takes a heavy-tailed distribution:
different formula (6) by the Delta method and the result above still
holds. Hence the confidence interval for a quantile can be com- iid
• Normal: X i , Xu ∼ N (µ = 0, σ = 1);
puted in the following steps: iid
• Log-Normal: X i , Xu ∼ Log-normal(µ = 0, σ = 1).
(1) fetch the quantile X ( ⌊np ⌋) First, we find the "true" 95th percentile value of these distribution
(2) compute Yi = I {X i ≤ X ( ⌊np ⌋) } by computing its value for a very large sample (Nu = 107 ). Second,
(3) compute µ S , µ N , σS2 , σS N , σ N
2
we compute the confidence intervals for M = 10000 simulation
(4) 2
compute σ by setting σ /n equal to the result of equation (6) runs using bootstrap and outer CI with pre- and post-adjustment
(5) compute L,U = n(p ± z α /2σ ) and compare their coverage estimates (≈0.002 standard error), shown
(6) fetch the two ranks X (L) and X (U ) in Table 3. We found that when the sample contains 1000 or more
clusters, all methods provide good coverage. Pre- and post-adjustment
We call this outer CI with pre-adjustment. This method reduces outer CI results are both very close to the much more computa-
the complexity of computing a quantile and its confidence interval tionally expensive bootstrap (in our un-optimized simulations, the
into a Delta method step and subsequently fetching three “ntiles”. outer CI method was ≈20 times faster than bootstrap). When the
6
sample size was smaller than 1000 clusters, bootstrap was notice- In this section we show how the Delta method can be very use-
ably inferior to outer CI. For all sample sizes, pre-adjustment pro- ful for estimating Cov(X̄ t , X̄ t ′ ) for any two time points t and t ′ .
vided slightly larger coverage than post-adjustment, and this dif- We then use this to show how we can analyze within-subject stud-
ference increased for smaller samples. In addition, because adjust- ies, also known as pre-post, repeated measurement, or longitudinal
ment tends to result in increased confidence intervals, unadjusted studies. Our method starts with metrics directly, highly contrast-
ranks are more likely to have the same value as the quantile value, ing with traditional methods such as mixed effect models which
and thus post-adjustment is more likely to underestimate the vari- build models from individual user’s data. Because we study metrics
ance in that case. In conclusion, post-adjustment outer CI works directly, our model is small and easy to solve with its complexity
very well for large sample sizes Nu ≥ 1000 and reasonably well constant to the scale of data.10
for smaller samples, but has slightly inferior coverage compared
to pre-adjustment. For big data, outer CI with post-adjustment is 5.2 Methodology
recommended due its efficiency, while for median to small sample How do we estimate covariance with a completely unknown miss-
sizes, outer CI with pre-adjustment is preferred if accuracy is para- ing data mechanism? There are many existing works on handling
mount. missing data. One approach is to model the propensity of a data-
Table 3: Simulated examples: The first two columns contain the simulated point being missing using other observed predictors [13]. This re-
models and sample sizes. The last three columns contain the coverage rates for quires additional covariates/predictors to be observed, plus a rather
the Bootstrap, pre-adjusted and post-adjusted outer confidence intervals.
strong assumption that conditioned on these predictors, data are
Distribution Nu Bootstrap Outer CI Outer CI
missing completely at random. We present a novel idea using the
(pre-adj.) (post-adj.)
Delta method after data augmentation without the need of model-
Normal 100 0.9039 0.9465 0.9369
ing the missing data mechanism. Specifically, we use an additional
Normal 1000 0.9500 0.9549 0.9506
indicator for the presence/absence status of a user in each period t.
Normal 10000 0.9500 0.9500 0.9482
For user i in period t, let Iit = 1 if user i appears in period t, and 0
Log-normal 100 0.8551 0.9198 0.9049 otherwise. For each user i in period t, instead of one scalar metric
Log-normal 1000 0.9403 0.9474 0.9421 value (X it ), we augment it to a vector (Iit , X it ). When Iit = 0, i.e.
Log-normal 10000 0.9458 0.9482 0.9479 user is missing, we define X it = 0. Under this simple augmentation,
the metric value X̄ t , taking the average over those non-missing
Í Í
measurements in period t, is the same as i X it / i Iit ! In this
5 MISSING DATA AND WITHIN-SUBJECT connection,
ANALYSES Í Í   
X it X it ′ X̄ t X̄ t ′
Cov(X̄ t , X̄ t ′ ) = Cov Íi , Íi = Cov ,
5.1 Background i I it i I it ′ I¯t I¯t ′
Consider the case that we are tracking a metric over time. For ex- where the last equality is by dividing both numerator and denom-
ample, businesses need to track key performance indicators like inator by the same total number of users who have appeared in
user engagement and revenue daily or weekly for a given audi- any of the two periods.11 Thanks to the central limit theorem, the
ence. To simplify, let us say we are tracking weekly visits-per-user. vector (I¯t , X̄ t , I¯t ′ , X̄ t ′ ) is also asymptotically (multivariate) normal
Visits-per-user is defined as a simple average X̄ t , t = 1, . . . for each with covariance matrix Σ, which can be estimated using sample
week. If there is a drop between X̄ t and X̄ t −1 , how do we set up an variance and covariance because there is no missing data after aug-
alert so that we can avoid triggering it due to random noise and mentation. In Section 2 we already applied the Delta method to
control the false alarm rate? Looking at the variance, we see compute the variance of a ratio of metrics by taking the first order
Var(X̄ t − X̄ t −1 ) = Var(X̄ t ) + Var(X̄ t −1 ) − 2Cov(X̄ t , X̄ t −1 ) Taylor expansion. Here, we can expand the two ratios to their first
order linear form (X̄ t − µ X t )/µ It − µ X t (I¯t − µ It )/µ 2I , where µ X
t
and we need to have a good estimate of the covariance term be- and µ I are the means of X and I and the expansion for t ′ is similar.
cause we know it is non-zero. Cov(X̄ t , X̄ t ′ ) can then be computed using Σ.
Normally, estimating the covariance of two sample averages is We can apply this to the general case of a within-subject study.
trivial and the procedure is very similar to the estimation of the Without loss of any generality and with the benefit of a concrete
sample variance. But there is something special about this case — solution, we only discuss a cross-over design here. Other designs
missing data. Not everyone uses the product every week. For any including more periods are the same in principle. In a cross-over
online and offline metric, we are only able to define the metric us- design, we have two groups I and II. For group I, we assign them
ing observed data. If for every user we observe its value in week to the treatment for the first period and control for the second. For
t − 1 and t, the covariance can be estimated using the sample co-
variance formula. But if there are many users who appear in week 10 Part of the work in this section has previously appeared in a technical report [28].
t −1 but not in t, it is unclear how to proceed. The naive approach is Since authoring the technical report, the authors developed a better understanding
of the differences between mixed effect models and the proposed method, and we
to estimate the covariance using complete observations, i.e. users present our new findings here. The technical report has more details about other types
who appear in both weeks. However, this only works if the data is of repeated measurement models.
11 Actually, if there are more than two period, we can either use only users appeared in
missing completely at random. In reality, active users will show up
any of the two periods, or users appeared in any of all the periods. It is mathematically
more often and are more likely to appear in both weeks, meaning the same thing if we added more users and then treat them as not appeared in I and
the missing data are obviously not random. X , i.e. X̄ t remains the same.
7
group II, the assignment order is reversed. We often pick the two We simulate the following 1000 times: each time we run both the
groups using random selection, so each period is an A/B test by mixed effect model X it ∼ IsTreatment + Time + (1|User) as well as
itself. the additive cross-over model (8) and record their estimates of the
Let X̄ it , i = 1, 2, t = 1, 2 be the metric for group i in period t. Let ATE and the corresponding estimated variance. We then estimate
X = (X̄ 11 , X̄ 12 , X̄ 21 , X̄ 22 ). We know X ∼ N (µ, ΣX ) for a mean vector the true estimator variance using the sample variance of those es-
µ and covariance matrix ΣX . ΣX has the form diag(Σ, Σ) since the timates among 1000 trials and compare that to the mean of the
two groups are independent with the same distribution. With the estimated variance to evaluate the quality of variance estimations.
help of the Delta method, we can estimate Σ from the data and treat We also compute the average of ATE estimates and compare to the
it as a constant covariance matrix (hence ΣX ). Our model concerns true ATE to assess the bias.
the mean vector µ(θ ) using other parameters which represent our Table 4: Simulated examples: The first three columns contain the method,
interest. In a cross-over design, θ = (θ 1 , θ 2 , ∆) where the first two true ATE and standard deviations of the corresponding methods. The last two
columns contain the point estimates and average estimated standard errors.
are baseline means for the two periods and our main interest is
Ground Truth SD(True) Estimate Avg. SE(Model)
the treatment effect ∆. (We assume there is no treatment effect
for group I carried over to the 2nd period. To study carry over mixed effect 6.592 0.1295 7.129 0.1261
effect, a more complicated design needs to be employed.) In this Delta method 6.592 0.1573 6.593 0.1568
parameterization,
Table 4 summarizes the results. We found both methods pro-
vide good variance estimation and the mixed effect model shows
µ(θ ) = (θ 1 + ∆, θ 2 , θ 1 , θ 2 + ∆) . (8)
smaller variance. However, mixed effect also displays an upward
bias in the ATE estimate while the Delta method closely tracks the
The maximum likelihood estimator and Fisher information the-
true effect. To further understand the difference between the two,
ory [45] paved a general way for us to estimate θ as well as the
we separate the data into two groups: users with complete data,
variance of the estimators for various mean vector models. For ex-
and users who only appear in one period (incomplete group). We
ample, if we want to model the relative change directly, we just
run the mixed effect model for the the two groups separately. Note
need to change the addition of ∆ into multiplication of (1 + ∆).
that in the second group each user only appears once in the data,
Notice the model we are fitting here is very small, almost a toy
so the model is essentially a linear model. Our working hypothesis
example from a text book. All the heavy lifting is in computing
is the following: because the mixed effect model assumes a fixed
ΣX which is dealt with by the Delta method where all computa-
treatment effect, the effect for the complete and incomplete groups
tions are trivially distributed. When µ(θ ) = Mθ is linear as in the
must be the same. The mixed effect model can take advantage of
additive cross-over model above, θ̂ = (M T Σ−1 −1 T −1
X M) M ΣX X and this assumption and construct an estimator by weighted average of
Var(b
θ ) = I where the Fisher Information I = M Σ−1 M.
−1 T
X the two estimates from the two groups, with the optimal weight in-
versely proportional to their variances. Table 5 shows the weighted
5.3 Numerical examples average estimator is indeed very close to mixed effect model for
We simulate two groups with 1000 users each. The first group re- both estimate and variance. The weighted average is closer to the
ceives treatment in period 1, then control in period 2, while the complete group because the variance there is much smaller than
second group receives the inverse. For each user, we impose an that of the incomplete group since within-subject comparison sig-
independent user effect ui that follows a normal distribution with nificantly reduces noise. This explains why the mixed effect model
mean 10 and standard deviation 3, and independent noises ϵit with can produce misleading results in within-subject analysis when-
mean 0 and standard deviation 2. Each user’s base observations (be- ever missing data patterns can be correlated with the treatment
fore treatment effect) for the two periods are (ui +ϵi 1 , ui +ϵi 2 ). We effect. The Delta method, on the other hand, offers a flexible and
then model the missing data and treatment effect such that they robust solution.
are correlated. We define a user’s engagement level li by its user Table 5: Simulated examples: Point and variance estimates using the mixed
effect vs. weighted average estimators.
effect ui through P(U < ui ), i.e. the probability that a user’s u is
bigger than another random user. We model the treatment effect Estimate Var
∆i as an additive normal with mean 10 and standard deviation 0.3, mixed effect model 7.1290 0.00161
multiplied by li . For each user and each of the two periods, there mixed effect model on complete group 7.3876 0.00180
is a 1 - max(0.1, li ) probability of this user being missing. We can linear model on incomplete group 5.1174 0.01133
interpret the missing data as a user not showing up, or as a failure weighted avg estimator 7.0766 0.00155
to observe. In this model, without missing data, every user has two
observations and the average treatment effect should be E(∆i ) = 5
because E(li ) = 0.5. Since we have missing data and it is more 6 CONCLUDING REMARKS
likely for lower engagement levels to be missing, we expect the av- Measure everything is not only an inspiring slogan, but also a cru-
erage treatment effect for the population of all observed users to be cial step towards the holy grail of data-driven decision making. In
between 5 to 10. In the current model we didn’t add a time period the big data era, innovative technologies have tremendously ad-
effect such that the second period could have a different mean θ 2 vanced user telemetry and feedback collection, and distilling in-
from the first period’s mean θ 1 , but in analysis we always assume sights and knowledge from them is an imperative task for business
that this effect could exist. success. To do so, we typically apply certain statistical models at
8
the level of individual observations, fit the model using numerical [4] Douglas Bates, Martin Maechler, Ben Bolker, Steven Walker, et al. 2014. lme4:
procedures such as solving optimization problems, and draw con- Linear mixed-effects models using Eigen and S4. R package version 1, 7 (2014),
1–23.
clusions and assess uncertainties from the fitted models. However, [5] Dennis D Boos and Jacqueline M Hughes-Oliver. 2000. How large does n have
for big data such an analytical procedure can be very challenging. to be for Z and t intervals? The American Statistician 54, 2 (2000), 121–128.
[6] Léon Bottou. 2010. Large-scale machine learning with stochastic gradient de-
Based on the key observation that metrics are approximately nor- scent. In Proceedings of COMPSTAT’2010. Springer, 177–186.
mal due to the central limit theorem, this paper offers an alterna- [7] Morton B Brown and Robert A Wolfe. 1983. Estimation of the variance of per-
tive perspective by advocating modeling metrics, i.e., summaries or centile estimates. Computational Statistics & Data Analysis 1 (1983), 167–174.
[8] Roman Budylin, Alexey Drutsa, Ilya Katsev, and Valeriya Tsoy. 2018. Consistent
aggregations of the original data sets, in a direct manner. By doing Transformation of Ratio Metrics for Efficient Online Controlled Experiments. In
so, we can decompose big data into small problems. However, al- Proceedings of the Eleventh ACM International Conference on Web Search and Data
though conceptually sound, in practice these metric level models Mining. ACM, 55–63.
[9] Bob Carpenter, Andrew Gelman, Matt Hoffman, Daniel Lee, Ben Goodrich,
often involve nonlinear transformations of data or complex data Michael Betancourt, Michael A Brubaker, Jiqiang Guo, Peter Li, and Allen Rid-
generating mechanisms, posing several challenges for trustworthy dell. 2016. Stan: A probabilistic programming language. Journal of Statistical
Software 20 (2016), 1–37.
and efficient metric analytics. [10] George Casella and Roger L Berger. 2002. Statistical Inference, Second Edition.
To address these issues, we promoted the Delta method’s cen- Duxbury Press: Pacific Grove, CA.
tral role in making it possible to extend the central limit theorem to [11] Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Si-
mon Weaver, and Jingren Zhou. 2008. SCOPE: Easy and efficient parallel pro-
new territories. We demonstrated how to apply the Delta method cessing of massive data sets. Proceedings of the VLDB Endowment 1 (2008), 1265–
in four real-life applications, and illustrated how this approach 1276.
naturally leads to trivially parallelizable and highly efficient im- [12] Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine
Learning 20 (1995), 273–297.
plementations. Among these applications, ratio metrics, clustered [13] M. Davidian, A.A. Tsiatis, and S. Leon. 2005. Semiparametric Estimation of Treat-
randomizations and quantile metrics are all common and impor- ment Effect in a Pretest-Posttest Study with Missing Data. Statist. Sci. 20 (2005),
295–301. Issue 3.
tant scenarios for A/B testing, and business analytics in general. [14] A. Deng, J. Lu, and J. Litz. 2017. Trustworthy analysis of online A/B tests: Pitfalls,
Within-subject studies are becoming more popular for superior challenges and solutions. In Proceedings of the Tenth ACM International Confer-
sensitivities, and missing data with an unknown mechanism is ence on Web Search and Data Mining. 641–649.
[15] A. Deng and X. Shi. 2016. Data-driven metric development for online controlled
ubiquitous in both the online and offline worlds. Our contribution experiments: Seven lessons learned. In Proceedings of the 22nd ACM SIGKDD
to technical improvements, novel ideas and new understandings in- International Conference on Knowledge Discovery and Data Mining.
cludes bias correction for ratio metric estimation, combination of [16] Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. 2013. Improving the sensitivity
of online controlled experiments by utilizing pre-experiment data. In Proceedings
the Delta method and outer confidence intervals for quantile met- of the 6th ACM WSDM Conference. 123–132.
ric variance estimation, and the idea of data augmentation for gen- [17] Pavel Dmitriev, Somit Gupta, Dong Woo Kim, and Garnet Vaz. 2017. A Dirty
Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Ex-
eral missing data problems in within-subject studies. We also re- periments. In Proceedings of the 23rd ACM SIGKDD International Conference on
vealed the connection between the Delta method and mixed effect Knowledge Discovery and Data Mining (KDD ’17). ACM, New York, NY, USA,
models, and explained their differences. In addition, we pointed 1427–1436. https://doi.org/10.1145/3097983.3098024
[18] Pavel Dmitriev and Xian Wu. 2016. Measuring Metrics. In Proceedings of the 25th
out the advantage of the Delta method in the presence of an un- ACM International on Conference on Information and Knowledge Management.
known missing data mechanism. Overall speaking, we hope this ACM, 429–437.
paper can serve as a practical guide of applying the Delta method [19] Allan Donner. 1987. Statistical methodology for paired cluster designs. American
Journal of Epidemiology 126, 5 (1987), 972–979.
in large-scale metric analyses and A/B tests, so that it is no longer [20] Dean Eckles, Brian Karrer, and Johan Ugander. 2017. Design and analysis of
just a technical detail, but a starting point and a central piece of experiments in networks: Reducing bias from interference. Journal of Causal
Inference 5, 1 (2017).
analytical thinking. [21] Jianqing Fan, Fang Han, and Han Liu. 2014. Challenges of big data analysis.
Although the Delta method can help tackle big data problems, National Science Review 1 (2014), 293–314.
it does not replace the need for rigorous experimental designs and [22] Edgar C Fieller. 1940. The biological standardization of insulin. Supplement to
the Journal of the Royal Statistical Society 7, 1 (1940), 1–64.
probabilistic modeling. For example, “optimal” choices for cluster [23] Edgar C Fieller. 1954. Some problems in interval estimation. Journal of the Royal
configurations, randomization mechanisms and data transforma- Statistical Society. Series B (Methodological) (1954), 175–185.
tions are known to increase the sensitivities of metrics [8, 32, 49]. [24] Ronald Aylmer Fisher. 1922. On the mathematical foundations of theoretical sta-
tistics. Philosophical Transactions of the Royal Society of London. Series A, Con-
We leave them as future research directions. taining Papers of a Mathematical or Physical Character 222 (1922), 309–368.
[25] Pedro A Forero, Alfonso Cano, and Georgios B Giannakis. 2010. Consensus-
ACKNOWLEDGMENTS based distributed support vector machines. Journal of Machine Learning Re-
search 11, May (2010), 1663–1707.
We benefited from insightful discussions with Ya Xu and Daniel [26] Andrew Gelman and Jennifer Hill. 2006. Data analysis using regression and mul-
tilevel/hierarchical models. Cambridge University Press.
Ting on quantile metrics and outer confidence intervals. We thank [27] Huan Gui, Ya Xu, Anmol Bhasin, and Jiawei Han. 2015. Network A/B testing:
Jong Ho Lee for implementing the efficient quantile metric esti- From sampling to estimation. In Proceedings of the 24th International Conference
mation and confidence interval construction in ExP using Apache on World Wide Web. International World Wide Web Conferences Steering Com-
mittee, 399–409.
Spark, and Yu Guo for previous work on repeated measurements. [28] Yu Guo and Alex Deng. 2015. Flexible Online Repeated Measures Experiment.
arXiv preprint arXiv:1501.00450 (2015).
REFERENCES [29] Peter Hall. 2013. The bootstrap and Edgeworth expansion. Springer Science &
Business Media.
[1] Susan Athey and Guido W Imbens. 2017. The econometrics of randomized ex- [30] Joe Hirschberg and Jenny Lye. 2010. A geometric comparison of the delta and
periments. Handbook of Economic Field Experiments 1 (2017), 73–140. Fieller confidence intervals. The American Statistician 64 (2010), 234–241.
[2] Lars Backstrom and Jon Kleinberg. 2011. Network bucket testing. In Proceedings [31] Michael I Jordan, Jason D Lee, and Yun Yang. 2018. Communication-efficient
of the 20th international conference on World wide web. ACM, 615–624. distributed statistical inference. J. Amer. Statist. Assoc. in press (2018).
[3] Douglas Bates, Martin Mächler, Ben Bolker, and Steve Walker. 2014. Fitting doi:10.1080/01621459.2018.1429274.
linear mixed-effects models using lme4. arXiv preprint arXiv:1406.5823 (2014).

9
[32] Eugene Kharitonov, Alexey Drutsa, and Pavel Serdyukov. 2017. Learning sensi- [43] Walter Rudin et al. 1964. Principles of mathematical analysis. Vol. 3. McGraw-hill
tive combinations of a/b test metrics. In Proceedings of the Tenth ACM Interna- New York.
tional Conference on Web Search and Data Mining. ACM, 651–659. [44] Diane Tang, Ashish Agarwal, Deirdre O’Brien, and Mike Meyer. 2010. Overlap-
[33] Neil Klar and Allan Donner. 2001. Current and future challenges in the design ping Experiment Infrastructure: More, Better, Faster Experimentation. Proceed-
and analysis of cluster randomization trials. Statistics in medicine 20, 24 (2001), ings of the 16th ACM SIGKDD Conference (2010).
3729–3740. [45] Aad W Van der Vaart. 2000. Asymptotic statistics. Vol. 3. Cambridge university
[34] Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar, and Michael I Jordan. 2014. press.
A scalable bootstrap for massive data. Journal of the Royal Statistical Society: [46] Ulrike Von Luxburg and Volker H Franz. 2009. A geometric approach to confi-
Series B (Statistical Methodology) 76, 4 (2014), 795–816. dence sets for ratios: Fieller’s theorem, generalizations and bootstrap. Statistica
[35] Ronny Kohavi, Thomas Crook, Roger Longbotham, Brian Frasca, Randy Henne, Sinica (2009), 1095–1117.
Juan Lavista Ferres, and Tamir Melamed. 2009. Online experimentation at Mi- [47] Dongli Wang and Yan Zhou. 2012. Distributed support vector machines: An
crosoft. In Proceedings of the Third International Workshop on Data Mining Case overview. In Control and Decision Conference (CCDC), 2012 24th Chinese. IEEE,
Studies, held at the 5th ACM SIGKDD Conference. 11–23. 3897–3901.
[36] Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. [48] Larry Wasserman. 2003. All of Statistics: A Concise Course in Statistical Inference.
2013. Online Controlled Experiments at Large Scale. Proceedings of the 19th Springer.
ACM SIGKDD Conference (2013). [49] Huizhi Xie and Juliette Aurisset. 2016. Improving the sensitivity of online con-
[37] Ron Kohavi, Randal M Henne, and Dan Sommerfield. 2007. Practical guide to trolled experiments: Case studies at netflix. In Proceedings of the 22nd ACM
controlled experiments on the web: listen to your customers not to the hippo. In SIGKDD International Conference on Knowledge Discovery and Data Mining.
Proceedings of the 13th ACM SIGKDD Conference. 959–967. ACM, 645–654.
[38] Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M Henne. 2009. [50] Ya Xu, Nanyu Chen, Addrian Fernandez, Omar Sinno, and Anmol Bhasin. 2015.
Controlled experiments on the web: survey and practical guide. Data mining From infrastructure to culture: A/B testing challenges in large scale social net-
and knowledge discovery 18, 1 (2009), 140–181. works. In Proceedings of the 21th ACM SIGKDD International Conference on
[39] R. Kohavi, R. Longbotham, and T. Walker. 2010. Online Experiments: Practical Knowledge Discovery and Data Mining. ACM, 2227–2236.
Lessons. Computer 43, 9 (Sept 2010), 82–85. [51] Matei Zaharia, Reynold S Xin, Patrick Wendell, Tathagata Das, Michael Arm-
[40] Daniel Krewski. 1976. Distribution-free confidence intervals for quantile inter- brust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman,
vals. J. Amer. Statist. Assoc. 71, 354 (1976), 420–422. Michael J Franklin, et al. 2016. Apache Spark: A unified engine for big data
[41] Kung-Yee Liang and Scott L Zeger. 1986. Longitudinal data analysis using gen- processing. Commun. ACM 59, 11 (2016), 56–65.
eralized linear models. Biometrika 73, 1 (1986), 13–22. [52] Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. 2010. Paral-
[42] John S Meyer. 1987. Outer and inner confidence intervals for finite population lelized stochastic gradient descent. In Advances in Neural Information Processing
quantile intervals. J. Amer. Statist. Assoc. 82, 397 (1987), 201–204. Systems. 2595–2603.

10

You might also like