Chernoff bound: Difference between revisions

Content deleted Content added

Inline

Revision as of 21:57, 23 June 2014

In probability theory, the Chernoff bound, named after Herman Chernoff but due to Herman Rubin,^[1] gives exponentially decreasing bounds on tail distributions of sums of independent random variables. It is a sharper bound than the known first or second moment based tail bounds such as Markov's inequality or Chebyshev inequality, which only yield power-law bounds on tail decay. However, the Chernoff bound requires that the variates be independent – a condition that neither the Markov nor the Chebyshev inequalities require.

It is related to the (historically prior) Bernstein inequalities, and to Hoeffding's inequality.

Example

Let $X 1, ..., X n$ be independent Bernoulli random variables, each having probability p > 1/2. Then the probability of simultaneous occurrence of more than n/2 of the events ${X k = 1}$ has an exact value $S$ , where

S=\sum _{i=\lfloor {\tfrac {n}{2}}\rfloor +1}^{n}{\binom {n}{i}}p^{i}(1-p)^{n-i}.

The Chernoff bound shows that $S$ has the following lower bound:

S\geq 1-e^{-{\frac {1}{2p}}n\left(p-{\frac {1}{2}}\right)^{2}}.

Indeed, noticing that $μ = np$ , we get by the multiplicative form of Chernoff bound (see below or Corollary 13.3 in Sinclair's class notes),

{\begin{aligned}\Pr \left(\sum _{k=1}^{n}X_{k}\leq \left\lfloor {\tfrac {n}{2}}\right\rfloor \right)&=\Pr \left(\sum _{k=1}^{n}X_{k}\leq \left(1-\left(1-{\tfrac {1}{2p}}\right)\right)\mu \right)\\&\leq e^{-{\frac {\mu }{2}}\left(1-{\frac {1}{2p}}\right)^{2}}\\&=e^{-{\frac {n}{2p}}\left(p-{\frac {1}{2}}\right)^{2}}\end{aligned}}

This result admits various generalizations as outlined below. One can encounter many flavours of Chernoff bounds: the original additive form (which gives a bound on the absolute error) or the more practical multiplicative form (which bounds the error relative to the mean).

A motivating example

File:Chernoff bound.png

The simplest case of Chernoff bounds is used to bound the success probability of majority agreement for $n$ independent, equally likely events.

A simple motivating example is to consider a biased coin. One side (say, Heads), is more likely to come up than the other, but you don't know which and would like to find out. The obvious solution is to flip it many times and then choose the side that comes up the most. But how many times do you have to flip it to be confident that you've chosen correctly?

In our example, let $X i$ denote the event that the ith coin flip comes up Heads; suppose that we want to ensure we choose the wrong side with at most a small probability $ε$ . Then, rearranging the above, we must have:

n\geq {\frac {1}{(p-{\frac {1}{2}})^{2}}}\ln {\frac {1}{\sqrt {\varepsilon }}}.

If the coin is noticeably biased, say coming up on one side 60% of the time (p = .6), then we can guess that side with 95% (ε = .05) accuracy after 150 flips (n = 150). If it is 90% biased, then a mere 10 flips suffices. If the coin is only biased a tiny amount, like most real coins are, the number of necessary flips becomes much larger.

More practically, the Chernoff bound is used in randomized algorithms (or in computational devices such as quantum computers) to determine a bound on the number of runs necessary to determine a value by majority agreement, up to a specified probability. For example, suppose an algorithm (or machine) A computes the correct value of a function f with probability p > 1/2. If we choose n satisfying the inequality above, the probability that a majority exists and is equal to the correct value is at least 1 − ε, which for small enough ε is quite reliable. If p is a constant, ε diminishes exponentially with growing n, which is what makes algorithms in the complexity class BPP efficient.

Notice that if p is very close to 1/2, the necessary $n$ can become very large. For example, if p = 1/2 + 1/2^m, as it might be in some PP algorithms, the result is that $n$ is bounded below by an exponential function in m:

n\geq 2^{2m}\ln {\frac {1}{\sqrt {\varepsilon }}}.

The first step in the proof of Chernoff bounds

The Chernoff bound for a random variable $X$ , which is the sum of $n$ independent random variables $X 1, ..., X n$ , is obtained by applying $e tX$ for some well-chosen value of t. This method was first applied by Sergei Bernstein to prove the related Bernstein inequalities.

From Markov's inequality and using independence we can derive the following useful inequality:

For any t > 0,

\Pr(X\geq a)=\Pr \left(e^{tX}\geq e^{ta}\right)\leq {\frac {E\left[e^{tX}\right]}{e^{ta}}}=e^{-ta}\mathrm {E} \left[\prod _{i}e^{tX_{i}}\right].

In particular optimizing over t and using independence of $X i$ we obtain,

\Pr(X\geq a)\leq \min _{t>0}e^{-ta}\prod _{i}E\left[e^{tX_{i}}\right].

(1)

Similarly,

\Pr(X\leq a)=\Pr \left(e^{-tX}\geq e^{-ta}\right)

and so,

\Pr(X\leq a)\leq \min _{t>0}e^{ta}\prod _{i}\mathrm {E} \left[e^{-tX_{i}}\right]

Precise statements and proofs

Theorem for additive form (absolute error)

The following Theorem is due to Wassily Hoeffding and hence is called Chernoff-Hoeffding theorem.

Chernoff-Hoeffding Theorem. Suppose

X 1, ..., X n

are i.i.d. random variables, taking values in

{0, 1}.

Let

p = E[X i]

and

ε > 0

. Then

{\begin{aligned}\Pr \left({\frac {1}{n}}\sum X_{i}\geq p+\varepsilon \right)\leq \left(\left({\frac {p}{p+\varepsilon }}\right)^{p+\varepsilon }{\left({\frac {1-p}{1-p-\varepsilon }}\right)}^{1-p-\varepsilon }\right)^{n}&=e^{-D(p+\varepsilon \|p)n}\\\Pr \left({\frac {1}{n}}\sum X_{i}\leq p-\varepsilon \right)\leq \left(\left({\frac {p}{p-\varepsilon }}\right)^{p-\varepsilon }{\left({\frac {1-p}{1-p+\varepsilon }}\right)}^{1-p+\varepsilon }\right)^{n}&=e^{-D(p-\varepsilon \|p)n}\end{aligned}}

where

D(x\|y)=x\log {\frac {x}{y}}+(1-x)\log \left({\frac {1-x}{1-y}}\right)

is the Kullback–Leibler divergence between Bernoulli distributed random variables with parameters x and y respectively. If

p ≥ .mw-parser-output .sfrac{white-space:nowrap}.mw-parser-output .sfrac.tion,.mw-parser-output .sfrac .tion{display:inline-block;vertical-align:-0.5em;font-size:85%;text-align:center}.mw-parser-output .sfrac .num{display:block;line-height:1em;margin:0.0em 0.1em;border-bottom:1px solid}.mw-parser-output .sfrac .den{display:block;line-height:1em;margin:0.1em 0.1em}.mw-parser-output .sr-only{border:0;clip:rect(0,0,0,0);clip-path:polygon(0px 0px,0px 0px,0px 0px);height:1px;margin:-1px;overflow:hidden;padding:0;position:absolute;width:1px}⁠1/2⁠,

then

\Pr \left(X>np+x\right)\leq \exp \left(-{\frac {x^{2}}{2np(1-p)}}\right).

Proof

Let $q = p + ε$ . Taking $a = nq$ in (1), we obtain:

\Pr \left({\frac {1}{n}}\sum X_{i}\geq q\right)\leq \inf _{t>0}{\frac {E\left[\prod e^{tX_{i}}\right]}{e^{tnq}}}=\inf _{t>0}\left({\frac {E\left[e^{tX_{i}}\right]}{e^{tq}}}\right)^{n}.

Now, knowing that $Pr(X i = 1) = p, Pr(X i = 0) = 1 - p$ , we have

\left({\frac {\mathrm {E} \left[e^{tX_{i}}\right]}{e^{tq}}}\right)^{n}=\left({\frac {pe^{t}+(1-p)}{e^{tq}}}\right)^{n}=\left(pe^{(1-q)t}+(1-p)e^{-qt}\right)^{n}.

Therefore we can easily compute the infimum, using calculus:

{\frac {d}{dt}}\left(pe^{(1-q)t}+(1-p)e^{-qt}\right)=(1-q)pe^{(1-q)t}-q(1-p)e^{-qt}

Setting the equation to zero and solving, we have

{\begin{aligned}(1-q)pe^{(1-q)t}&=q(1-p)e^{-qt}\\(1-q)pe^{t}&=q(1-p)\end{aligned}}

so that

e^{t}={\frac {(1-p)q}{(1-q)p}}.

Thus,

t=\log \left({\frac {(1-p)q}{(1-q)p}}\right).

As $q = p + ε > p$ , we see that $t > 0$ , so our bound is satisfied on $t$ . Having solved for $t$ , we can plug back into the equations above to find that

{\begin{aligned}\log \left(pe^{(1-q)t}+(1-p)e^{-qt}\right)&=\log \left(e^{-qt}(1-p+pe^{t})\right)\\&=\log \left(e^{-q\log \left({\frac {(1-p)q}{(1-q)p}}\right)}\right)+\log \left(1-p+pe^{\log \left({\frac {1-p}{1-q}}\right)}e^{\log {\frac {q}{p}}}\right)\\&=-q\log {\frac {1-p}{1-q}}-q\log {\frac {q}{p}}+\log \left(1-p+p\left({\frac {1-p}{1-q}}\right){\frac {q}{p}}\right)\\&=-q\log {\frac {1-p}{1-q}}-q\log {\frac {q}{p}}+\log \left({\frac {(1-p)(1-q)}{1-q}}+{\frac {(1-p)q}{1-q}}\right)\\&=-q\log {\frac {q}{p}}+\left(-q\log {\frac {1-p}{1-q}}+\log {\frac {1-p}{1-q}}\right)\\&=-q\log {\frac {q}{p}}+(1-q)\log {\frac {1-p}{1-q}}\\&=-D(q\|p).\end{aligned}}

We now have our desired result, that

\Pr \left({\tfrac {1}{n}}\sum X_{i}\geq p+\varepsilon \right)\leq e^{-D(p+\varepsilon \|p)n}.

To complete the proof for the symmetric case, we simply define the random variable $Y i = 1 - X i$ , apply the same proof, and plug it into our bound.

Simpler bounds

A simpler bound follows by relaxing the theorem using $D (p + x || p) \geq 2 x 2$ , which follows from the convexity of $D (p + x || p)$ and the fact that

{\frac {d^{2}}{dx^{2}}}D(p+x\|p)={\frac {1}{(p+x)(1-p-x)}}\geq 4={\frac {d^{2}}{dx^{2}}}(2x^{2}).

This result is a special case of Hoeffding's inequality. Sometimes, the bound

D((1+x)p\|p)\geq {\tfrac {1}{4}}x^{2}p,\qquad -{\tfrac {1}{2}}\leq x\leq {\tfrac {1}{2}},

which is stronger for $p < ⁠ 1 / 8 ⁠,$ is also used.

Theorem for multiplicative form of Chernoff bound (relative error)

Multiplicative Chernoff Bound. Suppose

X 1, ..., X n

be independent random variables taking values in

{0, 1}.

Let

X

denotes their sum and set

Pr(X i = 1) = p i

. If

μ = E[X]

, then for any

δ > 0

:

\Pr(X>(1+\delta )\mu )<\left({\frac {e^{\delta }}{(1+\delta )^{(1+\delta )}}}\right)^{\mu }.

Proof

According to (1),

{\begin{aligned}\Pr(X>(1+\delta )\mu )&\leq \inf _{t>0}{\frac {\mathrm {E} \left[\prod _{i=1}^{n}\exp(tX_{i})\right]}{\exp(t(1+\delta )\mu )}}\\&=\inf _{t>0}{\frac {\prod _{i=1}^{n}\mathrm {E} \left[e^{tX_{i}}\right]}{\exp(t(1+\delta )\mu )}}\\&=\inf _{t>0}{\frac {\prod _{i=1}^{n}\left[p_{i}e^{t}+(1-p_{i})\right]}{\exp(t(1+\delta )\mu )}}\end{aligned}}

The third line above follows because $e^{tX_{i}}$ takes the value $e t$ with probability $p i$ and the value 1 with probability $1 - p i$ . This is identical to the calculation above in the proof of the Theorem for additive form (absolute error).

Rewriting $p_{i}e^{t}+(1-p_{i})$ as $p_{i}(e^{t}-1)+1$ and recalling that $1+x\leq e^{x}$ (with strict inequality if $x > 0$ ), we set $x=p_{i}(e^{t}-1)$ . The same result can be obtained by directly replacing $a$ in the equation for the Chernoff bound with $(1 + δ) μ$ .^[2]

Thus,

\Pr(X>(1+\delta )\mu )<{\frac {\prod _{i=1}^{n}\exp(p_{i}(e^{t}-1))}{\exp(t(1+\delta )\mu )}}={\frac {\exp \left((e^{t}-1)\sum _{i=1}^{n}p_{i}\right)}{\exp(t(1+\delta )\mu )}}={\frac {\exp((e^{t}-1)\mu )}{\exp(t(1+\delta )\mu )}}.

If we simply set $t = log(1 + δ)$ so that $t > 0$ for $δ > 0$ , we can substitute and find

{\frac {\exp((e^{t}-1)\mu )}{\exp(t(1+\delta )\mu )}}={\frac {\exp((1+\delta -1)\mu )}{(1+\delta )^{(1+\delta )\mu }}}=\left[{\frac {e^{\delta }}{(1+\delta )^{(1+\delta )}}}\right]^{\mu }

This proves the result desired. A similar proof strategy can be used to show that

\Pr(X<(1-\delta )\mu )<\left[{\frac {\exp(-\delta )}{(1-\delta )^{(1-\delta )}}}\right]^{\mu }.

The above formula is often unwieldy in practice,^[3] so the following looser but more convenient bounds are often used:

\Pr(X\geq (1+\delta )\mu )\leq e^{-{\frac {\delta ^{2}\mu }{3}}},\qquad 0<\delta <1,

\Pr(X\leq (1-\delta )\mu )\leq e^{-{\frac {\delta ^{2}\mu }{2}}},\qquad 0<\delta <1.

Better Chernoff bounds for some special cases

We can obtain stronger bounds using simpler proof techniques for some special cases of symmetric random variables.

Suppose $X 1, ..., X n$ are independent random variables, and let $X$ denote their sum.

If $\Pr(X_{i}=1)=\Pr(X_{i}=-1)={\tfrac {1}{2}}$ . Then,

\Pr(X\geq a)\leq e^{\frac {-a^{2}}{2n}},\qquad a>0,

and therefore also

\Pr(|X|\geq a)\leq 2e^{\frac {-a^{2}}{2n}},\qquad a>0.

If $\Pr(X_{i}=1)=\Pr(X_{i}=0)={\tfrac {1}{2}},\mathrm {E} [X]=\mu ={\frac {n}{2}}$ Then,

\Pr(X\geq \mu +a)\leq e^{\frac {-2a^{2}}{n}},\qquad a>0,

\Pr(X\leq \mu -a)\leq e^{\frac {-2a^{2}}{n}},\qquad 0<a<\mu ,

Applications of Chernoff bound

Chernoff bounds have very useful applications in set balancing and packet routing in sparse networks.

The set balancing problem arises while designing statistical experiments. Typically while designing a statistical experiment, given the features of each participant in the experiment, we need to know how to divide the participants into 2 disjoint groups such that each feature is roughly as balanced as possible between the two groups. Refer to this book section for more info on the problem.

Chernoff bounds are also used to obtain tight bounds for permutation routing problems which reduce network congestion while routing packets in sparse networks. Refer to this book section for a thorough treatment of the problem.

Matrix Chernoff bound

Rudolf Ahlswede and Andreas Winter introduced (Ahlswede & Winter 2003) a Chernoff bound for matrix-valued random variables.

If M is distributed according to some distribution over $d \times d$ matrices with zero mean, and if $M 1, ..., M t$ are independent copies of M then for any $ε > 0$ ,

\Pr \left(\left\|{\frac {1}{t}}\sum _{i=1}^{t}M_{i}-\mathrm {E} [M]\right\|_{2}>\varepsilon \right)\leq d\exp \left(-C{\frac {\varepsilon ^{2}t}{\gamma ^{2}}}\right).

where $\lVert M\rVert _{2}\leq \gamma$ holds almost surely and C > 0 is an absolute constant.

Notice that the number of samples in the inequality depends logarithmically on d. In general, unfortunately, such a dependency is inevitable: take for example a diagonal random sign matrix of dimension d. The operator norm of the sum of t independent samples is precisely the maximum deviation among d independent random walks of length t. In order to achieve a fixed bound on the maximum deviation with constant probability, it is easy to see that t should grow logarithmically with d in this scenario.^[4]

The following theorem can be obtained by assuming M has low rank, in order to avoid the dependency on the dimensions.

Theorem without the dependency on the dimensions

Let $0 < ε < 1$ and M be a random symmetric real matrix with $\|\mathrm {E} [M]\|_{2}\leq 1$ and $\|M\|_{2}\leq \gamma$ almost surely. Assume that each element on the support of M has at most rank r. Set

t=\Omega \left({\frac {\gamma \log(\gamma /\varepsilon ^{2})}{\varepsilon ^{2}}}\right).

If $r\leq t$ holds almost surely, then

\Pr \left(\left\|{\frac {1}{t}}\sum _{i=1}^{t}M_{i}-\mathrm {E} [M]\right\|_{2}>\varepsilon \right)\leq {\frac {1}{\mathbf {poly} (t)}}

where $M 1, ..., M t$ are i.i.d. copies of M.

References

^ Chernoff, Herman (2014). "A career in statistics". In Lin, Xihong; Genest, Christian; Banks, David L.; Molenberghs, Geert; Scott, David W.; Wang, Jane-Ling (eds.). Past, Present, and Future of Statistics. CRC Press. p. 35. ISBN 9781482204964. {{cite book}}: External link in |chapterurl= (help); Unknown parameter |chapterurl= ignored (|chapter-url= suggested) (help)
^ Refer to the proof above
^ Mitzenmacher, Michael and Upfal, Eli (2005). Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press. ISBN 0-521-83540-2.{{cite book}}: CS1 maint: multiple names: authors list (link)
^ *Magen, A.; Zouzias, A. (2011). "Low Rank Matrix-Valued Chernoff Bounds and Approximate Matrix Multiplication". arXiv:1005.2724 [cs.DM].

Chernoff, H. (1952). "A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations". Annals of Mathematical Statistics. 23 (4): 493–507. doi:10.1214/aoms/1177729330. JSTOR 2236576. MR 0057518. Zbl 0048.11804.
Hoeffding, W. (1963). "Probability Inequalities for Sums of Bounded Random Variables". Journal of the American Statistical Association. 58 (301): 13–30. doi:10.2307/2282952. JSTOR 2282952.
Chernoff, H. (1981). "A Note on an Inequality Involving the Normal Distribution". Annals of Probability. 9 (3): 533. doi:10.1214/aop/1176994428. JSTOR 2243541. MR 0614640. Zbl 0457.60014.
Hagerup, T. (1990). "A guided tour of Chernoff bounds". Information Processing Letters. 33 (6): 305. doi:10.1016/0020-0190(90)90214-I.
Ahlswede, R.; Winter, A. (2003). "Strong Converse for Identification via Quantum Channels". IEEE Transactions on Information Theory. 48 (3): 569–579. arXiv:quant-ph/0012127. doi:10.1109/18.985947.
Mitzenmacher, M.; Upfal, E. (2005). Probability and Computing: Randomized Algorithms and Probabilistic Analysis. ISBN 978-0-521-83540-4.
Nielsen, F. (2011). "Chernoff information of exponential families". arXiv:1102.2684 [cs.IT].
Sinclair, Alistair. "Class notes for the course "Randomness and Computation" given Fall 2011" (PDF). Retrieved 26 December 2012.

[1] Chernoff, Herman (2014). "A career in statistics". In Lin, Xihong; Genest, Christian; Banks, David L.; Molenberghs, Geert; Scott, David W.; Wang, Jane-Ling (eds.). Past, Present, and Future of Statistics. CRC Press. p. 35. ISBN 9781482204964. {{cite book}}: External link in |chapterurl= (help); Unknown parameter |chapterurl= ignored (|chapter-url= suggested) (help)

[2] Refer to the proof above

[MitzenmacherUpfal-3] Mitzenmacher, Michael and Upfal, Eli (2005). Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press. ISBN 0-521-83540-2.{{cite book}}: CS1 maint: multiple names: authors list (link)

[4] *Magen, A.; Zouzias, A. (2011). "Low Rank Matrix-Valued Chernoff Bounds and Approximate Matrix Multiplication". arXiv:1005.2724 [cs.DM].

[1]

[2]

[3]

[4]

@@ Line 4: / Line 4: @@
 ==Example==
-Let {{math|''X''<sub>1</sub>, ..., ''X<sub>n</sub>''}} be independent [[Bernoulli random variable]]s, each having probability ''p'' > 1/2.  Then the probability of simultaneous occurrence of more than ''n''/2 of the events {{math|{''X<sub>k</sub>'' {{=}} 1} }} has an exact value ''S'', where
+Let {{math|''X''<sub>1</sub>, ..., ''X<sub>n</sub>''}} be independent [[Bernoulli random variable]]s, each having probability ''p'' > 1/2.  Then the probability of simultaneous occurrence of more than ''n''/2 of the events {{math|{''X<sub>k</sub>'' {{=}} 1} }} has an exact value {{mvar|S}}, where
 :<math> S=\sum_{i = \lfloor \tfrac{n}{2} \rfloor + 1}^n \binom{n}{i}p^i (1 - p)^{n - i} .</math>
-The Chernoff bound shows that ''S'' has the following lower bound:
+The Chernoff bound shows that {{mvar|S}} has the following lower bound:
 :<math> S \ge 1 - e^{-\frac{1}{2p}n \left(p - \frac{1}{2} \right)^2} .</math>
@@ Line 24: / Line 24: @@
 == A motivating example ==
 [[Image:Chernoff bound.png|right]]
-The simplest case of Chernoff bounds is used to bound the success probability of majority agreement for ''n'' independent, equally likely events.
+The simplest case of Chernoff bounds is used to bound the success probability of majority agreement for {{mvar|n}} independent, equally likely events.
 A simple motivating example is to consider a biased coin. One side (say, Heads), is more likely to come up than the other, but you don't know which and would like to find out. The obvious solution is to flip it many times and then choose the side that comes up the most. But how many times do you have to flip it to be confident that you've chosen correctly?
-In our example, let {{mvar|X<sub>i</sub>}} denote the event that the ''i''th coin flip comes up Heads; suppose that we want to ensure we choose the wrong side with at most a small probability ε. Then, rearranging the above, we must have:
+In our example, let {{mvar|X<sub>i</sub>}} denote the event that the ''i''th coin flip comes up Heads; suppose that we want to ensure we choose the wrong side with at most a small probability {{mvar|ε}}. Then, rearranging the above, we must have:
 :<math> n \geq \frac{1}{(p -\frac{1}{2})^2} \ln \frac{1}{\sqrt{\varepsilon}}.</math>
@@ Line 36: / Line 36: @@
 More practically, the Chernoff bound is used in [[randomized algorithm]]s (or in computational devices such as [[quantum computer]]s) to determine a bound on the number of runs necessary to determine a value by majority agreement, up to a specified probability. For example, suppose an algorithm (or machine) ''A'' computes the correct value of a function ''f'' with probability ''p'' > 1/2. If we choose ''n'' satisfying the inequality above, the probability that a majority exists and is equal to the correct value is at least 1 − ε, which for small enough ε is quite reliable. If ''p'' is a constant, ε diminishes exponentially with growing ''n'', which is what makes algorithms in the complexity class [[BPP (complexity)|BPP]] efficient.
-Notice that if ''p'' is very close to 1/2, the necessary ''n'' can become very large. For example, if ''p'' = 1/2 + 1/2<sup>''m''</sup>, as it might be in some [[PP (complexity)|PP]] algorithms, the result is that ''n'' is bounded below by an exponential function in ''m'':
+Notice that if ''p'' is very close to 1/2, the necessary {{mvar|n}} can become very large. For example, if ''p'' = 1/2 + 1/2<sup>''m''</sup>, as it might be in some [[PP (complexity)|PP]] algorithms, the result is that {{mvar|n}} is bounded below by an exponential function in ''m'':
 :<math> n \geq 2^{2m} \ln \frac{1}{\sqrt{\varepsilon}}.</math>
 == The first step in the proof of Chernoff bounds ==
-The Chernoff bound for a random variable ''X'', which is the sum of ''n'' independent random variables {{math|''X''<sub>1</sub>, ..., ''X<sub>n</sub>''}}, is obtained by applying {{mvar|e<sup>tX</sup>}} for some well-chosen value of ''t''. This method was first applied by [[Sergei Bernstein]] to prove the related [[Bernstein inequalities (probability theory)|Bernstein inequalities]].
+The Chernoff bound for a random variable {{mvar|X}}, which is the sum of {{mvar|n}} independent random variables {{math|''X''<sub>1</sub>, ..., ''X<sub>n</sub>''}}, is obtained by applying {{mvar|e<sup>tX</sup>}} for some well-chosen value of ''t''. This method was first applied by [[Sergei Bernstein]] to prove the related [[Bernstein inequalities (probability theory)|Bernstein inequalities]].
 From [[Markov's inequality]] and using independence we can derive the following useful inequality:
@@ Line 65: / Line 65: @@
 The following Theorem is due to [[Wassily Hoeffding]] and hence is called Chernoff-Hoeffding theorem.
-:'''Chernoff-Hoeffding Theorem.''' Assume random variables {{math|''X''<sub>1</sub>, ..., ''X<sub>m</sub>''}} are [[i.i.d.]] Let {{math|''p'' {{=}} E[''X<sub>i</sub>''], ''X<sub>i</sub>'' ∈ {0, 1},}} and {{math|''ε'' > 0}}. Then
+:'''Chernoff-Hoeffding Theorem.''' Suppose {{math|''X''<sub>1</sub>, ..., ''X<sub>n</sub>''}} are [[i.i.d.]] random variables, taking values in {{math|{0, 1}.}} Let {{math|''p'' {{=}} E[''X<sub>i</sub>'']}} and {{math|''ε'' > 0}}. Then
 ::<math>\begin{align}
-\Pr \left (\frac{1}{m} \sum X_i \geq p + \varepsilon \right ) \leq \left (\left (\frac{p}{p + \varepsilon}\right )^{p+\varepsilon} {\left (\frac{1 - p}{1-p- \varepsilon}\right )}^{1 - p- \varepsilon}\right ) ^m &=  e^{-D(p+\varepsilon\|p) m} \\
+\Pr \left (\frac{1}{n} \sum X_i \geq p + \varepsilon \right ) \leq \left (\left (\frac{p}{p + \varepsilon}\right )^{p+\varepsilon} {\left (\frac{1 - p}{1-p- \varepsilon}\right )}^{1 - p- \varepsilon}\right )^n &= e^{-D(p+\varepsilon\|p) n} \\
-\Pr \left (\frac{1}{m} \sum X_i \leq p - \varepsilon \right ) \leq \left (\left (\frac{p}{p - \varepsilon}\right )^{p-\varepsilon} {\left (\frac{1 - p}{1-p+ \varepsilon}\right )}^{1 - p+ \varepsilon}\right ) ^m &= e^{-D(p-\varepsilon\|p) m}
+\Pr \left (\frac{1}{n} \sum X_i \leq p - \varepsilon \right ) \leq \left (\left (\frac{p}{p - \varepsilon}\right )^{p-\varepsilon} {\left (\frac{1 - p}{1-p+ \varepsilon}\right )}^{1 - p+ \varepsilon}\right )^n &= e^{-D(p-\varepsilon\|p) n}
 \end{align}</math>
 :where
 ::<math> D(x\|y) = x \log \frac{x}{y} + (1-x) \log \left (\frac{1-x}{1-y} \right )</math>
-:is the [[Kullback–Leibler divergence]] between [[Bernoulli distribution|Bernoulli distributed]] random variables with parameters ''x'' and ''y'' respectively. If <math> p\geq 1/2 </math>, then
+:is the [[Kullback–Leibler divergence]] between [[Bernoulli distribution|Bernoulli distributed]] random variables with parameters ''x'' and ''y'' respectively. If {{math|''p'' ≥ {{sfrac|1|2}},}} then
-::<math> \Pr\left ( X>mp+x \right ) \leq \exp \left (-\frac{x^2}{2mp(1-p)} \right ).</math>
+::<math> \Pr\left ( X>np+x \right ) \leq \exp \left (-\frac{x^2}{2np(1-p)} \right ).</math>
 ==== Proof ====
-Let ''q'' = ''p'' + ε. Taking ''a'' = ''mq'' in ({{EquationNote|1}}), we obtain:
+Let {{math|''q'' {{=}} ''p'' + ''ε''}}. Taking {{math|''a'' {{=}} ''nq''}} in ({{EquationNote|1}}), we obtain:
-:<math>\Pr\left ( \frac{1}{m} \sum X_i \ge q\right )\le \inf_{t>0} \frac{E \left[\prod e^{t X_i}\right]}{e^{tmq}} = \inf_{t>0} \left[\frac{ E\left[e^{tX_i} \right] }{e^{tq}}\right]^m.</math>
+:<math>\Pr\left ( \frac{1}{n} \sum X_i \ge q\right )\le \inf_{t>0} \frac{E \left[\prod e^{t X_i}\right]}{e^{tnq}} = \inf_{t>0} \left ( \frac{ E\left[e^{tX_i} \right] }{e^{tq}}\right )^n.</math>
 Now, knowing that {{math|Pr(''X<sub>i</sub>'' {{=}} 1) {{=}} ''p'', Pr(''X<sub>i</sub>'' {{=}} 0) {{=}} 1 − ''p''}}, we have
-:<math> \left[\frac{ \mathrm{E}\left[e^{tX_i} \right] }{e^{tq}}\right]^m = \left[\frac{p e^t + (1-p)}{e^{tq} }\right]^m = [pe^{(1-q)t} + (1-p)e^{-qt}]^m.</math>
+:<math>\left (\frac{\mathrm{E}\left[e^{tX_i} \right] }{e^{tq}}\right )^n = \left (\frac{p e^t + (1-p)}{e^{tq} }\right )^n = \left ( pe^{(1-q)t} + (1-p)e^{-qt} \right )^n.</math>
 Therefore we can easily compute the infimum, using calculus:
 :<math>\frac{d}{dt} \left (pe^{(1-q)t} + (1-p)e^{-qt} \right) = (1-q)pe^{(1-q)t}-q(1-p)e^{-qt}</math>
 Setting the equation to zero and solving, we have
@@ Line 101: / Line 103: @@
 :<math>t = \log\left(\frac{(1-p)q}{(1-q)p}\right).</math>
-As {{math|''q'' {{=}} ''p'' + ''ε'' > ''p''}}, we see that ''t'' > 0, so our bound is satisfied on ''t''. Having solved for ''t'', we can plug back into the equations above to find that
+As {{math|''q'' {{=}} ''p'' + ''ε'' > ''p''}}, we see that {{math|''t'' > 0}}, so our bound is satisfied on {{mvar|t}}. Having solved for {{mvar|t}}, we can plug back into the equations above to find that
 :<math>\begin{align}
 \log \left (pe^{(1-q)t} + (1-p)e^{-qt} \right ) &= \log \left ( e^{-qt}(1-p+pe^t) \right ) \\
-&= \log\left (e^{-q \log\left(\frac{(1-p)q}{(1-q)p}\right)}\right] +  \log\left[1-p+pe^{\log\left(\frac{1-p}{1-q}\right)}e^{\log\frac{q}{p}}\right ) \\
+&= \log\left (e^{-q \log\left(\frac{(1-p)q}{(1-q)p}\right)}\right) + \log\left(1-p+pe^{\log\left(\frac{1-p}{1-q}\right)}e^{\log\frac{q}{p}}\right ) \\
 &= -q\log\frac{1-p}{1-q} -q \log\frac{q}{p} + \log\left(1-p+ p\left(\frac{1-p}{1-q}\right)\frac{q}{p}\right) \\
 &= -q\log\frac{1-p}{1-q} -q \log\frac{q}{p} + \log\left(\frac{(1-p)(1-q)}{1-q}+\frac{(1-p)q}{1-q}\right) \\
+&= -q \log\frac{q}{p} + \left ( -q\log\frac{1-p}{1-q} + \log\frac{1-p}{1-q} \right ) \\
 &= -q\log\frac{q}{p} + (1-q)\log\frac{1-p}{1-q} \\
 &= -D(q \| p).
@@ Line 114: / Line 117: @@
 We now have our desired result, that
-:<math>\Pr \left (\tfrac{1}{m}\sum X_i \ge p + \varepsilon\right ) \le e^{-D(p+\varepsilon\|p) m}.</math>
+:<math>\Pr \left (\tfrac{1}{n}\sum X_i \ge p + \varepsilon\right ) \le e^{-D(p+\varepsilon\|p) n}.</math>
-To complete the proof for the symmetric case, we simply define the random variable ''Y<sub>i</sub>'' = 1−''X<sub>i</sub>'', apply the same proof, and plug it into our bound.
+To complete the proof for the symmetric case, we simply define the random variable {{math|''Y<sub>i</sub>'' {{=}} 1 − ''X<sub>i</sub>''}}, apply the same proof, and plug it into our bound.
 ==== Simpler bounds ====
-A simpler bound follows by relaxing the theorem using <math>D( p + x \| p) \geq 2 x^2</math>, which follows from the [[Convex function|convexity]] of {{math|''D''(''p'' + ''x'' {{!!}} ''p'')}} and the fact that
+A simpler bound follows by relaxing the theorem using {{math|''D''(''p'' + ''x'' {{!!}} ''p'') ≥ 2''x''<sup>2</sup>}}, which follows from the [[Convex function|convexity]] of {{math|''D''(''p'' + ''x'' {{!!}} ''p'')}} and the fact that
 :<math>\frac{d^2}{dx^2} D(p+x\|p) = \frac{1}{(p+x)(1-p-x)}\geq 4=\frac{d^2}{dx^2}(2x^2).</math>
-This result is a special case of [[Hoeffding's inequality]]. Sometimes, the bound <math>D( (1+x) p \| p) \geq x^2 p/4</math> for <math>-1/2 \leq x \leq 1/2</math>, which is stronger for ''p'' < 1/8, is also used.
+This result is a special case of [[Hoeffding's inequality]]. Sometimes, the bound
+:<math>D( (1+x) p \| p) \geq \tfrac{1}{4} x^2 p, \qquad -\tfrac{1}{2} \leq x \leq \tfrac{1}{2},</math>
-=== Theorem for multiplicative form of Chernoff bound (relative error) ===
-Let {{math|''X''<sub>1</sub>, ..., ''X<sub>n</sub>''}} be [[Statistical independence|independent]] random variables taking on values 0 or 1. Furthermore let {{math|Pr(''X<sub>i</sub>'' {{=}} 1) {{=}} ''p<sub>i</sub>''}}. Then, if we let <math>X = \sum_{i=1}^n X_i</math> and μ be the expectation of ''X'', for any δ > 0
+which is stronger for {{math|''p'' < {{sfrac|1|8}},}} is also used.
-:<math>\Pr ( X > (1+\delta)\mu) < \left(\frac{e^\delta}{(1+\delta)^{(1+\delta)}}\right)^\mu.</math>
+=== Theorem for multiplicative form of Chernoff bound (relative error) ===
+:'''Multiplicative Chernoff Bound.''' Suppose {{math|''X''<sub>1</sub>, ..., ''X<sub>n</sub>''}} be [[Statistical independence|independent]] random variables taking values in {{math|{0, 1}.}} Let {{mvar|X}} denotes their sum and set {{math|Pr(''X<sub>i</sub>'' {{=}} 1) {{=}} ''p<sub>i</sub>''}}. If {{math|''μ'' {{=}} E[''X'']}}, then for any {{math|''δ'' > 0}}:
+::<math>\Pr ( X > (1+\delta)\mu) < \left(\frac{e^\delta}{(1+\delta)^{(1+\delta)}}\right)^\mu.</math>
 ==== Proof ====
@@ Line 134: / Line 140: @@
 :<math>\begin{align}
-\Pr (X > (1 + \delta)\mu) &\le \inf_{t > 0} \frac{\mathrm{E}\left[\prod_{i=1}^n\exp(tX_i)\right]}{\exp(t(1+\delta)\mu)} ) \\
+\Pr (X > (1 + \delta)\mu) &\le \inf_{t > 0} \frac{\mathrm{E}\left[\prod_{i=1}^n\exp(tX_i)\right]}{\exp(t(1+\delta)\mu)}\\
-& = \inf_{t > 0} \frac{\prod_{i=1}^n\mathrm{E}[\exp(tX_i)]}{\exp(t(1+\delta)\mu)} \\
+& = \inf_{t > 0} \frac{\prod_{i=1}^n\mathrm{E}\left [e^{tX_i} \right]}{\exp(t(1+\delta)\mu)} \\
-& = \inf_{t > 0} \frac{\prod_{i=1}^n\left[p_i\exp(t) + (1-p_i)\right]}{\exp(t(1+\delta)\mu)}
+& = \inf_{t > 0} \frac{\prod_{i=1}^n\left[p_ie^t + (1-p_i)\right]}{\exp(t(1+\delta)\mu)}
 \end{align}</math>
-The third line above follows because <math>e^{tX_i}</math> takes the value {{mvar|e<sup>t</sup>}} with probability {{mvar|p<sub>i</sub>}} and the value 1 with probability <math>1-p_i</math>. This is identical to the calculation above in the proof of the [[#Theorem for additive form (absolute error)|Theorem for additive form (absolute error)]].
+The third line above follows because <math>e^{tX_i}</math> takes the value {{mvar|e<sup>t</sup>}} with probability {{mvar|p<sub>i</sub>}} and the value 1 with probability {{math|1 − ''p<sub>i</sub>''}}. This is identical to the calculation above in the proof of the [[#Theorem for additive form (absolute error)|Theorem for additive form (absolute error)]].
-Rewriting <math>p_ie^t + (1-p_i)</math> as <math>p_i(e^t-1) + 1</math> and recalling that <math>1+x \le e^x</math> (with strict inequality if ''x'' > 0), we set <math>x = p_i(e^t-1)</math>. The same result can be obtained by directly replacing ''a'' in the equation for the Chernoff bound with (1 + δ)μ.<ref>Refer to the proof above</ref>
+Rewriting <math>p_ie^t + (1-p_i)</math> as <math>p_i(e^t-1) + 1</math> and recalling that <math>1+x \le e^x</math> (with strict inequality if {{math|''x'' > 0}}), we set <math>x = p_i(e^t-1)</math>. The same result can be obtained by directly replacing {{mvar|a}} in the equation for the Chernoff bound with {{math|(1 + ''δ'')''μ''}}.<ref>Refer to the proof above</ref>
 Thus,
@@ Line 147: / Line 153: @@
 :<math>\Pr(X > (1+\delta)\mu) < \frac{\prod_{i=1}^n\exp(p_i(e^t-1))}{\exp(t(1+\delta)\mu)} = \frac{\exp\left((e^t-1)\sum_{i=1}^n p_i\right)}{\exp(t(1+\delta)\mu)}  = \frac{\exp((e^t-1)\mu)}{\exp(t(1+\delta)\mu)}.</math>
-If we simply set ''t'' = log(1+δ) so that ''t'' > 0 for δ > 0, we can substitute and find
+If we simply set {{math|''t'' {{=}} log(1 + ''δ'')}} so that {{math|''t'' > 0}} for {{math|''δ'' > 0}}, we can substitute and find
-:<math>\frac{\exp((e^t-1)\mu)}{\exp(t(1+\delta)\mu)} = \frac{\exp((1+\delta - 1)\mu)}{(1+\delta)^{(1+\delta)\mu}} = \left[\frac{\exp(\delta)}{(1+\delta)^{(1+\delta)}}\right]^\mu</math>
+:<math>\frac{\exp((e^t-1)\mu)}{\exp(t(1+\delta)\mu)} = \frac{\exp((1+\delta - 1)\mu)}{(1+\delta)^{(1+\delta)\mu}} = \left[\frac{e^\delta}{(1+\delta)^{(1+\delta)}}\right]^\mu</math>
 This proves the result desired. A similar proof strategy can be used to show that
@@ Line 170: / Line 176: @@
 * If <math>\Pr(X_i = 1) = \Pr(X_i = 0) = \tfrac{1}{2}, \mathrm{E}[X] = \mu = \frac{n}{2} </math> Then,
-::<math>\Pr( X \ge \mu+a) \le e^{\frac{-2a^2}{n}}, \qquad a > 0</math>,
+::<math>\Pr( X \ge \mu+a) \le e^{\frac{-2a^2}{n}}, \qquad a > 0,</math>
-::<math>\Pr( X \le \mu-a) \le e^{\frac{-2a^2}{n}}, \qquad 0 < a < \mu</math>,
+::<math>\Pr( X \le \mu-a) \le e^{\frac{-2a^2}{n}}, \qquad 0 < a < \mu,</math>
 ==Applications of Chernoff bound==
@@ Line 185: / Line 191: @@
 [[Rudolf Ahlswede]] and [[Andreas Winter]] introduced {{Harv| Ahlswede|Winter|2003}} a Chernoff bound for matrix-valued random variables.
-If ''M'' is distributed according to some distribution over ''d'' × ''d'' matrices with zero mean, and if {{math|''M''<sub>1</sub>, ..., ''M<sub>t</sub>''}} are independent copies of ''M'' then for any ε > 0,
+If ''M'' is distributed according to some distribution over {{math|''d'' × ''d''}} matrices with zero mean, and if {{math|''M''<sub>1</sub>, ..., ''M<sub>t</sub>''}} are independent copies of ''M'' then for any {{math|''ε'' > 0}},
 :<math>\Pr\left( \left\| \frac{1}{t} \sum_{i=1}^t M_i - \mathrm{E}[M] \right\|_2 > \varepsilon \right) \leq d \exp \left( -C \frac{\varepsilon^2 t}{\gamma^2} \right).</math>
@@ Line 196: / Line 202: @@
 ===Theorem without the dependency on the dimensions===
-Let 0 < ε < 1 and ''M'' be a random symmetric real matrix with <math>\| \mathrm{E}[M] \|_2 \leq 1 </math> and <math>\| M\|_2 \leq \gamma </math> almost surely. Assume that each element on the support of ''M'' has at most rank ''r''. Set
+Let {{math|0 < ''ε'' < 1}} and ''M'' be a random symmetric real matrix with <math>\| \mathrm{E}[M] \|_2 \leq 1 </math> and <math>\| M\|_2 \leq \gamma </math> almost surely. Assume that each element on the support of ''M'' has at most rank ''r''. Set
 :<math> t = \Omega \left( \frac{\gamma\log (\gamma/\varepsilon^2)}{\varepsilon^2} \right).</math>
 If <math> r \leq t </math> holds almost surely, then

Revision as of 21:57, 23 June 2014

Example

A motivating example

The first step in the proof of Chernoff bounds

Precise statements and proofs

Theorem for additive form (absolute error)

Proof

Simpler bounds

Theorem for multiplicative form of Chernoff bound (relative error)

Proof

Better Chernoff bounds for some special cases

Applications of Chernoff bound

Matrix Chernoff bound

Theorem without the dependency on the dimensions

See also

References