Jump to content

Shannon's source coding theorem: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
improved latex formulas, also converted some to {{math|...}}
Line 8: Line 8:


== Statements ==
== Statements ==

''Source coding'' is a mapping from (a sequence of) symbols from an information [[Information theory#Source theory|source]] to a sequence of alphabet symbols (usually bits) such that the source symbols can be exactly recovered from the binary bits (lossless source coding) or recovered within some distortion (lossy source coding). This is the concept behind [[data compression]].
''Source coding'' is a mapping from (a sequence of) symbols from an information [[Information theory#Source theory|source]] to a sequence of alphabet symbols (usually bits) such that the source symbols can be exactly recovered from the binary bits (lossless source coding) or recovered within some distortion (lossy source coding). This is the concept behind [[data compression]].


=== Source coding theorem ===
=== Source coding theorem ===
In information theory, the '''source coding theorem''' (Shannon 1948)<ref name="Shannon"/> informally states that (MacKay 2003, pg. 81,<ref name="MacKay"/> Cover:Chapter 5<ref name="Cover"/>):


<blockquote>{{mvar|N}} [[Independent and identically distributed random variables|i.i.d.]] random variables each with [[Entropy (information theory)|entropy]] {{math|''H''(''X'')}} can be compressed into more than {{math|''N&thinsp;H''(''X'')}} [[bit]]s with negligible risk of information loss, as {{math|''N'' ∞}}; but conversely, if they are compressed into fewer than {{math|''N&thinsp;H''(''X'')}} bits it is virtually certain that information will be lost.</blockquote>
In information theory, the '''source coding theorem''' (Shannon 1948)<ref name="Shannon"/> informally states that:

:"''N'' [[Independent and identically distributed random variables|i.i.d.]] random variables each with [[Entropy (information theory)|entropy]] ''H(X)'' can be compressed into more than ''N H(X)'' [[bit]]s with negligible risk of information loss, as ''N'' tends to infinity; but conversely, if they are compressed into fewer than ''N H(X)'' bits it is virtually certain that information will be lost." (MacKay 2003, pg. 81,<ref name="MacKay"/> Cover:Chapter 5<ref name="Cover"/>).


=== Source coding theorem for symbol codes ===
=== Source coding theorem for symbol codes ===
Let {{math|Σ<sub>1</sub>, Σ<sub>2</sub>}} denote two finite alphabets and let {{math|Σ{{su|b=1|p=∗}}}} and {{math|Σ{{su|b=2|p=∗}}}} denote the [[Kleene star|set of all finite words]] from those alphabets (respectively).


Suppose that {{mvar|X}} is a random variable taking values in {{math|Σ<sub>1</sub>}} and let {{math|&thinsp;''f''&thinsp;}} be a [[Variable-length code#Uniquely decodable codes|uniquely decodable]] code from {{math|Σ{{su|b=1|p=∗}}}} to {{math|Σ{{su|b=2|p=∗}}}} where {{math|{{!}}Σ<sub>2</sub>{{!}} {{=}} ''a''}}. Let {{mvar|S}} denote the random variable given by the word length {{math|&thinsp;''f''&thinsp;(''X'')}}.
Let <math>\Sigma_1</math>, <math>\Sigma_2</math> denote two finite alphabets and let <math>\Sigma_1^*</math> and <math>\Sigma_2^*</math> denote the [[Kleene star|set of all finite words]] from those alphabets (respectively).


If {{math|&thinsp;''f''&thinsp;}} is optimal in the sense that it has the minimal expected word length for {{mvar|X}}, then (Shannon 1948):
Suppose that ''X'' is a random variable taking values in <math>\Sigma_1</math> and let ''f'' be a [[Variable-length code#Uniquely decodable codes|uniquely decodable]] code from <math>\Sigma_1^*</math> to <math>\Sigma_2^*</math> where <math>|\Sigma_2|=a</math>. Let ''S'' denote the random variable given by the wordlength ''f(X)''.

If ''f'' is optimal in the sense that it has the minimal expected wordlength for ''X'', then


:<math> \frac{H(X)}{\log_2 a} \leq \mathbb{E}S < \frac{H(X)}{\log_2 a} +1 </math>
:<math> \frac{H(X)}{\log_2 a} \leq \mathbb{E}S < \frac{H(X)}{\log_2 a} +1 </math>
(Shannon 1948)


== Proof: Source coding theorem ==
== Proof: Source coding theorem ==
Given {{mvar|X}} is an [[independent identically-distributed random variables|i.i.d.]] source, its [[time series]] {{math|''X''<sub>1</sub>, ..., ''X<sub>n</sub>''}} is i.i.d. with [[entropy]] {{math|''H''(''X'')}} in the discrete-valued case and [[differential entropy]] in the continuous-valued case. The Source coding theorem states that for any {{math|''ε'' > 0}} for any [[information theory#Rate|rate]] larger than the [[entropy]] of the source, there is large enough {{mvar|n}} and an encoder that takes {{mvar|n}} i.i.d. repetition of the source, {{math|''X''<sup>1:''n''</sup>}}, and maps it to {{math|''n''(''H''(''X'') + ''ε'')}} binary bits such that the source symbols {{math|''X''<sup>1:''n''</sup>}} are recoverable from the binary bits with probability at least {{math|1 ''ε''}}.
Given <math>X</math> is an [[independent identically-distributed random variables|i.i.d.]] source, its [[time series]]
''X''<sub>1</sub>, ..., ''X''<sub>''n''</sub> is i.i.d. with [[entropy]] ''H''(''X'') in the discrete-valued case and [[differential entropy]] in the continuous-valued case. The Source coding theorem states that for any <math> \epsilon > 0 </math> for any [[information theory#Rate|rate]] larger than the [[entropy]] of the source, there is large enough <math> n </math> and an encoder that takes <math> n </math> i.i.d. repetition of the source, <math> X^{1:n} </math>, and maps it to <math> n.(H(X)+\epsilon) </math> binary bits such that the source symbols <math> X^{1:n} </math> are recoverable from the binary bits with probability at least <math> 1 - \epsilon </math>.


'''Proof of Achievability.''' Fix some {{math|''ε'' > 0}}, and let
''' Proof for achievability '''


Fix some <math> \epsilon > 0 </math>, and let <math>p(x_1, \ldots, x_n) = \Pr[X_1 = x_1, \ldots, X_n = x_n]</math>. The typical set,<math>A_n^\epsilon</math>, is defined as follows:
:<math>p(x_1, \ldots, x_n) = \Pr \left[X_1 = x_1, \cdots, X_n = x_n \right].</math>


The typical set, {{math|''A''{{su|b=''n''|p=''ε''}}}}, is defined as follows:
<math> A_n^\epsilon =\; \left\{(x_1, \ldots, x_n) : \left|-\frac{1}{n} \log p(x_1, \ldots, x_n) - H_n(X)\right|<\epsilon \right\}.</math>


:<math>A_n^\varepsilon =\left\{(x_1, \cdots, x_n) \ : \ \left|-\frac{1}{n} \log p(x_1, \cdots, x_n) - H_n(X)\right| < \varepsilon \right\}.</math>
The [[Asymptotic equipartition property#AEP for discrete-time i.i.d. sources|Asymptotic Equipartition Property]] (AEP) shows that for large enough ''n'', the probability that a sequence generated by the source lies in the typical set,<math>A_n^\epsilon</math>, as defined approaches one. In particular there for large enough ''n'', <math>P(A_n^\epsilon)>1-\epsilon</math> (See

The [[Asymptotic equipartition property#AEP for discrete-time i.i.d. sources|Asymptotic Equipartition Property]] (AEP) shows that for large enough {{mvar|n}}, the probability that a sequence generated by the source lies in the typical set, {{math|''A''{{su|b=''n''|p=''ε''}}}}, as defined approaches one. In particular there for large enough {{mvar|n}}, <math>P(A_n^\varepsilon)>1-\varepsilon</math> (See
[[Asymptotic equipartition property#AEP for discrete-time i.i.d. sources|AEP]] for a proof):
[[Asymptotic equipartition property#AEP for discrete-time i.i.d. sources|AEP]] for a proof):


The definition of typical sets implies that those sequences that lie in the typical set satisfy:
The definition of typical sets implies that those sequences that lie in the typical set satisfy:


:<math>2^{-n(H(X)+\varepsilon)} \leq p \left (x_1, \cdots, x_n \right ) \leq 2^{-n(H(X)-\varepsilon)}</math>
:<math>
2^{-n(H(X)+\epsilon)} \leq p(x_1, x_2, ..., x_n) \leq 2^{-n(H(X)-\epsilon)}
</math>


Note that:
Note that:


*The probability of a sequence from <math>X</math> being drawn from <math>A_n^\epsilon</math> is greater than <math>1-\epsilon</math>
*The probability of a sequence from {{mvar|X}} being drawn from {{math|''A''{{su|b=''n''|p=''ε''}}}} is greater than {{math|1 − ''ε''}}.


*<math>\left| A_n^\epsilon \right| \leq 2^{n(H(X)+\epsilon)}</math> since the probability of the whole set <math>A_n^\epsilon</math> is at most one.
*<math>\left| A_n^\varepsilon \right| \leq 2^{n(H(X)+\varepsilon)}</math> since the probability of the whole set {{math|''A''{{su|b=''n''|p=''ε''}}}} is at most one.


*<math>\left| A_n^\epsilon \right| \geq (1-\epsilon)2^{n(H(X)-\epsilon)}</math>. For the proof, use the upper bound on the probability of each term in typical set and the lower bound on the probability of the whole set <math>A_n^\epsilon</math>.
*<math>\left| A_n^\varepsilon \right| \geq (1-\varepsilon) 2^{n(H(X)-\varepsilon)}</math>. For the proof, use the upper bound on the probability of each term in typical set and the lower bound on the probability of the whole set {{math|''A''{{su|b=''n''|p=''ε''}}}}.


Since <math>\left| A_n^\epsilon \right| \leq 2^{n(H(X)+\epsilon)}, n.(H(X)+\epsilon)\; </math> bits are enough to point to any string in this set.
Since <math>\left| A_n^\varepsilon \right| \leq 2^{n(H(X)+\varepsilon)}, n.(H(X)+\varepsilon)</math> bits are enough to point to any string in this set.


The encoding algorithm: The encoder checks if the input sequence lies within the typical set; if yes, it outputs the index of the input sequence within the typical set; if not, the encoder outputs an arbitrary <math> n.(H(X)+\epsilon) </math> digit number. As long as the input sequence lies within the typical set (with probability at least <math>1-\epsilon</math>), the encoder doesn't make any error. So, the probability of error of the encoder is bounded above by <math>\epsilon</math>
The encoding algorithm: The encoder checks if the input sequence lies within the typical set; if yes, it outputs the index of the input sequence within the typical set; if not, the encoder outputs an arbitrary {{math|''n''(''H''(''X'') + ''ε'')}} digit number. As long as the input sequence lies within the typical set (with probability at least {{math|1 − ''ε''}}), the encoder doesn't make any error. So, the probability of error of the encoder is bounded above by {{mvar|ε}}.


'''Proof of Converse.''' The converse is proved by showing that any set of size smaller than {{math|''A''{{su|b=''n''|p=''ε''}}}} (in the sense of exponent) would cover a set of probability bounded away from {{math|1}}.
''' Proof for converse '''
The converse is proved by showing that any set of size smaller than <math>A_n^\epsilon</math> (in the sense of exponent) would cover a set of probability bounded away from 1.


== Proof: Source coding theorem for symbol codes ==
== Proof: Source coding theorem for symbol codes ==
Let <math>s_i</math> denote the wordlength of each possible <math>x_i</math> (<math>i=1,\ldots,n</math>). Define <math>q_i = a^{-s_i}/C</math>, where ''C'' is chosen so that <math> \sum q_i = 1</math>.
For {{math|1 ≤ ''i'' ≤ ''n''}} let {{math|''s<sub>i</sub>''}} denote the word length of each possible {{math|''x<sub>i</sub>''}}. Define <math>q_i = a^{-s_i}/C</math>, where {{mvar|C}} is chosen so that {{math|''q''<sub>1</sub> + ... + ''q<sub>n</sub>'' {{=}} 1}}. Then


:<math>\begin{align}
Then
H(X) &= -\sum_{i=1}^n p_i \log_2 p_i \\
&\leq -\sum_{i=1}^n p_i \log_2 q_i \\
&= -\sum_{i=1}^n p_i \log_2 a^{-s_i} + \sum_{i=1}^n p_i \log_2 C \\
&= -\sum_{i=1}^n p_i \log_2 a^{-s_i} + \log_2 C \\
&\leq -\sum_{i=1}^n - s_i p_i \log_2 a \\
&\leq \mathbb{E} S \log_2 a \\
\end{align}</math>


where the second line follows from [[Gibbs' inequality]] and the fifth line follows from [[Kraft's inequality]]:
:<math>
\begin{align}
H(X) &= - \sum_{i=1}^n p_i \log_2 p_i \\
&\leq - \sum_{i=1}^n p_i \log_2 q_i \\
&= - \sum_{i=1}^n p_i \log_2 a^{-s_i} + \sum_{i=1}^n p_i \log_2 C \\
&= - \sum_{i=1}^n p_i \log_2 a^{-s_i} + \log_2 C \\
&\leq - \sum_{i=1}^n - s_i p_i \log_2 a \\
&\leq \mathbb{E}S \log_2 a \\
\end{align}
</math>


where the second line follows from [[Gibbs' inequality]] and the fifth line follows from [[Kraft's inequality]]: <math>C = \sum_{i=1}^n a^{-s_i} \leq 1</math> so <math>\log C \leq 0</math>.
:<math>C = \sum_{i=1}^n a^{-s_i} \leq 1</math>

so {{math|log ''C'' ≤ 0}}.


For the second inequality we may set
For the second inequality we may set
Line 96: Line 90:
:<math> \sum a^{-s_i} \leq \sum p_i = 1</math>
:<math> \sum a^{-s_i} \leq \sum p_i = 1</math>


and so by Kraft's inequality there exists a prefix-free code having those wordlengths. Thus the minimal ''S'' satisfies
and so by Kraft's inequality there exists a prefix-free code having those word lengths. Thus the minimal {{mvar|S}} satisfies


:<math>
:<math>\begin{align}
\mathbb{E} S & = \sum p_i s_i \\
\begin{align}
\mathbb{E}S & = \sum p_i s_i \\
& < \sum p_i \left( -\log_a p_i +1 \right) \\
& < \sum p_i \left( -\log_a p_i +1 \right) \\
& = \sum - p_i \frac{\log_2 p_i}{\log_2 a} +1 \\
& = \sum - p_i \frac{\log_2 p_i}{\log_2 a} +1 \\
& = \frac{H(X)}{\log_2 a} +1. \\
& = \frac{H(X)}{\log_2 a} +1 \\
\end{align}
\end{align}</math>
</math>


==Extension to non-stationary independent sources ==
==Extension to non-stationary independent sources ==

=== Fixed Rate lossless source coding for discrete time non-stationary independent sources===
=== Fixed Rate lossless source coding for discrete time non-stationary independent sources===
Define typical set <math>A_n^\epsilon</math> as:
Define typical set {{math|''A''{{su|b=''n''|p=''ε''}}}} as:


: <math> A_n^\epsilon =\; \{x_1^n : \left|-\frac{1}{n} \log p(X_1, X_2, ..., X_n) - \bar{H_n}(X)\right|<\epsilon \}.</math>
:<math>A_n^\varepsilon = \left \{x_1^n \ : \ \left|-\frac{1}{n} \log p \left (X_1, \cdots, X_n \right ) - \overline{H_n}(X)\right| < \varepsilon \right \}.</math>


Then, for given <math>\delta>0</math>, for n large enough, <math>\Pr(A_n^\epsilon)> 1-\delta </math>. Now we just encode the sequences in the typical set, and usual methods in source coding show that the cardinality of this set is smaller than <math>2^{n(\bar{H_n}(X)+\epsilon)}</math>. Thus, on an average, <math>\bar{H_n}(X)+\epsilon</math> bits suffice for encoding with probability greater than <math>1-\delta</math>, where <math>\epsilon\; \mbox{ and }\; \delta</math> can be made arbitrarily small, by making n larger.
Then, for given {{math|''δ'' > 0}}, for {{mvar|n}} large enough, {{math|Pr(''A''{{su|b=''n''|p=''ε''}}) > 1 − ''δ''}}. Now we just encode the sequences in the typical set, and usual methods in source coding show that the cardinality of this set is smaller than <math>2^{n(\overline{H_n}(X)+\varepsilon)}</math>. Thus, on an average, {{math|{{overline|''H<sub>n</sub>''}}(''X'') + ''ε''}} bits suffice for encoding with probability greater than {{math|1 − ''δ''}}, where {{mvar|ε}} and {{mvar|δ}} can be made arbitrarily small, by making {{mvar|n}} larger.


==See also==
==See also==
Line 126: Line 117:
<ref name="Shannon">[[C.E. Shannon]], "[http://plan9.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf A Mathematical Theory of Communication]", ''[[Bell System Technical Journal]]'', vol. 27, pp.&nbsp;379–423, 623-656, July, October, 1948</ref>
<ref name="Shannon">[[C.E. Shannon]], "[http://plan9.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf A Mathematical Theory of Communication]", ''[[Bell System Technical Journal]]'', vol. 27, pp.&nbsp;379–423, 623-656, July, October, 1948</ref>
<ref name="MacKay">David J. C. MacKay. ''[http://www.inference.phy.cam.ac.uk/mackay/itila/book.html Information Theory, Inference, and Learning Algorithms]'' Cambridge: Cambridge University Press, 2003. ISBN 0-521-64298-1</ref>
<ref name="MacKay">David J. C. MacKay. ''[http://www.inference.phy.cam.ac.uk/mackay/itila/book.html Information Theory, Inference, and Learning Algorithms]'' Cambridge: Cambridge University Press, 2003. ISBN 0-521-64298-1</ref>
<ref name="Cover">{{cite book | last = Cover | first = Thomas M. | title = Elements of Information Theory | chapter = Chapter 5: Data Compression | year = 2006 | publisher = John Wiley & Sons | isbn = 0-471-24195-4 }}</ref>}}
<ref name="Cover">{{cite book
| last = Cover
| first = Thomas M.
| title = Elements of Information Theory
| chapter = Chapter 5: Data Compression
| year = 2006
| publisher = John Wiley & Sons
| isbn = 0-471-24195-4 }}</ref>
}}


[[Category:Information theory]]
[[Category:Information theory]]

Revision as of 05:38, 24 April 2014

In information theory, Shannon's source coding theorem (or noiseless coding theorem) establishes the limits to possible data compression, and the operational meaning of the Shannon entropy.

The source coding theorem shows that (in the limit, as the length of a stream of independent and identically-distributed random variable (i.i.d.) data tends to infinity) it is impossible to compress the data such that the code rate (average number of bits per symbol) is less than the Shannon entropy of the source, without it being virtually certain that information will be lost. However it is possible to get the code rate arbitrarily close to the Shannon entropy, with negligible probability of loss.

The source coding theorem for symbol codes places an upper and a lower bound on the minimal possible expected length of codewords as a function of the entropy of the input word (which is viewed as a random variable) and of the size of the target alphabet.

Statements

Source coding is a mapping from (a sequence of) symbols from an information source to a sequence of alphabet symbols (usually bits) such that the source symbols can be exactly recovered from the binary bits (lossless source coding) or recovered within some distortion (lossy source coding). This is the concept behind data compression.

Source coding theorem

In information theory, the source coding theorem (Shannon 1948)[1] informally states that (MacKay 2003, pg. 81,[2] Cover:Chapter 5[3]):

N i.i.d. random variables each with entropy H(X) can be compressed into more than N H(X) bits with negligible risk of information loss, as N → ∞; but conversely, if they are compressed into fewer than N H(X) bits it is virtually certain that information will be lost.

Source coding theorem for symbol codes

Let Σ1, Σ2 denote two finite alphabets and let Σ
1
and Σ
2
denote the set of all finite words from those alphabets (respectively).

Suppose that X is a random variable taking values in Σ1 and let f be a uniquely decodable code from Σ
1
to Σ
2
where 2| = a. Let S denote the random variable given by the word length f (X).

If f is optimal in the sense that it has the minimal expected word length for X, then (Shannon 1948):

Proof: Source coding theorem

Given X is an i.i.d. source, its time series X1, ..., Xn is i.i.d. with entropy H(X) in the discrete-valued case and differential entropy in the continuous-valued case. The Source coding theorem states that for any ε > 0 for any rate larger than the entropy of the source, there is large enough n and an encoder that takes n i.i.d. repetition of the source, X1:n, and maps it to n(H(X) + ε) binary bits such that the source symbols X1:n are recoverable from the binary bits with probability at least 1 − ε.

Proof of Achievability. Fix some ε > 0, and let

The typical set, Aε
n
, is defined as follows:

The Asymptotic Equipartition Property (AEP) shows that for large enough n, the probability that a sequence generated by the source lies in the typical set, Aε
n
, as defined approaches one. In particular there for large enough n, (See AEP for a proof):

The definition of typical sets implies that those sequences that lie in the typical set satisfy:

Note that:

  • The probability of a sequence from X being drawn from Aε
    n
    is greater than 1 − ε.
  • since the probability of the whole set Aε
    n
    is at most one.
  • . For the proof, use the upper bound on the probability of each term in typical set and the lower bound on the probability of the whole set Aε
    n
    .

Since bits are enough to point to any string in this set.

The encoding algorithm: The encoder checks if the input sequence lies within the typical set; if yes, it outputs the index of the input sequence within the typical set; if not, the encoder outputs an arbitrary n(H(X) + ε) digit number. As long as the input sequence lies within the typical set (with probability at least 1 − ε), the encoder doesn't make any error. So, the probability of error of the encoder is bounded above by ε.

Proof of Converse. The converse is proved by showing that any set of size smaller than Aε
n
(in the sense of exponent) would cover a set of probability bounded away from 1.

Proof: Source coding theorem for symbol codes

For 1 ≤ in let si denote the word length of each possible xi. Define , where C is chosen so that q1 + ... + qn = 1. Then

where the second line follows from Gibbs' inequality and the fifth line follows from Kraft's inequality:

so log C ≤ 0.

For the second inequality we may set

so that

and so

and

and so by Kraft's inequality there exists a prefix-free code having those word lengths. Thus the minimal S satisfies

Extension to non-stationary independent sources

Fixed Rate lossless source coding for discrete time non-stationary independent sources

Define typical set Aε
n
as:

Then, for given δ > 0, for n large enough, Pr(Aε
n
) > 1 − δ
. Now we just encode the sequences in the typical set, and usual methods in source coding show that the cardinality of this set is smaller than . Thus, on an average, Hn(X) + ε bits suffice for encoding with probability greater than 1 − δ, where ε and δ can be made arbitrarily small, by making n larger.

See also

References

  1. ^ C.E. Shannon, "A Mathematical Theory of Communication", Bell System Technical Journal, vol. 27, pp. 379–423, 623-656, July, October, 1948
  2. ^ David J. C. MacKay. Information Theory, Inference, and Learning Algorithms Cambridge: Cambridge University Press, 2003. ISBN 0-521-64298-1
  3. ^ Cover, Thomas M. (2006). "Chapter 5: Data Compression". Elements of Information Theory. John Wiley & Sons. ISBN 0-471-24195-4.