Academia.eduAcademia.edu

Channel Coding: The Road to Channel Capacity

2007, Proceedings of The IEEE

Starting from Shannon's celebrated 1948 channel coding theorem, we trace the evolution of channel coding from Hamming codes to capacity-approaching codes. We focus on the contributions that have led to the most significant improvements in performance versus complexity for practical applications, particularly on the additive white Gaussian noise channel. We discuss algebraic block codes, and why they did not prove

1 Channel Coding: The Road to Channel Capacity Daniel J. Costello, Jr., Fellow, IEEE, and G. David Forney, Jr., Fellow, IEEE arXiv:cs/0611112v1 [cs.IT] 22 Nov 2006 Submitted to the Proceedings of the IEEE First revision, November 2006 Abstract Starting from Shannon’s celebrated 1948 channel coding theorem, we trace the evolution of channel coding from Hamming codes to capacity-approaching codes. We focus on the contributions that have led to the most significant improvements in performance vs. complexity for practical applications, particularly on the additive white Gaussian noise (AWGN) channel. We discuss algebraic block codes, and why they did not prove to be the way to get to the Shannon limit. We trace the antecedents of today’s capacity-approaching codes: convolutional codes, concatenated codes, and other probabilistic coding schemes. Finally, we sketch some of the practical applications of these codes. Index Terms Channel coding, algebraic block codes, convolutional codes, concatenated codes, turbo codes, low-density paritycheck codes, codes on graphs. I. I NTRODUCTION The field of channel coding started with Claude Shannon’s 1948 landmark paper [1]. For the next half century, its central objective was to find practical coding schemes that could approach channel capacity (hereafter called “the Shannon limit”) on well-understood channels such as the additive white Gaussian noise (AWGN) channel. This goal proved to be challenging, but not impossible. In the past decade, with the advent of turbo codes and the rebirth of low-density parity-check codes, it has finally been achieved, at least in many cases of practical interest. As Bob McEliece observed in his 2004 Shannon Lecture [2], the extraordinary efforts that were required to achieve this objective may not be fully appreciated by future historians. McEliece imagined a biographical note in the 166th edition of the Encyclopedia Galactica along the following lines: Claude Shannon: Born on the planet Earth (Sol III) in the year 1916 A.D. Generally regarded as the father of the Information Age, he formulated the notion of channel capacity in 1948 A.D. Within several decades, mathematicians and engineers had devised practical ways to communicate reliably at data rates within 1% of the Shannon limit . . . The purpose of this paper is to tell the story of how Shannon’s challenge was met, at least as it appeared to us, before the details of this story are lost to memory. We focus on the AWGN channel, which was the target for many of these efforts. In Section II, we review various definitions of the Shannon limit for this channel. In Section III, we discuss the subfield of algebraic coding, which dominated the channel coding field for its first couple of decades. We will discuss both the achievements of algebraic coding, and also the reasons why it did not prove to be the way to approach the Shannon limit. Daniel J. Costello, Jr., is with the Univ. Notre Dame, IN 46556, USA, e-mail: costello.2@nd.edu. G. David Forney, Jr., is with the Mass. Inst. of Tech., Cambridge, MA 02139 USA, e-mail: forney@mit.edu. This work was supported in part by NSF Grant CCR02-05310 and NASA Grant NNGO5GH73G. 2 In Section IV, we discuss the alternative line of development that was inspired more directly by Shannon’s random coding approach, which is sometimes called “probabilistic coding.” The first major contribution to this area after Shannon was Elias’ invention of convolutional codes. This line of development includes product codes, concatenated codes, trellis decoding of block codes, and ultimately modern capacity-approaching codes. In Section V, we discuss codes for bandwidth-limited channels, namely lattice codes and trellis-coded modulation. Finally, in Section VI, we discuss the development of capacity-approaching codes, principally turbo codes and low-density parity-check (LDPC) codes. II. C ODING FOR THE AWGN CHANNEL A coding scheme for the AWGN channel may be characterized by two simple parameters: its signal-to-noise ratio (SNR) and its spectral efficiency η in bits per second per Hertz (b/s/Hz). The SNR is the ratio of average signal power to average noise power, a dimensionless quantity. The spectral efficiency of a coding scheme that transmits R bits per second (b/s) over an AWGN channel of bandwidth W Hz is simply η = R/W b/s/Hz. Coding schemes for the AWGN channel typically map a sequence of bits at a rate R b/s to a sequence of real symbols at a rate of 2B symbols per second; the discrete-time code rate is then r = R/2B bits per symbol. The sequence of real symbols is then modulated via pulse amplitude modulation (PAM) or quadrature amplitude modulation (QAM) for transmission over an AWGN channel of bandwidth W . By Nyquist theory, B (sometimes called the “Shannon bandwidth” [3]) cannot exceed the actual bandwidth W . If B ≈ W , then the spectral efficiency is η = R/W ≈ R/B = 2r . We therefore say that the nominal spectral efficiency of a discrete-time coding scheme is 2r , the discrete-time code rate in bits per two symbols. The actual spectral efficiency η = R/W of the corresponding continuous-time scheme is upperbounded by the nominal spectral efficiency 2r , and approaches 2r as B → W . Thus, for discrete-time codes, we will often denote 2r by η , implicitly assuming B ≈ W . Shannon showed that on an AWGN channel with signal-to-noise ratio SNR and bandwidth W Hz, the rate of reliable transmission is upperbounded by R < W log2 (1 + SNR). Moreover, if a long code with rate R < W log2 (1 + SNR) is chosen at random, then there exists a decoding scheme such that with high probability the code and decoder will achieve highly reliable transmission (i.e., low probability of decoding error). Equivalently, Shannon’s result shows that the spectral efficiency is upperbounded by η < log2 (1 + SNR); or, given a spectral efficiency η , that the SNR needed for reliable transmission is lowerbounded by SNR > 2η − 1. So we may say that the Shannon limit on rate (i.e., the channel capacity) is W log2 (1 + SNR) b/s, or equivalently that the Shannon limit on spectral efficiency is log2 (1 + SNR) b/s/Hz, or equivalently that the Shannon limit on SNR for a given spectral efficiency η is 2η − 1. Note that the Shannon limit on SNR is a lower bound rather than an upper bound. These bounds suggest that we define a normalized SNR parameter SNRnorm as follows: SNR . 2η − 1 Then for any reliable coding scheme, SNRnorm > 1; i.e., the Shannon limit (lower bound) on SNRnorm is 1 (0 dB), independent of η . Moreover, SNRnorm measures the “gap to capacity”, i.e., 10 log10 SNRnorm is the difference in decibels (dB)1 between the SNR actually used and the Shannon limit on SNR given η , namely 2η − 1. If the SNRnorm = 1 In decibels, a multiplicative factor of α is expressed as 10 log10 α dB. CHANNEL CODING: THE ROAD TO CHANNEL CAPACITY 3 desired spectral efficiency is less than 1 b/s/Hz (the so-called power-limited regime), then it can be shown that binary codes can be used on the AWGN channel with a cost in Shannon limit on SNR of less than 0.2 dB. On the other hand, since for a binary coding scheme the discrete-time code rate is bounded by r ≤ 1 bit per symbol, the spectral efficiency of a binary coding scheme is limited to η ≤ 2r ≤ 2 b/s/Hz, so multilevel coding schemes must be used if the desired spectral efficiency is greater than 2 b/s/Hz (the so-called bandwidth-limited regime). In practice, coding schemes for the power-limited and bandwidth-limited regimes differ considerably. A closely related normalized SNR parameter that has been traditionally used in the power-limited regime is Eb /N0 , which may be defined as SNR 2η − 1 Eb /N0 = = SNRnorm . η η For a given spectral efficiency η , Eb /N0 is thus lowerbounded by Eb /N0 > 2η − 1 , η so we may say that the Shannon limit (lower bound) on Eb /N0 as a function of η is 2 η−1 . This function decreases monotonically with η , and approaches ln 2 as η → 0, so we may say that the ultimate Shannon limit (lower bound) on Eb /N0 for any η is ln 2 (-1.59 dB). η We see that as η → 0, Eb /N0 → SNRnorm ln 2, so Eb /N0 and SNRnorm become equivalent parameters in the severely power-limited regime. In the power-limited regime, we will therefore use the traditional parameter Eb /N0 . However, in the bandwidth-limited regime, we will use SNRnorm , which is more informative in this regime. III. A LGEBRAIC CODING The algebraic coding paradigm dominated the first several decades of the field of channel coding. Indeed, most of the textbooks on coding of this period (including Peterson [4], Berlekamp [5], Lin [6], Peterson and Weldon [7], MacWilliams and Sloane [8], and Blahut [9]) covered only algebraic coding theory. Algebraic coding theory is primarily concerned with linear (n, k, d) block codes over the binary field F2 . A binary linear (n, k, d) block code consists of 2k binary n-tuples, called codewords, which have the group property: i.e., the componentwise mod-2 sum of any two codewords is another codeword. The parameter d denotes the minimum Hamming distance between any two distinct codewords— i.e., the minimum number of coordinates in which any two codewords differ. The theory generalizes to linear (n, k, d) block codes over nonbinary fields Fq . The principal objective of algebraic coding theory is to maximize the minimum distance d for a given (n, k). The motivation for this objective is to maximize error-correction power. Over a binary symmetric channel (BSC: a binary-input, binary-output channel with statistically independent binary errors), the optimum decoding rule is to decode to the codeword closest in Hamming distance to the received n-tuple. With this rule, a code with minimum distance d can correct all patterns of (d − 1)/2 or fewer channel errors (assuming that d is odd), but cannot correct some patterns containing a greater number of errors. The field of algebraic coding theory has had many successes, which we will briefly survey below. However, even though binary algebraic block codes can be used on the AWGN channel, they have not proved to be the way to approach channel capacity on this channel, even in the power-limited regime. Indeed, they have not proved to be the way to approach channel capacity even on the BSC. As we proceed, we will discuss some of the fundamental reasons for this failure. A. Binary coding on the power-limited AWGN channel A binary linear (n, k, d) block code may be used on a Gaussian channel as follows. To transmit a codeword, each of its n binary symbols may be mapped into the two symbols {±α} of a binary pulse-amplitude-modulation (2-PAM) alphabet, yielding a two-valued real n-tuple x. This n-tuple may then be sent 4 through a channel of bandwidth W at a symbol rate 2B up to the Nyquist limit of 2W binary symbols per second, using standard pulse amplitude modulation (PAM) for baseband channels, or quadrature amplitude modulation (QAM) for passband channels. At the receiver, an optimum PAM or QAM detector can produce a real-valued n-tuple y = x + n, where x is the transmitted sequence and n is a discrete-time white Gaussian noise sequence. The optimum (maximum likelihood) decision rule is then to choose the one of the 2k possible transmitted sequences x that is closest to the received sequence y in Euclidean distance. If the symbol rate 2B approaches the Nyquist limit of 2W symbols per second, then the transmitted data rate can approach R = (k/n)2W b/s, so the spectral efficiency of such a binary coding scheme can approach η = 2k/n b/s/Hz. As mentioned previously, since k/n ≤ 1, we have η ≤ 2 b/s/Hz; i.e., binary coding cannot be used in the bandwidth-limited regime. With no coding (independent transmission of random bits via PAM or QAM), the transmitted data rate is 2W b/s, so the nominal spectral efficency is η = 2 b/s/Hz. It is straightforward to show that with optimum modulation and detection the probability of error per bit is p √ Pb (E) = Q( SNR) = Q( 2Eb /N0 ), where Z ∞ 2 1 e−y /2 dy Q(x) = √ 2π x is the Gaussian probability of error function. This baseline performance curve of Pb (E) vs. Eb /N0 for uncoded transmission is plotted in Figure 1. For example, in order to achieve a bit error probability of Pb (E) ≈ 10−5 , we must have Eb /N0 ≈ 9.1 (9.6 dB) for uncoded transmission. On the other hand, the Shannon limit on Eb /N0 for η = 2 is Eb /N0 = 1.5 (1.76 dB), so the gap to capacity at the uncoded binary-PAM spectral efficiency of η = 2 is SNRnorm ≈ 7.8 dB. If a coding scheme with unlimited bandwidth expansion were allowed, i.e., η → 0, then a further gain of 3.35 dB to the ultimate Shannon limit on Eb /N0 of -1.59 dB would be achievable. These two limits are also shown on Figure 1. The performance curve of any practical coding scheme that improves on uncoded transmission must lie between the relevant Shannon limit and the uncoded performance curve. Thus Figure 1 defines the “playing field” for channel coding. The real coding gain of a coding scheme at a given probability of error per bit Pb (E) will be defined as the difference (in dB) between the Eb /N0 required to obtain that Pb (E) with coding vs. without coding. Thus the maximum possible real coding gain at Pb (E) ≈ 10−5 is about 11.2 dB. For moderate-complexity binary linear (n, k, d) codes, it can often be assumed that the decoding error probability is dominated by the probability of making an error to one of the nearest-neighbor codewords. If this assumption holds, then it is easy to show that with optimum (minimum-Euclidean-distance) decoding, the decoding error probability PB (E)2 per block is well approximated by the union bound estimate √ p PB (E) ≈ Nd Q( d · SNR) = Nd Q( (dk/n)2Eb /N0 ), where Nd denotes the number of codewords of Hamming weight d. The probability of decoding error per information bit Pb (E) is then given by p p Pb (E) = PB (E)/k ≈ (Nd /k) Q( (dk/n)2Eb /N0 ) = (Nd /k) Q( γc 2Eb /N0 ), where the quantity γc = dk/n is called the nominal coding gain of the code.3 The real coding gain is less than the nominal coding gain γc if the “error coefficient” Nd /k is greater than 1. A rule of thumb that is valid when the 2 The probability of error per information bit is not in general the same as the bit error probability (average number of bit errors per transmitted bit), although both normally have the same exponent (argument of the Q function). 3 An (n, k, d) code with odd minimum distance d may be extended by addition of an overall parity-check to an (n + 1, k, d + 1) evenminimum-distance code. For error correction, such an extension is of no use, since the extended code corrects no more errors but has a lower code rate; but for an AWGN channel, such an extension always helps (unless k = 1 and d = n), since the nominal coding gain γc = dk/n increases. Thus an author who discusses odd-distance codes is probably thinking about minimum-Hamming-distance decoding, whereas an author who discusses even-distance codes is probably thinking about minimum-Euclidean-distance decoding. CHANNEL CODING: THE ROAD TO CHANNEL CAPACITY 5 1e+00 1e−01 Uncoded binary PAM b P (E) 1e−02 1e−03 Ultimate Shannon Limit Shannon Limit (η=2) 1e−04 1e−05 −2 Fig. 1. 0 2 4 Eb/No (dB) 6 8 10 Pb (E) vs. Eb /N0 for uncoded binary PAM, compared to Shannon limits on Eb /N0 for η = 2 and η → 0. error coefficient Nd /k is not too large and Pb (E) is on the order of 10−6 is that a factor of 2 increase in the error coefficient costs about 0.2 dB of real coding gain. As Pb (E) → 0, the real coding gain approaches the nominal coding gain γc , so γc is also called the asymptotic coding gain. For example, consider the binary linear (32, 6, 16) “biorthogonal” block code, so called because the Euclidean images of the 64 codewords consist of 32 orthogonal vectors and their negatives. With this code, every codeword has Nd = 62 nearest neighbors at minimum Hamming distance d = 16. Its nominal spectral efficiency is η = 3/8, its nominal coding gain is γc = 3 (4.77 dB), and its probability of decoding error per information bit is p Pb (E) ≈ (62/6) Q( 6Eb /N0 ), which is plotted in Figure 2. We see that this code requires Eb /N0 ≈ 5.8 dB to achieve Pb (E) ≈ 10−5 , so its real coding gain at this error probability is about 3.8 dB. At this point, we can already identify two issues that must be addressed to approach the Shannon limit on AWGN channels. First, in order to obtain optimum performance, the decoder must operate on the real-valued received sequence y (“soft decisions”) and minimize Euclidean distance, rather than quantize to a two-level received sequence (“hard decisions”) and minimize Hamming distance. It can be shown that hard decisions (two-level quantization) generally costs 2 to 3 dB in decoding performance. Thus, in order to approach the Shannon limit on an AWGN channel, the error-correction paradigm of algebraic coding must be modified to accommodate soft decisions. Second, we can see already that decoding complexity is going to be an issue. For optimum decoding, soft decision or hard, the decoder must choose the best of 2k codewords, so a straightforward exhaustive optimum decoding algorithm will require on the order of 2k computations. Thus, as codes become large, lower-complexity decoding algorithms that approach optimum performance must be devised. 6 1e+00 Uncoded binary PAM (32,6,16) biorthogonal block code 1e−01 b P (E) 1e−02 1e−03 1e−04 1e−05 −2 Fig. 2. Ultimate Shannon Limit Shannon Limit (η=2) 0 2 4 Eb/No (dB) 6 8 10 Pb (E) vs. Eb /N0 for the (32, 6, 16) biorthogonal block code, compared to uncoded PAM and Shannon limits. B. The earliest codes: Hamming, Golay, and Reed-Muller The first nontrivial code to appear in the literature was the (7, 4, 3) Hamming code, mentioned by Shannon in his original paper [1]. Richard Hamming, a colleague of Shannon at Bell Labs, developed an infinite class of single-error-correcting (d = 3) binary linear codes, with parameters (n = 2m − 1, k = 2m − 1 − m, d = 3) for m ≥ 2 [10]. Thus k/n → 1 and η → 2 as m → ∞, while γc → 3 (4.77 dB). However, even with optimum soft-decision decoding, the real coding gain of Hamming codes on the AWGN channel never exceeds about 3 dB. The Hamming codes are “perfect,” in the sense that the spheres of Hamming radius 1 about each of the 2k codewords contain 2m binary n-tuples and thus form a “perfect” (exhaustive) partition of binary n-space (F2 )n . Shortly after the publication of Shannon’s paper, the Swiss mathematician Marcel Golay published a half-page paper [11] with a “perfect” binary linear (23, 12, 7) triple-error-correcting code, in which the spheres of Hamming  23 23 23 11 binary n-tuples) form an radius 3 about each of the 212 codewords (containing 23 + + 0 1 2 + 3 = 2 23 exhaustive partition of (F2 ) — and also a similar “perfect” (11, 6, 5) double-error-correcting ternary code. These binary and ternary Golay codes have come to be considered probably the most remarkable of all algebraic block codes, and it is now known that no other nontrivial “perfect” linear codes exist. Berlekamp [12] characterized Golay’s paper as the “best single published page” in coding theory during 1948–1973. Another early class of error-correcting codes was the Reed-Muller (RM) codes, which were introduced in 1954 by David Muller [13], and then reintroduced shortly thereafter with an efficient decoding algorithm by Irving Reed [14]. The RM(r , m) codes are a class of multiple-error-correcting (n, k, d) codes parametrized by two integers r and m, 0 ≤ r ≤ m, such that n = 2m and d = 2m−r . The RM(0, m) code is the (2m , 1, 2m ) binary repetition code (consisting of two codewords, the all-zero and all-one words), and the RM(m, m) code is the (2m , 2m , 1) binary code consisting of all binary 2m -tuples (i.e., uncoded transmission). Starting with RM(0, 1) = (2, 1, 2) and RM(1, 1) = (2, 2, 1), the RM codes may be constructed recursively by CHANNEL CODING: THE ROAD TO CHANNEL CAPACITY 7 the length-doubling |u|u + v| (Plotkin, squaring) construction as follows: RM(r, m) = {(u, u + v) | u ∈ RM(r, m − 1), v ∈ RM(r − 1, m − 1)}. From this construction it follows that the dimension k of RM(r, m) is given recursively by or nonrecursively by k(r, m) = k(r, m) = k(r, m − 1) + k(r − 1, m − 1), m i=0 i . Pr Figure 3 shows the parameters of the RM codes of lengths ≤ 32 in a tableau that reflects this length-doubling construction. For example, the RM(2, 5) code is a (32, 16, 8) code that can be constructed from the RM(2, 4) = (16, 11, 4) code and the RM(1, 4) = (16, 5, 8) code. (32,32,1) (16,16,1) (8,8,1) (4,4,1) (2,2,1) (16,15,2) (8,7,2) (1,1,1) (32,26,4) r = m−2, d = 4; extended Hamming codes (8,4,4) k = n/2; self−dual codes (32,16,8) (16,5,8) (4,1,4) (32,6,16) (8,1,8) (16,1,16) (32,1,32) Fig. 3. r = m−1, d = 2; SPC codes (16,11,4) (4,3,2) (2,1,2) (32,31,2) r = m, d = 1; all binary n−tuples r = 1, d = n/2; biorthogonal codes r = 0, d = n; repetition codes Tableau of Reed-Muller codes. RM codes include several important subclasses of codes. We have already mentioned the (2m , 2m , 1) codes consisting of all binary 2m -tuples and the (2m , 1, 2m ) repetition codes. RM codes also include the (2m , 2m − 1, 2) single-parity-check (SPC) codes, the (2m , 2m − m − 1, 4) extended Hamming4 codes, the (2m , m + 1, 2m−1 ) biorthogonal codes, and, for odd m, a class of (2m , 2m−1 , 2(m+1)/2 ) self-dual codes.5 Reed [14] introduced a low-complexity hard-decision error-correction algorithm for RM codes based on a simple majority-logic decoding rule. This simple decoding rule is able to correct all hard-decision error patterns of weight ⌊(d − 1)/2⌋ or less, which is the maximum possible (i.e., it is a bounded-distance decoding rule). This simple majority-logic, hard-decision decoder was attractive for the technology of the 50s and 60s. RM codes are thus an infinite class of codes with flexible parameters that can achieve near-optimal decoding on a BSC with a simple decoding algorithm. This was an important advance over the Hamming and Golay codes, whose parameters are much more restrictive. Performance curves with optimum hard-decision coding are shown in Figure 4 for the (31, 26, 3) Hamming code, the (23, 12, 7) Golay code, and the (31, 16, 7) shortened RM code. We see that they achieve real coding gains at Pb (E) ≈ 10−5 of only 0.9 dB, 2.3 dB, and 1.6 dB, respectively. The reasons for this poor performance are the use of hard decisions, which costs roughly 2 dB, and the fact that by modern standards these codes are very short. 4 An (n, k, d) code can be extended by adding code symbols or shortened by deleting code symbols; see footnote 3. An (n, k, d) binary linear code forms a k-dimensional subspace of the vector space Fn 2 . The dual of an (n, k, d) code is the (n − k)dimensional orthogonal subspace. A code that equals its dual is called self-dual. For self-dual codes, it follows that k = n/2. 5 8 1e+00 Uncoded binary PAM (31,26,3) Hamming code (31,16,7) shortened RM code (23,12,7) Golay code 1e−01 b P (E) 1e−02 1e−03 Shannon Limit (η = 1) 1e−04 1e−05 0 1 2 3 4 5 Eb/No (dB) 6 7 8 9 10 Fig. 4. Pb (E) vs. Eb /N0 for the (31,26,3) Hamming code, the (23,12,7) Golay code, and the (31,16,7) shortened RM code with optimum hard-decision decoding, compared to uncoded binary-PAM. It is clear from the tableau of Figure 3 that RM codes are not asymptotically “good”; that is, there is no sequence of (n, k, d) RM codes of increasing length n such that both k/n and d/n are bounded away from 0 as n → ∞. Since asymptotic goodness was the Holy Grail of algebraic coding theory (it is easy to show that typical random binary codes are asymptotically good), and since codes with somewhat better (n, k, d) (e.g., BCH codes) were found subsequently, theoretical attention soon turned away from RM codes. However, in recent years it has been recognized that “RM codes are not so bad.” RM codes are particularly good in terms of performance vs. complexity with trellis-based decoding and other soft-decision decoding algorithms, as we note in Section IV-E. Finally, they are almost as good in terms of (n, k, d) as the best binary codes known for lengths less than 128, which is the principal application domain of algebraic block codes. Indeed, with optimum decoding, RM codes may be “good enough” to reach the Shannon limit on the AWGN channel. Notice that the nominal coding gains of the self-dual RM codes and the biorthogonal codes become infinite as m → ∞. It is known that with optimum (minimum-Euclidean-distance) decoding, the real coding gain of the biorthogonal codes does asymptotically approach the ultimate Shannon limit, albeit with exponentially increasing complexity and vanishing spectral efficiency. It seems likely that the real coding gains of the self-dual RM codes with optimum decoding approach the Shannon limit at the nonzero spectral efficiency of η = 1, albeit with exponential complexity, but to our knowledge this has never been proved. C. Soft decisions: Wagner decoding On the road to modern capacity-approaching codes for AWGN channels, an essential step has been to replace hard-decision with soft-decision decoding; i.e., decoding that takes into account the reliability of received channel outputs. CHANNEL CODING: THE ROAD TO CHANNEL CAPACITY 9 The earliest soft-decision decoding algorithm known to us is Wagner decoding, described in [15] and attributed to C. A. Wagner, which is an optimum decoding rule for the special class of (n, n − 1, 2) single-parity-check (SPC) codes. Each received real-valued symbol rk from an AWGN channel may be represented in sign-magnitude form, where the sign sgn(rk ) indicates the “hard decision,” and the magnitude |rk | indicates the “reliability” of rk . The Wagner rule is: first check whether the hard-decision binary n-tuple is a codeword. If so, accept it. If not, then flip the hard decision corresponding to the output rk that has the minimum reliability |rk |. It is easy to show that the Wagner rule finds the minimum-Euclidean-distance codeword; i.e., that Wagner decoding is optimum for an (n, n − 1, 2) SPC code. Moreover, Wagner decoding is much simpler than exhaustive minimumdistance decoding, which requires on the order of 2n−1 computations. D. BCH and Reed-Solomon codes In the 1960s, research in channel coding was dominated by the development of algebraic block codes, particularly cyclic codes. The algebraic coding paradigm used the structure of finite-field algebra to design efficient encoding and error-correction procedures for linear block codes operating on a hard-decision channel. The emphasis was on constructing codes with a guaranteed minimum distance d, and then using the algebraic structure of the codes to design bounded-distance error-correction algorithms whose complexity grows only as a small power of d. In particular, the goal was to develop flexible classes of easily-implementable codes with better performance than RM codes. Cyclic codes are codes that are invariant under cyclic (“end-around”) shifts of n-tuple codewords. They were first investigated by Eugene Prange in 1957 [16], and became the primary focus of research after the publication of Wesley Peterson’s pioneering text in 1961 [4]. Cyclic codes have a nice algebraic theory, and attractively simple encoding and decoding procedures based on cyclic shift-register implementations. Hamming, Golay, and shortened RM codes can be put into cyclic form. The “big bang” in this field was the invention of Bose-Chaudhuri-Hocquenghem (BCH) and Reed-Solomon (RS) codes in three independent papers in 1959 and 1960 [17], [18], [19]. It was shortly recognized that RS codes are a class of nonbinary BCH codes, or alternatively that BCH codes are subfield subcodes of RS codes. Binary BCH codes include a large class of t-error-correcting cyclic codes of length n = 2m − 1, odd minimum distance d = 2t + 1, and dimension k ≥ n − mt. Compared to shortened RM codes of a given length n = 2m − 1, there are more codes from which to choose, and for n ≥ 63 the BCH codes can have a somewhat larger dimension k for a given minimum distance d. However, BCH codes are still not asymptotically “good.” Although they are the premier class of binary algebraic block codes, they have not been used much in practice, except as “cyclic redundancy check” (CRC) codes for error detection in automatic-repeat-request (ARQ) systems. In contrast, the nonbinary Reed-Solomon codes have proved to be highly useful in practice (although not necessarily in cyclic form). An (extended or shortened) RS code over the finite field Fq , q = 2m , can have any block length up to n = q + 1, any minimum distance d ≤ n (where Hamming distance is defined in terms of q -ary symbols), and dimension k = n − d + 1, which meets an elementary upper bound called the Singleton bound [20]. In this sense, RS codes are optimum. An important property of RS and BCH codes is that they can be efficiently decoded by algebraic decoding algorithms using finite-field arithmetic. A glance at the tables of contents of the IEEE T RANSACTIONS ON I NFORMATION T HEORY shows that the development of such algorithms was one of the most active research fields of the 1960s. Already by 1960, Peterson had developed an error-correction algorithm with complexity on the order of d3 [21]. In 1968, Elwyn Berlekamp [5] devised an error-correction algorithm with complexity on the order of d2 , which was interpreted by Jim Massey [22] as an algorithm for finding the shortest linear feedback shift register that can generate a certain sequence. This Berlekamp-Massey algorithm became the standard for the next decade. Finally, it was shown that these algorithms could be straightforwardly extended to correct both erasures and errors [23], and even to correct soft decisions [24], [25] (suboptimally, but in some cases asymptotically optimally). 10 The fact that RS codes are inherently nonbinary (the longest binary RS code has length 3) may cause difficulties in using them over binary channels. If the 2m -ary RS code symbols are simply represented as binary m-tuples and sent over a binary channel, then a single binary error can cause an entire 2m -ary symbol to be incorrect; this causes RS codes to be inferior to BCH codes as binary-error-correcting codes. However, in this mode RS codes are inherently good burst-error-correcting codes, since the effect of an m-bit burst that is concentrated in a single RS code symbol is only a single symbol error. In fact, it can be shown that RS codes are effectively optimal binary burst-error-correcting codes [26]. The ability of RS codes to correct both random and burst errors makes them particularly well suited for applications such as magnetic tape and disk storage, where imperfections in the storage media sometimes cause bursty errors. They are also useful as outer codes in concatenated coding schemes, to be discussed in Section IV-D. For these reasons, RS codes are probably the most widely deployed codes in practice. E. Reed-Solomon code implementations The first major application of RS codes was as outer codes in concatenated coding systems for deep-space communications. For the 1977 Voyager mission, the Jet Propulsion Laboratory (JPL) used a (255, 223, 33), 16error-correcting RS code over F256 as an outer code, with a rate-1/2, 64-state convolutional inner code (see also Section IV-D). The RS decoder used special-purpose hardware for decoding, and was capable of running up to about 1 Mb/s [27]. This concatenated convolutional/RS coding system became a NASA standard. 1980 saw the first major commercial application of RS codes in the compact disc (CD) standard. This system used two short RS codes over F256 , namely (32, 28, 5) and (28, 24, 5) RS codes, and operated at bit rates of the order of 4 Mb/s [28]. All subsequent audio and video magnetic storage systems have used RS codes for error correction, nowadays at much higher rates. Cyclotomics, Inc. built a prototype “hypersystolic” RS decoder in 1986–88 that was capable of decoding a (63, 53, 11) RS code over F64 at bit rates approaching 1 Gb/s [29]. This decoder may still hold the RS decoding speed record. Reed-Solomon codes continue to be preferred for error correction when the raw channel error rate is not too large, because they can provide substantial error-correction power with relatively small redundancy at data rates up to tens or hundreds of Mb/s. They also work well against bursty errors. In these respects, they complement modern capacity-approaching codes. F. The “coding is dead” workshop The first IEEE Communication Theory Workshop in St. Petersburg, Florida in April 1971 became famous as the “coding is dead” workshop. No written record of this workshop seems to have survived. However, Bob Lucky wrote a column about it many years later in IEEE S PECTRUM [30]. Lucky recalls: A small group of us in the communications field will always remember a workshop held in Florida about 20 years ago . . . One of my friends [Ned Weldon] gave a talk that has lived in infamy as the “coding is dead” talk. His thesis was that he and the other coding theorists formed a small, inbred group that had been isolated from reality for too long. He illustrated this talk with a single slide showing a pen of rats that psychologists had penned in a confined space for an extensive period of time. I cannot tell you what those rats were doing, but suffice it to say that the slide has since been borrowed many times to depict the depths of depravity into which such a disconnected group can fall . . . Of course, as Lucky goes on to say, the irony is that since 1971 coding has flourished and become embedded in practically all communications applications. He asks plaintively, “Why are we technologists so bad at predicting the future of technology?” CHANNEL CODING: THE ROAD TO CHANNEL CAPACITY 11 From today’s perspective, one answer to this question could be that what Weldon was really asserting was that “algebraic coding is dead” (or at least had reached the point of diminishing returns). Another answer was given on the spot by Irwin Jacobs, who stood up in the back row, flourished a medium-scaleintegrated circuit (perhaps a 4-bit shift register), and asserted that “This is the future of coding.” Elwyn Berlekamp said much the same thing. Interestingly, Jacobs and Berlekamp went on to lead the two principal coding companies of the 1970s, Linkabit and Cyclotomics, the one championing convolutional codes, and the other, block codes. History has shown that both answers were right. Coding has moved from theory to practice in the past 35 years because (a) other classes of coding schemes have supplanted the algebraic coding paradigm, and (b) advances in integrated circuit technology have ultimately allowed designers to implement any (polynomial-complexity) algorithm that they can think of. Today’s technology is on the order of a million times faster than that of 1971. Even though Moore’s Law had already been propounded in 1971, it seems to be hard for the human mind to grasp what a factor of 106 can make possible. G. Further developments in algebraic coding theory Of course algebraic coding theory has not died; it continues to be an active research area. A recent text in this area is Roth [31]. A new class of block codes based on algebraic geometry (AG) was introduced by Goppa in the late 1970s [32], [33]. Tsfasman, Vladut, and Zink [34] constructed AG codes over nonbinary fields Fq with q ≥ 49 whose minimum distance as n → ∞ surpasses the Gilbert-Varshamov bound (the best known lower bound on the minimum distance of block codes), which is perhaps the most notable achievement of AG codes. AG codes are generally much longer than RS codes, and can usually be decoded by extensions of RS decoding algorithms. However, AG codes have not been adopted yet for practical applications. For a nice survey of this field, see [35]. In 1997, Sudan [36] introduced a list decoding algorithm based on polynomial interpolation for decoding beyond the guaranteed error-correction distance of RS and related codes.6 Although in principle there may be more than one codeword within such an expanded distance, in fact with high probability only one will occur. Guruswami and Sudan [38] further improved the algorithm and its decoding radius, and Koetter and Vardy [39] extended it to handle soft decisions. There is currently some hope that algorithms of this type will be used in practice. Other approaches to soft-decision decoding algorithms have continued to be developed, notably the orderedstatistics approach of Fossorier and Lin (see, e.g., [40]) whose roots can be traced back to Wagner decoding. IV. P ROBABILISTIC CODING “Probabilistic coding” is a name for an alternative line of development that was more directly inspired by Shannon’s probabilistic approach to coding. Whereas algebraic coding theory aims to find specific codes that maximize the minimum distance d for a given (n, k), probabilistic coding is more concerned with finding classes of codes that optimize average performance as a function of coding and decoding complexity. Probabilistic decoders typically use soft-decision (reliability) information, both as inputs (from the channel outputs), and at intermediate stages of the decoding process. Classical coding schemes that fall into this class include convolutional codes, product codes, concatenated codes, trellis-coded modulation, and trellis decoding of block codes. Popular textbooks that emphasize the probabilistic view of coding include Wozencraft and Jacobs [41], Gallager [42], Clark and Cain [43], Lin and Costello [44], Johannesson and Zigangirov [45], and the forthcoming book by Richardson and Urbanke [46]. For many years, the competition between the algebraic and probabilistic approaches was cast as a competition between block codes and convolutional codes. Convolutional coding was motivated from the start by the objective of optimizing the tradeoff of performance vs. complexity, which on the binary-input AWGN channel necessarily implies soft decisions and quasi-optimal decoding. In practice, most channel coding systems have used convolutional codes. Modern capacity-approaching codes are the ultimate fruit of this line of development. 6 List decoding was an (unpublished) invention of Elias— see [37]. 12 A. Elias’ invention of convolutional codes Convolutional codes were invented by Peter Elias in 1955 [47]. Elias’ goal was to find classes of codes for the binary symmetric channel (BSC) with as much structure as possible, without loss of performance. Elias’ several contributions have been nicely summarized by Bob Gallager, who was Elias’ student [48]: [Elias’] 1955 paper . . . was perhaps the most influential early paper in information theory after Shannon’s. This paper takes several important steps toward realizing the promise held out by Shannon’s paper . . . . The first major result of the paper is a derivation of upper and lower bounds on the smallest achievable error probability on a BSC using codes of a given block length n. These bounds decrease exponentially with n for any data rate R less than the capacity C . Moreover, the upper and lower bounds are substantially the same over a significant range of rates up to capacity. This result shows that: (a) achieving a small error probability at any error rate near capacity necessarily requires a code with a long block length; and (b) almost all randomly chosen codes perform essentially as well as the best codes; that is, most codes are good codes. Consequently, Elias turned his attention to finding classes of codes that have some special structure, so as to simplify implementation, without sacrificing average performance over the class. His second major result is that the special class of linear codes has the same average performance as the class of completely random codes. Encoding of linear codes is fairly simple, and the symmetry and special structure of these codes led to a promise of simplified decoding strategies . . . . In practice, practically all codes are linear. Elias’ third major result was the invention of [linear time-varying] convolutional codes . . . . These codes are even simpler to encode than general linear codes, and they have many other useful qualities. Elias showed that convolutional codes also have the same average performance as randomly chosen codes. We may mention at this point that Gallager’s doctoral thesis on low-density parity-check (LDPC) codes, supervised by Elias, was similarly motivated by the problem of finding a class of “random-like” codes that could be decoded near capacity with quasi-optimal performance and feasible complexity [49]. Linearity is the only algebraic property that is shared by convolutional codes and algebraic block codes.7 The additional structure introduced by Elias was later understood as the dynamical structure of a discrete-time, k-input, n-output finite-state Markov process. A convolutional code is characterized by its code rate k/n, where k and n are typically small integers, and by the number of its states, which is often closely related to decoding complexity. In more recent terms, Elias’ and Gallager’s codes can be represented as “codes on graphs,” in which the complexity of the graph increases only linearly with the code block length. This is why convolutional codes are useful as components of turbo coding systems. In this light, there is a fairly straight line of development from Elias’ invention to modern capacity-approaching codes. Nonetheless, this development actually took the better part of a half-century. B. Convolutional codes in the 1960s and 1970s Shortly after Elias’ paper, Jack Wozencraft recognized that the tree structure of convolutional codes permits decoding by a sequential search algorithm [51]. Sequential decoding became the subject of intense research at MIT, culminating in the development of the fast, storage-free Fano sequential decoding algorithm [52], and an analytical proof that the rate of a sequential decoding system is bounded by the computational cut-off rate R0 [53]. Subsequently, Jim Massey proposed a very simple decoding method for convolutional codes, called threshold decoding [54]. Burst-error-correcting variants of threshold decoding developed by Massey and Gallager proved to be 7 Linear convolutional codes have the algebraic structure of discrete-time multi-input, multi-output linear dynamical systems [50], but this is rather different from the algebraic structure of linear block codes. CHANNEL CODING: THE ROAD TO CHANNEL CAPACITY 13 quite suitable for practical error correction [26]. Codex Corp. was founded in 1962 around the Massey and Gallager codes (including LDPC codes, which were never seriously considered for practical implementation). Codex built hundreds of burst-error-correcting threshold decoders during the 1960s, but the business never grew very large, and Codex left it in 1970. In 1967, Andy Viterbi introduced what became known as the Viterbi algorithm (VA) as an “asymptotically optimal” decoding algorithm for convolutional codes, in order to prove exponential error bounds [55]. It was quickly recognized [56], [57] that the VA was actually an optimum decoding algorithm. More importantly, Jerry Heller at the Jet Propulsion Laboratory (JPL) [58], [59] realized that relatively short convolutional codes decoded by the VA were potentially quite practical— e.g., a 64-state code could obtain a sizable real coding gain, on the order of 6 dB. Linkabit Corp. was founded by Irwin Jacobs, Len Kleinrock, and Andy Viterbi in 1968 as a consulting company. In 1969, Jerry Heller was hired as Linkabit’s first full-time employee. Shortly thereafter, Linkabit built a prototype 64-state Viterbi algorithm decoder (“a big monster filling a rack” [60]), capable of running at 2 Mb/s [61]. During the 1970s, through the leadership of Linkabit and JPL, the VA became part of the NASA standard for deep-space communication. Around 1975, Linkabit developed a relatively inexpensive, flexible, and fast VA chip. The VA soon began to be incorporated into many other communications applications. Meanwhile, although a convolutional code with sequential decoding was the first code in space (for the 1968 Pioneer 9 mission [56]), and a few prototype sequential decoding systems were built, sequential decoding never took off in practice. By the time electronics technology could support sequential decoding, the VA had become a more attractive alternative. However, there seems to be a current resurgence of interest in sequential decoding for specialized applications [62]. C. Soft decisions: APP decoding Part of the attraction of convolutional codes is that all of these convolutional decoding algorithms are inherently capable of using soft decisions, without any essential increase in complexity. In particular, the VA implements minimum-Euclidean-distance sequence detection on an AWGN channel. An alternative approach to using reliability information is to try to compute (exactly or approximately) the a posteriori probability (APP) of each transmitted bit being a 0 or a 1, given the APPs of each received symbol. In his thesis, Gallager [49] developed an iterative message-passing APP decoding algorithm for LDPC codes, which seems to have been the first appearance in any literature of the now-ubiquitous “sum-product algorithm” (also called “belief propagation”). At about the same time, Massey [54] developed an APP version of threshold decoding. In 1974, Bahl, Cocke, Jelinek, and Raviv [63] published an algorithm for APP decoding of convolutional codes, now called the BCJR algorithm. Because this algorithm is more complicated than the VA (for one thing, it is a forward-backward rather than a forward-only algorithm) and its performance is more or less the same, it did not supplant the VA for decoding convolutional codes. However, because it is a soft-input, soft-output (SISO) algorithm (i.e., APPs in, APPs out), it became a key element of iterative turbo decoding (see Section VI). Theoretically, it is now recognized as an implementation of the sum-product algorithm on a trellis. D. Product codes and concatenated codes Before inventing convolutional codes, Elias had invented another class of codes now known as product codes [64]. The product of an (n1 , k1 , d1 ) with an (n2 , k2 , d2 ) binary linear block code is an (n1 n2 , k1 k2 , d1 d2 ) binary linear block code. A product code may be decoded, simply but suboptimally, by independent decoding of the component codes. Elias showed that with a repeated product of extended Hamming codes, an arbitrarily low error probability could be achieved at a nonzero code rate, albeit at a code rate well below the Shannon limit. In 1966, Dave Forney introduced concatenated codes [65]. As originally conceived, a concatenated code involves a serial cascade of two linear block codes: an outer (n2 , k2 , d2 ) nonbinary Reed-Solomon code over a finite field Fq 14 with q = 2k1 elements, and an inner (n1 , k1 , d1 ) binary code with q = 2k1 codewords (see Figure 5). The resulting concatenated code is an (n1 n2 , k1 k2 , d1 d2 ) binary linear block code. The key idea is that the inner and outer codes may be relatively short codes that are easy to encode and decode, whereas the concatenated code is a longer, more powerful code. For example, if the outer code is a (15, 11, 5) RS code over F16 and the inner code is a (7, 4, 3) binary Hamming code, then the concatenated code is a much more powerful (105, 44, 15) code. Outer Encoder (n2,k2) code over GF(2 k1) Fig. 5. Inner Encoder (n1,k1) binary code Channel Inner Decoder (n1,k1) binary code Outer Decoder (n2,k 2) code over GF(2 k 1) A concatenated code. The two-stage decoder shown in Figure 5 is not optimum, but is capable of correcting a wide variety of error patterns. For example, any error pattern that causes at most one error in each of the inner codewords will be corrected. In addition, if bursty errors cause one or two inner codewords to be decoded incorrectly, they will appear as correctable symbol errors to the outer decoder. The overall result is a long, powerful code with a simple, suboptimum decoder that can correct many combinations of burst and random errors. Forney showed that with a proper choice of the constituent codes, concatenated coding schemes could operate at any code rate up to the Shannon limit with exponentially decreasing error probability, but only polynomial decoding complexity. Concatenation can also be applied to convolutional codes. In fact, the most common concatenated code used in practice is one developed in the 1970s as a NASA standard (mentioned in Section III-E). It consists of an inner rate-1/2, 64-state convolutional code with minimum distance8 d = 10 along with an outer (255, 223, 33) RS code over F256 . The inner decoder uses soft-decision Viterbi decoding, while the outer decoder uses the hard-decision Berlekamp-Massey algorithm. Also, since the decoding errors made by the Viterbi algorithm tend to be bursty, a symbol interleaver is inserted between the two encoders, and a de-interleaver between the two decoders. In the late 1980s, a more complex concatenated coding scheme with iterative decoding was proposed by Erik Paaske [66], and independently by Oliver Collins [67], to improve the performance of the NASA concatenated coding standard. Instead of a single outer Reed-Solomon (RS) code, Paaske and Collins proposed to use several outer RS codes of different rates. After one round of decoding, the outputs of the strongest (lowest-rate) RS decoders may be deemed to be reliable, and thus may be fed back to the inner (Viterbi) convolutional decoder as known bits for another round of decoding. Performance improvements of about 1.0 dB were achieved after a few iterations. This scheme was used to rescue the 1992 Galileo mission (see also Section IV-F). Also, in retrospect, its use of iterative decoding with a concatenated code may be seen as a precursor of turbo codes (see, for example, the paper by Hagenauer et al. [68]). E. Trellis decoding of block codes A convolutional code may be viewed as the output sequence of a discrete-time, finite-state system. By rolling out the state-transition diagram of such a system in time, we get a picture called a trellis diagram, which explicitly displays every possible state sequence, and also every possible output sequence (if state transitions are labelled by the corresponding outputs). With such a trellis representation of a convolutional code, it becomes obvious that on a memoryless channel the Viterbi algorithm is a maximum-likelihood sequence detection algorithm [56]. The success of VA decoding of convolutional codes led to the idea of representing a block code by a (necessarily time-varying) trellis diagram with as few states as possible, and then using the VA to decode it. Another fundamental contribution of the BCJR paper [63] was to show that every (n, k, d) binary linear code may be represented by a trellis diagram with at most min{2k , 2n−k } states.9 The subject of minimal trellis representations of block codes became an active research area during the 1990s. Given a linear block code with a fixed coordinate ordering, it turns out that there is a unique minimal trellis 8 9 The minimum distance between infinite code sequences in a convolutional code is also known as the free distance. This result is usually attributed to a subsequent paper by Wolf [69]. CHANNEL CODING: THE ROAD TO CHANNEL CAPACITY 15 representation; however, finding the best coordinate ordering is an NP-hard problem. Nonetheless, optimal coordinate orderings for many classes of linear block codes have been found. In particular, the optimum coordinate ordering for Golay and Reed-Muller codes is known, and the resulting trellis diagrams are rather nice. On the other hand, the state complexity of any class of “good” block codes must increase exponentially as n → ∞. An excellent summary of this field by Vardy appears in [70]. In practice, this approach was superseded by the advent of turbo and LDPC codes, to be discussed in Section VI. F. History of coding for deep-space applications The deep-space communications application is the arena in which the most powerful coding schemes for the power-limited AWGN channel have been first deployed, because: • • • • The only noise is AWGN in the receiver front end; Bandwidth is effectively unlimited; Fractions of a dB have huge scientific and economic value; Receiver (decoding) complexity is effectively unlimited. As we have already noted, for power-limited AWGN channels, there is a negligible penalty to using binary codes with binary modulation rather than more general modulation schemes. The first coded scheme to be designed for space applications was a simple (32, 6, 16) biorthogonal code for the Mariner missions (1969), which can be optimally soft-decision decoded using a fast Hadamard transform. Such a scheme can achieve a nominal coding gain of 3 (4.8 dB). At a target bit error probability of Pb (E) ≈ 5 · 10−3 , the real coding gain achieved was only about 2.2 dB. The first coded scheme actually to be launched into space was a rate-1/2 convolutional code with constraint length10 ν = 20 (220 states) for the Pioneer 1968 mission [3]. The receiver used 3-bit-quantized soft decisions and sequential decoding implemented on a general-purpose 16-bit minicomputer with a 1 MHz clock rate. At a rate of 512 b/s, the real coding gain achieved at Pb (E) ≈ 5 · 10−3 was about 3.3 dB. During the 1970s, as noted in Sections III-E and IV-D, the NASA standard became a concatenated coding scheme based on a rate-1/2, 64-state inner convolutional code and a (255, 223, 33) Reed-Solomon outer code over F256 . The overall rate of this code is 0.437, and it achieves an impressive 7.3 dB real coding gain at Pb (E) ≈ 10−5 ; i.e., its gap to capacity (SNRnorm ) is only about 2.5 dB (see Figure 6). When the primary antenna failed to deploy on the Galileo mission (circa 1992), an elaborate concatenated coding scheme using a rate-1/6, 214 -state inner convolutional code with a Big Viterbi Decoder (BVD) and a set of variablestrength RS outer codes was reprogrammed into the spacecraft computers (see Section IV-D). This scheme was able to achieve Pb (E) ≈ 2 · 10−7 at Eb /N0 ≈ 0.8 dB, for a real coding gain of about 10.2 dB. Finally, within the last decade, turbo codes and LDPC codes for deep-space communications have been developed to get within 1 dB of the Shannon limit, and these are now becoming industry standards (see Section VI-H). For a more comprehensive history of coding for deep-space channels, see [71]. V. C ODES FOR BANDWIDTH - LIMITED CHANNELS Most work on channel coding has focussed on binary codes. However, on a bandwidth-limited AWGN channel, in order to obtain a spectral efficiency η > 2 b/s/Hz, some kind of nonbinary coding must be used. Early work, primarily theoretical, focussed on lattice codes, which in many respects are analogous to binary linear block codes. The practical breakthrough in this field came with Ungerboeck’s invention of trellis-coded modulation, which is similarly analogous to convolutional coding. 10 The constraint length ν is the dimension of the state space of a convolutional encoder; the number of states is thus 2ν . 16 1e+00 Uncoded binary PAM NASA standard concatenated code 1e−01 b P (E) 1e−02 1e−03 1e−04 1e−05 Shannon Limit (η = 0.874) 1e−06 −2 Fig. 6. −1 0 1 2 3 4 5 Eb/No (dB) 6 7 8 9 10 Pb (E) vs. Eb /N0 for the NASA standard concatenated code, compared to uncoded PAM and the Shannon limit for η = 0.874. A. Coding for the bandwidth-limited AWGN channel Coding schemes for a bandwidth-limited AWGN channel typically use two-dimensional quadrature amplitude modulation (QAM). A sequence of QAM symbols may be sent through a channel of bandwidth W at a symbol rate up to the Nyquist limit of W QAM symbols per second. If the information rate is η bits per QAM symbol, then the nominal spectral efficiency is also η b/s/Hz. An uncoded baseline scheme is simply to use a square M × M QAM constellation, where M is even, typically a power of two. The information rate is then η = log2 M 2 bits per QAM symbol. The average energy of such a constellation is easily shown to be (2η − 1)d2 (M 2 − 1)d2 = , Es = 6 6 where d is the minimum Euclidean distance between constellation points. Since SNR = Es /N0 , it is then straightforward to show that with optimum modulation and detection the probability of error per QAM symbol is p Ps (E) ≈ 4 Q( 3 · SNRnorm ), where Q(x) is again the Gaussian probability of error function. This baseline performance curve of Ps (E) vs. SNRnorm for uncoded QAM transmission is plotted in Figure 7. For example, in order to achieve a symbol error probability of Ps (E) ≈ 10−5 , we must have SNRnorm ≈ 7 (8.5 dB) for uncoded QAM transmission. We recall from Section II that the Shannon limit on SNRnorm is 1 (0 dB), so the gap to capacity is about 8.5 dB at Ps (E) ≈ 10−5 . Thus the maximum possible coding gain is somewhat smaller in the bandwidth-limited regime than in the power-limited regime. Furthermore, as we will discuss next, in the bandwidth-limited regime the Shannon limit on SNRnorm with no shaping is πe/6 (1.53 dB), so the maximum possible coding gain with no shaping at Ps (E) ≈ 10−5 is only about 7 dB. These two limits are also shown on Figure 7. CHANNEL CODING: THE ROAD TO CHANNEL CAPACITY 17 0 10 −1 10 Uncoded QAM −2 s P (E) 10 Shannon Limit −3 10 Shannon Limit (no shaping) −4 10 −5 10 −1 0 1 2 3 4 SNR norm Fig. 7. 5 6 7 8 9 (dB) Ps (E) vs. SNRnorm for uncoded QAM, compared to Shannon limits on SNRnorm with and without shaping. We now briefly discuss shaping. The set of all n-tuples of constellation points from a square QAM constellation is the set of all points on a d-spaced rectangular grid that lie within a 2n-cube in real 2n-space R2n . The average energy of this 2n-dimensional constellation could be reduced if instead the constellation consisted of all points on the same grid that lie within a 2n-sphere of the same volume, which would comprise approximately the same number of points. The reduction in average energy of a 2n-sphere relative to a 2n-cube of the same volume is called the shaping gain γs (S2n ) of a 2n-sphere. As n → ∞, γs (Sn ) → πe/6 (1.53 dB). For large signal constellations, shaping can be implemented more or less independently of coding, and shaping gain is more or less independent of coding gain. The Shannon limit essentially assumes n-sphere shaping with n → ∞, and therefore incorporates 1.53 dB of shaping gain (over an uncoded square QAM constellation). In the bandwidth-limited regime, coding without shaping can therefore get only to within 1.53 dB of the Shannon limit; the remaining 1.53 dB can be obtained by shaping and only by shaping. We do not have space to discuss shaping schemes in this paper. It turns out that obtaining shaping gains on the order of 1 dB is not very hard, so nowadays most practical schemes for the bandwidth-limited Gaussian channel incorporate shaping. For example, the V.34 modem (see Section V-D) incorporates a 16-dimensional “shell mapping” shaping scheme whose shaping gain is about 0.8 dB. The performance curve of any practical coding scheme that improves on uncoded QAM must lie between the relevant Shannon limit and the uncoded QAM curve. Thus Figure 7 defines the “playing field” for coding and shaping in the bandwidth-limited regime. The real coding gain of a coding scheme at a given symbol error probability Ps (E) will be defined as the difference (in dB) between the SNRnorm required to obtain that Ps (E) with coding, but no shaping, vs. without coding (uncoded QAM). Thus the maximum possible real coding gain at Ps (E) ≈ 10−5 is about 7 dB. Again, for moderate-complexity coding, it can often be assumed that the error probability is dominated by the 18 probability of making an error to one of the nearest-neighbor codewords. Under this assumption, using a union bound estimate [75], [76], it is easily shown that with optimum decoding, the probability of decoding error per QAM symbol is well approximated by p  p  Ps (E) ≈ (2Nd /n) Q 3d2 2−ρ SNRnorm = (2Nd /n) Q 3γc SNRnorm , where d2 is the minimum squared Euclidean distance between code sequences (assuming an underlying QAM constellation with minimum distance 1 between signal points), 2Nd /n is the number of code sequences at the minimum distance per QAM symbol, and ρ is the redundancy of the coding scheme (the difference between the actual and maximum possible data rates with the underlying QAM constellation) in bits per two dimensions. The quantity γc = d2 2−ρ is called the nominal coding gain of the coding scheme. The real coding gain is usually slightly less than the nominal coding gain, due to the effect of the “error coefficient” 2Nd /n. B. Spherical lattice codes It is clear from the proof of Shannon’s capacity theorem for the AWGN channel that an optimal code for a bandwidth-limited AWGN channel consists of a dense packing of signal points within an n-sphere in a highdimensional Euclidean space Rn . Finding the densest packings in Rn is a longstanding mathematical problem. Most of the densest known packings are lattices [72]– i.e., packings that have a group property. Notable lattice packings include the integer lattice Z in one dimension, the hexagonal lattice A2 in two dimensions, the Gosset lattice E8 in eight dimensions, and the Leech lattice Λ24 in 24 dimensions. Therefore, from the very earliest days, there have been proposals to use spherical lattice codes as codes for the bandwidth-limited AWGN channel, notably by de Buda [73] and Lang in Canada. Lang proposed an E8 lattice code for telephone-line modems to a CCITT international standards committee in the mid-70s, and actually built a Leech lattice modem in the late 1980s [74]. By the union bound estimate, the probability of error per two-dimensional symbol of a spherical lattice code based on an n-dimensional lattice Λ on an AWGN channel with minimum-distance decoding may be estimated as p  Ps (E) ≈ 2Kmin (Λ)/n Q 3γc (Λ)γs (Sn )SNRnorm , where Kmin (Λ) is the kissing number (number of nearest neighbors) of the lattice Λ, γc (Λ) is the nominal coding √ gain (Hermite parameter) of Λ, and γs (Sn ) is the shaping gain of an n-sphere. Since Ps (E) ≈ 4Q 3SNRnorm for a square two-dimensional QAM constellation, the real coding gain of a spherical lattice code over a square QAM constellation is the combination of the nominal coding gain γc (Λ) and the shaping gain γs (Sn ), minus a Ps (E)-dependent factor due to the larger “error coefficient” 2Kmin (Λ)/n. For example (see [76]), the Gosset lattice E8 has a nominal coding gain of 2 (3 dB); however, Kmin (E8 ) = 240, so with no shaping p  Ps (E) ≈ 60 Q 6SNRnorm , which is plotted in Figure 8. We see that the real coding gain of E8 is only about 2.2 dB at Ps (E) ≈ 10−5 . The Leech lattice Λ24 has a nominal coding gain of 4 (6 dB); however, Kmin (Λ24 ) = 196560, so with no shaping p  Ps (E) ≈ 16380 Q 12SNRnorm , also plotted in Figure 8. We see that the real coding gain of Λ24 is only about 3.6 dB at Ps (E) ≈ 10−5 . Spherical shaping in 8 or 24 dimensions would contribute a shaping gain of about 0.75 dB or 1.1 dB, respectively. For a detailed discussion of lattices and lattice codes, see the book by Conway and Sloane [72]. CHANNEL CODING: THE ROAD TO CHANNEL CAPACITY 19 0 10 Uncoded QAM Gosset lattice E 8 Leech lattice Λ 24 −1 10 −2 s P (E) 10 −3 10 Shannon Limit (no shaping) −4 10 −5 10 0 1 2 3 4 SNR norm 5 (dB) 6 7 8 9 Fig. 8. Ps (E) vs. SNRnorm for Gosset lattice E8 and Leech lattice Λ24 with no shaping, compared to uncoded QAM and the Shannon limit on SNRnorm without shaping. C. Trellis-coded modulation The big breakthrough in practical coding for bandwidth-limited channels was Gottfried Ungerboeck’s invention of trellis-coded modulation (TCM), originally conceived in the 1970s, but not published until 1982 [77]. Ungerboeck realized that in the bandwidth-limited regime, the redundancy needed for coding should be obtained by expanding the signal constellation while keeping the bandwidth fixed, rather than by increasing the bandwidth while keeping a fixed signal constellation, as is done in the power-limited regime. From capacity calculations, he showed that doubling the signal constellation should suffice— e.g., using a 32-QAM rather than a 16-QAM constellation. Ungerboeck invented clever trellis codes for such expanded constellations, using minimum Euclidean distance rather than Hamming distance as the design criterion. As with convolutional codes, trellis codes may be optimally decoded by a VA decoder, whose decoding complexity is proportional to the number of states in the encoder. Ungerboeck showed that effective coding gains of 3 to 4 dB could be obtained with simple 4- to 8-state trellis codes, with no bandwidth expansion. An 8-state two-dimensional (2D) QAM trellis code due to Lee-Fang Wei [79] (with a nonlinear twist to make it “rotationally invariant”) was soon incorporated into the V.32 voice-grade telephone-line modem standard (see Section V-D). The nominal (and real) coding gain √ of this 8-state 2D code is γc = 5/2 = 2.5 (3.97 dB); its performance curve is approximately Ps (E) ≈ 4 Q( 7.5 SNRnorm ), plotted in Figure 9. Later standards such as V.34 have used a 16-state 4D trellis code√of Wei [80] (see Section V-D), which 1 has less redundancy √ (ρ = 2 vs. ρ = 1), a nominal coding gain of γc = 4/ 2 = 2.82 (4.52 dB), and performance Ps (E) ≈ 12 Q( 8.49 SNRnorm ), also plotted in Figure 9. We see that its real coding gain at Ps (E) ≈ 10−5 is about 4.2 dB. Trellis codes have proved to be more attractive than lattice codes in terms of performance vs. complexity, just 20 0 10 Uncoded QAM 8−state 2D Wei code 16−state 4D Wei code −1 10 −2 s P (E) 10 −3 10 −4 10 Shannon Limit (no shaping) −5 10 0 1 2 3 4 SNR norm 5 (dB) 6 7 8 9 Fig. 9. Ps (E) vs. SNRnorm for 8-state 2D and 16-state 4D Wei trellis codes with no shaping, compared to uncoded QAM and the Shannon limit on SNRnorm without shaping. as convolutional codes have been preferred to block codes. Nonetheless, the signal constellations used for trellis codes have generally been based on simple lattices, and their “subset partitioning” is often best understood as being based on a sublattice chain. For example, the V.32 code uses a QAM constellation based on the two-dimensional integer lattice Z2 , with an 8-way partition based on the sublattice chain Z2 /R2 Z2 /2Z2 /2R2 Z2 , where R2 is a scaled rotation operator. The Wei 4D 16-state code uses a constellation based on the 4-dimensional integer lattice Z4 , with an 8-way partition based on the sublattice chain Z4 /D4 /R4 Z4 /R4 D4 , where D4 is the 4-dimensional “checkerboard lattice,” and R4 is a 4D extension of R2 . In 1977, Imai and Hirakawa introduced a related concept, called multilevel coding [78]. In this approach, an independent binary code is used at each stage of a chain of 2-way partitions, such as Z2 /R2 Z2 /2Z2 /2R2 Z2 . By information-theoretic arguments, it can be shown that multilevel coding suffices to approach the Shannon limit [124]. However, TCM has been the preferred approach in practice. D. History of coding for modem applications For several decades, the telephone channel was the arena in which the most powerful coding and modulation schemes for the bandwidth-limited AWGN channel were first developed and deployed, because: • • • At that time, the telephone channel was fairly well modeled as a bandwidth-limited AWGN channel; One dB had significant commercial value; Data rates were low enough that a considerable amount of processing could be done per bit. The first international standard to use coding was the V.32 standard (1986) for 9600 b/s transmission over the public switched telephone network (PSTN) (later raised to 14.4 kb/s in V.32bis). This modem used an 8-state, 2D CHANNEL CODING: THE ROAD TO CHANNEL CAPACITY 21 rotationally invariant Wei trellis code to achieve a real coding gain of about 3.5 dB with a 32-QAM (later 128-QAM in V.32bis) constellation at 2400 symbols/s— i.e., a nominal bandwidth of 2400 Hz. The “ultimate modem standard” was V.34 (1994) for transmission at up to 28.8 kb/s over the PSTN (later raised to 33.6 kb/s in V.34bis). This modem used a 16-state, 4D rotationally invariant Wei trellis code to achieve a coding gain of about 4.0 dB with a variable-sized QAM constellation with up to 1664 points. An optional 32-state, 4D trellis code with an additional coding gain of 0.3 dB and four times (4x) the decoding complexity and a 64-state, 4D code with a further 0.15 dB coding gain and a further 4x increase in complexity were also specified. A 16D “shell mapping” constellation shaping scheme provided an additional gain of about 0.8 dB. A variable symbol rate of up to 3429 symbols/s was used, with symbol rate and data rate selection determined by “line probing” of individual channels. However, the V.34 standard was shortly superseded by V.90 (1998) and V.92 (2000), which allow users to send data directly over the 56 or 64 kb/s digital backbones that are now nearly universal in the PSTN. Neither V.90 nor V.92 uses coding, because of the difficulty of achieving coding gain on a digital channel. Currently, coding techniques similar to those of V.34 are used in higher-speed wireline modems, such as digital subscriber line (DSL) modems, as well as on digital cellular wireless channels. Capacity-approaching coding schemes are now normally included in new wireless standards. In other words, bandwidth-limited coding has moved to these newer, higher-bandwidth settings. VI. T HE TURBO REVOLUTION In 1993, at the IEEE International Conference on Communications (ICC) in Geneva, Berrou, Glavieux, and Thitimajshima [81] stunned the coding research community by introducing a new class of “turbo codes” that purportedly could achieve near-Shannon-limit performance with modest decoding complexity. Comments to the effect of “It can’t be true; they must have made a 3 dB error” were widespread.11 However, within the next year various laboratories confirmed these astonishing results, and the “turbo revolution” was launched. Shortly thereafter, codes similar to Gallager’s LDPC codes were discovered independently by MacKay at Cambridge [82], [83] and by Spielman at MIT [84], [85], along with low-complexity iterative decoding algorithms. MacKay showed that in practice moderate-length LDPC codes (103 -104 bits) could attain near-Shannon-limit performance, whereas Spielman showed that in theory, as n → ∞, they could approach the Shannon limit with linear decoding complexity. These results kicked off a similar explosion of research on LDPC codes, which are currently seen as competitors to turbo codes in practice. In 1995, Wiberg showed in his doctoral thesis at Linköping [86], [87] that both of these classes of codes could be understood as instances of “codes on sparse graphs,” and that their decoding algorithms could be understood as instances of a general iterative APP decoding algorithm called the “sum-product algorithm.” Late in his thesis work, Wiberg discovered that many of his results had previously been found by Tanner [88], in a largely forgotten 1981 paper. Wiberg’s rediscovery of Tanner’s work opened up a new field, called “codes on graphs.” In this section we will discuss the various historical threads leading to and springing from these watershed events of the mid-90’s, which have proved effectively to answer the challenge laid down by Shannon in 1948. A. Precursors As we have discussed in previous sections, certain elements of the turbo revolution had been appreciated for a long time. It had been known since the early work of Elias that linear codes were as good as general codes. Information theorists also understood that maximizing the minimum distance was not the key to getting to capacity; rather, codes should be “random-like,” in the sense that the distribution of distances from a typical codeword to all other codewords should resemble the distance distribution in a random code. These principles were already evident 11 Although both were professors, neither Berrou nor Glavieux had completed a doctoral degree. 22 in Gallager’s monograph on LDPC codes [49]. Gérard Battail, whose work inspired Berrou and Glavieux, was a long-time advocate of seeking “random-like” codes (see, e.g., [89]). Another element of the turbo revolution whose roots go far back is the use of soft decisions (reliability information) not only as input to a decoder, but also in the internal workings of an iterative decoder. Indeed, by 1962 Gallager had already developed the modern APP decoder for decoding LDPC codes and had shown that retaining soft-decision (APP) information in iterative decoding was useful even on a hard-decision channel such as a BSC. The idea of using soft-input, soft-output (SISO) decoding in a concatenated coding scheme originated in papers by Battail [90] and by Joachim Hagenauer and Peter Hoeher [91]. They proposed a SISO version of the Viterbi algorithm, called the soft-output Viterbi algorithm (SOVA). In collaboration with John Lodge, Hoeher and Hagenauer extended their ideas to iterating separate SISO APP decoders [92]. Moreover, at the same 1993 ICC at which Berrou et al. introduced turbo codes and first used the term “extrinsic information” (see discussion in next section), a paper by Lodge et al. [93] also included the idea of “extrinsic information”. By this time the benefits of retaining soft information throughout the decoding process had been clearly appreciated; see, for example, Battail [90] and Hagenauer [94]. We have already noted in Section IV-D that similar ideas had been developed at about the same time in the context of NASA’s iterative decoding scheme for concatenated codes. B. The turbo code breakthrough The invention of turbo codes began with Alain Glavieux’s suggestion to his colleague Claude Berrou, a professor of VLSI circuit design, that it would be interesting to implement the SOVA decoder in silicon. While studying the principles underlying the SOVA decoder, Berrou was struck by Hagenauer’s statement that “a SISO decoder is a kind of SNR amplifier.” As a physicist, Berrou wondered whether the SNR could be further improved by repeated decoding, using some sort of “turbo-type” iterative feedback. As they say, the rest is history. The original turbo encoder design introduced in [81] is shown in Figure 10. An information sequence u is encoded by an ordinary rate-1/2, 16-state, systematic recursive convolutional encoder to generate a first parity bit sequence; the same information bit sequence is then scrambled by a large pseudorandom interleaver π and encoded by a second, identical rate-1/2 systematic convolutional encoder to generate a second parity bit sequence. The encoder transmits all three sequences, so the overall encoder has rate 1/3. (This is now called the “parallel concatenation” of two codes, in contrast with the original kind of concatenation, now called “serial.”) u (0) v (1) - v π u (2) v Fig. 10. A parallel concatenated rate-1/3 turbo encoder. The use of recursive (feedback) convolutional encoders and an interleaver turn out to be critical for making a turbo code somewhat “random-like.” If a nonrecursive encoder were used, then a single nonzero information bit would necessarily generate a low-weight code sequence. It was soon shown by Benedetto et al. [95] and by Perez et al. [96] that the use of a length-N interleaver effectively reduces the number of low-weight codewords by a factor of N . However, turbo codes nevertheless have relatively poor minimum distance. Indeed, Breiling has shown that the minimum distance of turbo codes grows only logarithmically with the interleaver length N [97]. CHANNEL CODING: THE ROAD TO CHANNEL CAPACITY 23 The iterative turbo decoding system is shown in Figure 11. Decoders 1 and 2 are APP (BCJR) decoders for the two constituent convolutional codes, Π is the same permutation as in the encoder, and Π−1 is the inverse permutation. Berrou et al. discovered that the key to achieving good iterative decoding performance is the removal (i) of the “intrinsic information” from the output APPs L(i) (ul ), resulting in “extrinsic” APPs Le (ul ), which are then passed as a priori inputs to the other decoder. “Intrinsic information” represents the soft channel outputs (i) Lc rl and the a priori inputs already known prior to decoding, while “extrinsic information” represents additional knowledge learned about an information bit during an iteration. The removal of “intrinsic information” has the effect of reducing correlations from one decoding iteration to the next, thus allowing improved performance with an increasing number of iterations.12 (See Chapter 16 of [44] for more details.) The iterative feedback of the “extrinsic” APPs recalls the feedback of exhaust gases in a turbo-charged engine. (0) Lcrl (0) (1) (2) Lcrl Lcrl Lcrl π (2) Le (ul) Decoder 1 (1) (0) (1) L (ul)-Lcrl Le (ul) - (1) Le (ul) π π -1 π -1 Decoder 2 Decision (2) Le (ul) Fig. 11. (2) L (ul) (2) (0) L (ul)-Lcrl Iterative decoder for a parallel concatenated turbo code. The performance on an AWGN channel of the turbo code and decoder of Figures 10 and 11, with an interleaver length of N = 216 , after “puncturing” (deleting symbols) to raise the code rate to 1/2 (η = 1 b/s/Hz), is shown in Figure 12. At Pb (E) ≈ 10−5 , performance is about 0.7 dB from the Shannon limit for η = 1, or only 0.5 dB from the Shannon limit for binary codes at η = 1. In contrast, the real coding gain of the NASA standard concatenated code is about 1.6 dB less, even though its rate is lower and its decoding complexity is about the same. With randomly constructed interleavers, at values of Pb (E) somewhat below 10−5 , the performance curve of turbo codes typically flattens out, resulting in what has become known as an “error floor” (as seen in Figure 12, for example.) This happens because turbo codes do not have large minimum distances, so ultimately performance is limited by the probability of confusing the transmitted codeword with a near neighbor. Several approaches have been suggested to mitigate the error-floor effect. These include using serial concatenation rather than parallel concatenation (see, for example, [98] or [99]), the design of structured interleavers to improve the minimum distance (see, for example, [100] or [101]), or the use of multiple interleavers to eliminate low-weight codewords (see, for example, [102] or [103]). However, the fact that the minimum distance of turbo codes cannot grow linearly with block length implies that the ensemble of turbo codes is not asymptotically good, and that “error floors” cannot be totally avoided. C. Rediscovery of LDPC codes Gallager’s invention of LDPC codes and the iterative APP decoding algorithm was long before its time (“a bit of 21st-century coding that happened to fall in the 20th century”). His work was largely forgotten for more than 30 years. It is easy to understand why there was little interest in LDPC codes in the 60s and 70s, because these codes were much too complex for the technology of that time. It is not so easy to explain why they continued to be ignored by the coding community up to the mid-90s. Shortly after the turbo code breakthrough, several researchers with backgrounds in computer science and physics rather than in coding rediscovered the power and efficiency of LDPC codes. In his thesis, Dan Spielman [84], [85] 12 We note, however, that correlations do build up with iterations and that a saturation effect is eventually observed, where no further improvement is possible. 24 0 10 Turbo Code NASA standard concatenated code −1 −2 10 −3 −4 10 −5 10 Shannon Limit (η=1) b P (E) 10 Shannon Limit, binary codes (η=1) 10 −6 10 −7 10 −1 0 1 2 3 E /N (dB) b 4 5 6 o Fig. 12. Performance of a rate-1/2 turbo code with interleaver length N = 216 , compared to the NASA standard concatenated code and the relevant Shannon limits for η = 1. used LDPC codes based on expander graphs to devise codes with linear-time encoding and decoding algorithms and with respectable error performance. At about the same time and independently, David MacKay [83], [104] showed empirically that near-Shannon-limit performance could be obtained with long LDPC-type codes and iterative decoding. Given that turbo codes were already a hot topic, the rediscovery of LDPC codes kindled an explosion of interest in this field that has continued to this day. An LDPC code is commonly represented by a bipartite graph as in Figure 13, introduced by Michael Tanner in 1981 [88], and now called a “Tanner graph.” Each code symbol yk is represented by a node of one type, and each parity check by a node of a second type. A symbol node and a check node are connected by an edge if the corresponding symbol is involved in the corresponding check. In an LDPC code, the edges are sparse, in the sense that their number increases linearly with the block length n, rather than as n2 . The impressive complexity results of Spielman were quickly applied by Alon and Luby [105] to the Internet problem of reconstructing large files in the presence of packet erasures. This work exploits the fact that on an erasure channel,13 decoding linear codes is essentially a matter of solving linear equations, and becomes very efficient if it can be reduced to solving a series of equations, each of which involves a single unknown variable. An important general discovery that arose from this work was the superiority of irregular LDPC codes. In a regular LDPC code, such as the one shown in Figure 13, all symbol nodes have the same degree (number of incident edges), and so do all check nodes. Luby et al. [106], [107] found that by using irregular graphs and optimizing the degree sequences (numbers of symbol and check nodes of each degree), they could approach the 13 On an erasure channel, transmitted symbols or packets are either received correctly or not at all; i.e., there are no “channel errors.” CHANNEL CODING: THE ROAD TO CHANNEL CAPACITY 25 y0 ⑦ ❍ y1 y2 y3 y4 y5 y6 y7 Fig. 13. ❍❍ ❍❍ ⑦ ❳ ❍❳ ❍❳❳❳❍❍ ❍❍ ❳❳❍ ❳❍ ❳❍ ❳ ❍❍ ⑦ ❍ ✘✘✘ ✘❍ ❍❍ ✘✘❍ ✘ ✘❍ ❍❍ ✘ ⑦ ❍ ❳✘ ✘ ❳❳ ❍ ✘✘✚ ❳✘ ✘❍✚ ❍ ❳✘ ❳ ❳✚ ❳❍ ❳❍ ✘✘✘ ❳ ⑦ ✘ ❳❳❳ ✚ ✘✘✘ ❳❳✘ ✘ ✚ ❳ ✘ ✘✘✚ ❳❳❳ ⑦ ✘✘ ✚ ✘ ✘❳ ✟ ✚ ✘✘✘✘ ✟✟ ✚ ✘ ✚ ⑦ ✘✘✘ ✟✟ ✟ ✟ ✟✟ ⑦ ✟ + + + + Tanner graph of the (8, 4, 4) extended Hamming code. capacity of the erasure channel— i.e., achieve small error probabilities at code rates of nearly 1 - p, where p is the erasure probability. For example, a rate-1/2 LDPC code capable of correcting up to a fraction p = 0.4955 of erasures is described in [108]. “Tornado codes” of this type were commercialized by Digital Fountain, Inc. [109]. More recently, it has been shown [110] that on any erasure channel, binary or nonbinary, it is possible to design LDPC codes that can approach capacity arbitrarily closely, in the limit as n → ∞. The erasure channel is the only channel for which such a result has been proved. Building on the analytical techniques developed for Tornado codes, Richardson, Urbanke et al. [111], [112] used a technique called “density evolution” to design long irregular LDPC codes that for all practical purposes achieve the Shannon limit on binary AWGN channels. Given an irregular binary LDPC code with arbitrary degree sequences, they showed that the evolution of probability densities on a binary-input memoryless symmetric (BMS) channel using an iterative sum-product (or similar) decoder can be analyzed precisely. They proved that error-free performance could be achieved below a certain threshold, for very long codes and large numbers of iterations. Degree sequences may then be chosen to optimize the threshold. By simulations, they showed that codes designed in this way could clearly outperform turbo codes for block lengths of the order of 105 or more. Using this approach, Chung et al. [113] designed several rate-1/2 codes for the AWGN channel, including one whose theoretical threshold approached the Shannon limit within 0.0045 dB, and another whose simulated performance with a block length of 107 approached the Shannon limit within 0.040 dB at an error rate of Pb (E) ≈ 10−6 , as shown in Figure 14. It is rather surprising that this close approach to the Shannon limit required no extension of Gallager’s LDPC codes beyond irregularity. The former (threshold-optimized) code had symbol node degrees {2, 3, 6, 7, 15, 20, 50, 70, 100, 150, 400, 900, 2000, 3000, 6000, 8000}, with average degree dλ = 9.25, and check node degrees {18, 19}, with average degree dρ = 18.5. The latter (simulated) code had symbol degrees {2, 3, 6, 7, 18, 19, 55, 56, 200}, with dλ = 6, and all check degrees equal to 12. In current research, more structured LDPC codes are being sought for shorter block lengths, of the order of 1000. The original work of Tanner [88] included several algebraic constructions of codes on graphs. Algebraic structure may be preferable to a pseudo-random structure for implementation and may allow control over important code parameters such as minimum distance, as well as graph-theoretic variables such as expansion and girth.14 The most impressive results are perhaps those of [114], in which it is shown that certain classical finite-geometry codes and their extensions can produce good LDPC codes. High-rate codes with lengths up to 524,256 have been constructed and shown to perform within 0.3 dB of the Shannon limit. 14 Expansion and girth are properties of a graph that relate to its suitability for iterative decoding. 26 1e−02 dl=200 dl=100 Pb (E) 1e−03 Shannon Limit, binary codes (η=1) 1e−04 Threshold (d =100) Threshold (d =200) l l 1e−05 Threshold (d =8000) l 1e−06 0 0.05 0.1 0.15 0.2 0.25 E /N (dB) b 0.3 0.35 0.4 0.45 0.5 0 Fig. 14. Performance of optimized rate- 21 irregular LDPC codes: asymptotic analysis with maximum symbol degree dl = 100, 200, 8000, and simulations with maximum symbol degree dl = 100, 200 and n = 107 [113]. D. RA codes and other variants Divsalar, McEliece et al. [115] proposed “repeat-accumulate” (RA) codes in 1998 as simple “turbo-like” codes for which one could prove coding theorems. An RA code is generated by the serial concatenation of a simple (n, 1, n) repetition code, a large pseudo-random interleaver Π, and a simple 2-state rate-1/1 convolutional “accumulator” code with input-output equation yk = xk + yk−1, as shown in Figure 15. π Fig. 15. A repeat-accumulate (RA) encoder. The performance of RA codes turned out to be remarkably good, within about 1.5 dB of the Shannon limit— i.e., better than that of the best coding schemes known prior to turbo codes. Other authors have proposed equally simple codes with similar or even better performance. For example, RA codes have been extended to “accumulate-repeat-accumulate” (ARA) codes [116], which have even better performance. Ping and Wu [117] proposed “concatenated tree codes” comprising M two-state trellis codes interconnected by interleavers, which exhibit performance almost identical to turbo codes of equal block length, but with an order of magnitude less complexity (see also Massey and Costello [118]). It seems that there are many ways that simple codes can be interconnected by large pseudo-random interleavers and decoded with the sum-product algorithm so as to yield near-Shannon-limit performance. E. Fountain (rateless) codes Fountain codes, or “rateless codes,” are a new class of codes designed for channels without feedback whose statistics are not known a priori; e.g., Internet packet channels where the probability p of packet erasure is unknown. CHANNEL CODING: THE ROAD TO CHANNEL CAPACITY 27 The “fountain” idea is that the transmitter encodes a finite length information sequence into a potentially infinite stream of encoded symbols; the receiver then accumulates received symbols (possibly noisy) until it finds that it has enough for successful decoding. The first codes of this type were the LT (“Luby Transform”) codes of Luby [119], in which each encoded symbol is a parity check on a randomly chosen subset of the information symbols. These were extended to the “Raptor codes” of Shokrollahi [120], in which an inner LT code is concatenated with an outer fixed-length, high-rate LDPC code. Raptor codes permit linear-time decoding and clean up error floors, with a slightly greater coding overhead than LT codes. Both types of codes work well on erasure channels, and both have been implemented for Internet applications by Digital Fountain, Inc.. Raptor codes also appear to work well over more general noisy channels, such as the AWGN channel [121]. F. Approaching the capacity of bandwidth-limited channels In Section V, we discussed coding for bandwidth-limited channels. Following the introduction of capacityapproaching codes, researchers turned their attention to applying these new techniques to bandwidth-limited channels. Much of the early research followed the approach of Ungerboeck’s trellis-coded modulation [77] and the related work of Imai and Hirakawa on multilevel coding [78]. In two variations, turbo TCM due to Robertson and Woerz [122] and parallel concatenated TCM due to Benedetto et al. [123], Ungerboeck’s set partitioning rules were applied to turbo codes with TCM constituent encoders. In another variation, Wachsmann and Huber [124] adapted the multilevel coding technique to work with turbo constituent codes. In each case, performance approaching the Shannon limit was demonstrated at spectral efficiencies η > 2 b/s/Hz with large pseudorandom interleavers. Even earlier, a somewhat different approach had been introduced by LeGoff, Glavieux, and Berrou [125]. They employed turbo codes in combination with bit-interleaved coded modulation (BICM), a technique originally proposed by Zehavi [126] for bandwidth efficient convolutional coding on fading channels. In this arrangement, the output sequence of a turbo encoder is bit-interleaved and then Gray-mapped directly onto a signal constellation, without any attention to set partitioning or multilevel coding rules. However, because turbo codes are so powerful, this seeming neglect of efficient signal mapping design rules costs only a small fraction of a dB for most practical constellation sizes, and capacity-approaching performance can still be achieved. In more recent years, many variations of this basic scheme have appeared in the literature. A number of researchers have also investigated the use of LDPC codes in BICM systems. Because of its simplicity and the fact that coding and signal mapping can be considered separately, combining turbo or LDPC codes with BICM has become the most common capacityapproaching coding scheme for bandwidth-limited channels. G. Codes on graphs The field of “codes on graphs” has been developed to provide a common conceptual foundation for all known classes of capacity-approaching codes and their iterative decoding algorithms. Inspired partly by Gallager, Michael Tanner founded this field in a landmark paper nearly 25 years ago [88]. Tanner introduced the Tanner graph bipartite graphical model for LDPC codes, as shown in Figure 13. Tanner also generalized the parity-check constraints of LDPC codes to arbitrary linear code constraints. He observed that this model included product codes, or more generally codes constructed “recursively” from simpler component codes. He derived the generic sum-product decoding algorithm, and introduced what is now called the “min-sum” (or “max-product”) algorithm. Finally, his grasp of the architectural advantages of “codes on graphs” was clear: The decoding by iteration of a fairly simple basic operation makes the suggested decoders naturally adapted to parallel implementation with large-scale-integrated circuit technology. Since the decoders can use soft decisions effectively, and because of their low computational complexity and parallelism can decode large blocks very quickly, these codes may well compete with current convolutional techniques in some applications. 28 Like Gallager’s, Tanner’s work was largely forgotten for many years, until Niclas Wiberg’s seminal thesis [86], [87]. Wiberg based his thesis on LDPC codes, Tanner’s paper, and the field of “trellis complexity of block codes,” discussed in Section IV-E above. Wiberg’s most important contribution may have been to extend Tanner graphs to include state variables as well as symbol variables, as shown in Figure 16(a). A Wiberg-type graph, now called a “factor graph” [127], is still bipartite; however, in addition to symbol variables, which are external, observable, and determined a priori, a factor graph may include state variables, which are internal, unobservable, and introduced at will by the code designer. ⑦ ❍ ❭❍ ❭ ❍❍ ⑦ ❳ ❍❳❭❳ ❍❍ ❍❍ ❍❍ ❭❳❳❳❳ ❍❳ ❳❳ ❍ ... ❍ ❭❍ ✟ ❭ ❍❍ ✟✚ ✟ ❍✚ ❭ ✟✚ ❍ ⑦ ❍ ❍ ✟ ✟✟❭✚ ❍✟ ✟ ✟ ✚ ❍ ❭✟ ✟ ❍✚ ❍✟✟ ❭ ✟ ♥ ... ❳❳ ✚✟ ❳✟ ✚ ❳❳❍❍ ❭ ❍❍❭ ❳❳❳ ✚✟ ✟ ❳❳ ✟ ❍ ✚ ♥ ❭ ✟ ✟ ✟ ✟✟ ... ✟ ✟ ✟ ✟✟ ♥ =❍ ❭❍ ❭ ❍❍ ❍ =❍ ❳❳❭ ❳❳❳❍❍ ❍❍ ❭ ❳❳❍ ❍❳ ❳❳ ❍ ❍ ❭❍ ... ✟ ❍ = = = ... = (a) ❭ ❍ ✟✚ ✟✚ ❭ ✟❍ ✚❍❍ ❍ ✟ ✟✟❭✚ ❍✟ ❍✟ ✚ ❭✟✟ ✟ ❍✚ ❍✟✟ ❭ ✟ ... ❳❳ ✚✟ ❳ ✚✟ ❳❳❍❍ ❭ ❍❍❭ ❳❳❳ ✚✟ ✟ ❳❳ ❭ ✟ ❍ ✚ ✟ ✟ ✟ ✟✟ ✟ ✟ ✟ ✟✟ (b) Fig. 16. (a) Generic bipartite factor graph, with symbol variables (filled circles), state variables (open circles), and constraints (squares). (b) Equivalent normal graph, with equality constraints replacing variables, and observed variables indicated by “half-edges.” Subsequently Forney proposed a refinement of factor graphs, namely “normal graphs” [128]. Figure 16(b) shows a normal graph that is equivalent to the generic factor graph of Figure 16(a). In a normal graph, state variables are associated with edges and symbol variables with “half-edges;” state nodes are replaced by repetition constraints that constrain all incident state edges to be equal, while symbol nodes are replaced by repetition constraints and symbol half-edges. This conversion thus causes no change in graph topology or complexity. Both styles of graphical realization are in current use, as are “Forney-style factor graphs.” By introducing states, Wiberg showed how turbo codes and trellis codes are related to LDPC codes. Figure 17(a) illustrates the factor graph of a conventional trellis code, where each constraint determines the possible combinations of (state, symbol, next state) that can occur. Figure 17(b) is an equivalent normal graph, with state variables represented simply by edges. Note that the graph of a trellis code has no cycles (loops). ⑦ ⑦ ♥ ⑦ ♥ ⑦ ♥ ⑦ ♥ ⑦ ♥ ⑦ ♥ ⑦ ♥ (a) (b) Fig. 17. (a) Factor graph of a trellis code, (b) equivalent normal graph of a trellis code. Perhaps the key result following from the unification of trellis codes and general codes on graphs is the “cut-set bound,” which we now briefly describe. If a code graph is disconnected into two components by deletion of a cut set (a minimal set of edges whose removal partitions the graph into two disconnected components), then the code CHANNEL CODING: THE ROAD TO CHANNEL CAPACITY 29 constraints require a certain minimum amount of information to pass between the two components. In a trellis, this establishes a lower bound on state space size. In a general graph, it establishes a lower bound on the product of the sizes of the state spaces corresponding to a cut set. The cut-set bound implies that cycle-free graphs cannot have state spaces much smaller than those of conventional trellises, since cut sets in cycle-free graphs are single edges; dramatic reductions in complexity can occur only in graphs with cycles, such as the graphs of turbo and LDPC codes. In this light turbo codes, LDPC codes, and RA codes can all be seen as codes whose graphs are made up of simple codes with linear-complexity graph realizations, connected by a long, pseudo-random interleaver Π, as shown in Figures 18, 19, and 20. info bits = first parity bits second parity bits = = = ... Π ... ... = = trellis 1 trellis 2 Fig. 18. Normal graph of a Berrou-type turbo code. A data sequence is encoded by two low-complexity trellis codes, in one case after interleaving by a pseudo-random permutation Π. ✏ =✏ P P ❍ P ❤ ❍ P ❤ ✭ + ✭ ✏ ✏✟ ✟ ✏ ✏ =P P ✏ =✏ P P ❍ P ❤ ❍ P ❤ ✭ ✭✟ ✏ ✟✏ + ✏ ✏ =P P Π ... ✏ =✏ P P ✏ =✏ P P ... ❍ P ❤✭ ❍ P ❤ ✭ ✏ + ✏ ✟✟ Fig. 19. Normal graph of a regular dλ = 3, dρ = 6 LDPC code. Code bits satisfy single-parity-check constraints (indicated by “+”), with connections specified by a pseudo-random permutation Π. Wiberg made equally significant conceptual contributions on the decoding side. Like Tanner, he gave clean characterizations of the min-sum and sum-product algorithms, showing that they were essentially identical except for the substitution of “min” for “sum” and “sum” for “product” (and even giving the further “semi-ring” generalization 30 ... = ... + = = + ❅ ❅ ❅ = + Π = + = = ... ❅ ❅ ❅ + = + ... Fig. 20. Normal graph of rate- 13 RA code. Data bits are repeated three times, permuted by a pseudo-random permutation Π, and encoded by a rate-1/1 convolutional encoder. [129]). He showed that on cycle-free graphs they perform exact ML and APP decoding, respectively. In particular, on trellises they reduce to the Viterbi and BCJR algorithms, respectively.15 This result strongly motivates the heuristic extension of iterative sum-product decoding to graphs with cycles. Wiberg showed that the turbo and LDPC decoding algorithms may be understood as instances of iterative sum-product decoding applied to their respective graphs. While these graphs necessarily contain cycles, the probability of short cycles is low, and consequently iterative sum-product decoding works well. Forney [128] showed that with a normal graph representation, there is a clean separation of functions in iterative sum-product decoding: • • • All computations occur at constraint nodes, not at states; State edges are used for internal communication (message-passing); Symbol edges are used for external communication (I/O). Connections shortly began to be made to a variety of related work in various other fields, notably in [130], [131], [129], [127]. In addition to the Viterbi and BCJR algorithms for decoding trellis codes and the turbo and LDPC decoding algorithms, the following algorithms have all been shown to be special cases of the sum-product algorithm operating on appropriate graphs: • • • • the “belief propagation” and “belief revision” algorithms of Pearl [132], used for statistical inference on “Bayesian networks;” the “forward-backward” (Baum-Welch) algorithm [133], used for detection of hidden Markov models in signal processing, especially for speech recognition; “Junction tree” algorithms used with Markov random fields [129]; and Kalman filters and smoothers for general Gaussian graphs [127]. In summary, the principles of all known capacity-approaching codes and a wide variety of message-passing algorithms used not only in coding but also in computer science and signal processing can be understood within the framework of “codes on graphs.” 15 Indeed, the “extrinsic” APP’s passed in a turbo decoder are exactly the messages produced by the sum-product algorithm. CHANNEL CODING: THE ROAD TO CHANNEL CAPACITY 31 TABLE I A PPLICATIONS OF TURBO CODES ( COURTESY OF C. B ERROU ). Application CCSDS (deep space) UMTS, CDMA2000 (3G Mobile) DVB-RCS (Return Channel over Satellite) DVB-RCT (Return Channel over Terrestial) Inmarsat (Aero-H) Eutelsat (Skyplex) IEEE 802.16 (WiMAX) turbo code binary, 16-state binary, 8-state duo-binary, 8-state duo-binary, 8-state binary, 16-state duo-binary, 8-state duo-binary, 8-state termination tail bits tail bits circular circular no circular circular polynomials 23, 33, 25, 37 13, 15, 17 15, 13 15, 13 23, 35 15, 13 15, 13 rates 1/6, 1/4, 1/3, 1/2 1/4, 1/3, 1/2 1/3 up to 6/7 1/2, 3/4 1/2 4/5, 6/7 1/2 up to 7/8 Notes: 1) “duo-binary” refers to a turbo code with rate-2/3 constituent codes; 2) “termination” refers to the method of forcing the encoder back to a known state following encoding; 3) polynomials, given in octal notation, specify encoder connections. H. The impact of the turbo revolution Even though it has been less than 15 years since the introduction of turbo codes, these codes and the related class of LDPC codes have already had a significant impact in practice. In particular, almost all digital communication and storage system standards that involve coding are being upgraded to include these new capacity-approaching techniques. Known applications of turbo codes as of this writing are summarized in Table I. LDPC codes have been adopted for the DVB-S2 (Digital Video Broadcasting) and 10GBASE-T or IEEE 802.3an (Ethernet) standards, and are currently also being considered for the IEEE 802.16e (WiMax) and 802.11n (WiFi) standards, as well as for various storage system applications. It is evident from this explosion of activity that capacity-approaching codes are revolutionizing the way that information is transmitted and stored. VII. C ONCLUSIONS It took only 50 years, but the Shannon limit is now routinely being approached within 1 dB on AWGN channels, both power-limited and bandwidth-limited. Similar gains are being achieved in other important applications, such as wireless channels and Internet (packet erasure) channels. So is coding theory finally dead? The Shannon limit guarantees that on memoryless channels such as the AWGN channel, there is little more to be gained in terms of performance. Therefore channel coding for classical applications has certainly reached the point of diminishing returns, just as algebraic coding theory had by 1971. However, this does not mean that research in coding will dry up, any more than research in algebraic coding theory has disappeared. There will always be a place for discipline-driven research that fills out our understanding. Research motivated by issues of performance vs. complexity will always be in fashion, and measures of “complexity” are sure to be redefined by future generations of technology. Coding for non-classical channels, such as multi-user channels, networks, and channels with memory, are hot areas today that seem likely to remain active for a long time. The world of coding research thus continues to be an expanding universe. ACKNOWLEDGMENTS The authors would like to thank Mr. Ali Pusane for his help in the preparation of this manuscript. Comments on earlier drafts by C. Berrou, J. L. Massey, and R. Urbanke were very helpful. 32 R EFERENCES [1] C. E. Shannon, “A mathematical theory of communication,” Bell Syst. Tech. J., vol. 27, pp. 379–423 and 623–656, 1948. [2] R. W. McEliece, “Are there turbo codes on Mars?” (2004 Shannon Lecture), Proc. 2004 Intl. Symp. Inform. Theory (Chicago, IL), June 30, 2004. [3] J. L. Massey, ”Deep-space communications and coding: A marriage made in heaven,” in Advanced Methods for Satellite and Deep Space Communications, (J. Hagenauer, ed.). New York: Springer, 1992. [4] W. W. Peterson, Error-Correcting Codes. Cambridge, MA: MIT Press, 1961. [5] E. R. Berlekamp, Algebraic Coding Theory. New York: McGraw-Hill, 1968. [6] S. Lin, An introduction to error-correcting codes. Englewood Cliffs, NJ: Prentice-Hall, 1970. [7] W. W. Peterson and E. J. Weldon, Jr., Error-Correcting Codes. Cambridge, MA: MIT Press, 1972. [8] F. J. MacWilliams and N. J. A. Sloane, The Theory of Error-Correcting Codes. New York, NY: Elsevier, 1977. [9] R. E. Blahut, Theory and Practice of Error Correcting Codes. Reading, MA: Addison-Wesley, 1983. [10] R. W. Hamming, “Error detecting and error correcting codes,” Bell Syst. Tech. J., vol. 29, pp. 147–160, 1950. [11] M. J. E. Golay, “Notes on digital coding,” Proc. IRE, vol. 37, p. 657, June 1949. [12] E. R. Berlekamp (ed.), Key Papers in the Development of Coding Theory. New York: IEEE Press, 1974. [13] D. E. Muller, “Application of Boolean algebra to switching circuit design and to error detection,” IRE Trans. Electron. Comput., vol. EC-3, pp. 6–12, Sept. 1954. [14] I. S. Reed, “A class of multiple-error-correcting codes and the decoding scheme,” IRE Trans. Inform. Theory, vol. IT-4, pp. 38–49, Sept. 1954. [15] R. A. Silverman and M. Balser, “Coding for constant-data-rate systems,” IRE Trans. Inform. Theory, vol. PGIT–4, pp. 50–63, Sept. 1954. [16] E. Prange, “Cyclic error-correcting codes in two symbols,” Tech. Note AFCRC-TN-57-103, Air Force Cambridge Research Center, Cambridge, MA, Sept. 1957. [17] A. Hocquenghem, “Codes correcteurs d’erreurs,” Chiffres, vol. 2, pp. 147–156, 1959. [18] R. C. Bose and D. K. Ray-Chaudhuri, “On a class of error-correcting binary group codes,” Inform. Contr., vol. 3, pp. 68–79, Mar. 1960. [19] I. S. Reed and G. Solomon, “Polynomial codes over certain finite fields,” J. SIAM, vol. 8, pp. 300–304, June 1960. [20] R. C. Singleton, “Maximum distance Q-nary codes,” IEEE Trans. Inform. Theory, vol. IT-10, pp. 116–118, Apr. 1964. [21] W. W. Peterson, “Encoding and error-correction procedures for the Bose-Chaudhuri codes,” IRE Trans. Inform. Theory, vol. IT-6, pp. 459–470, Sept. 1960. [22] J. L. Massey, “Shift-register synthesis and BCH decoding,” IEEE Trans. Inform. Theory, vol. IT-15, pp. 122–127, Jan. 1969. [23] G. D. Forney, Jr., “On decoding BCH codes,” IEEE Trans. Inform. Theory, vol. IT–11, pp. 549–557, Oct. 1965. [24] G. D. Forney, Jr., “Generalized minimum distance decoding,” IEEE Trans. Inform. Theory, vol. IT–12, pp. 125–131, Apr. 1966. [25] D. Chase, “A class of algorithms for decoding block codes with channel measurement information,” IEEE Trans. Inform. Theory, vol. IT-18, pp. 170–182, Jan. 1972. [26] G. D. Forney, Jr., “Burst-correcting codes for the classic bursty channel,” IEEE Trans. Commun. Technol., vol. COM–19, pp. 772–781, Oct. 1971. [27] R. W. McEliece and L. Swanson, “Reed-Solomon codes and the exploration of the solar system,” in Reed-Solomon Codes and Their Applications (S. B. Wicker and V. K. Bhargava, eds.), pp. 25–40. Piscataway, NJ: IEEE Press, 1994. [28] K. A. S. Immink, “Reed-Solomon codes and the compact disc,” in Reed-Solomon Codes and Their Applications (S. B. Wicker and V. K. Bhargava, eds.), pp. 41–59. Piscataway, NJ: IEEE Press, 1994. [29] E. R. Berlekamp, G. Seroussi and P. Tong, “A hypersystolic Reed-Solomon decoder,” in Reed-Solomon Codes and Their Applications (S. B. Wicker and V. K. Bhargava, eds.), pp. 205–241. Piscataway, NJ: IEEE Press, 1994. [30] R. W. Lucky, ”Coding is dead,” IEEE Spectrum, c. 1991. Reprinted in R. W. Lucky, Lucky Strikes . . . Again, pp. 243-245. Piscataway, NJ: IEEE Press, 1993. [31] R. M. Roth, Introduction to Coding Theory. Cambridge, UK: Cambridge U. Press, 2006. [32] V. D. Goppa, “Codes associated with divisors,” Probl. Inform. Transm., vol. 13, pp. 22–27, 1977. [33] V. D. Goppa, “Codes on algebraic curves,” Sov. Math. Dokl., vol. 24, pp. 170–172, 1981. [34] M. A. Tsfasman, S. G. Vladut and T. Zink, “Modular codes, Shimura curves and Goppa codes better than the Varshamov-Gilbert bound,” Math. Nachr., vol. 109, pp. 21–28, 1982. [35] I. Blake, C. Heegard, T. Høholdt and V. Wei, “Algebraic-geometry codes,” IEEE Trans. Inform. Theory, vol. 44, pp. 2596–2618, Oct. 1998. [36] M. Sudan, “Decoding of Reed-Solomon codes beyond the error-correction bound,” J. Complexity, vol. 13, pp. 180–193, 1997. [37] P. Elias, “Error-correcting codes for list decoding,” IEEE Trans. Inform. Theory, vol. 37, pp. 5–12, Jan. 1991. [38] V. Guruswami and M. Sudan, “Improved decoding of Reed-Solomon and algebraic-geometry codes,” IEEE Trans. Inform. Theory, vol. 45, pp. 1757–1767, Sept. 1999. [39] R. Koetter and A. Vardy, “Algebraic soft-decision decoding of Reed-Solomon codes,” IEEE Trans. Inform. Theory, vol. 49, pp. 2809– 2825, Nov. 2003. [40] M. Fossorier and S. Lin, “Computationally efficient soft-decision decoding of linear block codes based on ordered statistics,” IEEE Trans. Inform. Theory, vol. 42, pp. 738–750, May 1996. [41] J. M. Wozencraft and I. M. Jacobs, Principles of Communication Engineering. New York: Wiley, 1965. [42] R. G. Gallager, Information Theory and Reliable Communication. New York: Wiley, 1968. [43] G. C. Clark, Jr. and J. B. Cain, Error-Correction Coding for Digital Communications. New York: Plenum, 1981. [44] S. Lin and D. J. Costello, Jr., Error Correcting Coding: Fundamentals and Applications, 2nd ed. Englewood Cliffs, NJ: Prentice-Hall, 2004. CHANNEL CODING: THE ROAD TO CHANNEL CAPACITY 33 [45] R. Johannesson and K. S. Zigangirov, Fundamentals of Convolutional Coding. Piscataway, NJ: IEEE Press, 1999. [46] T. Richardson and R. Urbanke, Modern Coding Theory. To appear, 2006. [47] P. Elias, “Coding for noisy channels,” IRE Conv. Rec., pt. 4, pp. 37–46, Mar. 1955. Reprinted in Key Papers in the Development of Information Theory (D. Slepian, ed.), IEEE Press, 1973; Key Papers in the Development of Coding Theory (E. R. Berlekamp, ed.), IEEE Press, 1974; The Electron and the Bit (J. V. Guttag, ed.), EECS Dept., MIT, Cambridge, MA, 2005. [48] R. G. Gallager, “Introduction to ‘Coding for noisy channels,’ by Peter Elias,” in The Electron and the Bit (J. V. Guttag, ed.), pp. 91–94. Cambridge, MA: EECS Dept., MIT, 2005. [49] R. G. Gallager, Low-Density Parity-Check Codes. Cambridge, MA: MIT Press, 1963. [50] G. D. Forney, Jr., “Convolutional codes I: Algebraic structure,” IEEE Trans. Inform. Theory, vol. IT–16, pp. 720–738, Nov. 1970. [51] J. M. Wozencraft and B. Reiffen, Sequential Decoding. Cambridge, MA: MIT Press, 1961. [52] R. M. Fano, “A heuristic discussion of probabilistic decoding,” IEEE Trans. Inform. Theory, vol. IT–9, pp. 64–74, Jan. 1963. [53] I. M. Jacobs and E. R. Berlekamp, “A lower bound to the distribution of computation for sequential decoding,” IEEE Trans. Inform. Theory, vol. IT–13, pp. 167–174, 1967. [54] J. L. Massey, Threshold Decoding. Cambridge, MA: MIT Press, 1963. [55] A. J. Viterbi, “Error bounds for convolutional codes and an asymptotically optimum decoding algorithm,” IEEE Trans. Inform. Theory, vol. IT–13, pp. 260–269, April 1967. [56] G. D. Forney, Jr., “Review of random tree codes,” Appendix A, Final Report, Contract NAS2-3637, NASA CR73176, NASA Ames Res. Ctr., Moffett Field, CA, Dec. 1967. [57] J. K. Omura, “On the Viterbi decoding algorithm,” IEEE Trans. Inform. Theory, vol. IT–15, pp. 177–179, 1969. [58] J. A. Heller, “Short constraint length convolutional codes,” Jet Prop. Lab., Space Prog. Summary 37–54, vol. III, pp. 171–177, 1968. [59] J. A. Heller, “Improved performance of short constraint length convolutional codes,” Jet Prop. Lab., Space Prog. Summary 37–56, vol. III, pp. 83–84, 1969. [60] D. Morton, “Andrew Viterbi, electrical engineer: An oral history,” IEEE History Center, Rutgers U., New Brunswick, NJ, Oct. 1999. [61] J. A. Heller and I. M. Jacobs, “Viterbi decoding for satellite and space communication,” IEEE Trans. Commun. Tech., vol. COM–19, pp. 835–848, Oct. 1971. [62] The Revival of Sequential Decoding, a Workshop held at the Munich University of Technology, J. Hagenauer and D. Costello, general chairs, Munich, Germany, June 2006. [63] L. R. Bahl, J. Cocke, F. Jelinek and J. Raviv, “Optimal decoding of linear codes for minimizing symbol error rate,” IEEE Trans. Inform. Theory, vol. IT–20, pp. 284–287, Mar. 1974. [64] P. Elias, “Error-free coding,” IRE Trans. Inform. Theory, vol. IT-4, pp. 29–37, Sept. 1954. [65] G. D. Forney, Jr., Concatenated Codes. Cambridge, MA: MIT Press, 1966. [66] E. Paaske, “Improved decoding for a concatenated coding system recommended by CCSDS,” IEEE Trans. Commun., vol. 38, pp. 1138– 1144, Aug. 1990. [67] O. Collins and M. Hizlan, “Determinate-state convolutional codes,” IEEE Trans. Commun., vol. 41, pp. 1785–1794, Dec. 1993. [68] J. Hagenauer, E. Offer, and L. Papke, “Matching Viterbi decoders and Reed–Solomon decoders in concatenated systems,” in ReedSolomon Codes and Their Applications (S. B. Wicker and V. K. Bhargava, eds.), pp. 242–271. Piscataway, NJ: IEEE Press, 1994. [69] J. K. Wolf, “Efficient maximum-likelihood decoding of linear block codes using a trellis,” IEEE Trans. Inform. Theory, vol. 24, pp. 76–80. 1978. [70] A. Vardy, “Trellis structure of codes,” in Handbook of Coding Theory (V. Pless and W. C. Huffman, eds.). Amsterdam: Elsevier, 1998. [71] D. J. Costello, Jr., J. Hagenauer, H. Imai and S. B. Wicker, “Applications of error-control coding,” IEEE Trans. Inform. Theory, vol. 44, pp. 2531–2560, Oct. 1998. [72] J. H. Conway and N. J. A. Sloane, Sphere Packings, Lattices and Groups. New York: Springer, 1988. [73] R. de Buda, “The upper error bound of a new near-optimal code,” IEEE Trans. Inform. Theory, vol. IT–21, pp. 441–445, July 1975. [74] G. R. Lang and F. M. Longstaff, “A Leech lattice modem,” IEEE J. Select. Areas Commun., vol. 7, pp. 968–973, Aug. 1989. [75] G. D. Forney, Jr. and L.-F. Wei, “Multidimensional constellations— Part I: Introduction, figures of merit, and generalized cross constellations,” IEEE J. Select. Areas Commun., vol. 7, pp. 877-892, Aug. 1989. [76] G. D. Forney, Jr. and G. Ungerboeck, “Modulation and coding for linear Gaussian channels,” IEEE Trans. Inform. Theory, vol. 44, pp. 2384–2315, Oct. 1998. [77] G. Ungerboeck, “Channel coding with multilevel/phase signals,” IEEE Trans. Inform. Theory, vol. IT–28, pp. 55–67, Jan. 1982. [78] H. Imai and S. Hirakawa, “ A new multilevel coding method using error-correcting codes,” IEEE Trans. Inform. Theory, vol. 23, pp. 371–377, May 1997. [79] L.-F. Wei, “Rotationally invariant convolutional channel encoding with expanded signal space— Part II: Nonlinear codes,” IEEE J. Select. Areas Commun., vol. 2, pp. 672–686, Sept. 1984. [80] L.-F. Wei, “Trellis-coded modulation using multidimensional constellations,” IEEE Trans. Inform. Theory, vol. IT–33, pp. 483–501, July 1987. [81] C. Berrou, A. Glavieux and P. Thitimajshima, “Near Shannon limit error-correcting coding and decoding: Turbo codes,” Proc. 1993 Int. Conf. Commun. (Geneva), pp. 1064–1070, May 1993. [82] D. J. C. MacKay and R. M. Neal, “Good codes on very sparse matrices,” in Cryptography and Coding. 5th IMA Conference, (Colin Boyd, ed.), pp. 100–111, Berlin, Springer, 1995. [83] D. J. C. MacKay and R. M. Neal, “Near Shannon limit performance of low-density parity-check codes,” Elect. Lett., vol. 32, pp. 1645–1646, Aug. 1996 (reprinted Elect. Lett., vol. 33, pp. 457–458, Mar. 1997). [84] M. Sipser and D. A. Spielman, “Expander codes,” IEEE Trans. Inform. Theory, vol. 42, pp. 1710–1722, Nov. 1996. Also Proc. 35th Symp. Found. Comp. Sci., pp. 566–576, 1994. [85] D. A. Spielman, “Linear-time encodable and decodable error-correcting codes,” IEEE Trans. Inform. Theory, vol. 42, pp. 1723–1731, Nov. 1996. Also Proc. 27th ACM Symp. Theory Comp., pp. 388–397, 1995. [86] N. Wiberg, “Codes and decoding on general graphs,” Ph.D. dissertation, Linköping U., Linköping, Sweden, 1996. 34 [87] N. Wiberg, H.-A. Loeliger and R. Kötter, “Codes and iterative decoding on general graphs,” Eur. Trans. Telecomm., vol. 6, pp. 513–525, Sept./Oct. 1995. [88] R. M. Tanner, “A recursive approach to low complexity codes,” IEEE Trans. Inform. Theory, vol. IT–27, pp. 533–547, Sept. 1981. [89] G. Battail, “Construction explicite de bons codes longs,” Annales des Télécommunications, vol. 44, pp. 392–404, 1989. [90] G. Battail, “Pondration des symboles décodés par l’algorithme de Viterbi,”, Annales des Télécommunications, vol. 42, pp. 31-38, Jan.-Feb. 1987. [91] J. Hagenauer and P. Hoeher, “A Viterbi algorithm with soft-decision outputs and its applications,” in Proceedings of GLOBECOM’89, vol. 3, pp. 1680–1686, 1989. [92] J. Lodge, P. Hoeher and J. Hagenauer, “The decoding of multidimensional codes using separable MAP ‘filters’,” in Proc. 16th Biennial Symposium on Communications, Kingston, Ontario, Canada, pp. 343–346, May 27-29, 1992. [93] J. Lodge, R. Young, P. Hoeher and J. Hagenauer, “Separable MAP ‘filters’ for the decoding of product and concatenated codes,” in Proc. Intl. Conference on Communications (ICC’93), Geneva, pp. 1740–1745, May 23-26, 1993. [94] J. Hagenauer, “‘Soft-in/soft-out,’ the benefits of using soft values in all stages of digital receivers,” in Proceedings of the 3rd International Workshop on Digital Signal Processing, Techniques Applied to Space Communications (ESTEC), Noordwijk, pp. 7.1–7.15, Sept. 1992. [95] S. Benedetto and G. Montorsi, “Unveiling turbo codes: some results on parallel concatenated coding schemes,” IEEE Trans. Inform. Theory, pp. 409–428, Mar. 1996. [96] L. C. Perez, J. Seghers and D. J. Costello, Jr.,“A distance spectrum interpretation of turbo codes,” IEEE Trans. Inform. Theory, vol. 42, pp. 1698–1709, Nov. 1996. [97] M. Breiling, “A logarithmic upper bound on the minimum distance of turbo codes,” IEEE Trans. Inform. Theory, vol. 50, pp. 1692–1710, 2004. [98] J. D. Anderson, “Turbo codes extended with outer BCH code,” IEE Electron. Lett., vol. 32, pp. 2059–2060, Oct. 1996. [99] S. Benedetto, D. Divsalar, G. Montorsi and F. Pollara, “Serial concatenation of interleaved codes: Performance analysis, design, and iterative decoding,” IEEE Trans. Inform. Theory, vol. 44, pp. 909–926, May 1998. [100] S. Crozier and P. Guinand, “Distance upper bounds and true minimum distance results for turbo codes designed with DRP interleavers,” in Proc. 3rd International Symposium on Turbo Codes and Related Topics, Brest, France, Sept. 1-5, 2003. [101] C. Douillard and C. Berrou, “Turbo codes with rate-m/(m + 1) constituent convolutional codes,” IEEE Trans. Commun., vol. 53, pp. 1630–1638, Oct. 2005. [102] E. Boutillon and D. Gnaedig, “Maximum spread of d-dimensional multiple turbo codes,” IEEE Trans. Commun., vol. 53, pp. 1237– 1242, Aug. 2005. [103] C. He, M. Lentmaier, D. J. Costello, Jr. and K. Sh. Zigangirov, “Joint permutor analysis and design for multiple turbo codes,” IEEE Trans. Inform. Theory, vol. 52, pp. 4068–4083, Sept. 2006. [104] D. J. C. MacKay, “Good error-correcting codes based on very sparse matrices,” IEEE Trans. Inform. Theory, vol. 45, pp. 399–431, Mar. 1999. Also Proc. 5th IMA Conf. Crypto. Coding, pp. 100–111, 1995. [105] N. Alon and M. Luby, “A linear-time erasure-resilient code with nearly optimal recovery,” IEEE Trans. Inform. Theory, vol. 42, pp. 1732–1736, Nov. 1996. [106] M. Luby, M. Mitzenmacher, M. A. Shokrollahi, D. A. Spielman and V. Stemann, “Practical loss-resilient codes,” Proc. 29th Symp. Theory Computing, pp. 150–159, 1997. [107] M. Luby, M. Mitzenmacher, M. A. Shokrollahi and D. A. Spielman, “Improved low-density parity-check codes using irregular graphs,” IEEE Trans. Inform. Theory, vol. 47, pp. 585–598, Feb. 2001. [108] A. Shokrollahi and R. Storn, “Design of efficient erasure codes with differential evolution,” Proc. 2000 IEEE Intl. Symp. Inform. Theory (Sorrento, Italy), p. 5, June 2000. [109] J. W. Byers, M. Luby, M. Mitzenmacher and A. Rege, “A digital fountain approach to reliable distribution of bulk data,” Proc. ACM SIGCOMM ’98 (Vancouver), 1998. [110] M. Luby, M. Mitzenmacher, M. A. Shokrollahi and D. A. Spielman, “Efficient erasure-correcting codes,” IEEE Trans. Inform. Theory, vol. 47, pp. 569–584, Feb. 2001. [111] T. J. Richardson and R. Urbanke, “The capacity of low-density parity-check codes under message-passing decoding,” IEEE Trans. Inform. Theory, vol. 47, pp. 599–618, Feb. 2001. [112] T. J. Richardson, A. Shokrollahi and R. Urbanke, “Design of capacity-approaching irregular low-density parity-check codes,” IEEE Trans. Inform. Theory, vol. 47, pp. 619–637, Feb. 2001. [113] S.-Y. Chung, G. D. Forney, Jr., T. J. Richardson and R. Urbanke, “On the design of low-density parity-check codes within 0.0045 dB from the Shannon limit,” IEEE Commun. Letters, vol. 5, pp. 58–60, Feb. 2001. [114] Y. Kou, S. Lin and M. P. C. Fossorier, “Low-density parity-check codes based on finite geometries: A rediscovery,” Proc. 2000 IEEE Intl. Symp. Inform. Theory (Sorrento, Italy), p. 200, June 2000. [115] D. Divsalar, H. Jin and R. J. McEliece, “Coding theorems for ‘turbo-like’ codes,” Proc. 1998 Allerton Conf. (Allerton, IL), pp. 201–210, Sept. 1998. [116] A. Abbasfar, D. Divsalar and Y. Kung, “Accumulate-repeat-accumulate codes,” in Proc. IEEE Global Telecommunications Conference (GLOBECOM 2004), pp. 509–513, Dallas, TX, USA, Dec. 2004. [117] L. Ping and K. Y. Wu, “Concatenated tree codes: A low-complexity, high-performance approach,” IEEE Trans. Inform. Theory, vol. 47, pp. 791–799, Feb. 2001. [118] P. C. Massey and D. J. Costello, Jr., “New low-complexity turbo-like codes,” in Proc. IEEE Inform. Theory Workshop, pp. 70–72, Cairns, Australia, Sept. 2001. [119] M. Luby, “LT codes,” Proc. 43d Annual IEEE Symp. Foundations of Computer Science (FOCS) (Vancouver, CA), pp. 271–280, Nov. 2002. [120] A. Shokrollahi, “Raptor codes,” IEEE Trans. Inform. Theory, vol. 52, pp. 2551–2567, June 2006. [121] O. Etesami and A. Shokrollahi, “Raptor codes on binary memoryless symmetric channels,” IEEE Trans. Inform. Theory, vol. 52, pp. 2033–2051, May 2006. CHANNEL CODING: THE ROAD TO CHANNEL CAPACITY 35 [122] P. Robertson and T. Wörz, “Coded modulation scheme employing turbo codes,” IEE Electron. Lett., vol. 31, pp. 1546–1547, Aug. 1995. [123] S. Benedetto, D. Divsalar, G. Montorsi and F. Pollara, “Bandwidth-efficient parallel concatenated coding schemes,” IEE Electron. Lett., vol. 31, pp. 2067–2069, Nov. 1995. [124] U. Wachsmann, R. F. H. Fischer, and J. B. Huber, “Multilevel codes: theoretical concepts and practical design rules,” IEEE Trans. Inform. Theory, vol. 45, pp. 1361–1391, July 1999. [125] S. Le Goff, A. Glavieux and C. Berrou, “Turbo codes and high spectral efficiency modulation,” in Proc. IEEE Intl. Conf. Commun. (ICC 1994), pp. 645–649, New Orleans, La., May 1994. [126] E. Zehavi, “8-PSK trellis codes for a Rayleigh channel,” IEEE Trans. Commun., vol. COM-40, pp. 873–884, May 1992. [127] F. R. Kschischang, B. J. Frey and H.-A. Loeliger, “Factor graphs and the sum-product algorithm,” IEEE Trans. Inform. Theory, vol. 47, pp. 498–519, Feb. 2001. [128] G. D. Forney, Jr., “Codes on graphs: Normal realizations,” IEEE Trans. Inform. Theory, vol. IT–13, pp. 520–548, Feb. 2001. [129] S. M. Aji and R. J. McEliece, “The generalized distributive law,” IEEE Trans. Inform. Theory, vol. 46, pp. 325–343, Mar. 2000. [130] R. J. McEliece, D. J. C. MacKay and J.-F. Cheng, “Turbo decoding as an instance of Pearl’s ‘belief propagation’ algorithm,” IEEE J. Selected Areas Commun., vol. 16, pp. 140–152, Feb. 1998. [131] F. R. Kschischang and B. J. Frey, “Iterative decoding of compound codes by probability propagation in graphical models,” IEEE J. Selected Areas Commun., vol. 16, pp. 219–230, Feb. 1998. [132] J. Pearl, Probabilistic Reasoning in Intelligent Systems. San Mateo, CA: Morgan Kaufmann, 1988. [133] L. E. Baum and T. Petrie, “Statistical inference for probabilistic functions of finite-state Markov chains,” Ann. Math. Stat., vol. 37, pp. 1554–1563, Dec. 1966.