Cs-2 Nptel Notes

INDEX
S. No Topic Page No.

Week 1
1 Introduction to Digital Communication Systems 1
Spectrum of Transmitted Digital Communication Signal and Wide Sense
2 Stationarity 11
Spectrum of Transmitted Digital Communication Signal, Autocorrelation
3 Function and Power Spectral Density 20
Spectrum of Transmitted Digital Communication Signal, Relation to Energy
4 Spectral Density and Introduction to AWGN Channel 34
Additive White Gaussian Noise (AWGN) Properties, Gaussian Noise and
5 White Noise 47
Structure of Digital Communication Receiver, Receiver Filter and Signal-to-
6 Noise Power Ratio (SNR) 62
7 Digital Communication Receiver, Noise Properties and Output Noise Power 71
Week 2
8 Digital Communication Receiver, Optimal SNR and Matched Filter 82
Probability of Error in Digital Communication and Probability Density
9 Functions of Output 93
Probability of Error in Digital Communication, Optimal Decison Rule and
10 Gaussian Q function 109
Introduction to Binary Phase Shift Keying (BPSK) Modulation, Optimal
11 Decision Rule and Probability of Bit-Error or Bit-Error Rate (BER) 117
12 Introduction to Amplitude Shift Keying (ASK) Modulation 134
Optimal Decision Rule for Amplitude Shift Keying (ASK), Bit Error Rate
13 (BER) and Comparison with Binary Phase Shift Keying (BPSK) Modulation 149
14 Introduction to Signal Space Concept and Orthonormal Basis Signals 162
15 Introduction to Frequency Shift Keying (FSK) 177
Week 3
Optimal Decision Rule for FSK, Bit Error Rate (BER) and Comparison with
16 BPSK, ASK 192
17 Introduction to Quadrature Phase Shift Keying (QPSK) 204
18 Waveforms of Quadrature Phase Shift Keying (QPSK) 214
Matched Filtering, Bit Error Rate and Symbol Error Rate for Quadrature Phase
19 Shift Keying (QPSK) 222
Introduction to M-ary PAM (Pulse Amplitude Modulation), Average Symbol
20 Power and Decision rules 239
M-ary PAM (Pulse Amplitude Modulation) -Part-II, Optimal Decision Rule
21 and Probability of Error 252
Week 4
M-ary QAM (Quadrature Amplitude Modulation) Part-I, Introduction,
22 Transmitted Waveform and Average Symbol Energy 271
M-ary QAM (Quadrature Amplitude Modulation) - Part-II, Optimal Decision
23 Rule, Probability of Error and Contellation Diagram 281
M-ary PSK (Phase Shift Keying) Part-I, Introduction , Transmitted Waveform
24 and Constellation Diagram 299
M-ary PSK (Phase Shift Keying) - Part- II, Optimal Decision Rule, Nearest
25 Neighbor Criterion and Approximate Probability of Error 311
Introduction to Information Theory, Relevance of Information Theory and
26 Characterization of Information 321
Definition of Entropy, Average of Information / Uncertainity of source and
27 Properties of Entropy 333
Entropy Example- Binary Source Maximum and Minimum Entropy of Binary
28 Source 345
Week 5
Maximum Entropy of Source with M-ary Alphabet, Concave/Convex Functions
29 and Jensens Inequality 358
Joint Entropy , Definition of Joint Entropy of Two Sources and Simple
30 Examples for Joint Entropy Computation 372
Properties of Joint Entropy and Relation between Joint Entropy and Marginal
31 Entropies 383
Conditional Entropy, Example of Conditional Entropy and Properties of
32 Conditional Entropy 397
Mutual Information, Diagrammatic Representation and Properties of Mutual
33 Information 410
Simple Example of Mutual Information and Practical Example of Mutual
34 Information-Binary Symmetric Channel 420
Channel Capacity, Implications of Channel Capacity, Claude E. Shannon-
Father of Information Theory and Example of Capacity of Binary Symmetric
35 Channel 432
Week 6
36 Differential Entropy and Example for Uniform Probability Density function 443
37 Differential Entropy of Gaussian Source and Insights 453
38 Joint Conditional/ Differential Entropies and Mutual Information 462
39 Capacity of Gaussian channel- Part I 471
Capacity of Gaussian Channel Part-II, Practical Implications and Maximum
40 rate in bits\sec 483
Introduction to Source Coding and Data Compression, Variable Length codes
41 and Unique Decodability 500
Uniquely Decodable Codes, Prefix-free code, Instantaneous Code and Average
42 Code length 509
Week 7
43 Binary Tree Representation of Code, Example and Kraft Inequality 517
44 Lower Bound on Average Code Length and Kullback-Leibler Divergence 526
45 Optimal Code length, Constrained Optimization and Morse Code Example 536
46 Approaching Lower Bound on Average code length and Block Coding 545
47 Huffman Code, Algorithm, Example and Average Code Length 551
Introduction to channel coding, Rate of Code, Repetition Code and Hamming
48 Distance 560
Introduction to Convolutional Codes, Binary Field Arithmetic and Linear
49 Codes 568
Week 8
Example of Convolutional Code Output and Convolution Operation for Code
50 generation 576
Matrix Representation of Convolutional Codes, Generator Matrix, Transform
51 Domain Representation and Shift Register Architecture 584
State Diagram Representation of Convolutional Code, State transitions and
52 Example of Code Generation using State transitions 593
53 Trellis Representation of Convolutional Code and Valid Code Words 605
Decoding of the Convolutional Code, Minimum Hamming distance and
54 Maximum Likelihood Codeword Estimate 613
55 Principle of Decoding of Convolutional code 619
Viterbi Decoder for Maximum Likelihood Decoding of Convolutional Code
Using Trellis Representation, Branch Metric Calculation, State Metric
56 Calculation and Example 629
Principles of Communication Systems - Part II
Prof. Aditya K. Jagannatham
Department of Electrical Engineering
Indian Institute of Technology, Kanpur
Lecture - 01
Introduction to Digital Communication Systems
Hello, welcome to another module in this massive open online course. So, this course will
introduce you to the typical structure and working of a digital communication system.
(Refer Slide Time: 00:30)
So, this module will focus on introduction to digital communications systems and if you look
at typical digital communication system; A schematic diagram of a digital communication
system will be as follows.
1
I have a message signal that I want to communicate m(t). In a typical analog signal, you simply
modulate it and transmit this message signal m(t); However, in a digital communication system,
the process is very elaborate. First it has to go it has to has to pass through an encoder which
basically the encoder converts the message into a stream of bits.
So, what the encoder is doing is taking your message and it is converting into the stream of bits
and you might have seen in previous modules, this is achieved through the first sampling
followed by quantization of the samples and representing the quantized samples, alright, using
an appropriate number of bits..
Now, this stream of bits from the encoder is given as input to a digital modulation scheme
which is passed through the; which is passed through a channel. This then subsequently passes
through a channel. The channel can be various types, it can be a wire line, two state pair or a
coaxial cable. It can be also be a wireless channel that is it can be a radio channel. We will look
at some of these different channel types. The output is then passed through a demodulator.
So, we have modulation operation at the transmitter and at the receiver we have a demodulator.
So, at this point now you have the demodulated bits. So, basically you have 0, 1, 1, 0, 1, 1.
Now notice that purposefully I have written a different set of bits, for instance if you look at
the input the fifth bit is a 0, but the fifth bit at the output is a 1 and this is a very important point
The demodulated stream of the bits need not be exactly equal to the transmitted stream of the
bits or the modulated stream of bits. This arises from bit errors at the receiver.
2
Bit errors are a major challenge in digital communication system. It is desirable to have
minimum number of bit errors, that is to lower or minimize the probability of bit error. The
probability of bit error that is the probability with which a received bit is in error is a very
important metric to characterize the performance of communication system. This arises due to
various factors such as the noise at the receiver the feeding wire the fading channel if the
channel is wireless.
The stream of bits from a demodulator is followed by a decoder which gives you your estimated
signal 𝑚
̂ (𝑡).
3
Now, the key parts of the theory digital communications systems are basically how to design
the various modules. Namely what is the optimal transmit processing. how to receive for
instance what is the optimal filtering operation that can be employed in the transmitter what is
the optimal modulation that can be employed at the transmitter.
What are the modulation schemes for instance when you look at transmission? One thing you
can look at is transmit filter what is the energy or power of the transmit signal. How to design
a filter such that the signal power is maximized noise power is minimized. So, that the signal
to noise power ratio is maximized what is the proper performance of the different modulation
schemes for instance what is the probability of error associated with the modulation schemes
which characterizes the ultimate performance.
So, there are several issues which we are going to look at through the various modules.
Now, how to transmit the signal? In a digital communication signal, the transmitted signal
corresponds to a series of symbols.
4
Let us take a simple example to start quickly. We take the information symbol 0 we map it. We
have 2 information; binary information symbol 0 and information symbol 1. In order to transmit
the signal these have to be mapped to a physical signal. So, we have to map it to some voltage
levels. Suppose 0 is mapped to +A and 1 is mapped to -A.
So, let us say these are 2 symbols belonging to the digital constellation.
So, there are 2 different voltage level minus A and plus A and both of them are opposite in
phase. So, one has phase 0 the other has phase 180 degree alright plus A and minus A.
5
We have to 2 different phases. So, binary phase constellation; This is known as the binary phase
shift keying in constellation which is one of the simplest constellations we will look at. There
will be several others as we proceed through the various modules. This mapping operation is
termed as digital modulation.
Although conventionally we can simply call it as modulation more specifically this is digital
modulation as we are basically mapping the information bits belonging into a digital
constellation.
Suppose we have a bit-stream 0,1,1,0,01…..., this stream can be mapped to digitally modulates
symbols by assigning +A to 0 and -A to 1. This gives us the stream of digitally modulated
symbols A, -A, -A, A, A, -A.
Now, how do you transmit this; obviously, we cannot just simply transmit the stream of
digitally modulated symbols. I cannot simply transmit voltage level. I have to transmit it over
some signal right a some physical signal that can be transmitted over the channel, alright. So,
this is this signal, this particular template signal that is used to transmit the; this voltage levels
of the digital modulation termed as a pulse.
6
This particular signal is termed as a pulse. How do you transmit the digital modulated symbols?
So, we transmit the digitally modulated symbols using a pulse. Suppose a pulse 𝑃𝑇 (𝑡).Pulse
can be of height 1 from -T/2 to T/2. T is the width of the pulse.
This pulse is denoted by:
1, 𝑖𝑓 |𝑡| < 𝑇/2

𝑃𝑇 (𝑡) = {
0, 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
7
Suppose the first symbol is 𝑎0 .This is multiplied by the original pulse. This results in 𝑎0 𝑃𝑇 (𝑡).
For the Second symbol the pulse is shifted by T and then, the symbol is multiplied by the shifted
pulse this gives, 𝑎1 𝑝𝑇 (𝑡 − 𝑇). For the third symbol, the pulse is shifted by 2T, this gives us:
𝑎2 𝑃𝑇 (𝑡 − 2𝑇). And the process is continued for further symbols.
So, what you are doing is corresponding to the kth pulse, shifting the pulse by kT times and
multiplying it by ak. Each symbol ak is transmitted in the kth duration.
8
The kth symbol is transmitted as 𝑎𝑘 𝑃𝑇 (𝑡 − 𝑘𝑇)
The net digital transmitted signal is given by:
𝑥(𝑡) = ∑ 𝑎𝑘 𝑃𝑇 (𝑡 − 𝑘𝑇)
𝑘=−∞
So, basically in this module we have seen a brief introduction to a digital communication
system at typical schematic diagram illustrating the various components or modules at the
9
transmitter and receiver and all also incorporating the also showing the place of the relevance
of the channel in this typical schematic of a digital communication system and also we have
seen overview or brief introduction to the process of mapping or modulation of the information
bits to a digital modulation consultation followed by the transmitted the structure of a typical
digital digitally modulated communication signal employing a pulse shaping filter, alright.
So, this PT which is basically your pulse is a filter. It can be implemented as a filter. This is
also known as pulse shipping filter, alright. So, we will stop here and look at a; look at other
aspect in the subject with modules.
Thank you.
10
Lecture - 02
Spectrum of Transmitted Digital Communication Signal, Wide Sense Stationarity
Hello, welcome to another module in this massive open online course. In this module, let
us start looking at the spectrum of the transmitted signal in the digital communication
signal. In the last module we saw that a digital transmitted signal is given by:
𝑥(𝑡) = ∑ 𝑎𝑘 𝑃𝑇 (𝑡 − 𝑘𝑇)
𝑘=−∞
11
In general, PT(t) can be any pulse. In this example a rectangular pulse has been considered
The kth bit can randomly be 0 or 1. Thus ak can randomly be A or -A depending on the kth
bit. Now, we want to characterize the spectrum of the transmitted signal x(t). The PT(t) in
general can be any pulse but for this example we consider a rectangular pulse given by:
1 𝑖𝑓 |𝑡| ≤ 𝑇/2
𝑃𝑇 (𝑡) = {
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
12
Fourier transform of this pulse is given by
𝑇𝑠𝑖𝑛(𝜋𝐹𝑇)
𝑃𝑇 (𝑡) ↔ 𝑇𝑠𝑖𝑛𝑐𝐹𝑇 =
𝜋𝐹𝑇
Now, we will invoke our knowledge of the properties of probability. Since ak is random,
we will set probability ak equals A equal to half, and probability ak equals -A equal to half.
So, we are assuming this ak and A are equiprobable. Now when we use our knowledge of
probability if you calculate the expected value of ak that will be:
13
𝐸{𝑎𝑘 } = 0.5 × 𝐴 + 0.5 × (−𝐴)
Now, we will also assume that the symbols are independently distributed. I.e. symbols ak
are IID (Independent identically distributed). Identical distribution means each takes
minus A or plus A with probability half. Independent means basically these random
variables are independent, and it follows from independence that they are uncorrelated.
This can be written as:
𝐸{𝑎𝑘 𝑎𝑚 } = 𝐸{𝑎𝑘 }𝐸{𝑎𝑚 } 𝑖𝑓 (𝑘 ≠ 𝑚)
For the digital signals in our example E{ak} and E{am} are equal to zero thus,
𝐸{𝑎𝑘 𝑎𝑚 } = 0
14
Now we need to find the average value of x(t).
𝐸{𝑥(𝑡)} = ∑ 𝐸{𝑎𝑘 }𝑃𝑇 (𝑡 − 𝑘𝑇) = 0 (𝑠𝑖𝑛𝑐𝑒 𝐸{𝑎𝑘 } = 0)

𝑘=−∞
Now, let us find the spectra. Let us employ a conventional approach. Whenever we want
to find the spectrum of a signal, we use the Fourier transform all right.
15
Fourier transform of x(t) is given by:
∞
𝑋(𝑓) = ∫ 𝑥(𝑡)𝑒 −𝑗2𝜋𝐹𝑡 𝑑𝑡
−∞
∞
𝐸{𝑋(𝑓)} = ∫ 𝐸{𝑥(𝑡)}𝑒 −𝑗2𝜋𝐹𝑡 𝑑𝑡 = 0 (𝑠𝑖𝑛𝑐𝑒 𝐸{𝑥(𝑡)} = 0)
−∞
16
Thus, the expectation value or the average value of the Fourier transform of the transmitted
signal is zero. This gives rise to a fallacy as this would mean we do not need a signal in
order to transmit (as the average value of the signal is zero). This is because Fourier
transform cannot be used to calculate the spectrum of a random signal.
To calculate the spectrum of a random signal we use another technique which uses power
spectral density.
17
Power spectral density (PSD) gives us the spectral distribution of power of a random signal
In order to compute PSD, we first need to compute the autocorrelation of the signal.
For a wide sense stationary (WSS) process the autocorrelation is defined as,
𝑅𝑥𝑥 (𝜏) = 𝐸{𝑥(𝑡)𝑥(𝑡 − 𝜏)}
Power spectral density will be the Fourier transform of autocorrelation.
18
We will continue with this discussion in the next module.
Thank you very much.
19
Lecture – 03
Spectrum of Transmitted Digital Communication Signal, Autocorrelation Function
and Power Spectral Density
Hello, welcome to another module in this massive open online course. So, we are looking
at the autocorrelation function of the transmitted digital communication signal that is,
∞ ∞
𝐸{𝑥(𝑡)𝑥(𝑡 − 𝜏)} = 𝐸{ ∑ 𝑎𝑘 𝑃𝑇 (𝑡 − 𝑘𝑇) × ∑ 𝑎𝑚 𝑃𝑇 (𝑡 + 𝜏 − 𝑚𝑇)}

𝑘=−∞ 𝑚=−∞
20
Now, we expand the summation that is,
∞ ∞
= 𝐸{ ∑ ∑ 𝑎𝑘 𝑎𝑚 𝑃𝑇 (𝑡 − 𝑘𝑇)𝑃𝑇 (𝑡 + 𝜏 − 𝑚𝑇)}
𝑘=−∞ 𝑚=−∞
Now, if I take the expectation operator inside,
21
∞ ∞
= ∑ ∑ 𝐸{𝑎𝑘 𝑎𝑚 }𝑃𝑇 (𝑡 − 𝑘𝑇)𝑃𝑇 (𝑡 + 𝜏 − 𝑚𝑇)

𝑘=−∞ 𝑚=−∞
Since symbols are IID, we can write,
𝐸{𝑎𝑘 𝑎𝑚 } = 𝐸{𝑎𝑘 }𝐸{𝑎𝑚 } = 0 (𝑖𝑓 𝑘 ≠ 𝑚)
The only terms that remain are those where k=m.
22
So, therefore, the summation equivalently reduces to,
= ∑ 𝐸{𝑎𝑘2 }𝑃𝑇 (𝑡 − 𝑘𝑇)𝑃𝑇 (𝑡 + 𝜏 − 𝑘𝑡)

𝑘=−∞
We no from the previous modules that,
𝑃𝑟(𝑎𝑘 = 𝐴) = 0.5 𝑎𝑛𝑑 𝑃𝑟 (𝑎𝑘 = −𝐴) = 0.5
Therefore,
𝐸{𝑎𝑘2 } = 0.5𝐴2 + 0.5(−𝐴)2
= 𝐴2
23
We can A2 as Pd or ‘data symbol power’.
Therefore, the summation reduces to,
𝐸{𝑥(𝑡)𝑥(𝑡 − 𝜏)} = 𝑃𝑑 ∑ 𝑃𝑇 (𝑡 − 𝑘𝑇)𝑃𝑇 (𝑡 + 𝜏 − 𝑘𝑇)

𝑘=−∞
This is our autocorrelation function which depends only on t as well. Thus, the resulting
process is not wide sense stationary. We would now like to convert this process into a wide
24
sense stationary process because WSS signals are very helpful for study and we can
calculate their PSD toc characterize the spectral distribution of power.
In order to achieve this, we introduce a random delay to. This gives us,
𝑥(𝑡) = ∑ 𝑎𝑘 𝑃𝑇 (𝑡 − 𝑘𝑇 − 𝑡𝑜 )
𝑘=−∞
to is a uniformly distributed random delay in [0, T].
25
to is assumed to be uniformly distributed over [0, T] i.e.,
1
(𝑡 )
𝑓𝑇𝑜 𝑜 = { 𝑇 𝑖𝑓(0 ≤ 𝑡𝑜 ≤ 𝑇)
0 𝑜𝑡𝑒𝑟ℎ𝑤𝑖𝑠𝑒
Now, with this shifted signal,
26
∞
𝐸{𝑥(𝑡)𝑥(𝑡 − 𝜏)} = 𝑃𝑑 ∑ 𝐸{𝑃𝑇 (𝑡 − 𝑘𝑇 − 𝑡𝑜 ) × 𝑃𝑇 (𝑡 + 𝜏 − 𝑘𝑇 − 𝑡𝑜 )}

𝑘=−∞
We have to take the expected value of the summation w.r.t to.
To take the average we multiply the expression by the distribution of the random delay.
∞ ∞
= 𝑃𝑑 ∑ ∫ 𝑓𝑇0 (𝑡𝑜 )𝑃𝑇 (𝑡 − 𝑘𝑇 − 𝑡𝑜 )𝑃𝑇 (𝑡 + 𝜏 − 𝑘𝑇 − 𝑡𝑜 )𝑑𝑡𝑜
𝑘=−∞ −∞
27
Substituting the probability distribution of the random delay we get,
∞ 𝑇
𝑃𝑑
= ∑ ∫ 𝑃𝑇 (𝑡 − 𝑘𝑇 − 𝑡𝑜 )𝑃𝑇 (𝑡 + 𝜏 − 𝑘𝑇 − 𝑡𝑜 )𝑑𝑡𝑜
𝑇 0
𝑘=−∞
Now we employ the following substitution,
28
𝑡𝑜 + 𝑘𝑇 = 𝑡̃𝑜 𝑎𝑛𝑑 𝑑𝑡𝑜 = 𝑑𝑡̃𝑜
The summation then transforms to,
∞(𝑘+1)𝑇
𝑃𝑑
= ∑ ∫ 𝑃𝑇 (𝑡 − 𝑡̃𝑜 )𝑃𝑇 (𝑡 + 𝜏 − 𝑡̃𝑜 )𝑑𝑡̃𝑜
𝑇 𝑘𝑇
𝑘=−∞
Each integral in the above sum is going from kT to (k+1) T. and k itself goes from negative
infinity to infinity. Thus, the sum can be replaced by a single integral going from −∞ 𝑡𝑜 ∞
29
The summation now equals:
𝑃𝑑 ∞
= ∫ 𝑃𝑇 (𝑡 − 𝑡̃𝑜 )𝑃𝑇 (𝑡 + 𝜏 − 𝑡̃𝑜 )𝑑𝑡̃𝑜
𝑇 −∞
Then we substitute,
𝑡 + 𝜏 − 𝑡̃𝑜 = 𝑡𝑜 ′ and 𝑑 𝑡̃𝑜 = −𝑑𝑡𝑜 ′
Note here that 𝑡𝑜 ′ goes from ∞ to −∞ (reverse order) but there will be minus sign in 𝑑𝑡𝑜 ′
Which will reverse the integration range again (Refer Slide Time: 20:19)
30
Thus, the autocorrelation will now be,
𝑃𝑑 ∞
= ∫ 𝑃𝑇 (𝑡𝑜′ )𝑃𝑇 (𝑡𝑜 ′ − 𝜏)𝑑𝑡𝑜 ′
𝑇 −∞
The integral in the above expression is independent of t and is in fact the autocorrelation
of PT(t) and can be termed as ‘𝑅𝑃𝑇 𝑃𝑇 (𝜏)’.
Thus, we have the relation,
31
𝑃𝑑
𝑅𝑥𝑥 (𝜏) = 𝑅 (𝜏)
𝑇 𝑃𝑇 𝑃𝑇
32
Fourier transform of the autocorrelation will give the power spectral density of the signal.
Thus,
𝑃𝑑
𝑆𝑥𝑥 (𝑓) = 𝑆 (𝑓)
𝑇 𝑃𝑇 𝑃𝑇
Where, 𝑆𝑥𝑥 (𝑓) is the power spectral density of the transmitted signal and 𝑆𝑃𝑇 𝑃𝑇 (𝑓) is the
energy spectral density of the pulse (since the pulse is deterministic).
Remember if you take the Fourier transform the autocorrelation function of a deterministic
signal such as this pulse, you will get the energy spectral density. For a random signal it
gives the power spectral density.
In this module we have shown that the power spectral density of the transmitted signal
depends on the energy spectral density of the pulse, given by the expression above.
So, we stop here and look at this further in the subsequent modules.
33
Lecture – 04
Spectrum of Transmitted Digital Communication Signal, Relation to Energy
Spectral Density, Introduction to AWGN Channel
Hello, welcome to another module in this massive open online course on communication
We are looking at the power spectral density of a transmitted signal. In the last module we
derived the following expression for the autocorrelation of a transmitted signal:
𝑃𝑑
𝑅𝑥𝑥 (𝜏) = 𝑅 (𝜏)
𝑇 𝑃𝑃
Where 𝑅𝑃𝑃 (𝜏) is the autocorrelation function of the pulse. In general P(t) can be any pulse
not necessarily a rectangular pulse which has ben taken as example.
34
35
∞
𝑅𝑃𝑃 (𝜏) = ∫ 𝑃(𝑡)𝑃(𝑡 − 𝜏) 𝑑𝑡
−∞
Energy spectral density of the pulse is the Fourier transform of the autocorrelation function
of the pulse,
∞
𝑆𝑃𝑃 (𝑓) = ∫ 𝑅𝑃𝑃 (𝑡)𝑅𝑃𝑃 (𝑡 − 𝜏) 𝑑𝑡
−∞
It can be shown that,
𝑆𝑃𝑃 (𝑓) = |𝑃(𝑓)|2
Where 𝑃(𝑓) is the Fourier transform of 𝑃(𝑡). i.e.,
∞
𝑃(𝑓) = ∫ 𝑃(𝑡)𝑒 −𝑗2𝜋𝑓𝑡 𝑑𝑡
−∞
36
For a wide sense stationary process, the power spectral density is given by Fourier
transform of the autocorrelation function. i.e.,
∞
𝑆𝑥𝑥 (𝑓) = ∫ 𝑅𝑥𝑥 (𝜏)𝑒 −𝑗2𝜋𝑓𝜏 𝑑𝜏
−∞
37
Thus we can say:
𝑃𝑑
𝑆𝑥𝑥 (𝑓) = 𝑆 (𝑓)
𝑇 𝑃𝑃
𝑃𝑑
= |𝑃(𝑓)|2
𝑇
So, basically what we have is a very interesting result that the power spectral density of
the transmitted signal of transmitted signal is proportional to energy spectral density of
pulse. This is a very important result that is the power spectral density of the transmitted
digital communication signal is basically proportional to the energy spectral density of the
pulse P. This i-s an important property which helps us characterize what is the power what
is the spectral distribution of power of the transmitted digital communication signal.
38
For a simple example consider the rectangular pulse 𝑃𝑇 (𝑡) defined by,
1 𝑖𝑓 |𝑡| < 𝑇/2

𝑃𝑇 (𝑡) {
39
Fourier transform of this pulse is given by:
∞
𝑃𝑇 (𝑓) = ∫ 𝑃𝑇 (𝑡)𝑒 −𝑗2𝜋𝑓𝑡 𝑑𝑡
−∞
40
= 𝑇𝑠𝑖𝑛𝑐(𝑓𝑇)
𝑇𝑠𝑖𝑛𝑐(𝜋𝑓𝑇)
=
𝜋𝑓𝑇
|𝑃𝑇 (𝑡)|2 = 𝑇 2 𝑠𝑖𝑛𝑐 2 𝑓𝑇
𝑃𝑑
𝑃𝑆𝐷 = |𝑃 (𝑓)|2
𝑇 𝑇
𝑆𝑥𝑥 (𝑓) = 𝑃𝑑 𝑇𝑠𝑖𝑛𝑐 2 𝑓𝑇
This completes our analysis of power spectral density of the transmitted signal.
41
And now let us look at the properties of the digital communication channel. How do we
model a digital communication. Now, simply put the channel is the medium through which
the transmitted signal propagates from the signal propagates or signal traverses from the
transmitter to receiver in a communication system.
There can be various kinds of channels.
42
For instance, we can have some examples. Common examples are a for instance your
telephone lines, coaxial cables which connect basically your set top boxes okay all right.
Coaxial cables which transmit which are used for the transmission of TV signals correct
or your cable basically. And also the wireless channel, when there is no physical or there
is no particular guided propagation medium for the electromagnetic waves.. In fact, this is
turning out to be the most dominant mode of communication there is for instance your all
your broadcast services such as AM, FM your cellular at all this thing, cellular, Bluetooth,
Wi-Fi set all of these are based on the wireless channel.
43
Suppose we have a transmitted signal 𝑥(𝑡) which passes through a channel and at the
receiver we have a signal 𝑦(𝑡).
44
One of the simplest ways to model a communication channel is AWGN or Additive White
Gaussian Noise channel
This is a simple but very useful model that is used not only in digital communication but
also in analog communications.
So, in an AWGN channel we have the received signal is given by
𝑦(𝑡) = 𝑥(𝑡) + 𝑛(𝑡)
45
Where 𝑛(𝑡) is the noise which is added to the transmitted signal (hence the name). And of
course, there are other aspect that is the white and the Gaussian which need to be defined
in order to complete the definition of the noise as well as the definition of this channel. So,
we will look at these aspects in the next module.
46
Lecture - 05
Additive White Gaussian Noise (AWGN) Properties, Gaussian Noise, White Noise
Hello, welcome to another module in this massive open online course.
In this module we will look at the properties of Additive white gaussian noise.
In an AWGN channel, noise is added to the transmitted signal 𝑥(𝑡) to get the received
signal 𝑦(𝑡).
𝑦(𝑡) = 𝑥(𝑡) + 𝑛(𝑡)
47
Noise 𝑁(𝑡) is random and is a function of time., thus, it is a random process.
A popular model for a random noise process is the Gaussian random process model.
48
𝑁(𝑡) is considered to be a gaussian random process if the statistics of 𝑁(𝑡) of all orders
are gaussian.
This means, f we look at random noise samples, 𝑁(𝑡1 ), 𝑁(𝑡2 )…. 𝑁(𝑡𝐾 )
49
These are 𝑘 noise samples taken at time instants, 𝑡1 , 𝑡2 … … . . , 𝑡𝑘
Now we look at the joint distribution of these samples given by,
𝑓𝑁(𝑡1 ),𝑁(𝑡2 ),𝑁(𝑡3 )…,𝑁(𝑡𝑘 ) (𝑛1 , 𝑛2 , 𝑛3 , … . , 𝑛𝑘 )
50
If this distribution follows multivariate gaussian density for all values of 𝑘 then it will be
considered as gaussian random process.
Noise which follows a Gaussian random process is termed as a Gaussian noise.
For a random process to be gaussian the above should be valid for all possible values of 𝑘
not just particular values of 𝑘.
51
Most naturally occurring random processes are Gaussian random process. To model the
thermal noise the circuits, we use Gaussian random process. This is one of the most popular
models popular noise models for a digital communication system and is employed almost
throughout a dominant fraction of literature.
If a Gaussian random process is wide sense stationary then it is also strict sense stationary.
This property is not true in general for random processes, but it holds for Gaussian random
processes.
52
So, strict sense stationary processes are a strict subset of white sense stationary processes.
So, if you look at the set of random processes, if we look at the set of random processes
then there are several random processes that are wide sense stationary, but only a few of
them are strict sense stationary. So, typically if it is strict sense stationary, it implies wide
sense stationary. Typically, wide sense stationarity does not because wide sense
stationarity is a much more relaxed condition. So, typically wide sense stationarity does
not imply strict sense stationary.
Strict sense stationary processes will be a subset of wide sense stationary process. The
criteria for wide sense stationary process are as follows,
𝐸{𝑁(𝑡)} = 𝜇 ∀ 𝑡
𝐸{𝑁(𝑡)𝑁(𝑡 + 𝜏)} = 𝑅𝑁𝑁 (𝜏) ∀ 𝑡
Thus, a process is wide sense stationary if it’s expected value is independent of time and
autocorrelation process depends only on the time shift between two samples.
53
Now let us look at the property of white noise. This property is motivated by the
observation that, noise generally has a very low temporal correlation. This is because noise
is in general very erratic, unlike a smooth signal.
54
This low temporal correlation is modelled as follows:
𝜂
𝑅𝑁𝑁 (𝜏) = 𝐸{𝑁(𝑡)𝑁(𝑡 + 𝜏)} = 𝛿(𝑡)
2
55
𝑅𝑁𝑁 (𝜏) is an impulse at time 𝜏 = 0 scaled by 𝜂/2. This means value of the autocorrelation
is zero for 𝜏 ≠ 0.
This means that noise samples at two different time instants (𝜏 ≠ 0) are uncorrelated.
56
Power spectral density of the noise signal is given by the Fourier transform of the
autocorrelation function. i.e.,
𝜂
𝑆𝑁𝑁 (𝑓) =
2
Fourier transform of the delta function gives unity and thus we obtain that PSD is
uniformly distributed over the entire frequency range.
57
This is similar to white light which is a combination of radiation of all frequencies, hence
the name white noise.
58
Note that these three properties are independent of each other and do not follow from one
another. For a channel to be called AWGN channel it has to have all of the three properties
(i) Additive noise ii) Gaussian distribution of noise, iii) White noise or uniform distribution
of power over the entire frequency range.
So, now, we go back to our digital communication channel model.
59
Let me just quickly draw a simple schematic. This is a very simple yet a very powerful
channel model which can be used to characterize a general digital communication channel.
AWGN channel implies that basically noise is additive.
So, the noise is additive in addition if the noise is white and Gaussian this is known as an
additive white Gaussian noise channel. One of the simplest, one of the most frequently
used and one of the most popular and also one of the most practically applicable channel
models or models for a typical digital communication system.
60
So, we will stop here; and based on this model, we will analyse the performance look at
optimal schemes for this digital communication system, for a digital communication
system and also their performance in the subsequent modules.
61
Lecture – 06
Structure of Digital Communication Receiver, Receive
Filter, Signal-to-Noise Power Ratio (SNR)
Welcome to another module in this massive open online course, in this module we will
start looking at a digital communication receiver.
So, let us start looking at the operation and implementation of a digital communication
receiver.
At the receiver end the received signal 𝑦(𝑡) which is the sum of transmitted signal 𝑥(𝑡)
and noise signal 𝑛(𝑡), is passed through a LTI filter with impulse response ℎ(𝑡). On the
output of this filter (let’s call it 𝑟(𝑡)) a sampling operation is performed with a sampling
time T (which is the symbol rate). The transmitted signal is extracted from the sampling
of the signal 𝑟(𝑡)
Let us start by discussing the filter ℎ(𝑡).
62
𝑟(𝑡) will be given by the convolution of the received signal with the impulse response of
the filter.
𝑟(𝑡) = 𝑦(𝑡) ⋆ ℎ(𝑡)
This can be written as,
∞
𝑟(𝑡) = ∫ 𝑦(𝑡 − 𝜏)ℎ(𝜏)𝑑𝜏
−∞
63
In the above we substitute the value of y(t),
𝑦(𝑡) = 𝑥(𝑡) + 𝑛(𝑡)
= 𝑎0 𝑝(𝑡) + 𝑛(𝑡)
Here for the sake of simplicity we will consider the transmission of only a single symbol
𝑎0 transmitted by a single pulse 𝑝(𝑡).
64
Substituting the value of y(t) in the expression for r(t) we get,
∞
𝑟(𝑡) = ∫ (𝑎0 𝑝(𝑡 − 𝜏) + 𝑛(𝑡 − 𝜏))ℎ(𝜏)𝑑𝜏
∞
To sample the above at symbol time T we substitute t in place of T.
∞
𝑟(𝑇) = ∫ (𝑎0 𝑝(𝑇 − 𝜏) + 𝑛(𝑇 − 𝜏))ℎ(𝜏)𝑑𝜏
−∞
Next, we separate the above expression into signal and noise components
65
∞ ∞
𝑟(𝑇) = 𝑎0 ∫ 𝑝(𝑇 − 𝜏)ℎ(𝜏)𝑑𝜏 + ∫ 𝑛(𝑇 − 𝜏)ℎ(𝜏)𝑑𝜏
−∞ −∞
The first term in the above expression is the signal component and the second term are the
noise component. We want to design a filter ℎ(𝜏) in order to minimize the noise
component and maximize the signal component.
66
Thus, we have two fundamental objectives, i) minimizing the noise power and ii)
maximizing the signal power. Or we can say we want to maximize the ratio of signal power
to the noise power. This ratio is called the Signal to Noise Ratio or SNR.
𝑠𝑖𝑔𝑛𝑎𝑙 𝑝𝑜𝑤𝑒𝑟
𝑆𝑁𝑅 =
𝑛𝑜𝑖𝑠𝑒 𝑝𝑜𝑤𝑒𝑟
SNR is a fundamental quantity for the analysis of digital as well as analog communication
systems. As noise will inevitably be a part of any communication system.
67
First let us calculate the signal power. The signal is given by,
∞
𝑟(𝑡) = 𝑎0 ∫ 𝑝(𝑇 − 𝜏)ℎ(𝜏)𝑑𝜏
−∞
The signal power is given by,
∞ 2
𝑠𝑖𝑔𝑛𝑎𝑙 𝑝𝑜𝑤𝑒𝑟 = 𝐸{|𝑎0 ∫ 𝑝(𝑡 − 𝜏)ℎ(𝜏) 𝑑𝜏| }
−∞
∞ 2
= 𝐸{|𝑎02 |}𝐸{|∫ 𝑝(𝑡 − 𝜏)ℎ(𝜏) 𝑑𝜏| }
−∞
Here, the second term is a real signal and thus modulus can be removed, and the first term
is the data symbol power or 𝑃𝑑 .
68
i.e.,
𝐸(|𝑎0 |2 ) = 𝑃𝑑
𝑝(𝑡) and ℎ(𝑡) are deterministic signals and not random signals thus their expected value
will be signal itself.
Thus, the expression for signal power will be,
69
∞ 2
𝑠𝑖𝑔𝑛𝑎𝑙 𝑝𝑜𝑤𝑒𝑟 = 𝑃𝑑 (∫ 𝑝(𝑇 − 𝜏)ℎ(𝜏)𝑑𝜏)
−∞
We have now computed the numerator term of the SNR. We will continue this discussion
further in the next module and see how can we design ℎ(𝜏) in order to maximize SNR.
70
Lecture – 07
Digital Communication Receiver, Noise Properties, Output
Noise Power
Hello. Welcome to another module in this massive open online course. So, we are looking
at the signal processing or how to process the receive signal at the receiver of the digital
communication system .
So, in this module let us start looking at how to characterize the noise power.
∞
𝑛̃ = ∫ 𝑛(𝑇 − 𝜏)ℎ(𝜏)𝑑𝜏
−∞
𝑛̃ is the noise after passing through the receive filter sampled at time T.
71
Now we want to calculate the noise power.
Assuming that noise 𝑛 is real noise power will be given by,
𝐸{|𝑛|2 } = 𝐸{𝑛2 }
Till now we have assumed that the noise in additive in nature. Now we invoke the other
two aspects of our model, namely white noise and gaussian nature.
72
The mean of the noise will be zero and the autocorrelation will be a delta function at zero.
i.e.,
𝐸{𝑛(𝑡)} = 0
𝜂
𝐸{𝑛(𝑡)𝑛(𝑡 + 𝜏)} = 𝑅𝑛𝑛 (𝜏) = 𝛿(𝜏)
2
We also assume that the noise process is a wide sense stationary process.
73
Note that a gaussian process will pass through a linear filter, the resulting signal will still
be a gaussian. Thus 𝑛̃ is also a gaussian.
̃ sampled at symbol time T.

𝑛̃ is essentially 𝑛(𝑡)
∞
𝑛̃ = ∫ 𝑛(𝑇 − 𝜏)ℎ(𝜏)𝑑𝜏
∞
We now need to find the mean or average value of 𝑛̃.
74
∞
𝐸{𝑛̃} = 𝐸 {∫ 𝑛(𝑇 − 𝜏)ℎ(𝜏)𝑑𝜏}
−∞
Since the expectation operator is linear and the impulse response ℎ(𝜏) is deterministic, we
can write,
∞
𝐸{𝑛̃} = ∫ 𝐸{𝑛(𝑇 − 𝜏)}ℎ(𝜏)𝑑𝜏}
−∞
75
Since 𝐸{𝑛(𝑡)} = 0 we get,
𝐸{𝑛̃} = 0
Now we want to find the power or the variance of the signal.
𝐸{𝑛̃2 } = 𝐸{𝑛̃.̃}
𝑛
∞ ∞
= 𝐸{∫−∞ 𝑛(𝑇 − 𝜏)ℎ(𝜏)𝑑𝜏 × ∫−∞ 𝑛(𝑇 − 𝜏̃ )ℎ(𝜏̃ )𝑑𝜏̃ }.
76
The above integral can be transformed into the double integral.
∞ ∞
= 𝐸{∫ ∫ 𝑛(𝑇 − 𝜏)𝑛(𝑇 − 𝜏̃ )ℎ(𝜏)ℎ(𝜏̃ )𝑑𝜏𝑑𝜏̃ }
−∞ −∞
Now we make use of the fact that the expectation operator is linear and move it inside.
Also the impulse response are not random and thus can be taken outside the expectation
operator.
77
Thus we have,
∞ ∞
= ∫ ∫ 𝐸{𝑛(𝑇 − 𝜏)𝑛(𝑇 − 𝜏̃ )}ℎ(𝜏)ℎ(𝜏̃ )𝑑𝜏𝑑𝜏̃
−∞ −∞
The expression in the expectation operator is essentially autocorrelation of the noise

function with time shift 𝜏 − 𝜏̃ .
Thus, we can write
𝜂 ∞ ∞
∫ ∫ ℎ(𝜏)ℎ(𝜏̃ )𝛿(𝜏̃ − 𝜏)𝑑𝜏̃ 𝑑𝜏
2 −∞ ∞
78
∞ ∞
= ∫ ℎ(𝜏)(∫ ℎ(𝜏̃ )𝛿(𝜏̃ − 𝜏)𝑑 𝜏̃ )𝑑𝜏
−∞ −∞
Using the shifting property of the delta function we can write,
𝜂 ∞
= ∫ |ℎ(𝜏)|2
2 −∞
79
We can call the integral term as energy of the impulse function 𝐸ℎ .
Thus,
𝜂
𝐸{𝑛̃2 } = 𝐸ℎ
2
This relation gives us a method to calculate the noise power at the output of the receiving
filter. We will continue our discussion on SNR in the next module.
80
81
Lecture - 08
Digital Communication Receiver, Optimal SNR, Matched Filter
at the SNR at the output that is after the processing, after the filtering and the sampling at
the receiver in the digital communication system, alright.
82
The signal at the output of the sampler will be given by
∞
𝑠𝑖𝑔𝑛𝑎𝑙 = ∫ 𝑎0 𝑝(𝑇 − 𝜏)ℎ(𝜏)𝑑𝜏
−∞
∞
𝑛𝑜𝑖𝑠𝑒 = ∫ 𝑛(𝑇 − 𝜏)ℎ(𝜏)𝑑𝜏
−∞
83
SNR is given by the ratio of signal power to noise power,
𝑠𝑖𝑔𝑛𝑎𝑙 𝑝𝑜𝑤𝑒𝑟
𝑆𝑁𝑅 =
𝑛𝑜𝑖𝑠𝑒 𝑝𝑜𝑤𝑒𝑟
From the previous modules we substitute the value of the signal and noise powers.
∞ 2
𝑃𝑑 (∫−∞ 𝑝(𝑇 − 𝜏)ℎ(𝜏)𝑑𝜏)
= 𝜂 ∞
|ℎ(𝜏)|2 𝑑𝜏
2 ∫−∞
For optimal communication, we want to maximize the SNR. For this we will use Cauchy
Schwartz inequality. First for simplicity of notation, let use the following notation,
𝑝(𝑇 − 𝜏) = 𝑢(𝜏) and ℎ(𝜏) = 𝑣(𝜏).
84
Now we apply the Cauchy Schwarz inequality i.e.,
∞ 2 ∞ ∞
(∫ 𝑢(𝑡)𝑣(𝑡)𝑑𝑡) ≤ ∫ 𝑢2 (𝑡)𝑑𝑡 × ∫ 𝑣 2 (𝑡)𝑑𝑡
−∞ −∞ −∞
Using he Cauchy Schwarz inequality we can write,
∞ ∞
𝑃𝑑 ∫−∞ 𝑝2 (𝑇 − 𝜏)𝑑𝜏 . ∫∞ ℎ2 (𝜏)𝑑𝜏
𝑆𝑁𝑅 ≤ 𝜂 ∞ 2
(𝜏)𝑑𝜏
2 ∫−∞ ℎ
85
Or,
∞
𝑃𝑑 ∫−∞ 𝑝2 (𝑇 − 𝜏)𝑑𝜏
𝑆𝑁𝑅 ≤ 𝜂
2
The SNR will be maximized when the equality condition of the Cauchy Schwarz
inequality is satisfied. i.e., when u(t) and v(t) are proportional to each other. Or,
𝑝(𝑇 − 𝜏) ∝ ℎ(𝜏)
86
Or
𝑝(𝑇 − 𝜏) = 𝑘ℎ(𝜏)
Since the above will be valid for any k, without loss of generality, we can assume k=1.
Thus, for maximum SNR we have to satisfy,
ℎ(𝜏) = 𝑝(𝑇 − 𝜏)
87
For maximum SNR, the receiver filter has to be “matched” to the pulse shaping filter.
This is termed as matched filter Principle and one of the fundamental concepts of digital
communication.
88
89
Consider an exponential pulse shaping filter. The receiver filter will simply be the pulse
shaping filter, inverted in the time domain (see figure) and shifted by T (along the time
axis.)
The above two operations will give us the matched receive filter.
90
At the matched filter,
∞
𝑃𝑑 ∫−∞ 𝑝2 (𝑇 − 𝜏)𝑑𝜏
𝑆𝑁𝑅 = 𝜂
2
∞ ∞
2 (𝑇
∫ 𝑝 − 𝜏)𝑑𝜏 = ∫ 𝑝2 (𝜏)𝑑𝜏
−∞ −∞
91
Thus, we get,
𝑃𝑑 ∞ 2
𝑆𝑁𝑅 = 𝜂 ∫ ℎ (𝜏)𝑑𝜏
2 −∞
𝑃𝑑 ∞ 2
= 𝜂 ∫ 𝑝 (𝜏)𝑑𝜏
−∞
2
Thus, maximum SNR is proportional to the total energy of the pulse shaping filter. And
more importantly this value is achieved when the receive filter is matched to the pulse
shaping filter, with inverted and shifted time. i.e.,
ℎ(𝜏) = 𝑝(𝑇 − 𝜏)
Thank you.
92
Lecture - 09
Probability of Error in Digital Communication, Probability Density
Functions of Output
Hello. Welcome to another module in this massive open online course. So, we are
looking at the receiver of a digital communication system, in today's module this module
let us start looking at.
The probability of error how to characterize or how to or basically what is the probability
of error corresponding to a particular scheme at the digital communication receiver. So,
we would like to look at the probability of error. Now the probability of error remember
in a digital communication system we are transmitting several symbols, the probability
of error is the probability that a certain symbol is received in error, and this is a very
important metric to characterize the performance of a digital communication system and
of course, naturally the performance is better if the probability of error is lower alright.
So, we always try to achieve the lowest possible probability of error alright. So, the aim
is to design a reliable communication system which means the probability of error that is
we want the symbols to be received accurately at the receiver therefore, the probability
93
of error that the probability that the same received symbol is in error must be as low as
possible. So, the probabilities of error describe it we transmit. So, probability of error
now the probability of error this is an important metric; is an important metric correct to
characterize the performance of a digital communication system. Characterize or
quantify of a digital communication system performance of a digital communication
system.
And of course, the lower the probability so we try to minimize we aim to minimize the
probability, the aim is to minimize probability of error; typical values of probability of
error approximately 10 to the power of minus 6 to 10 to the power of minus 8 these are
the typical probability we also know that we are going to look at this is the known as the
bit error rates or symbol error rates etcetera, these are the typical values of probability of
error.
94
Now, how do we characterize now how to characterize the probability of error that is the
next question how do we measure or how to for a given scheme that is the idea for a
given scheme how to characterize the probability of error for a given scheme, how to
characterize the probability of error. Now basically we have seen that our received
sample r of T is given as a naught, after sampling. Remember after filtering and sampling
you have seen this in the previous modules after remember after filtering with h of t and
sampling at t equal to after filtering and sampling this is a naught correct integral minus
infinity to infinity, P T minus tau, h tau d tau plus n tilde this is the noise remember we
also verified that the noise is Gaussian.
95
We also verified that this noise is Gaussian in nature and further we have seen that to
maximize the signal to noise power ratio we have to use the optimal filter which is the
match filter. To maximize SNR we have to choose h of t equals h of tau equals P T
minus tau, this is termed as the matched filter and this is one of the important most
important principles in digital communication.
This is termed as the this is a very important principle this is termed as the matched filter
and therefore, employing the matched filter we have that is h of t minus tau equals p of
96
tau we have r of T equals minus infinity to infinity, that is substituting h of tau we get h
of tau equals P of T minus tau, we get P square T minus tau d tau plus n tilde.
Now, remember we have also seen this P square d minus T minus tau d tau is nothing but
this is also equal to this is nothing but this is also equal to integral minus infinity to
infinity P square tau d tau which is basically like a denote it as E P the energy of the
pulse. So, this quantity here integral minus infinity to infinity is nothing but the energy of
the pulse shaping filter, this is the energy of the pulse shaping filter. So, I can say the
output is very simply is simply a naught E P plus n tilde where a naught remember is a
transmitted symbol, E P is the energy of the pulse shaping filter that is integral minus
infinity to infinity, P square tau d tau plus n tilde where n tilde is a Gaussian noise.
Now, let us look at these aspects further. So, first we have this is the transmitted symbol,
E P is energy of the pulse shaping filter we have already seen this is the pulse let me
simply write this as the pulse energy in tilde is Gaussian noise.
97
And remember we have also seen n tilde we also proved in the previous modules that the
mean of n tilde equals 0, that it is 0 mean and further the variance of n tilde that is
expected value of n tilde square for additive white Gaussian noise is eta naught by 2
integral minus infinity mod h tau square d tau, but we have also seen that the energy of
the pulse shaping filter is nothing but the energy of the receive filter is nothing but the
energy of the pulse shaping filter for a matched filter. So, this is eta by naught by 2
integral minus infinity to infinity, P square tau d tau which is eta naught by 2 times E P
that is the expected value of n square.
98
And therefore, we can write n tilde is Gaussian, basically this can be summarized as n
tilde this n calligraphic n denotes normal distribution or Gaussian. So, n tilde is Gaussian
with mean 0 and variance eta naught by 2 E P.
So, this is basically what we have derived for the Gaussian noise. So, n tilde this is the
noise after sampling correct noise after sampling or noise at output of sampler, this
denotes Gaussian this denotes that the mean equals 0, this is the mean that is the first
entry and this is basically the power of the noise or the variance that is we are using the
notation for a Gaussian which is standard notation, Gaussian of mean mu and variance
sigma square is denoted by n mu comma sigma square, that is Gaussian of mean equal to
mu variance equal to sigma square. So, here our noise n tilde remembers the noise n tilde
is the argument when test follows that is n t that is n t the noise additive white Gaussian
noise is passed through a linear filter that is h t alright.
Hence the output noise process n tilde t is also Gaussian and when we sample it the
sample of a Gaussian random process, Gaussian random noise process is a Gaussian
random variable. Therefore, n tilde is a Gaussian random variable we have derived the
mean of that, the mean of that is 0 the variance of that is a eta by naught by 2 integral
minus infinity to infinity, h square tau d tau that is the energy of the filter, but when you
use a matched filter the matched filter is same as the pulse therefore, the energy of the
matched filter is same with the energy of the pulse therefore, the variance of the
Gaussian noise at the output of the sampler is eta naught by 2 times E P where E P is the
pulse.
So, that is the complete argument and I think it is important that you understand that. So,
please go through the entire if it is not clear I urge you to go through the previous
modules where we calculated the nature of the gauss noise at the output of the sampler,
and also the mean and the variance once again.
99
Now more importantly the other thing that one can also notice note here that if I have a
Gaussian random variable that is as an aside, you can note that if I have a Gaussian
random variable with mean 0 and variance one that is Gaussian random variable mean 0
variance mean equal to 0 variance unity. Such a Gaussian random variable this is known
as a standard Gaussian or standard normal variable standard normal random variable, this
is known as the standard normal or the standard Gaussian; this is known as the standard
normal or standard Gaussian random variable or standard Gaussian random variable.
100
Now therefore, basically we have a very elegant model if you look at this we have the
received symbol r T that is not the received symbol, but the sample r T which is equal to
a naught E P plus n tilde, let us now of course, we said that the symbols must belong to
digital constellation. Now the very first modules we have realized that this symbols in a
digital communication system symbols must belong to a digital constellation in a digital
communication.
So, let us choose a very simple system in which basically we have very simple
constellation in which a naught equals either minus A or plus A. Remember we can have
remember how this arises we have the information symbols if you go back, we have
these are your information stream is the bits basically the raw bits information bits. Now
I can map 0 to the voltage level A and 1 to the voltage level minus A for instance or so I
will get a minus A, A minus A 0 is mapped to a one is mapped to minus A, A minus A 0
So, these are basically the symbols. So, I will just let me write it clearly. So, A minus A
minus A, A minus A, A and so on here we have 0 mapped to a, one mapped to minus A,
one mapped minus A, 0 mapped to A, one mapped to minus A. So, we are using the
mapping 0 is mapping to A and information bit 1 is mapping to minus A and now,
therefore, if you observe this at the output you will have r naught r T equals plus or
minus A times E P plus n tilde.
101
So, what you will receive is the signal level plus A. So, since you are transmitting either
plus A or minus A. So, what you will receive after matched filtering and sampling these
are the plus A times E P, where E P is energy of the pulse or minus A times E P in the
presence of this noise n tilde.
So, let us consider these 2 scenarios let us write them clearly. So, we have r T equals A E
P plus n tilde, if a naught is equal to A that is corresponding to the information bit 0 or r
T equals minus A E P plus n tilde if a naught equals minus A. So, this is basically a
complete characterization of the output.
102
So, this is output after sampling and now you will realize here something interesting that
is if you transmit a naught equal to A what you get is A times E P plus n tilde, which
means n tilde. Now look at this A times E P is a constant you are adding it to the
Gaussian noise n tilde which are 0 mean. So, the mean of this Gaussian noise n tilde is
going to shift to A times E P.
Similarly, when you transmit a naught equal to minus A output is minus A times E P
plus n tilde, which means the in this case the mean of the noise n tilde will shift to minus
A times E p. So, if you look at the output right r of t which is either A times E P plus A n
tilde or minus A times E P plus n tilde corresponding to a or minus A the mean will be
shifted to basically a times E P or minus A times E P because n tilde itself is 0 mean
Gaussian noise, and the variance will remain unchanged because when you simply add a
constant quantity to the Gaussian noise to the Gaussian random variable, the mean shifts
the variance which is the spread of the Gaussian random variable remains unchanged.
So, the variance of the Gaussian random variable will remain fixed that is your eta
naught by 2 times E P and that is an important point.
So, let us note that. So, if we take a Gaussian random will. So, let us look at this property
I think because it is a fairly important property which we might invoke later also, if you
have n which is distributed as a Gaussian random variable according to mean, mu,
variance sigma square, then n plus K where K is a constant is distributed according to the
103
Gaussian random variable the mean is shifted by K variance is unchanged to variances is
a this is an important property, variance is unchanged because variance is remember it is
related to the spread of the Gaussian random variable variance is a.
So, basically what we are saying is if you have an initial Gaussian random variable
which has this probability density function, which is centered at the mean which has
mean mu if you add. So, this is let us say your n probability density. In fact, this is not n,
but this is the probability density function of n.
Now, the probability density function of n plus k will be something that is shifted. So,
this is basically your mu plus K, but when we say the variance the variance is related to
the spread its pop it is proportional to the spread. So, this spread remains unchanged or
basically the shape of the Gaussian remains unchanged, it is basically simply translated.
So, the spread here is sigma square, spread here will also be sigma square this is
unchanged.
104
Now, what is the relevance of this 2 a digital communication system problem now what
we are seeing is something very interesting go back and look at r T equals either plus A
E P plus n tilde or minus A E P plus n tilde, which means this is n tilde shifted by E p.
So, this will be simply Gaussian with mean shifted because n tilde is 0 mean. So, it is it
is mean will be 0 plus A t e p which is A E P sigma square of course, sigma square we
also know what is sigma square, sigma square is eta naught by 2 times E P the variance
is unchanged this is also Gaussian with A E P minus A E P; minus A E P plus n tilde is
Gaussian with mean shifted to minus A E P and variance sigma square.
So, basically corresponding to both these, so the way to distinguish the outputs is
basically corresponding to the transmission of plus A, the output is a Gaussian random
variable. Remember because you are always observing the output in the presence of
noise n tilde which is random.
105
So, naturally the output is also a random variable correct; however, when the symbol
plus A is transmitted the output is a Gaussian random variable with mean A times E P
when the symbol minus A is transmitted the output is a Gaussian random variable with
mean, minus A times E P and in both cases the variance is the same variance is identical
which is sigma square equal to eta naught by 2 times E p. So, let us note that.
So, in this case r T is Gaussian mean equals minus A times E P and in both cases the
variance is the same, we will call it sigma square as a eta naught by 2 for the matched
106
filter this is a eta naught by 2 times E P as we have shown above alright. So, that is what
we are going to. So, what we are going to do.
So, let me just draw it I think it makes sense to draw it. So, we have 2 Gaussians and the
other Gaussian looks something like this, both of them have the same shape. So, this is
mean equal to. So, this is basically at minus A or let me just this is minus A times E P
and this is A times E P, variance is the same and by symmetric this for symmetry this
point of intersection is 0 this is purely you can seen by symmetry.
So, the point of intersection is 0 by symmetry and of course, now you can see this
corresponds to this Gaussian distribution corresponds to a naught equals minus A, this
Gaussian density corresponds to a naught equals plus A. So, naturally you can see that
when you are transmitting a naught equal to A the Gaussian is shifted towards the right
that is shifted towards a times E P, when you are transmitting a naught equals minus A
the Gaussian is shifted towards the left that is the mean A times or minus A times E P ok.
So, we have 2 Gaussians with the same variance that is eta naught over 2 times E p, but
one is shifted to minus A times E P, the other is shifted to A times E p. So, this is
basically the 2 the output the probability density function of the output basically
comprises of 2 shifted Gaussians.
107
That is something which is very interesting and both have the same variance and again I
am making the same point I am repeating this point both have the same variance, sigma
square equals eta naught by 2 times E P and this is an important property. So, please go
over this to understand this better and we will complete the derivation of this from this
probability density function the different probability density functions of r T
corresponding to plus A and minus A, we are going to compute the decision rule at the
output at the receiver and also what is the corresponding probability of error. So, let us
stop here.
Thank you.
108
Lecture - 10
Probability of Error in Digital Communication, Optimal Decision
Rule, Gaussian Q function
at the performance of a digital communication system and we have said that the output
statistic after sampling can be described as follows.
𝜂
𝐴𝐸𝑃 + 𝑛̃ ~ 𝑁(𝐴𝐸𝑃 , 𝐸𝑃 )
𝑟(𝑇) = { 2
𝜂
−𝐴𝐸𝑃 + 𝑛̃ ~ 𝑁(−𝐴𝐸𝑃 , 𝐸𝑃 )
2
This is a combination of two gaussian distributions, shifted by two opposite constants.
109
𝜂𝐸
For simplicity in writing symbols we assume𝐴𝐸𝑃 = 𝐴̃ and 2𝑃 = 𝜎 2
We can then write,
̃ 𝜎 2 ) 𝑓𝑜𝑟 𝑎𝑜 = 𝐴
𝑁(𝐴,
𝑟(𝑇) {
̃ 𝜎 2 ) 𝑓𝑜𝑟 𝑎𝑜 = −𝐴
𝑁(−𝐴,
Also, from symmetry of the two functions we can say that intersection point will lie at
zero.
110
From the above figure (see pict. above) we can see that for r(T)>0 PDF of +A has higher
value and for r(T)<0 PDF of -A has higher value. Thus, we formulate the following
detection rule or the decision rule.
𝑟(𝑇) ≥ 0 ⇒ 𝑑𝑒𝑡𝑒𝑐𝑡 𝐴
{
𝑟(𝑇) < 0 ⇒ 𝑑𝑒𝑡𝑒𝑐𝑡 − 𝐴
Detection rule is simply a rule that helps us decide the received symbol from the sampled
output of the transmitted signal.
Now we try to calculate the probability of error.
111
We can intuitively say that an error would occur when ao =A is transmitted but ao=-A is
detected. Or when 𝑟(𝑇) < 0 for 𝑎𝑜 = 𝐴 or vice versa.
This error would occur when 𝑛̃ > 𝐴̃
112
Thus, the probability of error is given by
𝑃𝑒 = 𝑃𝑟 (𝑛̃ ≥ 𝐴̃)
∞
= ∫ 𝑓𝑁 (𝑛̃)𝑑𝑛̃
𝐴̃
113
𝑛̃ 𝑛̃2
1 −
=∫ 𝑒 2𝜎2 𝑑𝑛
̃
𝐴̃ √2𝜋𝜎 2
𝑛̃
In the above expression we substitute, 𝜎 = 𝑛′ we get,
𝑛̃ (𝑛′)2
1 −
= ∫̃ 𝑒 2 𝑑𝑛′
𝐴 √2𝜋
𝜎
114
The above expression is simply the integration of a gaussian PDF with mean 0 and variance
1.
We define the following Q function
∞
1 2 /2
𝑄(𝑢) = ∫ 𝑒 −𝑡 𝑑𝑡
𝑢 √{2𝜋}
Q(u) is the probability that the standard gaussian random variable is greater than u.
115
Thus we have,
𝐴̃
𝑃𝑒 = 𝑄 ( )
𝜎
𝐴2 𝐸𝑃
= 𝑄(√ 𝜂 )
2
Here we have derived the probability of error corresponding to transmission of different

symbols in our system. The probability of error is given by the gaussian Q function.
Thank you.
116
Lecture – 11
Introduction to Binary Phase Shift Keying (BPSK) Modulation, Optimal Decision
Rule and Probability of Bit-Error or Bit-Error Rate (BER)
Hello, welcome to another module in this massive open online course in this module. Let
us start looking at binary phase shift keying. So, we want to look at this digital modulation
scheme which is known as binary phase shift keying.
This scheme also known as BPSK is one of the many modulation schemes.
117
In this scheme the pulse shaping filter or the pulse is given by:
2
𝑝(𝑡) = √ cos(2𝜋𝑓𝑐 𝑡) 𝑓𝑜𝑟 0 ≤ 𝑡 ≤ 𝑇
𝑇
Where T is the symbol duration.
118
Further, T will be an integer multiple of 1/fc, i.e.,
𝑘
𝑇= 𝑤ℎ𝑒𝑟𝑒 𝑘 𝑖𝑠 𝑎𝑛 𝑖𝑛𝑡𝑒𝑔𝑒𝑟
𝑓𝑐
For, k=2, the pulse will have two cycles in the symbol duration
119
For k=2, it can be shown that the pulse p(t) is normalized to unit energy.
120
What this means is the total pulse energy is 1. Or,
∞
𝐸𝑃 = ∫ 𝑝2 (𝑡)𝑑𝑡
−∞
𝑇
2
=∫ cos 2 (2𝜋𝑓𝑐 𝑡)𝑑𝑡
0 𝑇
121
2 𝑇 1 + 𝑐𝑜𝑠4𝜋𝑓𝑐 𝑡
= ∫ 𝑑𝑡
𝑇 0 2
=1
Now the transmitted signal x(t) is,
𝑥(𝑡) = 𝑎𝑜 𝑝(𝑡) 𝑤ℎ𝑒𝑟𝑒 𝑎𝑜 𝜖{−𝐴, 𝐴}
Each a0 carries one bit of information. Throughout all schemes we use a constant energy
per bit.
122
Let, 𝐸𝑏 be the constant energy per bit.
Information bits 0 and 1 occur with probability half for each
123
Average energy per bit is given by,
1 1
𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑒𝑛𝑒𝑟𝑔𝑦 = 𝐸𝑏 = 𝐴2 𝐸𝑃 + 𝐴2 𝐸𝑃
2 2
= 𝐴2 𝐸𝑃 = 𝐴2
Thus, we can write,
𝐴 = √𝐸𝑏
124
For bit 0 the transmitted wave is (since ao=+A)
2
𝑥(𝑡) = √𝐸𝑏 √ cos(2𝜋𝑓𝑐 𝑡) 0≤𝑡≤𝑇
𝑇
125
Corresponding to bit 1 ao=-A
2𝐸𝑏
𝑥(𝑡) = −√ 𝑐𝑜𝑠2𝜋𝑓𝑐 𝑡
𝑇
2𝐸𝑏
=√ cos(2𝜋𝑓𝑐 𝑡 + 𝜋)
𝑇
126
Note that the waveforms for 0 and one are similar but are phase shifted by 𝜋. Thus, if we
phase shift bit 0 by 𝜋 we get 1 and shifting 1 by 𝜋 we get 0.
127
This is what gives the scheme its name. Since waveforms for different bits are phase
shifted from each other.
128
129
At the receiver we match the filter thus we have
ℎ(𝜏) = 𝑝(𝑇 − 𝜏)
130
Here if the filter looses is not shifted by T, then it is no longer a causal filter. This makes
it important to shift the filter by T.
Now we remember the detection rule,
𝑟(𝑇) ≥ 0 ⇒ 𝑑𝑒𝑡𝑒𝑐𝑡 𝐴 𝑏𝑖𝑡 0

{
𝑟(𝑇) < 0 ⇒ 𝑑𝑒𝑡𝑒𝑐𝑡 − 𝐴 𝑏𝑖𝑡 1
131
And also, the probability of error, which is a Q function
𝐴2 𝐸𝑃
𝑃𝑒 = 𝑄 (√ 𝜂 )
2
Substituting 𝐴 = √𝐸𝑃 we get,
132
2𝐸𝑏
𝑃𝑒 = 𝑄(√ )
𝜂
This expression gives the probability of error with BPSK scheme and average energy per
bit Eb
This ends this module
133
Lecture - 12
Introduction to Amplitude Shift Keying (ASK) Modulation
Hello, welcome to another module in this massive open online course. So, in this module
let us start looking at another digital modulation scheme that is amplitude shift key.
So, previously we have seen phase binary phase shift keying, now, we will look at a
different modulation scheme which as amplitude shift key or ASK
134
The carrier pulse for this scheme the same as before i.e.,
2
𝑝(𝑡) = √ cos(2𝜋𝑓𝑐 𝑡) 𝑓𝑜𝑟 0 ≤ 𝑡 ≤ 𝑇
𝑇
Where 𝑇 = 𝑘/𝑓𝑐 . Where k is an integer
This means there are an integer number of cycles in each duration.
135
And as we have seen earlier, energy of the pulse 𝐸𝑃 is unity.
In this scheme the symbols used are 0 and A, i.e.,
𝑎𝑜 𝜖{0, 𝐴}
Where 0 corresponds to transmitted bit 1 and A corresponds to bit 0.
136
If ao =0 then the transmitted pulse p(t) for that symbol will also be zero. Thus the signal
will be “on” (or equal to p(t)) for some time (when ao=0) and it will be zero otherwise.
Because of such behaviour this scheme is also known as on off scheme.
137
Now we come back to the question of choosing A or the wave amplitude.
To choose A we fix the average bit energy Eb.
138
We have the following for signal x(t),
𝐴𝑝(𝑡)𝑖𝑓 𝑎𝑜 = 0, 𝑏𝑖𝑡 = 1
𝑥(𝑡) {
0 𝑖𝑓 𝑎𝑜 = 1, 𝑏𝑖𝑡 = 0
Each value has the probability half. Thus, the average energy per bit is given by,
1 2 1
𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑒𝑛𝑒𝑟𝑔𝑦 𝑝𝑒𝑟 𝑏𝑖𝑡 = 𝐴 𝐸𝑃 + 0 = 𝐴2 𝐸𝑃
2 2
139
Equating this to 𝐸𝑏 we get,
𝐴 = √2𝐸𝑏
So, this ensures energy per bit is constant across the schemes.
140
In BPSK both the waveforms differ in phase, in ASK, waveforms corresponding to

different bits, differ in magnitude. (zero and A)
Now, we want to calculate what is the probability of the resulting probability of error for
this scheme or the probability of bit error.
141
At the receiver we have,
𝑦(𝑡) = 𝑥(𝑡) + 𝑛(𝑡)
= 𝐴𝑝(𝑡) + 𝑛(𝑡)
Where n(t) is AWGN.
Assuming this is followed by a matched filter and sampling at time T
142
We have the following signal at the sampler output,
𝑟(𝑇) = 𝐴𝐸𝑝 + 𝑛̃
Where 𝑛̃ is the gaussian noise with mean zero and variance 𝜂/2 (for 𝐸𝑃 = 1)
143
For ao=0, no signal is transmitted, thus we have 𝑟(𝑇) = 𝑛̃
144
For ao=A we have 𝑟(𝑇) = 𝐴𝐸𝑃 + 𝑛̃. Thus, the overall expression for r(T) is,
𝐴𝐸𝑃 + 𝑛̃ 𝑖𝑓 𝑎𝑜 = 𝐴
𝑟(𝑇) = {
𝑛̃ 𝑖𝑓 𝑎𝑜 = 0
However in this r T simply n tilde which means it is a Gaussian with mean equal to 0. The
variance is the same in both case.
145
So, if you look at this, this is not shifted to minus n plus a but it is something very different.
If you look at this, this is still two Gaussians, in first case the mean is 0 that is or in one
case the first case rather the mean is A times E p, this corresponds to the bit 0 and this
corresponds to the bit equals 1 and now you can see the midpoint is not 0, but by symmetry
again midpoint is A E p divided by 2.
So, this is the Gaussian corresponding to bit 1, this is the Gaussian for bit equal to 0 by
symmetry the midpoint is A E p divided by 2 which is the detection threshold which is
146
also the detection which is also your detection threshold. So, by symmetry the midpoint is
A E p divided by 2 which is also the detection threshold. Therefore, the optimal decision
rule is not to compare r T with 0.
So, the optimal decision rule is not to compare r T with 0, but to compare r T with this
threshold A E p divided by 2 which E p is 1. So, this is a by 2. If r T is greater than or
equal to A E p divided by 2 implies, this implies the decision is a naught equal to A.
147
On the other hand if r T is less than A E p divided by 2 decision is a naught equals 0. So,
this is basically our by symmetry this is our optimal decision rule where we have exploited
the symmetry in the system, alright.
So, basically what we have seen so far is we have started looking at different digital
modulation scheme that is amplitude shift keying in which the different waveforms shifted
in terms of amplitude. One of them is amplitude A, other has an amplitude 0. We have
seen; what is the amplitude level to keep the energy constant at E b or the energy per bit
constant at E b? Again we have seen what are the different received what is the received
symbol, what is the probability density function or what are the probability density
functions of the sample r T corresponding the transmission of the waveform for bit 0 and
the waveform corresponding to bit 1. We have seen that one is the Gaussian with mean A
E p where E A is amplitude level E p is the pulse energy other is an a Gaussian with mean
at 0. Therefore by symmetry the midpoint is at A E p divided by 2 which is the detection
threshold.
If r t is greater than equal to A E p divided by 2 then the decision is a naught equal to A.

If on the other hand r T is less than A E p divided by 2 the decision is a naught is equal to
0. So, that is the detection scheme. We will complete this in the next lecture and look at
what is the resulting probability of bit error and also how it compares with the binary phase
shift keying scheme that we have descried earlier.
148
Lecture - 13
Optimal Decision Rule for Amplitude Shift Keying (ASK), Bit-Error Rate (BER)
and Comparison with Binary Phase Shift Keying (BPSK) Modulation
Hello, welcome to another module in this massive open online course. In the last module
we looked at optimal decision rule for ASK. In this module we will look at probability of
error in ASK.
The output of the sampler in ASK is given by,
𝐴𝐸𝑃 + 𝑛̃ 𝑖𝑓 𝑎𝑜 = 𝐴
𝑟(𝑇) = {
𝑛̃ 𝑖𝑓 𝑎𝑜 = 0
149
𝑟(𝑇) is simply a combination of two gaussian distributions one with mean 0 and other with
mean 𝐴𝐸𝑝 and both with variance 𝜂𝐸𝑝 /2.
By symmetry we concluded that midpoint (or intersection point) of these two will lie on
𝐴𝐸𝑃 /2.
150
Thus our optimal decision rule was given by,
𝐴𝐸𝑃
𝐷𝑒𝑐𝑖𝑑𝑒 𝑎𝑜 = 𝐴 𝑖𝑓 𝑟(𝑇) ≥
{ 2
𝐴𝐸𝑃
𝐷𝑒𝑐𝑖𝑑𝑒 𝑎𝑜 = 0 𝑖𝑓 𝑟(𝑇) ≤
2
We define a function 𝑟̃ (𝑇) given by,
151
𝐴𝐸𝑃
𝑟̃ (𝑇) = 𝑟(𝑇) −
2
𝐴𝐸𝑃
+ 𝑛̃ 𝑖𝑓 𝑎𝑜 = 𝐴
𝑟̃ (𝑇) = { 2
𝐴𝐸𝑃
− + 𝑛̃ 𝑖𝑓 𝑎𝑜 = 0
2
This is similar to BPSK with 𝐴𝐸𝑃 replaced by 𝐴𝐸𝑃 /2.
Now you can see there is symmetry, it is symmetric about 0. So, now, you can see this is
similar to BPSK; similar to BPSK with A E p replaced by A E p by this is similar to BPSK
with A E p replaced by A E p by 2. Therefore, again now we have something that similar
to BPSK that is A E p by 2 plus n tilde minus A E p by 2 plus n tilde. Remember for BPSK
binary phase shift keying we had A E p by A E p plus n tilde minus A E p plus n tilde
therefore, in that sense this is similar to binary phase shift keying with A E p replaced by
A E p divided by 2.
152
Hence, the probability of error or bit error rate in ASK will be given by an expression
similar to BPSK with 𝐴𝐸𝑃 replaced by 𝐴𝐸𝑃 /2.
153
Thus, we will have,
𝐴2 𝐸𝑃
𝑃𝑒 = 𝑄 (√ )
2𝜂
Substituting 𝐴2 = 2𝐸𝑏 and 𝐸𝑃 = 1 we get,
154
𝐸𝑏
𝑃𝑒 = 𝑄 (√ )
𝜂
Now, substitute E p equal to 1 that is pulse normalized to unit energy and remember we
have calculated in amplitude shift keying we have said that for average bit energy E b A
square must be equal to 2 E b. So, substituting a square equals 2 E b we get P e equals Q
square root of A square is 2 E b in to E p is 1 divided by 2 n naught divided by 2 n naught
which is equal to Q square root of E b over N naught that is your probability of error.
Next we compare the BER’s of ASK and BPSK.
155
𝐸𝑏
𝑃𝑒,𝐴𝑆𝐾 = 𝑄 (√ )
𝜂
2𝐸𝑏
𝑃𝑒,𝐵𝑃𝑆𝐾 = 𝑄 (√ )
𝜂
156
Here, it is important to remember that Q function is a decreasing function as it is the
integration of the tail of a gaussian from its argument.
Thus higher the argument of the tail Q function, lower its value.
Thus for the same value of average energy per bit, we have,
𝑃𝑒,𝐵𝑃𝑆𝐾 ≤ 𝑃𝑒,𝐴𝑆𝐾
157
Now we see how much improvement does BPSK provide.
1
√ 𝐸𝑏 𝐸𝑏
𝑄 ( 2. 2 ) = 𝑄 (√ )
𝜂 𝜂
158
For same BER, BPSK needs half the average bit energy compared to that of ASK.
For same , BER,
1
𝐸𝑏,𝐵𝑃𝑆𝐾 = 𝐸𝑏,𝐴𝑆𝐾
2
1
⇒ 10 log10 𝐸𝑏,𝐵𝑃𝑆𝐾 = 10 log10 + 10 log10 𝐸𝑏,𝐴𝑆𝐾
2
159
⇒ 10 log10 𝐸𝑏,𝐵𝑃𝑆𝐾 − 10 log10 𝐸𝑏,𝐴𝑆𝐾 = −10 log10 2 =

̃ − 3 𝑑𝐵
Or,
𝐸𝑏,𝐵𝑃𝑆𝐾 (𝑑𝐵) − 𝐸𝑏,𝐴𝑆𝐾 (𝑑𝐵) = −3𝑑𝐵
160
i.e., for same BER BPSK requires 3dB lower average energy per bit.
So, we will stop here and continue with other aspects in the subsequent modules.
161
Lecture - 14
Introduction to Signal Space Concept, Orthonormal Basis Signals
Hello, welcome to another module in this massive open online course. So, in today’s
module this module let us start looking at another concept in digital communication that
is a very important concept and in fact also one of the most fundamental frameworks which
is termed as a signal space. Slightly abstract but has several applications especially in
digital communications.
162
Let us call this as an important framework for to understand, framework to understand

digital communication. It is a framework to understand digital communication systems.
Now let us start with a simple introduction to the concept of signal space.
Now, consider 2 pulses.
Consider two pulses 𝑃1 (𝑡) and 𝑝2 (𝑡) such that they are orthogonal to each other and have
unit energy.
163
i.e.,
∞ ∞
∫ 𝑝12 (𝑡)𝑑𝑡 = ∫ 𝑝22 (𝑡)𝑑𝑡 = 𝐸𝑃 = 1
−∞ −∞
∞
∫ 𝑝1 (𝑡)𝑝2 (𝑡)𝑑𝑡 = 0
−∞
i.e, both are orthonormal to each other.
164
These two constitute an orthonormal basis for a signal space.
165
A signal space is simply similar to a vector , a space that is formed by a linear combination.
The space of signals given by,
𝛼1 𝑝1 (𝑡) + 𝛼2 𝑝2 (𝑡)
Where, 𝛼1 and 𝛼2 are real numbers.
166
All signals which can be represented by above forms will form the signal space S of 𝑝1 (𝑡)
and 𝑝2 (𝑡). Let us consider an example.
2
𝑝1 (𝑡) = √ cos(2𝜋𝑓1 𝑡) 0≤𝑡≤𝑇
𝑇
2
𝑝2 (𝑡) = √ cos(2𝜋𝑓2 𝑡) 0 ≤ 𝑡 ≤ 𝑇
𝑇
167
Here we assume that T contains integer number of cycles of both 𝑓1 and 𝑓2 . i.e.,
𝑘1 𝑘2
𝑇= = (𝑤ℎ𝑒𝑟𝑒 𝑘1 ≠ 𝑘2 )
𝑓1 𝑓2
We can assume that 𝑓1 and 𝑓2 are the multiples of the same fundamental frequency 𝑓𝑜 .
𝑘1
𝑓1 = = 𝑘1 𝑓𝑜
𝑇
168
𝑘2
𝑓2 = = 𝑘2 𝑓𝑜
𝑇
Where,
1
𝑓𝑜 =
𝑇
We look at energy of the pulse 𝑝1
169
∞
𝐸1 = ∫ 𝑝12 (𝑡)𝑑𝑡
−∞
𝑇
2
=∫ cos 2 (2𝜋𝑓1 𝑇)𝑑𝑡
0 𝑇
2 𝑇 1 + cos(4𝜋𝑓1 𝑡)
= ∫ 𝑑𝑡
𝑇 0 2
= 1 = 𝐸𝑃
(Refer Time: 13:47).
170
Similarly, we can show,
∞
𝐸2 = ∫ 𝑝22 (𝑡)𝑑𝑡 = 1 = 𝐸𝑝
−∞
Now consider inner product of the two,
171
∞
2 𝑇
∫ 𝑝1 (𝑡)𝑝2 (𝑡)𝑑𝑡 = ∫ cos(2𝜋𝑓1 𝑡) cos(2𝜋𝑓2 𝑡) 𝑑𝑡
−∞ 𝑇 0
1 𝑇
= ∫ {cos(2𝜋(𝑓1 + 𝑓2 ) 𝑡) + cos(2𝜋(𝑓1 − 𝑓2 )𝑡)}𝑑𝑡
𝑇 0
172
Replacing 𝑓1 and 𝑓2 by 𝑘𝑓1 and k𝑓2 we get and integrating from 0 to T we get
So, this is equal to I can write this as 1 over T sin 2 pi F 1 plus F 2 times T divided by 2 pi
times F 1 plus F 2 evaluated between the limits 0 to T plus 1 over T sin 2 pi F 1 minus F 2
times T divided by 2 pi times F 1 minus F 2 0 to T. We already seen F 1 plus F 2 is k times
F naught.
173
= sin 2π(𝑘1 − 𝑘2 ) 𝑓𝑜 𝑇 = 0
Thus, the functions are orthonormal.
174
𝑝1 and 𝑝2 are hence, orthogonal and have unit energy, thus they will form the basis of a
signal space.
175
Each signal in this signal space can be described as ,
2 2
𝛼1 𝑝1 (𝑡) + 𝛼2 𝑝2 (𝑡) = 𝛼1 √ cos(2𝜋𝑓1 𝑡) + 𝛼2 √ cos(2𝜋𝑓2 𝑡)
𝑇 𝑇
In the next modules we will try to build modulation scheme using this vector space.
176
Principles of Communication Systems – Part II
Lecture - 15
Introduction to Frequency Shift Keying (FSK)
Hello, welcome to another module in this massive open online course. In this module, let
us look at another digital modulation technique that is frequency shift key.
So, we will be looking at a new digital modulation scheme which is termed as frequency
shift key. This is a digital modulation scheme as we have already said and it is based on
the signal space concept that we have previously described.
177
Suppose we have two orthonormal pulses as discussed in the previous lecture,𝑝1 (𝑡) and
𝑝2 (𝑡).
2
𝑝1 (𝑡) = √ cos(2𝜋𝑓1 𝑡) 0≤𝑡≤𝑇
𝑇
2
𝑝2 (𝑡) = √ cos(2𝜋𝑓2 𝑡) 0≤𝑡≤𝑇
𝑇
Where 𝑓1 = 𝑘1 /𝑇 and 𝑓2 = 𝑘2 /𝑇. 𝑘1 and 𝑘2 being integers.
178
Both 𝑝1 (𝑡) and 𝑝2 (𝑡) have unit energy and are orthogonal to each other.
179
Now we take the following digital modulation scheme.
2
𝐴𝑝1 (𝑡) = 𝐴√ cos(2𝜋𝑓1 𝑡) 𝑓𝑜𝑟 𝑏𝑖𝑡 0
𝑇
𝑥(𝑡) =
2
𝐴𝑝2 (𝑡) = 𝐴√ cos(2𝜋𝑓2 𝑡) 𝑓𝑜𝑟 𝑏𝑖𝑡 1
{ 𝑇
180
Note here that both 𝑝1 (𝑡) and 𝑝2 (𝑡) differ in frequency.
One waveform can be obtained from other by shifting the frequency. Hence, this scheme
is known as frequency shift keying or FSK.
Assuming both 0 and 1 appear with probability half each, we get the average energy per
bit,
181
1 2 1
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑒𝑛𝑒𝑟𝑔𝑦 𝑝𝑒𝑟 𝑏𝑖𝑡 = 𝐴 𝐸𝑃 + 𝐴2 𝐸𝑃 = 𝐸𝑏
2 2
Assuming 𝐸𝑝 = 1 we get,
𝐴 = √𝐸𝑏
Thus, we have the transmitted signal,
182
2𝐸
√ 𝑏 cos(2𝜋𝑓1 𝑡) 𝑓𝑜𝑟 𝑏𝑖𝑡 0
𝑇
𝑥(𝑡) =
2𝐸
√ 𝑏 cos(2𝜋𝑓1 𝑡) 𝑓𝑜𝑟 𝑏𝑖𝑡 1
{ 𝑇
At the receiver end we have the received signal,
𝑦(𝑡) = 𝑥(𝑡) + 𝑛(𝑡)
Where n(t) is the additive gaussian white noise
183
Thus, there will be two pulses
𝐴𝑝1 (𝑡) + 𝑛(𝑡)𝑓𝑜𝑟 𝑏𝑖𝑡 0

𝑦(𝑡) = {
184
Now that we have two types of signal, we need to find out how to match the receive filter.
For bit 0 we define,
𝑝1 (𝑡) + 𝑝2 (𝑡)
𝑦̃(𝑡) = 𝑦(𝑡) − 𝐴 ( )
2
185
𝑝1 (𝑡) − 𝑝2 (𝑡)
= 𝐴( ) + 𝑛(𝑡)
2
= 𝐴 𝑝̃(𝑡) + 𝑛(𝑡)
186
For bit 1 we have,
𝑝1 (𝑡) + 𝑝2 (𝑡)
𝑦̃(𝑡) = 𝑦(𝑡) − 𝐴 ( )
2
= −𝐴 𝑝̃(𝑡) + 𝑛(𝑡)
187
Thus,
𝐴 𝑝̃(𝑡) + 𝑛(𝑡) 𝑓𝑜𝑟 𝑏𝑖𝑡 1

𝑦̃(𝑡) {
−𝐴 𝑝̃(𝑡) + 𝑛(𝑡) 𝑓𝑜𝑟 𝑏𝑖𝑡 0
This is similar to BPSK, with 𝑝(𝑡) replaced by, 𝑝̃(𝑡)
188
Thus, the receive filter ℎ(𝜏) has to be matched to 𝑝̃(𝑇 − 𝜏),
189
The factor of two is simply a constant and will not affect the SNR.
Hence, we can write,
ℎ(𝜏) = 𝑝1 (𝑇 − 𝜏) − 𝑝2 (𝑇 − 𝜏)
190
2 2
= √ cos(2𝜋𝑓1 (𝑇 − 𝜏)) − √ cos(2𝜋𝑓2 (𝑇 − 𝜏))
𝑇 𝑇
Thus, we have derived the receive filter for optimal SMR, in the next module we will
derive the SNR and bit error rate for this scheme.
191
Lecture - 16
Optimal Decision Rule for FSK, Bit-Error Rate (BER) and Comparison with
BPSK, ASK
at frequency shift keying all right. And in frequency shift keying well we are considering
2 different wave forms, to represent the information bit 0 and one which is shifted in
frequency correct.
That is, we have,

𝑦(𝑡) = {
192
(Refer Slide Time: 01:23).
The SNR optimizing receive filter is given by,
ℎ(𝑡) = 𝑝1 (𝑇 − 𝑡) − 𝑝2 (𝑇 − 𝑡)
193
Sampler output for information bit 0 will be given by,
∞
𝑟(𝑇) = ∫ 𝑦(𝜏)ℎ(𝑇 − 𝜏)𝑑𝜏
−∞
∞
= ∫ (𝐴𝑝1 (𝜏) + 𝑛(𝜏)) × (𝑝1 (𝜏) − 𝑝2 (𝜏))𝑑𝜏
−∞
194
∞ ∞
= ∫ 𝐴𝑝1 (𝜏)(𝑝1 (𝜏) − 𝑝1 (𝜏))𝑑𝜏 + ∫ 𝑛(𝜏)(𝑝1 (𝜏) − 𝑝2 (𝜏))𝑑𝜏
−∞ −∞
The first term in the above expression is the signal component and the second term are the
noise component.
Let us first simplify the signal component.
195
∞
𝑠𝑖𝑔𝑛𝑎𝑙 = ∫ 𝐴𝑝1 (𝜏)(𝑝1 (𝜏) − 𝑝2 (𝜏))𝑑𝜏
−∞
∞ ∞
=∫ 𝐴𝑝12 (𝜏)𝑑𝜏 − ∫ 𝐴𝑝1 (𝜏)𝑝2 (𝜏)𝑑𝜏
−∞ −∞
= 𝐴𝐸𝑃
Substituting 𝐸𝑃 = 1 the signal power would simply be 𝐴.
The noise part is given by
∞
𝑛̃ = ∫ 𝑛(𝜏)(𝑝1 (𝜏) − 𝑝2 (𝜏))𝑑𝜏
−∞
196
We know from our previous discussions the above expression will be a gaussian with mean
zero (since n(t) is a gaussian with mean zero).
Variance of the above expression will be given by,
𝜂 ∞
𝑣𝑎𝑟 = ∫ |ℎ(𝜏)|2 𝑑𝜏
2 −∞
197
𝜂 ∞ 2
= ∫ (𝑝1 (𝜏) − 𝑝2 (𝜏)) 𝑑𝜏
2 −∞
𝜂 ∞ 2
= ∫ (𝑝1 (𝜏) + 𝑝22 (𝜏))𝑑𝜏
2 −∞
Since the signals are orthonormal the last term 2𝑝1 (𝑡)𝑝2 (𝑡) will be zero.
=𝜂
198
This is the noise power after sampling
Corresponding to bit 1,
𝑦(𝑡) = 𝐴𝑝2 (𝑡) + 𝑛(𝑡)
Signal after sampling will be given by,
𝑟(𝑇) = −𝐴𝐸𝑃 + 𝑛̃
199
= −𝐴 + 𝑛̃
Thus, overall, we will have,
𝐴 + 𝑛̃ 𝑓𝑜𝑟 𝑏𝑖𝑡 0
𝑟(𝑇) = {
−𝐴 + 𝑛̃ 𝑓𝑜𝑟 𝑏𝑖𝑡 1
The above expression is similar to BPSK except 𝜎 2 = 𝜂 the noise term is 𝜂/2 for BPSK.
By symmetry we can say that the optimal decision rule will be,
200
𝑟(𝑇) ≥ 0 ⇒ 𝑑𝑒𝑐𝑖𝑑𝑒 𝑏𝑖𝑡 0
{
𝑟(𝑇) < 0 ⇒ 𝑑𝑒𝑐𝑖𝑑𝑒 𝑏𝑖𝑡 1
This is the optimal threshold-based decision rule for FSK.
The probability of bit error or BER is given by,
𝐴
𝐵𝐸𝑅 = 𝑄 ( )
𝜎
201
𝐴2
= 𝑄 (√ )
𝜂
Or,
𝐸𝑏
𝑃𝑒 = 𝑄 (√ )
𝜂
202
Hence BER for FSK is equal to BER for ASK (for equal average energy per bit). And both
of these are greater than BER of BPSK.
We need twice the average bit energy for ASK and FSK to achieve the same BER as
BPSK.
Or we can say ASK and FSK need 3dB more power to achieve the same BER as BPSK
So, BPSK is the amongst these 3 BPSK binary phase shift keying is the most efficient
modulations scheme and the reason for that if you explore it will be because binary phase
shift keying for the same average bit error rate uses antipodal signalling that is it uses
signals plus A and minus A all right which maximises the distance between the
constipation points.
You can note this in formally all though we have not shown this rigorously it maximises
the distance between the conciliation points for the same average bit error rate. While both
amplitude shift keying and frequency shift keying do not do this
So, we will stop here.
203
Lecture - 17
Introduction to Quadrature Phase Shift Keying (QPSK)
Hello. Welcome to another module in this massive open online course. So, in the previous
module, we have looked at different digital modulation schemes. In this module we will
start looking at another digital modulation scheme that is quadrature phase shift keying or
QPSK.
204
Suppose we have two orthonormal basis functions 𝑝1 (𝑡) and 𝑝2 (𝑡).
2
𝑝1 (𝑡) = √ cos(2𝜋𝑓𝑐 𝑡) 𝑓𝑜𝑟 0 ≤ 𝑡 ≤ 𝑇
𝑇
205
2
𝑝2 (𝑡) = √ sin(2𝜋𝑓𝑐 𝑡) 𝑓𝑜𝑟 0 ≤ 𝑡 ≤ 𝑇
𝑇
Where 𝑡 = 𝑘/𝑓𝑐 .
Observe that 𝑝2 (𝑡) is a phase shifted of version of 𝑝1 (𝑡). With a phase shift of 𝜋/2.
206
Also both of these have unit energy (this we can say from previous modules).
i.e.,
𝐸𝑝 = 1 for p1 and p2
207
The inner product is given by,
∞
∫ 𝑝1 (𝑡)𝑝2 (𝑡)𝑑𝑡
−∞
208
2 ∞
= ∫ sin(2𝜋𝑓𝑐 𝑡) cos(2𝜋𝑓𝑐 𝑡) 𝑑𝑡
𝑇 −∞
1 𝑇
= ∫ sin(4𝜋𝑓𝑐 𝑡) 𝑑𝑡
𝑇 0
=0
209
Thus, the inner product of the two pulses is zero. They are then orthonormal pulses (unit
energy + orthogonal)
Note that these two basis signals are similar to those in Quadrature carrier multiplexing,
in which the signals are modulated over 𝑚1 cos(2𝜋𝑓𝑐 𝑡) − 𝑚2 sin(2𝜋𝑓𝑐 𝑡)
210
We generate a signal x(t),
𝑥(𝑡) = 𝑎1 𝑝1 (𝑡) + 𝑎2 𝑝2 (𝑡)
Each of the two carries one symbol or one bit of information.
211
It is possible to have, 𝑎1 = ±𝐴 and 𝑎2 = ±𝐴, with average bit energy=𝐸𝑏 . Thus, 𝐴 = √𝐸𝑏 .
The signal x(t) carries two bits of information.
212
Hence, we can have a set of four possible signals.
𝐴𝑝1 (𝑡) + 𝐴𝑝2 (𝑡)

𝐴𝑝1 (𝑡) − 𝐴𝑝2 (𝑡)
𝑥(𝑡) =
−𝐴𝑝1 (𝑡) + 𝐴𝑝2 (𝑡)
{−𝐴𝑝1 (𝑡) − 𝐴𝑝2 (𝑡)
Here we see that a signals has two orthogonal pulses, thus each can carry 2 bits of
information in one symbol cycle, whereas in previous schemes only one symbol per cycle
could be transmitted.
We will explore this scheme further in the subsequent modules.
213
Lecture - 18
Waveforms of Quadrature Phase Shift Keying (QPSK)
at a different digital modulation scheme that is Quadrature Phase Shift Keying.
In QPSK the signal is given by,
𝑥(𝑡) = 𝑎1 𝑝1 (𝑡) + 𝑎2 𝑝2 (𝑡)
Where 𝑎1 and 𝑎2 can both be ±𝐴.
214
Hence, there will be four possible waveforms,
𝐴𝑝1 (𝑡) + 𝐴𝑝2 (𝑡)

𝐴𝑝1 (𝑡) − 𝐴𝑝2 (𝑡)
𝑥(𝑡) =
−𝐴𝑝1 (𝑡) + 𝐴𝑝2 (𝑡)
{−𝐴𝑝1 (𝑡) − 𝐴𝑝2 (𝑡)
Consider the first waveform,
215
2 2
𝑥(𝑡) = 𝐴√ cos(2𝜋𝑓𝑐 𝑡) + 𝐴√ sin(2𝜋𝑓𝑐 𝑡)
𝑇 𝑇
2 1 1
=𝐴 { cos(2𝜋𝑓𝑐 𝑡) + sin⁡(2𝜋𝑓𝑐 𝑡)}
√𝑇 √2 √2
2𝐴 𝜋
= cos (2𝜋𝑓𝑐 𝑡 − )
√𝑇 4
216
Similarly the second waveform (𝐴𝑝1 (𝑡) − 𝐴𝑝2 (2)) we can write,
2𝐴 𝜋
= cos (2𝜋𝑓𝑐 𝑡 + )
√𝑇 4
The third waveform (−𝐴𝑝1 (𝑡) + 𝐴𝑝2 (𝑡)) can be written as,
2𝐴 3𝜋
√𝑇 4
217
And the fourth, (−𝐴𝑝1 (𝑡) − 𝐴𝑝2 (𝑡)),
2𝐴 3𝜋
√𝑇 4
Hence overall we can write,
2𝐴 𝜋
cos (2𝜋𝑓𝑐 𝑡 − )⁡
√𝑇 4
2𝐴 𝜋
cos (2𝜋𝑓𝑐 𝑡 + )
4
𝑥(𝑡) = √𝑇
2𝐴 3𝜋
√𝑇 4
2𝐴 5𝜋
{√𝑇 4
218
Note here that each consecutive waveform is shifted from the other by a phase difference
of 𝜋/2. Also the last waveform is shifted form the first with a phase shift of, 𝜋/2.⁡ Each
waveform is shifted form it’s neighbouring by a phase of 𝜋/2 or 90𝑜 or a quadrature. This
is the reason why, this scheme is known as Quadrature phase shift keying.
219
These different waveforms can be represented using a phasor diagram. The four signals
can be represented on a circle of radius 2𝐴/√𝑇 (see figure above.). Each is shifted from
its consecutive by a phase of 𝜋/2. The points at an angle of (from the positive x-axis),
𝜋/4, 3𝜋/4, 5𝜋/4 (-3𝜋/4) and 7𝜋/4 (−𝜋/4).
We can represent these waveforms using a 2-dimensional constellation diagram.
220
The four points will be, (A,A), (-A,A), (-A,-A) and (A,-A).
In the subsequent week, we will look into matched filter design for QPSK and its BER
221
Lecture - 19
Matched Filtering, Bit-Error Rate and Symbol Error Rate for
Quadrature Phase Shift Keying (QPSK)
at quadrature phase shift keying and we have looked at several aspects first how to
construct the modulated signal, what are the different pulse wave forms in quadrature shift
keying correct, what is the property that they satisfy that is these different waveforms are
phase shifted from each other by 𝜋/2. Now let us look at the detection the processing at
the receiver and the resulting performance in terms of it is bit error rate.
222
We have 𝑥(𝑡) = 𝑎1 𝑝1 (𝑡) + 𝑎2 𝑝2 (𝑡). Where, 𝑝1 (𝑡) and 𝑝2 (𝑡) are independent symbols.
Thus, the optimal receive filter for each will be 𝑝1 (𝑇 − 𝜏) and 𝑝2 (𝑇 − 𝜏).
So, this will give us A 1 times E P where E P is the pulse energy which is equal to 1 plus
noise and tilde and this will give us A 2 times E P. So, when you match filter with respect
to P 2 the component with respect to component along the signal space along the signal
space direction P 1 T, that is component along pulse P 1 T will vanish and we get A 2
223
times E p, A 2 times that is the A 2 is a component along the signal space function P 2 or
the signal space pulse P 2 T E P is 1 plus the noise into tilde.
For instance, let us take. So, what we are saying is that basically we match filter with
respect to both. So, first we match filter with respect to P 1 that is the signal space basis
signal P 1 T alright. To make a decision with respect to A 1 alright and make a decision
with respect to the transmitted signal A 2 we match filter with respect to P 2 that is a pulse
P 2 T which is another orthogonal basis function for the signal space.
Now, let us look at how this operates. So, matched filter with respect to. So, when we
match filter for instance. So, let us say we match filter with respect to match filter with
respect to h t equals P 1 T minus tau.
Output of the AWGN channel will be,
𝑦(𝑡) = 𝑥(𝑡) + 𝑛(𝑡)
Where, n(t) is the white noise with zero mean and variance 𝑁𝑜 𝛿(𝜏)/2.
224
Matched filter output,
𝑦(𝑡) ⋆ ℎ1 (𝑡)|𝑡=𝑇 = 𝑥(𝑡) ⋆ ℎ1 (𝑡)|𝑡=𝑇 + 𝑛(𝑡) ⋆ ℎ1 (𝑡)|𝑡=𝑇
The signal component of the above expression will be,
𝑎1 𝑝1 (𝑡) + 𝑎2 𝑝2 (𝑡) ⋆ ℎ1 (𝑡)
∞
= ∫ 𝑎1 𝑝1 (𝜏) + 𝑎2 𝑝2 (𝜏)ℎ1 (𝑡 − 𝜏)𝑑𝜏
−∞
225
∞
= ∫ 𝑎1 𝑝1 (𝜏) + 𝑎2 𝑝2 (𝜏)𝑝1 (𝑡 + 𝜏 − 𝑇)𝑑𝜏
−∞
Sampling the above at t=T we get,
∞
∫ (𝑎1 𝑝1 (𝜏) + 𝑎2 𝑝2 (𝜏))𝑝1 (𝜏)𝑑𝜏
−∞
226
The inner product of 𝑝1 (𝑡) and 𝑝2 (𝑡) will be zero, leaving us with inner product of p1 with
p1 which is energy of the pulse 𝐸𝑃 . Thus, we have,
= 𝑎1 𝐸𝑝 = 𝑎1 (𝑠𝑢𝑏𝑠𝑡𝑖𝑡𝑢𝑡𝑖𝑛𝑔 𝐸𝑝 = 1)
The noise component will be same as in previous cases 𝑛

̃1 with mean zero and variance
𝜂𝐸𝑃
= 𝜂/2.
2
227
The final output at the sampler will be given by,
𝑟1 (𝑇) = 𝑎1 + 𝑛
̃1
Where, 𝑎1 will be ±𝐴 (where 𝐴 = √𝐸𝑏 ). 𝐸𝑏 being the average energy per bit.
Note here that this output is similar to BPSK
228
By symmetry the optimal detection rule will be,
𝑟1 (𝑇) ≥ 0 𝑑𝑒𝑐𝑖𝑑𝑒 𝐴, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 − 𝐴
Similarly, for the second pulse (symbol), we use the match filter ℎ2 (𝜏) on 𝑥(𝑡).
229
Output of the sampler will be,
𝑦(𝑡) = 𝑎2 + 𝑛
̃2
Schematically the QPSK scheme can be represented by two independent paths for 𝑝1 and
𝑝2 (see figure above). The paths have received filter ℎ1 (𝑡) and ℎ2 (𝑡), both are sampled at
t=T. and the output of the samplers are 𝑟1 (𝑇) and 𝑟2 (𝑇).
230
The output of each sampler is similar to BPSK. Both 𝑎1 and 𝑎2 can be ±𝐴.
The bit error rate (BER) of each component will be similar to BPSK.
231
BER for both the channels is given by,
2𝐸𝑏
𝑝𝑒1 = 𝑝𝑒2 = 𝑝𝑒 = 𝑄 (√ )
𝜂
232
Now we calculate the overall symbol error rate.
233
Probability that the symbol is received correctly in any channel is given by,
𝑝𝑐 = 1 − 𝑝𝑒
Probability that both the channels receive the correct bit will be given by,
(1 − 𝑝𝑒1 )(1 − 𝑝𝑒2 ) = (1 − 𝑝𝑒 )2
234
Here we have used the following rule from probability theory,
𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴)𝑃(𝐵) 𝑖𝑓 𝑏𝑜𝑡ℎ 𝐴 𝑎𝑛𝑑 𝐵 𝑎𝑟𝑒 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑎𝑛𝑡
235
Probability of symbol error is given by,
= 1 − 𝑝𝑟𝑜𝑏, 𝑏𝑜𝑡ℎ 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑟𝑒𝑐𝑖𝑒𝑣𝑒𝑑
= 2𝑝𝑒 − 𝑝𝑒2
236
2𝐸𝑏 2𝐸𝑏
𝑃𝑒 = 2𝑄 (√ ) − 𝑄 2 (√ )
𝜂 𝜂
For high SNR we will have high 𝐸𝑏 /𝜂, thus,
√2𝐸𝑏
𝑄( )≪1
𝜂
237
2𝐸𝑏 2𝐸𝑏
⇒ 𝑄 2 (√ ) ≪ 𝑄 (√ )
𝜂 𝜂
Hence, for high SNR
2𝐸𝑏
𝑃𝑒 =
̃ 2𝑄 (√ )
𝜂
So, basically roughly this means there are 2 bit is each bit can be in error, bit error rate is
twice roughly twice that of BPSK although it is not exactly true because P e of because
using the prob using the results from probability theory, we know that probability of A
union B is probability of A plus probability of B minus probability of A intersection B. In
fact, that is what the square term represents, but since the square term is much smaller, we
can neglect it at high SNR, and we can simply write it as twice Q of square root of 2 E b
over N naught. That is the overall probability of symbol error for this QPSK modulation
scheme with average bit energy E b alright.
So, we will stop here and explore other schemes in subsequent modules.
238
Lecture - 20
Introduction to M-ary PAM (Pulse Amplitude Modulation), Average
Symbol Power and Decision Rules
Hello. Welcome to another module. So, in this module let us start looking at M-ary another
digital modulation scheme based on M levels termed as M-ary pulse amplitude
modulation.
M-ary Pulse Amplitude modulation or M-ary PAM is a modulation scheme based on M-

levels, in contrast with previous schemes which were based on two levels.
239
We have the transmitted waveform,
𝑥(𝑡) = 𝑎𝑜 𝑝(𝑡)
𝑎𝑜 can take one of M different levels.
𝑁𝑜. 𝑜𝑓 𝑏𝑖𝑡𝑠 𝑓𝑜𝑟 𝑀 𝑙𝑒𝑣𝑒𝑙𝑠 = log 2 𝑀
240
Example if we have 4 levels we need 2 bits for representation, each level can be
represented by one of (00, 01, 10 or 11).
For further discussion, let us take an example of 8 level system. And let the 8 possible
values be ,
𝑎 𝜖 {±𝐴, ±3𝐴 ± 5𝐴, ±7𝐴}
Here all the levels in the constellation are separated by 2A. So, this is a M-ary that is 8 ary
PAM pulse amplitude modulated constellation.
241
For a general M-ary PAM we can write,
𝑀
𝑎𝑜 𝜖 ± (2𝑖 + 1) 𝑖 = 0,1,2, … … . , −1
2
Where 𝑀 is an even integer.
Let the average symbol energy be 𝐸𝑠
242
𝑀
−1
2
1
𝐸𝑠 = ∑ (2𝑖 + 1)2 𝐴2 𝑎𝑜 = (2𝑖 + 1)𝐴
𝑀
2 𝑖=0
Here, by symmetry we can say that +𝐴 and −𝐴. Will have the same energy so average can
be calculated by taking one side itself.
This can now be simplified,
243
𝑀
−1
2 2
2𝐴
= ∑ (4𝑖 2 + 4𝑖 + 1)
𝑀
𝑖=0
Using the formula for sum of n integers and sum of square of n integers we can simplify
the above expression.
244
The simplified expression will be,
𝐴2
= (𝑀2 − 1)
3
Thus, the average symbol as a function of A and M is given by,
𝐴2
𝐸𝑠 = (𝑀2 − 1)
3
245
For a fixed symbol energy 𝐸𝑠 we can write,
3𝐸𝑠
𝐴=√ 2
𝑀 −1
Where 2𝐴 is the separation between levels.
246
We choose a simple pulse 𝑝(𝑡) given by,
2 𝑘
𝑝(𝑡) = √ cos(2𝜋𝑓𝑐 𝑡) 𝑓𝑜𝑟 0 ≤ 𝑡 ≤
𝑇 𝑓𝑐
Where k is an integer.
Energy of the pulse is unity, i.e., 𝐸𝑝 = 1.
247
Matched receive filter will be given by,
ℎ(𝑡) = 𝑝(𝑇 − 𝑡)
We can say this because for the matched filter principle we did not assume any specific
modulation scheme or number of levels, only the pulse shape. Thus, for M-ary scheme the
matched filter will be given by the above expression.
248
Output after sampling from the matched filter will be given by,
𝑟(𝑇) = 𝑎𝑜 + 𝑛̃ (𝑎𝑠𝑠𝑢𝑚𝑖𝑛𝑔 𝐸𝑝 = 1)
𝜂𝐸𝑝
Where 𝑛̃ is the additive white gaussian noise process with mean zero and variance =
2
𝜂/2.
Let us derive the decision rule now. For the symbol A the decision rule will be,
𝑖𝑓 0 ≤ 𝑟(𝑇) ≤ 2𝐴 ⇒ 𝑑𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝐴
Similarly, for 3A we will have,
𝑖𝑓 2𝐴 ≤ 𝑟(𝑇) ≤ 4𝐴 ⇒ 𝑑𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝐴
Thus, here instead of a single boundary of decision we will have two boundaries for each
level and these will be at the middle points at each level.
249
The overall decision rule will be a combination of M different cases, one for each level
So, if R t 0 less than equal to R t less than equal to I am sorry again not 0, this is if 0 less
if 2 A less than equal to R t less than equal to 4 a implies decision equals 3 A.
Such decision rule is also known as the nearest neighbor rule as the decision for a symbol
is based on the level nearest to the sampled or received signal mean.
250
The boundary points however will have a slightly different rule as they only have one
neighbor.
We will continue the discussion in the next module.
Thank you.
251
Lecture - 21
M-ary PAM (Pulse Amplitude Modulation) - Part II, Optimal Decision Rule,
Probability of Error
at M-ary PAM: that is M-ary pulse amplitude modulation.
We looked at the optimal decision rule for M-ary PAM scheme, and concluded that for
interior points, the decision is based on nearest neighbour rule.
252
Nearest neighbour rule means 𝑎𝑜 is decided to be the nearest neighbour of the detected
(output after the filter) value of 𝑟(𝑇).
At he two end points however the situation is slightly different. The end points only have
one neighbour and are given by, −(𝑀 − 1)𝐴 and (𝑀 − 1)𝐴. Thus we can form the
decision rule for end points,
𝑑𝑒𝑐𝑖𝑑𝑒 − (𝑀 − 1)𝐴 𝑖𝑓 𝑎𝑜 ≥ (𝑀 − 2)𝐴

{
𝑑𝑒𝑐𝑖𝑑𝑒 (𝑀 − 1)𝐴 𝑖𝑓 𝑎𝑜 < −(𝑀 − 2)𝐴
253
Together with the previously discussed decision rules for interior points these form the
complete set of decision rules.
254
Or we can simply write,
𝐼𝑓 0 ≤ 𝑟(𝑇) < 2𝐴 𝑑𝑒𝑐𝑖𝑑𝑒 𝐴

𝑖𝑓 2𝐴 ≤ 𝑟(𝑇) < 4𝐴 𝑑𝑒𝑐𝑖𝑑𝑒 3𝐴
𝑖𝑓 − 2𝐴 ≤ 𝑟(𝑇) < 0 𝑑𝑒𝑐𝑖𝑑𝑒 − 𝐴
.
.
.
𝑖𝑓 𝑎𝑜 ≥ (𝑀 − 2)𝐴 𝑑𝑒𝑐𝑖𝑑𝑒 − (𝑀 − 1)𝐴
{ 𝑖𝑓 𝑎𝑜 < −(𝑀 − 2)𝐴 𝑑𝑒𝑐𝑖𝑑𝑒 (𝑀 − 1)𝐴
255
We will now calculate the probability of error.
Let us take an example, let the decision made be,
𝑎𝑜 = 𝐴
Thus, we will have, 0 ≤ 𝑟(𝑇) < 2𝐴. Error occurs if 𝑎𝑜 = 𝐴 but 𝑟(𝑇) ≥ 2𝐴 or 𝑟(𝑇) < 0
256
Probability of error in this case will be,
Pr(𝑒) = Pr(𝑟(𝑇) < 0 ∪ 𝑟(𝑇) ≥ 2𝐴)
Since the above two events are mutually exclusive, we can write,
𝑃𝑒 = 𝑃𝑟 (𝑟(𝑇 < 0)) + 𝑃𝑟 (𝑟(𝑇) ≥ 2𝐴)
𝑟(𝑇) = 𝑎𝑜 + 𝑛̃
257
= 𝐴 + 𝑛̃
𝑟(𝑇) ≥ 2𝐴
⇒ 𝑛̃ ≥ 𝐴
And
𝑟(𝑇) < 0 ⇒ 𝑛̃ < −𝐴
258
Thus, the probability of error is given by,
𝑃𝑒 = 𝑃𝑟 (𝑛̃ ≥ 𝐴) + 𝑃𝑟 (𝑛̃ < −𝐴)
Now, 𝑛̃ is a gaussian with zero mean thus is symmetric around zero (or it is an even
function). Thus, the two probabilities in the above sum will be equal. i.e.,
𝑃𝑟 (𝑛̃ ≥ 𝐴) = 𝑃𝑟 (𝑛̃ < −𝐴)
Here we have assumed that the probability of a single point is zero (𝑃𝑟 (𝑛̃ = 𝐴) = 0)
259
𝑃𝑒 = 2𝑃𝑟 (𝑛̃ > 𝐴)
𝐴2
= 2𝑄(√ 𝜂 )
2
260
Q function is the tail integral of a gaussian or the probability that value of a standard
gaussian is greater than x,
𝑄(𝑥) = 𝑃𝑟𝑜𝑏(𝑠𝑡𝑑. 𝑔𝑎𝑢𝑠𝑠𝑖𝑎𝑛 ≥ 𝑥)
261
The same can be said for all the interior points. Now, considered the exterior points.
𝑎𝑜 = (𝑀 − 1)𝐴
𝑟(𝑇) = (𝑀 − 1)𝐴 + 𝑛̃
262
Error occurs if,
(𝑀 − 1)𝐴 + 𝑛̃ < (𝑀 − 2)𝐴
⇒ 𝑛̃ < −𝐴
Thus,
𝑃𝑒 = Pr(𝑛̃ < −𝐴)
= Pr(𝑛̃ > 𝐴) (𝑓𝑟𝑜𝑚 𝑠𝑦𝑚𝑚𝑒𝑡𝑟𝑦)
263
Thus,
𝐴
𝑃𝑒 = 𝑄 ( )
𝜎
Assuming all constellation points to be equiprobable,
264
1
Pr(𝑎𝑜 = ±(2𝑖 + 1)𝐴) =
𝑀
2
Pr(𝐸𝑛𝑑 𝑝𝑜𝑖𝑛𝑡) =
𝑀
𝑀−2
Pr(𝐼𝑛𝑡𝑒𝑟𝑖𝑜𝑟 𝑝𝑜𝑖𝑛𝑡) =
𝑀
The average probability of error will therefore be,
265
𝑃𝑒 = Pr(𝑒|𝑒𝑛𝑑𝑝𝑜𝑖𝑛𝑡) Pr(𝐸𝑛𝑑𝑝𝑜𝑖𝑛𝑡) + Pr(𝑒|𝑖𝑛𝑡𝑒𝑟𝑖𝑜𝑟 𝑝𝑜𝑖𝑛𝑡) Pr(𝑖𝑛𝑡𝑒𝑟𝑖𝑜𝑟 𝑝𝑜𝑖𝑛𝑡)
𝐴 2 𝐴 𝑀−2
= 𝑄 ( ) + 2𝑄 ( )
𝜎 𝑀 𝜎 𝑀
1 𝐴
= 2 (1 − )𝑄 ( )
𝑀 𝜎
In the above expression we substitute,
𝜂 3𝐸𝑠
𝜎= 𝑎𝑛𝑑 𝐴 = √ 2
2 𝑀 −1
Where 𝐸𝑠 is the average symbol energy.
266
Thus,
1 6𝐸𝑠
𝑃𝑒 = 2 (1 − ) 𝑄 (√ )
𝑀 𝜂(𝑀2 − 1)
267
No. of bits per symbol is given by log 2 𝑀.
In this modulation scheme by loading a large number of bits per symbol, we can increase
the overall data transfer rate without increasing the symbol rate. This is extremely useful
in modern high speed communications like 4G and 5G.
So, average probability of error substituting this expression for both A and n naught by 2
that is twice 1 minus 1 by M Q square root of A square which is 3 E s divided by n square
minus 1 divided by n naught by 2. So, that will be 6, E s divided by n naught into M square
minus 1. So, this is basically the average probability of error for M-ary PAM it is slightly
elaborate, but it is a very interesting derivation it is one of the fundamental expressions
because M-ary PAM is a very general digital modulation scheme.
So, this is let me describe a little bit more about it, now if you substitute M equal to 2 if
you substitute M equal to 2, you can see that probability of error will be twice 1 minus 1
over twice 1 minus half is basically half twice into 2 into half is 1. So, this reduces to Q
square root of well n square minus 1 is M equal to 2. So, M square minus 1 is 3. So, six
divided by 3 is 2 twice E s by n naught which is same as what we had before what we had
before is E b Q of square root of 2 E b over n naught, but for 2 PAM that is when M equal
to 2 the average bit energy is the same as average symbol energy because number of bits
per symbol is 1, alright.
So, average bit energy is the same as average symbol energy and therefore, what you have
seen is we have derived the average bit energy for M-ary PAM as I have said it is a very
general modulation because M-ary PAM, you can transmit log M to the base 2 bits per
symbol. So, unlike the previous modulation schemes such as binary phase shift keying
etcetera you are not limited to 1 bit per symbol you can load a larger number of bits on
each symbol and that is helpful especially in modern high speed wireless communication
systems because if you have a good channel where the SNR is good you can keep loading
a large number of bits on each symbol such as 4 PAM, 8 PAM, 128 PAM.
In fact, we are going to see a much more general modulation scheme next that is known
as quam which is similar to QPSK that is we have 2 basis functions and you can load any
number of bits on each on each basis function correct. Each orthogonal basis function or
each orthogonal basis symbol thereby by increasing the number of bits per symbol for the
same symbol rate you can significantly increase the bit rate and that is how one achieves
268
higher and higher data rates for instance in 4 G wireless communication system or now
people are talking about somewhat five g wireless communication system. So, by loading
by using higher these are known as higher order modulations by using higher values of M
by loading large number of bits on each symbol one is able to achieve higher data rate that
is why the performance of this general modulation scheme is important.
So, by loading; so the final point, I want to mention is number of bits again emphasize
number of bits per symbol equals log M to the base 2 by loading large number of bits per
symbol rather than simply one bit per symbol using large of number of bits per symbol
one is able to achieve higher data rates in for instance 4 G and 5 G. So, these are very
important 4 G and 5 G that is the whole point. So, this is something. So, this is basically
summarizes the probability of error for M-ary PAM, let me just write it over here. So,
write its clear probability of error for M-ary PAM.
So, basically that wraps up this module where we had the previous module we have
introduced M-ary PAM partly discussed the decision rules for the internal points. Now we
have completed the discussion regarding the decision rules for the end points there are 2
end points and what is the probability of error for each point if it is an internal point
probability of error for an end point and therefore, what is the average probability of error
and we have derived an expression for that all right.
269
So, please go through this derivation again because it involves several steps a series of
logical arguments that we have to make and the series of computations that we have to
make at every step and we have used several different properties such as properties of
random variables such as properties of mutually exclusive events properties of mutually
exclusive and exhaustive events the total probability rule etcetera.
So, this is a very interesting derivation that also that of course, tells you what is the average
probability of error what is optimal decision rule what is an M-ary PAM constellation,
what is optimal decision rule, what is a resulting probability of error and also explains or
gives or a also this practical example helps you better understand several principles of
probability and random processes alright.
So, we will stop here. We will look at other aspects in the subsequent modules.
270
Lecture – 22
M-ary QAM (Quadrature Amplitude Modulation) – Part I, Introduction,
Transmitted Waveform, Average Symbol Energy
Hello. Welcome to another module in this massive open online course. In this module we
will discuss the another digital modulation scheme known as Quadrature amplitude
modulation or QAM.
271
And to be a bit more precise this is known as M-ary QAM, M-ary quadrature amplitude
modulation.
This is similar to QPSK where we had two pulses, here too we have two pulses 𝑝1 (𝑡) and
𝑝2 (𝑡).
𝑥(𝑡) = 𝑎1 𝑝1 (𝑡) + 𝑎2 𝑝2 (𝑡)
272
Where, 𝑎1 and 𝑎2 are two separate and independant symbols from a PAM constellation.
𝑝1 (𝑡) and 𝑝2 (𝑡) are given by,
2
𝑝1 (𝑡) = √ cos⁡(2𝜋𝑓𝑐 𝑡)
𝑇
273
2
𝑝2 (𝑡) = √ sin⁡(2𝜋𝑓𝑐 𝑡)
𝑇
Where 𝑇 = 𝑘/𝑓𝑐 .
These pulses are identical to QPSK, they have unit pulse energy (𝐸𝑝 = 1) and are shifted
by a face difference of 𝜋/2.
274
For an M-ary QAM the total number of symbols will be M, and we have two sets for 𝑎1
and 𝑎2 . Thus both 𝑎1 and 𝑎2 will be derived from a constellation of √𝑀-ary PAM.
𝑇𝑜𝑡𝑎𝑙⁡𝑛𝑢𝑚𝑏𝑒𝑟⁡𝑜𝑓⁡𝑠𝑦𝑛𝑏𝑜𝑙𝑠⁡𝑖𝑛⁡𝑄𝐴𝑀 = √𝑀 × √𝑀 = 𝑀
Thus we are here using an √𝑀 ary PAM (not M-ary PAM).
275
𝑎1 , 𝑎2 ⁡𝜖√𝑀 − 𝑎𝑟𝑦⁡𝑃𝐴𝑀
√𝑀
𝑎𝑗 = ±(2𝑖 + 1)𝐴⁡⁡⁡⁡0 ≤ 𝑖 ≤ − 1, 𝑗⁡𝜖{1,2}
2
This naturally means that √𝑀 has to be an even integer.
For example fotr a 64 QAM we have
276
𝑀 = 64
√𝑀 = 8
√𝑀
=4
2
Each PAM constellation comprises of 8 units.
Total number of bits in this scheme is given by,
𝑇𝑜𝑡𝑎𝑙⁡#⁡𝑏𝑖𝑡𝑠⁡/𝑠𝑦𝑚𝑏𝑜𝑙⁡ = log 2 𝑀 = log 2 64
=6
This means there are 3 bits for each PAM (23 = 8)
277
𝐸𝑠
𝐴𝑣𝑒𝑟𝑎𝑔𝑒⁡𝑠𝑦𝑚𝑏𝑜𝑙⁡𝑒𝑛𝑒𝑟𝑔𝑦⁡𝑝𝑒𝑟⁡𝑎1 =
2
𝐸𝑠
𝐴𝑣𝑒𝑟𝑎𝑔𝑒⁡𝑠𝑦𝑚𝑏𝑜𝑙⁡𝑒𝑛𝑒𝑟𝑔𝑦⁡𝑝𝑒𝑟⁡𝑎2 =
2
Each constituent PAM has half the symbol energy.
Each PAM is a √𝑀-ary PAM with distance 2A between levels.
278
𝐸𝑠 𝐴2 2
⇒ = (√𝑀 − 1)
2 3
3𝐸𝑠
⇒𝐴=
2(𝑀 − 1)
Here we have Es over 2 equals a square by 3 into square root of M square minus 1 which
basically implies Es is equal to well this basically implies or the other way round. Rather
a s or rather a is equal to 3 Es by 2 M minus 1 and this is an important.
279
So, the amplitude a or the distance 2 a is such that a for each square root of M-ary PAM a
for each square root of M-ary PAM constituent square root of M-ary PAM is chosen as 3
Es by 2 3 Es by 2 M minus 1. So, this ensures remember what is the constraint here ensures
average energy Es for average energy Es for M-ary for the M-ary QAM. So, alright so,
what we have done here is basically we have introduced a different modulation which is
generalized version of PAM. In fact, for that matter generalization of several scheme
VPSK, QPSK is generalization of BPSK PAM is in a generalization of BPSK and QAM
basically is a generalization of all these such modulation scheme.
So, it contains a combination of 2 PAMs there are 2 constituent PAM constellation pulse
amplitude modulation constitution. Each of containing square root of M symbols therefore,
that QAM is M-ary, alright. So, M-ary QAM is form from 2 constituents square root M-
ary PAMs, each transmitted on one is transmitted on the cosine 2 pi F c t component and
the other is transmitted on the sin 2 pi F c t pulse both of the pulses are in quadrature. So,
it is quadrature amplitude modulation all right, previously we have pulse amplitude
modulation.
Now, we are using 2 pulses which are in quadrature. So, it is quadrature amplitude
modulation each constituent PAM is square root of M symbols. And what we have derived
is basically to ensure an each constituent PAM has a average symbol energy Es by 2 that
ensures energy Es for the a M-ary QAM. And then we have derived a relation for the for
the constant A all right correspond to each constituent A square root M-ary PAM such that
the average symbol energy is Es for the overall QAM constellation.
So, we will stop here and in the next module you will look at what is a appropriate receiver
the processing at the receiver, and the resulting bit error rate and symbol error rate and
also the this overall symbol error rate for the QAM constellation.
280
Lecture - 23
M-ary QAM (Quadrature Amplitude Modulation) - Part II, Optimal Decision Rule,
Probability of Error, Constellation Diagram
at M-ary QAM that is M-ary quadrature amplitude modulation, and we are about start over
discussion on the receiver and receive processing and the bit error rate for M-ary QAM,
but before we do that yesterday we derived the value of amplitude A.
3𝐸𝑠
𝐴=√
2(𝑀 − 1)
281
The transmitted signal is given by,
𝑥(𝑡) = 𝑎1 𝑝1 (𝑡) + 𝑎2 𝑝2 (𝑡) + 𝑛(𝑡)
Where, 𝑝1 (𝑡) and 𝑝2 (𝑡) are orthonormal functions (pulses)
𝑦(𝑡) = 𝑥(𝑡) + 𝑛(𝑡)
= 𝑎1 𝑝1 (𝑡) + 𝑎2 𝑝2 (𝑡) + 𝑛(𝑡)
282
Here, 𝑛(𝑡) is AWGN with autocorrelation 𝜂𝛿(𝜏)/2. The receive filter will be similar to
that of QPSK. i.e., we will have two filters ℎ1 (𝑡) and ℎ2 (𝑡) with ℎ1 (𝑡) = 𝑝1 (𝑇 − 𝑡) and
ℎ2 (𝑡) = 𝑝2 (𝑇 − 𝑡).
The output of the filter after sampling will be a combination of two functions,
𝑟1 (𝑇) = 𝑎1 + 𝑛
̃1
𝑟2 (𝑇) = 𝑎2 + 𝑛̃2
The above is for the case of unit pulse energy (𝐸𝑝 = 1)
283
𝑛
̃1 and 𝑛̃2 are AWGN with mean zero and variance 𝜂/2.
284
Now consider the decision rules. Decision for 𝑎1 will be based on 𝑟1 (𝑇) and decision for
𝑎2 will be based on 𝑟2 (𝑇). These are independent PAM symbols (but are √𝑀 − 𝑎𝑟𝑦-PAM)
not M-ary-PAM.
Decision rule will be similar to that of PAM.
285
𝐼𝑓 0 ≤ 𝑟𝑖 (𝑇) < 2𝐴 𝑑𝑒𝑐𝑖𝑑𝑒 𝑎𝑖 = 𝐴
.
𝑖𝑛𝑡𝑒𝑟𝑖𝑜𝑟 𝑝𝑜𝑖𝑛𝑡𝑠 .
.
{ .
𝑟𝑖 (𝑇) ≥ (√𝑀 − 2)𝐴 𝑑𝑒𝑐𝑖𝑑𝑒 𝑎𝑖 = (√𝑀 − 1)𝐴
𝑒𝑛𝑑 𝑝𝑜𝑖𝑛𝑡𝑠 {
{ 𝑟𝑖 (𝑇) < −(√𝑀 − 2)𝐴 𝑑𝑒𝑐𝑖𝑑𝑒 𝑎𝑖 = −(√𝑀 − 1)𝐴
In the above expression 𝑖 𝜖 {1,2}
equals A if r i T greater than or equal to square root of M minus 2 A decide well a i equal
to square root of M minus where a M minus 1 is and further i equals either 1 or ok.
286
Simply, we can state that we choose 𝑎𝑖 as the closest constellation point in the √𝑀-ary
PAM to 𝑟𝑖 (𝑇).
Now, the probability of error; remember the probability of the error for each PAM.
Probability of error for a √𝑀 −ary PAM will be,
287
1 𝐴
𝑃𝑒 = 2 (1 − )𝑄
√𝑀 𝜂
√
( 2)
This can be obtained by replacing 𝑀 by √𝑀 in the expression for probability of error in

M-ary PAM.
Here we substitute the value of A.
288
3𝐸𝑠
𝐴=√
2(𝑀 − 1)
The above value of A ensures average energy 𝐸𝑠 for the overall M-ary QAM.
289
Substituting the above gives us,
1 3𝐸𝑠
𝑃𝑒 = 2 (1 − ) 𝑄 (√ )
√𝑀 𝜂(𝑀 − 1)
This is the probability of error for each constituent √𝑀-ary PAM.
290
The overall error rate for M-ary QAM is given by,
2
𝑃𝑒,𝑄𝐴𝑀 = 1 − (1 − 𝑃𝑒,𝑃𝐴𝑀 )
2
= 2𝑃𝑒,𝑃𝐴𝑀 − 𝑃𝑒,𝑃𝐴𝑀
For high SNR’s 𝑃𝑒,𝑃𝐴𝑀 ≪ 1, we have,
𝑃𝑒,𝑄𝐴𝑀 =
̃ 2𝑃𝑒,𝑃𝐴𝑀
291
Dropping the approximate sign for simplicity we have,
1 3𝐸𝑠
𝑃𝑒,𝑄𝐴𝑀 = 2 (1 − ) 𝑄 (√ )
√𝑀 𝜂(𝑀 − 1)
We will now look at a simple constellation diagram for M-ary QAM. Assume M=16. This
means 𝑎1 and 𝑎2 belong to 4-PAM.
292
𝑎1 𝜖 {−3𝐴, −𝐴, 𝐴, 3𝐴}
𝑎2 𝜖 {−3𝐴, −𝐴, 𝐴, 3𝐴}
These can be represented on a x-y plane like format. We take x-axis to be 𝑎1 axis and 𝑎2
to be y-axis. The points can be represented as in figure above as a constellation of 16 points
(4 points on each axis). Not that QAM will always have a square constellation.
293
Now the number of bits for 16-QAM will be,
𝑀 = 16
#𝑏𝑖𝑡𝑠 = log 2 16
= 4 𝑏𝑖𝑡𝑠
Thus there will be 2 bits on each PAM component.
294
In the QAM scheme we can increase M in order to increase the bit-rate without increasing
the symbol transfer rate.
So, there are all several different possibilities. So, you have M equal to of course, if M is
equal to 4 that is QPSK, 16 QAM, 64 QAM, 256 QAM, 1024 and so on. So, these are all
the QAM. So, this is of course, your 4 symbols means 2 bits means one bit on each in
phase and quadrature this is of course, your QPSK; and number of bits equals, 2 4 that is
log to the base to of this quantity 2, 4, 6, 8, 10 of course, you can add 2, 0, 4, 8, 4, 0, 9, 6
and you will have 12 bits. So, the number of bits increases.
We can choose higher order modulation adaptively in order to meet the rate requirements.
So, higher order modulation and adaptive modulation higher order modulation plus.
295
Higher order and adaptive modulation are important to achieve higher data rates in 3G, 4G
and 5G.
But higher order modulation usage of higher order module higher order modulation is one
of the strategy is one of the main strategies that can be employed to multiply right
multiplied the bit rates several times that is for the same symbol rate right for the same
symbol rate for instance, if you have a symbol rate of 10 mega symbol per second if you
use 4 QAM that is QPSK which has 2 bits per symbol, you get a data rate of 20 bits or 20
296
megabits per second 20 Mbps, and if you can succeed in using 1024 QAM, which has 10
bits per symbol you can get a data rate of 100 Mbps. So, you can see from 20 Mbps to 100
Mbps that is a factor of 5 into 5. So, that is a factor of 5 in data rate.
So, you can go from. So, let us say this can 3G data rate this can be 4G data rate. So, 3G
systems typically or data rates around Mbps or at most 10 Mbps, in 4G you can go up to
data you can go to data rates of 100 Mbps and in fact, more you can go to data rates of
about 200 to 300 Mbps. So, it all depends on the modulation scheme use along of course,
along with several other technologies, but modulation right and adaptively choosing the
modulation depending on the user at user requirements. So, the modulation scheme
depends on the requirement of the user right what kind of the rate the user requires and
also the channel condition, if the wireless channel is able to support such a data rate. So,
adaptively one can modulate on can chose an appropriate modulation right to meet the data
rate requirement and it is of course, deliver the maximum data rate to each user.
So, that increases the data rate significant as we go from 3G to 4G to 5G wireless

technologies and QAM is a very general and is one of the most widely used modulation
schemes. So, let me just summarize that QAM is one of the most simplest powerful and it
is one of the most widely used it is very flexible one of the most widely used modulation
scheme alright.
297
So, in this module we have covered comprehensively covered QAM that is what is the
constellation what is the structure of QAM what is the structure of the transmitted
waveform what is the processing at the receiver and the associated probability the
probability of error for M-ary QAM and also, the importance of this QAM with respect to
the modern wireless technologies all right. So, we will stop here and continue with the
other aspects in subsequent modules.
298
Lecture – 24
M-ary PSK (Phase Shift Keying) – Part I, Introduction, Transmitted
Waveform, Constellation Diagram
Hello. Welcome to another module in this massive open online course. So, today in this
module we are going to look at a different modulation scheme that is M-ary PSK or M-ary
phase shift keying, similar to M-ary QAM which is quadrature amplitude modulation and
as M symbols. A M-ary PSK or phase shift keying has M symbols and it is a phase shift
keying based digital modulation scheme.
299
300
This is another digital modulation scheme and again this also generalizes the concept of
binary phase shift keying which has 2 phases.
M-ary PSK generalizes the concept of BPSK to M phases.
301
For M-ary PSK, 360o circle is divided into M components, the M possible phases are,
2𝜋 4𝜋 2𝜋(𝑀 − 1)
0, , ,…….,
𝑀 𝑀 𝑀
In short, we can write,
2𝜋𝑖
(𝑖 + 1)𝑡ℎ 𝑝ℎ𝑎𝑠𝑒 = 0≤𝑖 ≤𝑀−1
𝑀
302
The scheme reduces to BPSK for M=2, where phases are 0 and 𝜋.
The M phases can be represented on a circle of radius A. The first phase (i=0) will be at
the x-axis and the second will be at 2𝜋/𝑀, then 4𝜋/𝑀 and so on.
303
2𝜋𝑖 2𝜋
The ith point will have projection 𝐴𝑐𝑜𝑠 ( 𝑀 ) on the x-axis and 𝐴𝑠𝑖𝑛 ( 𝑀 ) on the y-axis.
So we can now have two separate symbols,
2𝜋𝑖
𝑎1 = 𝐴𝑐𝑜𝑠 ( )
𝑀
2𝜋𝑖
𝑎2 = 𝐴𝑠𝑖𝑛 ( )
𝑀
304
The transmitted signal is given by,
𝑥(𝑡) = 𝑎1 𝑝1 (𝑡) + 𝑎2 𝑝2 (𝑡)
Where 𝑝1 (𝑡) and 𝑝2 (𝑡) are orthonormal pulses given by,
2
𝑝1 (𝑡) = √ cos(2𝜋𝑓𝑐 𝑡) 0 ≤ 𝑡 ≤ 𝑇
𝑇
2
𝑝2 (𝑡) = √ sin(2𝜋𝑓𝑐 𝑡) 0 ≤ 𝑡 ≤ 𝑇
𝑇
Where 𝑇 = 𝑘/𝑓𝑐 .
𝑎1 and 𝑎2 belong to the above constellation,
2𝜋𝑘
𝑎1 𝜖 {𝐴𝑐𝑜𝑠 ( ) , 𝑘 = 0, 1, 2, … . , 𝑀 − 1}
𝑀
2𝜋𝑘
𝑎2 𝜖 {𝐴𝑠𝑖𝑛 ( ) , 𝑘 = 0, 1, 2, … . , 𝑀 − 1}
𝑀
305
306
The kth constellation point is given by,
2𝜋𝑘 2𝜋𝑘
𝑆𝑘 = (𝐴𝑐𝑜𝑠 ( ) , 𝐴𝑠𝑖𝑛 ( ))
𝑀 𝑀
Average symbol energy is given by,
∞
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑠𝑦𝑚𝑏𝑜𝑙 𝑒𝑛𝑒𝑟𝑔𝑦 = 𝐸 {∫ 𝑥 2 (𝑡)𝑑𝑡}
−∞
307
∞
2
= 𝐸 {∫ (𝑎1 𝑝1 (𝑡) + 𝑎2 𝑝2 (𝑡)) 𝑑𝑡}
−∞
Since the inner product of 𝑝1 (𝑡) and 𝑝2 (𝑡) is zero and the total pulse energy for both is
𝐸𝑝 = 1.
= 𝐸{𝑎12 𝐸𝑝 + 𝑎22 𝐸𝑝 }
= 𝐸{𝑎12 + 𝑎22 }
308
2 2
2𝜋𝑘 2𝜋𝑘
= 𝐸 {(𝐴𝑐𝑜𝑠 ( )) + (𝐴𝑠𝑖𝑛 ( )) }
𝑀 𝑀
= 𝐴2
If we set average symbol energy to be 𝐸𝑏 we will have,
𝐴 = √𝐸𝑏
309
Thus the radius of the circle for a fixed symbol 𝐸𝑠 will be √𝐸𝑠 .
So, will stop here and look at the receiver design, the detection and also the ensuing
probability of error and other aspects is a subsequent modules.
310
Lecture – 25
M-ary PSK (Phase Shift Keying) – Part II, Optimal Decision Rule, Nearest
Neighbour Criterion, Approximate Probability of Error
Hello, welcome to another module in this massive on open online course. So, we are
looking at M-ary PSK modulation, and we have said that in M-ary PSK constellation
points are arranged inside as on a circle right circle of radius A equal’s square root of 𝐸𝑠 .
Each symbol is given by,
2𝜋𝑘 2𝜋𝑘
𝑆𝑘 = (√𝐸𝑠 cos ( ) , √𝐸𝑠 sin ( ))
𝑀 𝑀
311
Now we will look at receiver processing for M-ary PSK,
𝑥(𝑡) = 𝑎1 𝑝1 (𝑡) + 𝑎2 𝑝2 (𝑡)
𝑦(𝑡) = 𝑥(𝑡) + 𝑛(𝑡)
The receiver or sampled statistics will be given by,
𝑟1 (𝑡) = 𝑎1 𝐸𝑝 + 𝑛̃ = 𝑎1 + 𝐸𝑝
312
𝑟2 (𝑡) = 𝑎2 𝐸𝑝 + 𝑛̃ = 𝑎2 + 𝐸𝑝
Here 𝑛̃ is the additive white gaussian noise.
𝑎1 and 𝑎2 can be arranged on a circle with radius 𝐴 = √𝐸𝑠 , similarly we can re[resent
𝑟1 (𝑇) and 𝑟2 (𝑇) on the same circle as a different point 𝑟(𝑇) = (𝑟1 (𝑇), 𝑟2 (𝑇))
The decision rule will be to look for the nearest constellation point from 𝑟(𝑇)
313
i.e., we choose constellation point such that the distance

||𝑟(𝑇) − 𝑆𝑘 || is minimized. Or
min 𝑘||𝑟(𝑇) − 𝑆𝑘 ||
314
Or,
2𝜋𝑘 2 2𝜋𝑘 2
min 𝑘 √(𝑟1 (𝑇) − √𝐸𝑠 cos ( )) + (𝑟2 (𝑇) − √𝐸𝑠 sin ( ))
𝑀 𝑀
The expression cannot be solved in any manner for an accurate analytical expression of
BER. We will thus employ an indirect approach for BER.
315
Consider QPSK, each point in the constellation has 2 nearest neighbours and distance
between them is 2√𝐸𝑏 .
We can assume an approximate expression for SER,
𝑑𝑚𝑖𝑛
= 𝛼𝑄 ( )
√2𝜂
316
Here 𝛼 is the number of nearest neighbours and 𝑑𝑚𝑖𝑛 is the distance between such
neighbours.
In M-ary PSK a constellation point will have two nearest neighbours and the distance will
be,
𝜋
𝑑𝑚𝑖𝑛 = 2√𝐸𝑠 sin ( )
𝑀
317
This is a simple expression for distance between a chord.
318
𝜋
2√𝐸𝑠 sin ( )
𝑃𝑒 = 2𝑄 ( 𝑀 )
√𝜂
𝜋
2𝐸𝑠 sin2 𝑀
= 2𝑄 (√ )
𝜂
This is an approximate expression for the SER but is a very tight approximation (or very
close to the actual value) for high values of 𝐸𝑠 /𝜂.
319
320
Lecture – 26
Introduction of Information Theory, Relevance of Information Theory,
Characterization of Information
Hello, welcome to another module in this massive open online course and in this module,
we will change tracks and start looking at an extremely different topic and an extremely
important topic in understanding the performance and behaviour of communication
systems which is termed as information theory.
Information theory is a very fundamental milestone in the understanding of the behaviour

of communication systems and also in characterizing communication systems. There are
applications of information theory in multiple fields, including finance, economics, stocks
etc. However, in this course we will focus only on the fundamentals of information theory
and its applications in the field of communication systems.
321
Let us look at the basic schematic of a digital communication system. It consists of a source
which generates a message signal 𝑚(𝑡) which is then converted into a stream of bits using
an encoder.
These bits are now transmitted over a channel and require a certain bandwidth. Naturally,
we would like to minimizes the usage of resources in the communication system, thus
minimize the bandwidth required for a particular message. We want to convey the
maximum amount of information using the minimum amount of resources. Wo we want
322
to encode a certain amount of information in the minimum number of bits possible.
Information theory helps us systematically achieve this, by helping in maximizing the
information content per bit.
323
We know how to characterize physical quantities such as weight height even colour. But
how does one characterize abstract quantities such as information? Information helps us
achieve exactly that. It helps us to quantitatively and objectively characterize information.
i.e., it provides a tangible measure of information. It provides us a quantitative and
mathematical framework to characterize information.
324
325
A central notion in information theory is the concept of an information source. i.e., the
information to be encoded is generated by a source.
Let’s assume we have a discreet source which generates the symbol set
𝑆 𝜖 {𝑆𝑜 , 𝑆1 , 𝑆2 , 𝑆3 … . 𝑠𝑀−1 }. The set S is known as source alphabet.
326
327
Each symbol will have a certain probability,
Pr(𝑆𝑖 ) = 𝑃𝑖
For a continuous source, the information will be a continuous function and will have a
continuous probability function. But here we are considering a discreet source. From
probability theory we can say that,
𝑃𝑖 ≥ 0
𝑀−1
Σ𝑖=0 𝑃𝑖 = 0
328
Let us now discuss the characterize the information content of a symbol. The information
content is given by,
1
𝐼(𝑆𝑖 ) = log 2
𝑃𝑖
329
330
This means higher the probability of occurrence of a symbol, lower is its information
content.
𝐼𝑓 𝑃𝑖 → 1 ⇒ 𝐼(𝑠𝑖 ) → 0
{
𝐼𝑓 𝑃𝑖 → 0 ⇒ 𝐼(𝑠𝑖 ) → ∞
331
This is intuitively as well. Think of news, rare events make news, regular events do not.
Thus, rare events contain more information than regular ones. 1/𝑃𝑖 is a measure of
infrequency or rarity of an event. Thus, rarer an event, higher the value of 1/𝑃𝑖 and higher
the amount of information it contains.
We will start with this definition and further keep refining it progressively and look at
various other norms, various other measures to characterize the information, content of a
source and the relevance of all these aspects and the relevance in fact of this whole area of
information theory in communications in our future modules.
332
Lecture – 27
Definition of Entropy, Average Information/ Uncertainty of Source, Properties of
Entropy
Hello. Welcome to another module in this massive open online. So, in this module
alright, let us start looking at a new topic that is entropy which is one of the most
fundamental aspects of information theory, which we will going to define shortly. So,
this is entropy.
So, we have seen in our discussion that, I si the information per symbol with the
corresponding to source alphabet S I is log 1 to the base log 1 over Pi to the base 2 bit is
correct right. And well Pi we have seen this is nothing, but the probability of the symbol
S I and we have seen that Isi this quantity tends to infinity as Pi S the probability as base
equally Pi tends to 0. And I si tends to 0 that is for certain events that is when the event
occurs with probability 1 I si tends to 0 information tends to 0 alright. So, frequently
occurring event have less information rare events have higher amount of information
associated with them.
333
Now, the average information the average information of the source can now be defined
as the average information of source S well average information of source S is simply the
expected value this is I of S which is expected value of I of well expected value of I of S
which is basically nothing, but well this is summation i equal to 0 summation i equal to 0
to m minus 1 log 2. So, this is Pi, log 2 or let me write it this way. This is Pi times the
information that probability of symbol i times the information associated with symbol i.
334
So, we are computing the weighted average. So, this is equal to Pi log to the base log 2 to
the base 1 over Pi. And this quantity it is expected value of s this quantity is termed as H
of S H of S equals Pi log 1 over Pi to the base 2. This is termed as it is a fundamental
quantity of information theory is termed as the entropy. This is termed as the entropy of
sources.
So, this is termed as the entropy of source S. So, it is basically information of symbol i
this is your information of symbol i weighted by the probability Pi, weighted by it is
probability of occurrence. So, it is the expected value as a average information of the
source which is obtained by weighing the information of each symbol S I that is I si with
it is probability Pi and summing over all the alphabets summing over the entire source
alphabet, alright.
So, this quantity is termed as the entropy of the source and as we have said is a
fundamental quantity in information theory. It can characterizes the average information
per source average it characterizes. So, this characterizes entropy characterizes or
quantifies average information content, average information of the source average
information of the source as in the information average information per symbol one can
also think of average information per symbol emitted by the source. Emitted or generated
emitted or generated. This can also be thought of as entropy can also be thought of as
average uncertainty.
335
It is the average the average uncertainty associated with the source. So, more uncertainty
also implies basically more randomness or so more uncertainty.
So, more uncertainty more uncertainty implies more information. So, you can also be
thought of as the average uncertainty right. So, this symbol what is uncertain about the
next symbol that is going to be generated about the source all right. There is which
means that there is greater uncertainty, which captures the fact that that is created
information in the symbols generated by the source. If this if there is a greater certainty if
we know for sure what the or if we show or if you know with great certainty what the
next symbol that is generated by the source is going to be; that means, on an average
there is less information in the symbols or less information content in the symbols that is
generated by the source.
We are going to see this also of course, as we proceed through subsequent aspects or
subsequent examples of entropy.
336
Now, let us start by looking at some simple properties of entropy, if you look at the
definition of entropy which is Pi log on or p an observed that well Pi greater than equal
to 0 less than equal to 1 over Pi less than or equal to 1 this implies well one less than or
equal to 1 over Pi less than. So, 1 over Pi is always basically greater than or equal to 1
which implies log 1 over Pi to the base 2 is greater than or equal to 0. Since 1 over Pi is
greater than equal to 1. So, Pi is greater than equal to 1 log to the 2 to the base 1 over Pi.
Log to the base 2 1 over p is greater than equal to 0. So, Pi is greater than equal to 0
correct log 1, over Pi to the base 2 is greater than equal to 0, this implies both of these
are non negative quantities.
337
So, therefore, Pi log 1 over Pi to the base 2 is greater than or equal to 0 for each i. So,
this is basically this entropy is the sum of non negative terms, implies H of S is also
greater than or equal to 0. So, entropy is non negative correct. Entropy can be zero, but it
cannot be negative. In fact, we will see examples where entropy is 0. So, entropy has to
be greater than equal to. So, this also means that the information content of a source
cannot be cannot be negative. It is non negative either 0 or greater than 0 entropy of a
source the information content of a source has to be greater than or equal to 0.
338
Okay. So, this basically means that the information content, information content of
source is greater than equal to 0.
Further observe that if Pi equal to 1, then Pi log to the base 2 or Pi equal to 1 times log to
the base 2 1 over 1 this is 0. So, this is equal to 0.
So, this quantity equals 0 for Pi well for Pi equal to 1. Now what happens if Pi equal to 0
what happens to this quantity Pi equals 0 Pi equals 0, let us look at this quantity Pi log to
the base 2, 1 over Pi, which is equal to I can write it as log to the base 2, 1 over log to the
base 2 1 over Pi divided by 1 over p i.
339
Now, you can see here that both the numerator that is if you look at this quantity here,
numerator and the denominator tend to infinity as Pi tends to 0 as Pi tends to 0 both the
numerator and denominator tend to tend to infinity. So, this is undefined. So, this
expression at 0 in that sense it is undefined.
So, let us look at the limit as Pi tends to 0. And naturally if both the numerator and
denominator tend to infinity as Pi tends to 0, we can use the lthopital's rule right which is
the rule in calculus to evaluate the limit. So, correct.
340
So, now let us calculate the limit of this as Pi tends to 0, that is log the base to 1 over Pi
which is equal to well, I will write the numerator as minus log to the base 2 Pi 1 over Pi
which is equal to.
Now, again I can differentiate this I can take the derivative of the numerator since it is ill
defined on substitution of 0, I can take minus log 2 to the base Pi. I can differentiate the
numerator and denominator differentiate the numerator and denominator with respect to
Pi. This is termed as a L'Hospital's Rule which is basically from calculus to evaluate
correct the limit of precise with such quantities which are ill defined.
341
Now, if I substitute this that becomes limit Pi tends to 0, if you look at this that becomes
well the numerator is well d by d minus log Pi to the base 2. I can write this as Pi 2 the
base that is your traditional logarithm to the base e, to log e to the base 2 divided by well
I can differentiate with respect to Pi that gives me minus 1 over Pi square. And finally,
differentiating the numerator, I get minus log minus derivative of log is 1 over Pi minus
log in to log e to the base 2 into Pi square from the denominator will come to the
numerator. So this is basically limit Pi tends to 0, well Pi into log 2 p to the base 2 which
is equal to 0. So, basically what we have established is if you look at this is limit that is.
342
Since we cannot directly substitute Pi equal to 0 into Pi log 1 over Pi limit, p d Pi tends
to 0 Pi log to the base 2 1 over Pi also tends to 0.
So, you can say. So, and also remember we have also seen before that Pi log to the base
2 1 over Pi equals also equals 0 for certain events that is for Pi equal to 1. So, we have
seen that for both the what you see is that as Pi tends to 0 all for Pi equal to 1 the average
information that is Pi log to the base 2 1 over Pi is basically 0 that is for both events
which occur with probability one, and events which tend to occur with probabilities close
to 0 or the rarest of the rare events which rarely occur which means that the probability
of occurrence is close to 0 for both such events both types of such events the average
information associated with 0. And this is this can be understood as follows that of
course, as we have seen events which occur very frequently that is the probability close
to one naturally the information associated with them is 0.
So, the average information associated with them is 0 as we have seen before, but what
we have seen is that although information associated with the rare events which are
probability close to 0 is very high, since they occur extremely infrequently the average
information associated with them is also 0 all right. So, Pi log to the 2 log to the base 2 1
over Pi is 0 when Pi equal to 1 or when Pi is also close to 0. So, that is an important
point.
343
So, this basically shows that this is basically equal to. So, 1 over Pi log to the base 2 to
the base. So, what we have seen is Pi intuitively this is close to 0 for both very common
and very rare very common, and very common meaning, Pi approximately equal to 1,
very rare meaning Pi approximately equal to 0. So, that is what we have, so all right.
So, in this module what we have seen is basically we have defined a very important
quantity, which is the entropy all right which is basically the average information
associated with each symbol of the source which is expected value of the information
expected value of the information per symbol that is I si of the source which denotes the
average information that is associated per each symbol of the source. Which also said
this characterizes the uncertainty associated with the source all right, which is also
basically used to characterize or quantify the information content of a particular source
and we have also seen that Pi log to the base 2 1 over Pi is equal to 0 when Pi is 1 or P is
very close to 0.
344
Lecture – 28
Entropy Example – Binary Source, Maximum and Minimum Entropy of Binary
Source
at various aspects of information theory and we have also defined the concept of entropy
correct and we have seen various properties of the entropy of the source which based
characterizes the average information or also the average uncertainty associated with this
source. Average information per symbol of a source or the average uncertainty that is
associated with this (Refer Time: 00:41). So, let us now look at an example to understand
this better.
Consider a binary source (two symbols) with the alphabet,
𝑆 = {𝑆𝑜 , 𝑆1 }
The probabilities be,
Pr(𝑆𝑜 ) = 𝑝
345
⇒ Pr(𝑆1 ) = 1 − 𝑝 (𝑠𝑖𝑛𝑐𝑒 𝑠𝑢𝑚 𝑜𝑓 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑖𝑒𝑠 𝑖𝑠 𝑜𝑛𝑒)
Entropy of the binary source can then be defined as,
1 1
𝐻(𝑠) = Pr(𝑆𝑜 ) log 2 + Pr(𝑆1 ) log 2
Pr(𝑆𝑜 ) Pr(𝑆1 )
1 1
𝐻(𝑠) = 𝑝 log 2 + (1 − 𝑝) log 2
𝑝 1−𝑝
346
If we replace p by 1-p,
The expression remains the same
1 1
𝐻(1 − 𝑝) = (1 − 𝑝) log 2 + (1 − (1 − 𝑝)) log 2
1−𝑝 1 − (1 − 𝑝)
1 1
= (1 − 𝑝) log 2 + 𝑝 log 2
1−𝑝 𝑝
347
Now let as find the entropy for p=0,
1 1
lim 𝐻(𝑝) = lim 𝑝 log 2 + (1 − 𝑝) log 2
𝑝→0 𝑝→0 𝑝 1−𝑝
1
= lim 𝑝 log 2 ( ) = 0 (𝑓𝑟𝑜𝑚 𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 𝑚𝑜𝑑𝑢𝑙𝑒)
𝑝→0 𝑝
Thus we have,
lim 𝐻(𝑠) = 0
𝑝→0
348
Since, H(p)=H(1-p), we can also write,
lim 𝐻(𝑝) = lim 𝐻(1 − 𝑝)

𝑝→1 𝑝→1
= lim 𝐻(𝑝) = 0
𝑝→0
349
Now 𝑝 = 0 means Pr(𝑠𝑜 ) = 0 and Pr(𝑠1 ) = 1, i.e., source always generates 𝑠1 . And
similarly, for p=1, source always generated 𝑠𝑜 .
In both the cases there is no uncertainty. And hence the information content and thereby
the entropy both are zero.
350
Let’s look at the maximum value of entropy,
𝑑 𝐻(𝑝)
𝐻 ′ (𝑝) =
𝑑𝑝
𝑑
= (−𝑝 log 2 𝑝 − (1 − 𝑝) log 2 (1 − 𝑝))
𝑑𝑝
= − log 2 𝑝 + log 2 (1 − 𝑝)
351
In order to find the maxima, we equate the derivative to zero,
𝐻 ′ (𝑠) = 0
log 2 𝑝 = log 2 (1 − 𝑝)
⇒ p= 1−p
1
𝑜𝑟, 𝑝 =
2
352
The entropy is maximized when, both the symbols have equal probability, i.e.,
1
Pr(𝑠𝑜 ) = Pr(𝑠1 ) =
2
The double derivative of entropy is given by,
𝑑 ′
𝐻 ′′ (𝑠) = 𝐻 (𝑝)
𝑑𝑝
1 1 1
=− ( + )≤0
ln 2 𝑝 1 − 𝑝
Hence the point will be maxims (since 𝐻 ′ (𝑠) = 0 and 𝐻 ′′ (𝑠) < 0).
The maximum value of entropy will be given by,
1 1 1
𝐻 (𝑝 = ) = log 2 2 + log 2 2 = 1
2 2 2
353
354
355
We also have as we have seen before
𝐻(0) = 𝐻(1) = 0
Plotting the function for H(s), will be a convex function, zero at 0 and one, and achieving
a maximum value of 1 at p=0.5. Entropy is a measure of the number of bits required per
symbol. For p=0.5, we will need one bit per symbol. For other values of p, at least
theoretically we will need less than 1 bit per symbol.
So, for instance less than 1 implies lower than 1 symbol can be represented using less than.
So, implies can be represented and this is what the encoding process does right, the
encoding process expected to do is take this source symbols and come up with an efficient
code conduct with an efficiency stream write as information bits stream to represent this
using the minimum number of bits right take to binary symbols S naught and S 1 correct,
S naught and S 1 and come up with an information bit stream to represent this efficiently
using the lowest the least number of possible bits that is the lowest possible bits in an
average sense. That is the average number of bits used to represent each symbol must be
the smallest and that is what we are going to this process termed as coding or encoding. In
fact, that is the theory regarding this is what we are going to look at in detail as you go
through the rest of the modules of this course.
356
So, basically that is what we have seen we have seen an interesting example of a binary
source with 2 symbols S naught an S 1 probabilities P and 1 minus P we have looked at
what happens at the extremes that is P equal to 0 P equal to 1 there is no uncertainty, the
information is 0, maximum information or entropy occurs at P equal to half. And of course,
if you plot it, it looks like a nice concave function which rises starts from 0 at P equal to 0
rises to 1 at P equal to half and falls down again to 0 at P equal to 1, all right.
So, will stop here and look at explorer other aspects, other properties study this behavior
of entropy further in a subsequent modules.
357
Lecture – 29
Maximum Entropy of Source with M-ary Alphabets,
Concave/ Convex function, Jensen’s Inequality
at entropy, entropy of a source and the various properties of entropy all right. Let us now
look at maximum entropy that is when is the entropy maximized.
So, let us look at entropy maximization or maximum entropy basically into understand
when is entropy maximized.
358
Consider an alphabet with size 4 with alphabet,
𝑆 = {𝑠𝑜 , 𝑠1 , 𝑠2 𝑠3 }
The probabilities of each symbol are given by, 𝑝𝑜 = 0.5,𝑝1 = 0.25, 𝑝2 = 𝑝3 = 0.125
The total entropy is given by,
1 1 1 7
𝐻(𝑠) = 0.5 log 2 + 0.25 log 2 + 2 × 0.125 log 2 =
0.5 0.25 0.125 4
359
Thus, we have on average 1.75 buts per symbol.
We can assume a scheme of encoding,
𝑠𝑜 = 0; 𝑠1 = 10; 𝑠2 = 111, 𝑠3 = 110
The average number of bits per symbol,
= 1 × 0.5 + 2 × 0.25 + 3 × 2 × 0.125
= 1.75
The symbols which occur with smaller probability have a larger number of bits (larger
amount of information). There is a fundamentally relation between entropy and the
average number of bits. Average number of bits in a system is equal to the average entropy.
This relation will be explored more rigorously in future modules.
360
361
Now suppose that each symbol in the alphabet is equiprobable, i.e.,
1
𝑝1 = 𝑝2 = 𝑝3 = 𝑝4 =
4
The average entropy is now given by,
1
𝐻(𝑠) = 4 × log 2 4
4
= 2 𝑏𝑖𝑡𝑠 𝑝𝑒𝑟 𝑠𝑦𝑚𝑏𝑜𝑙
362
Let us now define a concave function. Suppose a function F(x) (refer to pict. above). And
𝑥1 and 𝑥2 be two points on the x-axis in the domain of the function. Then, for a concave
function, the cord joining the points (𝑥1 , 𝐹(𝑥1 )) and (𝑥2 , 𝐹(𝑥2 )) always lies bellow the
function.
Or, for a point on the x axis,𝑥 = 𝜃𝑥1 + (1 − 𝜃)𝑥2 , for 0 ≤ 𝜃 ≤ 1
𝐹(𝜃𝑥1 + (1 − 𝜃)𝑥2 ) ≥ 𝜃𝐹(𝑥1 ) + (1 − 𝜃)𝐹(𝑥2 )
Similarly, a function is concave for the inverse relation (the cord is above the function).
363
364
For a log function it can be easily observed that it is a concave function (see pict. above).
365
𝐹(𝑥) = log 2 𝑥
It can be shown that. if the second derivative of a function is negative, then the function is
a concave function.
For the log function
1 1
𝐹 ′′ (𝑥) = (− 2 ) ≤ 0
ln 2 𝑥
366
For a concave function, suppose we substitute (𝜃 = 𝑝1 ) and (1 − 𝜃 = 𝑝2 ). Then 𝑝1 and 𝑝2
will be probabilities. Thus, we can write,
𝐹(𝑝1 𝑥1 + 𝑝2 𝑥2 ) = 𝑝1 𝐹(𝑥1 ) + 𝑝2 𝐹(𝑥2 )
Generalizing for M symbols we can write,
𝑀−1 𝑀−1
𝐹 ( ∑ 𝑝𝑖 𝑥𝑖 ) ≥ ∑ 𝑝𝑖 𝐹(𝑥𝑖 )
𝑖=0 𝑖=0
367
Now,
𝑀−1
∑ 𝑝𝑖 𝑥𝑖 = 𝐸{𝑥}
𝑖=0
Thus, we have,
𝐹(𝐸{𝑥}) ≥ 𝐸{𝐹(𝑥)}
This is known as Jensen’s inequality for concave function.
368
The inequalities reversed for a convex function.
Now, coming to our problem if you look at entropy correct.
𝑀−1 𝑀−1
1 1
𝐻(𝑠) = ∑ 𝑝𝑖 log 2 ≤ log 2 ∑ 𝑝𝑖 . = log 2 𝑀
𝑝𝑖 𝑝𝑖
𝑖=0 𝑖=0
369
Hence, we have,
𝐻(𝑠) ≤ log 2 𝑀
The maximum entropy occurs when,
1
Pr(𝑠𝑖 ) = ∀ 𝑖 𝜖 {0,1,2,3,4, … . 𝑀 − 1}
𝑀
i.e., all the symbols are equiprobable.
370
371
Lecture - 30
Joint Entropy - Definition of Joint Entropy of Two Sources, Simple Example for
Joint Entropy Computation
let us start looking at another new topic that is entropy. So, far we have looked at different
properties of entropy when is entropy maximized let us start looking at another concept
that is joint entropy.
So, we will now start looking at a new concept which is the joint entropy you can think of
this has the joint information into the combined information. So, basically you can think
of this as the; we will talk more about this later you can think of this as the combined
information when you have multiple sources.
372
For example, consider communication channel with input 𝑋 and output 𝑌 with size M and
N respectively having the alphabet,
𝑋𝜖{𝑠𝑜 , 𝑠1 , 𝑠2 … . 𝑆𝑀−1 } 𝑌 𝜖 {𝑟𝑜 , 𝑟1 , 𝑟2 , … … 𝑟𝑁−1 }
It is important to study the properties of information combined in X and Y in order to

understand the fundamental limits of communication system.
373
Arranging both the input and the output in a tabular form we can have a 𝑀 × 𝑁 table, each
cell of the table will have an input symbol 𝑠𝑖 and output symbol 𝑟𝑗 and suppose it has a
probability 𝑝𝑖𝑗 .
𝑝𝑖𝑗 = Pr(𝑋 = 𝑠𝑖 ∩ 𝑌 = 𝑟𝑗 ) 0 ≤ 𝑖 ≤ 𝑀 − 1; 0 ≤ 𝑗 ≤ 𝑁 − 1
One can think of (𝑋, 𝑌) as a 2- dimensional source of symbols with total number of
symbols=MN.
374
The joint entropy can then be defined as,
𝑀−1 𝑁−1
1
𝐻(𝑋, 𝑌) = ∑ ∑ 𝑝𝑖𝑗 log 2
𝑝𝑖𝑗
𝑖=0 𝑗=0
Joint entropy can be defined as the information conveyed jointly by X and Y.
375
Assume the above 4x4 table for distribution of probabilities.
376
Total number of 2d symbols=4x4=16
For the distribution to be valid distribution,
𝑀−1 𝑁−1
∑ ∑ 𝑝𝑖𝑗 = 1
𝑖=0 𝑗=0
The total probability
377
1 1 1 1
= +2× +6× +4× =1
4 8 16 32
Secondly,
𝑀−1
∑ Pr(𝑠𝑖 , 𝑟𝑗 ) = Pr(𝑟𝑗 )
𝑖=0
This is because 𝑠𝑜 , 𝑠1 , 𝑠2 , … … 𝑠𝑀−1 are mutually exclusive and exhaustive events.
378
379
Similarly,
𝑁−1
∑ Pr(𝑠𝑖 , 𝑟𝑗 ) = Pr(𝑠𝑖 )
𝑗=0
Joint entropy is given by,
1
𝐻(𝑠) = ∑ ∑ 𝑝𝑖𝑗 log 2
𝑝𝑖𝑗
𝑖 𝑗
380
1 1 1 4
= 2 × log 2 8 + 6 × log 2 16 + log 2 4 + log 2 32
8 16 4 32
27
= = 3.375 𝑏𝑖𝑡𝑠
8
Well this will give you 3 over 4 plus. So, this will be 16 4 divide 1 over 4. So, this will be
3 over 2 plus half plus 5 over 8 that is if you look at this; this will be over 8 you have 6
plus 12 plus 4 plus 5 that is basically you have express for 10 22 plus 5 that is 27 over 8
which is equal to 3.375 bit.
381
𝐻(𝑋, 𝑌) = 3.375 𝑏𝑖𝑡𝑠
So, basically what we have done now is we have computed the joint entropy of X and Y
the (Refer Time: 27:51) is denoted by h of X comma Y. So, in this module basically we
will introduce we motivated this problem of joint entropy as considering a simple example
of a communication system in which your transmitting symbols X receiving symbols Y.
So, we would like to understand the interplay better understand the interplay or the
information various aspects of the information between of corresponding to these 2 point
it is just a import X and the output Y. We have defined this notion of joint entropy and we
have seen a simple example and computed the joint entropy. We have further explored the
properties of joint entropy and their relevance in the context of communication in the
subsequent modules.
382
Lecture – 31
Properties of Joint Entropy, Relation between Joint Entropy and Marginal
Entropies
The joint entropy is given by,
1
𝐻(𝑋, 𝑌) = ∑ ∑ 𝑝𝑖𝑗 log 2
𝑝𝑖𝑗
𝑖 𝑗
383
In the previous module we saw an example with joint entropy of 3.375 with total 16
symbols.
384
The marginal probability is the probability of a particular symbol. The entropy of X is

given by,
1 1 1 1
𝐻(𝑥) = 𝐻 ( , , , )
4 4 4 4
1
= 4 × log 2 4 = 2 𝑏𝑖𝑡𝑠
4
385
1 1 1 1
𝐻(𝑌) = 𝐻 ( , , , )
2 4 8 8
1 1 1
= log 2 2 + log 2 4 + 2 × log 2 8
2 4 8
= 1.75 𝑏𝑖𝑡𝑠
386
𝐻(𝑥, 𝑌) = 3.375 ≥ 𝐻(𝑥) = 2
𝐻(𝑋, 𝑌) ≥ 𝐻(𝑌) = 1.75
This property is intuitive as well, as the joint information or the combined information will
be greater than or equal to the individual information or,
𝐻(𝑋, 𝑌) ≥ 𝐻(𝑋)
387
More importantly in the example we can see that,
𝐻(𝑋) + 𝐻(𝑌) = 2 + 1.75 + 3.75 > 3.375
We will now attempt to show the following property in general,
𝐻(𝑋, 𝑌) ≤ 𝐻(𝑋) + 𝐻(𝑌)
388
1
𝐻(𝑋) = ∑ Pr(𝑠𝑖 ) log 2
Pr(𝑠𝑖 )
𝑖
1
= ∑ ∑ 𝑝𝑖𝑗 log 2 (𝑏𝑦 𝑡𝑜𝑡𝑎𝑙 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑟𝑢𝑙𝑒)
𝑝𝑖
𝑖 𝑗
Similarly, we can write,
1
𝐻(𝑌) = ∑ 𝑝𝑗 log 2
𝑝𝑗
𝑗
1
= ∑ ∑ 𝑝𝑖𝑗 log 2
𝑝𝑖𝑗
𝑗 𝑖
389
Consider the expression,
𝐻(𝑋, 𝑌) − 𝐻(𝑋) − 𝐻(𝑌)
1 1 1
= ∑ ∑ 𝑝𝑖𝑗 log 2 − ∑ ∑ 𝑝𝑖𝑗 log 2 − ∑ ∑ 𝑝𝑖𝑗 log 2
𝑝𝑖𝑗 𝑝𝑖 𝑝𝑗
𝑖 𝑗 𝑖 𝑗 𝑖 𝑗
𝑝𝑖 𝑝𝑗
= ∑ ∑ 𝑝𝑖𝑗 log 2
𝑝𝑖𝑗
𝑖 𝑗
390
391
392
From Jensen’s inequality we know that,
∑ 𝑝𝑖 log 2 𝑥𝑖 ≤ log 2 ∑ 𝑝𝑖 𝑥𝑖
𝑖 𝑖
𝑝𝑖 𝑝𝑗 𝑝𝑖 𝑝𝑗
⇒ ∑ 𝑝𝑖𝑗 log 2 ( ) ≤ log 2 ∑ 𝑝𝑖𝑗 ( )
𝑝𝑖 𝑝𝑗 𝑝𝑖 𝑝𝑗
𝑖,𝑗 𝑖,𝑗
= log 2 ∑ 𝑝𝑖 𝑝𝑗
𝑖,𝑗
= log 2 (∑ 𝑝𝑖 ∑ 𝑝𝑗 )
𝑖 𝑗
= log 2 1 = 0
393
394
Hence, we have,
𝐻(𝑋, 𝑌) − 𝐻(𝑋) − 𝐻(𝑌) ≤ 0
⇒ 𝐻(𝑋, 𝑌) ≤ 𝐻(𝑋) + 𝐻(𝑌)
Joint information joint entropy is less than or equal to the sum of the entropies of the
individual sources
395
Let us, we will stop this module here and look at other aspects in the subsequent modules.
396
Lecture – 32
Conditional Entropy, Example of Conditional Entropy, Properties of Conditional
Entropy
Hello, welcome to another module in this massive open online course. So far we have seen
several concepts in information theory; we have seen the definition of entropy, joint
entropy.
397
The entropy of Y given X will be given by,
𝑀−1
𝐻(𝑌|𝑋) = ∑ 𝑃(𝑋 = 𝑠𝑖 )𝐻(𝑌|𝑋 = 𝑠𝑖 )

𝑖=0
𝑀−1 𝑁−1
1
= ∑ Pr(𝑋 = 𝑠𝑖 ) ∑ Pr(𝑌 = 𝑟𝑗 |𝑋 = 𝑠𝑖 ) × log 2
𝑃(𝑌 = 𝑟𝑗 |𝑋 = 𝑠𝑖 )
𝑖=0 𝑗=0
1
= ∑ ∑ Pr(𝑠𝑖 ) Pr(𝑟𝑗 |𝑠𝑖 ) log 2
𝑃𝑟(𝑟𝑗 |𝑠𝑖 )
𝑖 𝑗
1
= ∑ ∑ Pr(𝑠𝑖 , 𝑟𝑗 ) log 2
Pr(𝑟𝑗 , 𝑠𝑖 )
𝑖 𝑗
398
399
Now consider the expression,
𝐻(𝑌|𝑋) + 𝐻(𝑋)
1 1
= ∑ ∑ Pr(𝑠𝑖 , 𝑟𝑗 ) log 2 + ∑ Pr(𝑠𝑖 ) log 2
Pr(𝑟𝑗 |𝑠𝑖 ) Pr(𝑠𝑖 )
𝑖 𝑗 𝑖
1 1
= ∑ ∑ Pr(𝑠𝑖 , 𝑟𝑗 ) log 2 + ∑ ∑ Pr(𝑠𝑖 , 𝑟𝑗 ) log 2
𝑖 𝑗 𝑖 𝑗
400
1
= ∑ ∑ Pr(𝑠𝑖 , 𝑟𝑗 ) log 2
𝑖 𝑗
1
= ∑ ∑ Pr(𝑠𝑖 , 𝑟𝑗 ) log 2
Pr(𝑠𝑖 , 𝑟𝑗 )
𝑖 𝑗
= 𝐻(𝑋, 𝑌)
Hence, we prove the relation,
𝐻(𝑋, 𝑌) = 𝐻(𝑌|𝑋) + 𝐻(𝑋)
By symmetry we can also write,
𝐻(𝑋, 𝑌) = 𝐻(𝑋|𝑌) + 𝐻(𝑌)
401
402
403
𝐻(𝑌|𝑋) represents the conditional uncertainty in Y after X has been observed.
Consider the 4x4 table of X and Y probabilities form the previous example (see pict.
above).
404
𝑃(𝑌|𝑋 = 𝑠𝑜 ) = 𝑃(𝑌 = 𝑟𝑗 |𝑋 = 𝑠𝑜 )
Pr(𝑌 = 𝑟𝑗 , 𝑋 = 𝑠𝑜 )
=
Pr⁡(𝑋 = 𝑠𝑜 )
1/8
Now, Pr(𝑌 = 𝑟𝑜 |𝑋 = 𝑠𝑜 ) = 1/4=1/2. Similarly, 𝑃(𝑌 = 𝑟1 |𝑋 = 𝑠𝑜 ) = 1/4 and
1
Pr(𝑌 = 𝑟2|𝑋 = 𝑠𝑜 ) = 8 = Pr⁡(𝑋 = 𝑟3 |𝑋 = 𝑠𝑜 ).
405
Hence, we will now have,
1 1 1 1
𝐻(𝑌|𝑋 = 𝑠𝑜 ) = 𝐻 ( , , , ) = 1.75⁡𝑏𝑖𝑡𝑠
2 4 8 8
1 1 1
𝐻(𝑌|𝑋 = 𝑠1 ) = 𝐻 ( , , , 0) = 1.7⁡𝑏𝑖𝑡𝑠
4 2 8
1 1 1 1
𝐻(𝑌|𝑋 = 𝑠2) = 𝐻 ( , , , ) = 2
4 4 4 4
406
𝐻(𝑌|𝑋 = 𝑠3 ) = 𝐻(1,0,0,0) = 0
𝐻(𝑌|𝑋) = ∑ Pr(𝑋 = 𝑠𝑖 ) 𝐻(𝑌|𝑋 = 𝑠𝑖 )

𝑖
= 1.375⁡𝑏𝑖𝑡𝑠
407
𝐻(𝑋) = 2⁡𝑏𝑖𝑡𝑠
𝐻(𝑋, 𝑌) = 3.375⁡𝑏𝑖𝑡𝑠⁡(𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠⁡𝑚𝑜𝑑𝑢𝑙𝑒)
We can now easily verify that 𝐻(𝑌|𝑋) + 𝐻(𝑋) = 𝐻(𝑋, 𝑌). Similarly we can also verify
that 𝐻(𝑋, 𝑌) = 𝐻(𝑋|𝑌) + 𝐻(𝑌).
408
409
Lecture - 33
Mutual Information, Diagrammatic Representation, Properties of Mutual
Information
Hello welcome to another module in this massive open online course. Today let us look at
another new concept information theory that is a concept of mutual information ok.
410
Suppose we have two sources of information, 𝑋 and 𝑌. The mutual information of these
sources can be defined as,
𝐼(𝑋; 𝑌) = 𝐻(𝑌) − 𝐻(𝑌|𝑋)
= 𝐻(𝑋) − 𝐻(𝑋|𝑌)
Suppose we have communication channel with transmitted signal 𝑋 and received signal 𝑌.
411
𝐼(𝑋; 𝑌) = 𝐻(𝑋) − 𝐻(𝑋|𝑌)
= 𝐻(𝑋) − 𝐻(𝑋, 𝑌) + 𝐻(𝑌)
= 𝐻(𝑌) − 𝐻(𝑌|𝑋)
412
𝐻(𝑋) is the uncertainty in the transmitted signal X and 𝐻(𝑋|𝑌) is the uncertainty
remaining in X after observing Y. thus the expression for 𝐼(𝑋; 𝑌) denotes the uncertainty
in X that is resolved after observing Y.
Naturally, for effective communication, we want to be able to predict X on observing Y,

thus we want the remaining uncertainty 𝐻(𝑋|𝑌) to be low. The mutual information, is a
measure of the uncertainty in X resolved on observing Y. We want to be able to reliable
413
tell X on observing Y.This quantity plays an important role in characterizing the
fundamental limits of a communication channel.
And this therefore, has a very fundamental role to play in communication the mutual
information therefore, plays as we shall see later plays a fundamental role in characterizing
the performance of a, plays a fundamental role in characterizing communication across the
channel.
414
Let us now look at some properties of 𝐼(𝑋, 𝑌).
𝐼(𝑋; 𝑌) = 𝐼(𝑌; 𝑋)
This is clear from the definition itself as,
𝐼(𝑋; 𝑌) = 𝐻(𝑋) − 𝐻(𝑋|𝑌) = 𝐻(𝑌) − 𝐻(𝑌|𝑋) = 𝐼(𝑌; 𝑋)
Also note that,
𝐼(𝑋; 𝑌) = 𝐻(𝑋) + 𝐻(𝑌) − 𝐻(𝑋, 𝑌)
Since 𝐻(𝑋) + 𝐻(𝑌) ≥ 𝐻(𝑋, 𝑌) we can write,
𝐼(𝑋; 𝑌) ≤ 0
415
We can represent the above in the form of a Venn diagram.
416
Now, if you look at these components individually. If you look at this component this is
the uncertainty remaining this is uncertainty remaining in X. If you look at this shaded
region that is in y, but not in X this is uncertainty remaining in Y on observing X.
So, this is uncertainty in Y on observing X or conditioned X alright. So, we have

information in X information in Y. Portion in Y not in X is the uncertainty or the
information in Y remaining in Y. So, this is basically, this is basically precisely
information remaining in X that is portion let me write it this portion in y, but not in X.
Similarly the portion in x, but not in Y is the information in X on observing Y or
uncertainty. So, portion in X information or X, information in X given Y or uncertainty X
on observing Y. And if you look at this area of overlap that is portions in both X and Y
this is basically the mutual information that is information resolved information about X
resolved on Y resolving and observing X or uncertainty in X resolved on observing Y.
So, we can say this is the uncertainty in X resolved on observing Y. And if you look at this
total quantity, now this is the joint entropy information in X and Y this is joint entropy to
get both this together. This is information in X comma Y. This is the joint entropy that is
information in X and Y. And if you observe this figure, now you can see the joint entropy
H X Y is a sum of 3 mutually disjoint components, H this is H X given Y mutual
information between X and Y and H Y given X.
417
So, the now you can write you can see from the figure H X comma Y H X given Y plus
the mutual information i X Y plus the information H i Y H Y given X. And this you can
see from the figure and this you can verify also for instance, i X given Y this is equal to H
Y minus H Y given X.
So, substituting we have H X given Y plus i X Y which is mutual information between X

and Y H Y minus H Y given X plus H Y given X, these cancel these cancel and now if
418
you look at this H X given Y plus H Y. This is nothing but the joint entropy H X comma
Y.
So, we can derive such useful results using the mutual information. So, this is the property
and you can, and you can look this these are 3 disjoint components H X given Y the mutual
information between X Y, and H Y given X that is the uncertainty in Y on observing X.
And therefore, that brings us end of this module, basically what you have seen in this
module is we have defined a new measure of which, I would rather key or a rather critical
measure to understand the subsequent theory that we are going to develop to characterize
the fundamental rate of information transfer that is fundamental rate at which information
can be transmitted across the channel. This is termed as the mutual information.
It is defined as H X between 2 sources it is defined as H X minus H X given Y or the H Y

minus H Y the mutual information is already always greater than equal to 0 and finally,
we have seen the relevance of mutual information that is mutual information between X
and Y is the information about the uncertainty about X that is resolved on observing Y. Or
you can also think of it as the uncertainty about Y that is resolved on observing X, it is
symmetric mutual information between X Y is the same as the mutual information between
Y and X. And it has a fundamental role to play in characterizing the rate or how much
information arise at what rate can information be transmitted across the channel.
And further we have seen a pictorial way to sort of comprehensively represent the various
quantities that we have seen so far the entropies the joint entropy the conditional entropy
and now also the mutual information between 2 sources alright.
So, we will stop here and continue with the other aspects in the subsequent modules.
419
Lecture – 34
Simple Example of Mutual Information, Practical Example of Mutual Information
- Binary Symmetric Channel
at mutual information, correct.
The mutual information is defined as,
𝐼(𝑋; 𝑌) = 𝐻(𝑋) − 𝐻(𝑋|𝑌)
= 𝐻(𝑌) − 𝐻(𝑌|𝑋) = 𝐼(𝑌; 𝑋)
420
Consider the table from the previous example with 4 × 4 table.
𝐻(𝑋) = 2; 𝐻(𝑌) = 1.75; 𝐻(𝑋, 𝑌) = 3.378; 𝐻(𝑌|𝑋) = 1.375
3
𝐼(𝑋; 𝑌) = 𝐻(𝑌) − 𝐻(𝑌|𝑋) =
8
421
13
𝐻(𝑋|𝑌) =
8
422
𝐼(𝑋; 𝑌) = 𝐻(𝑋) − 𝐻(𝑋|𝑌)
13 3
= 𝐼(𝑋; 𝑌) = 2 − =
8 8
Here, we see that both 𝐼(𝑋; 𝑌) and 𝐼(𝑌; 𝑋) are equal.
423
Consider another example, of a binary symmetric channel. Binary means the input is a
binary 0 or 1. Symmetric because the probability of both 0 and 1 is same and equal to 𝑝.
(see pict. above)
424
𝑝 is then the flip probability and 1 − 𝑝 is the probability of correct transmission. Let 𝑠𝑜
and 𝑠1 be the transmission probability of 0 and 1. 𝑟0 and 𝑟1 be the probability of receiving
0 or 1 respectively.
We will then have,
Pr(𝑟0 |𝑠0 ) = Pr(𝑟1|𝑠1 ) = 1 − 𝑝
Pr(𝑟1 |𝑠0 ) = Pr(𝑟0|𝑠1 ) = 𝑝
425
Suppose,
Pr(𝑋 = 𝑠0 ) = 𝛼
Then,
Pr(𝑋 = 𝑠1 ) = 1 − 𝛼
We have the conditional entropy,
𝐻(𝑌|𝑋) = Pr(𝑋 = 𝑠0 ) 𝐻(𝑌|𝑋 = 𝑠0 ) + Pr⁡(𝑋 = 𝑠1 )𝐻(𝑌|𝑋 = 𝑠1 )
426
Suppose it is given that 𝑋 = 𝑠0 we then have,
Pr(𝑌 = 𝑟0 |𝑠0) = 1 − 𝑝
Pr(𝑌 = 𝑟1 |𝑠1) = 𝑝
We can then say,
𝐻(𝑌|𝑋 = 𝑠0 ) = 𝐻(𝑝, 1 − 𝑝) = 𝐻(𝑝)
1 1
= 𝑝 log 2 + (1 − 𝑝) log 2
𝑝 1−𝑝
For 𝑋 = 𝑠1 we will have,
𝑃(𝑌 = 𝑟0|𝑠1 ) = 𝑝
𝑃(𝑌 = 𝑟1|𝑠1 ) = 1 − 𝑝
The joint entropy can then be written as,
𝐻(𝑌|𝑋 = 𝑠1) = 𝐻(1 − 𝑝, 𝑝) = 𝐻(𝑝)
1 1
= 𝑝 log 2 + (1 − 𝑝) log 2
𝑝 1−𝑝
427
Overall, we can write,
𝐻(𝑌|𝑋) = Pr(𝑋 = 𝑠0 ) 𝐻(𝑌|𝑋 = 𝑠0 ) + Pr(𝑋 = 𝑠1 ) 𝐻(𝑌|𝑋 = 𝑠1 )
= 𝛼𝐻(𝑝) + (1 − 𝛼)𝐻(𝑝)
= 𝐻(𝑝)
428
Let us now look at the probability of 𝑌,
Pr(𝑌 = 𝑟0 ) = Pr(𝑟0 ∩ 𝑠0 ) + Pr(𝑟0 ∩ 𝑠1 )
= Pr(𝑟0 |𝑠0 ) Pr(𝑠0 ) + Pr(𝑟0 |𝑠1) Pr(𝑠1 )
= (1 − 𝑝)𝛼 + 𝑝(1 − 𝛼)
= 𝛼 + 𝑝 − 2𝛼𝑝
429
For receiving the symbol 𝑟1 we can write,
𝑃(𝑌 = 𝑟1 ) = 1 − Pr(𝑌 = 𝑟0 )
= 1 + 2𝛼𝑝 − 𝛼 − 𝑝
𝐻(𝑌) = 𝐻(𝛼 + 𝑝 − 2𝛼𝑝)
𝐼(𝑋; 𝑌) = 𝐻(𝑌) − 𝐻(𝑌|𝑋)
430
= 𝐻(𝛼 + 𝑝 − 2𝛼𝑝) − 𝐻(𝑝)
This expression gives us the mutual information of a binary symmetric channel with flip
probability p and source probabilities 𝛼 and 1 − 𝛼.
431
Lecture - 35
Channel Capacity, Implications of Channel Capacity, Claude E. Shannon- Father
of Information Theory, Example of Capacity of Binary Symmetric Channel
Welcome to another module in this massive open online course, alright. So, we are looking
at the mutual information. We have looked at the properties of the mutual information, the
definition the properties and the relevance of mutual information, alright and let us now
look at one of the most important and landmark results in information theory which relates
to channel capacity and which depends on the mutual information.
432
Suppose we have a communication channel with input 𝑋 and output 𝑌, described by

transition probabilities 𝑝(𝑦(𝑥)).
𝐼(𝑋, 𝑌) = 𝐻(𝑋) − 𝐻(𝑋|𝑌) = 𝐻(𝑌) − 𝐻(𝑌|𝑋)
433
A fundamental result in information theory knows as the Shannon theorem is given as

follows,
𝐶 = max 𝐼(𝑋; 𝑌)
Pr(X)
C gives the capacity of the channel. i.e., the maximum of mutual information (over all
probabilities of input distribution) between input X and Y gives the capacity of the
communication channel. The above result was first published in a paper titled a
mathematical theory of communication, by Shannon.
434
435
This is the most fundamental result in information theory. However, this does give a
method of achieving this maximum capacity. However, this result has led to a number of
techniques and error correction codes for maximum capacity in wireline as well as wireless
communication channels.
C gives the maximum rate at which information can transmitted over the channel with
arbitrarily low probabilities.
436
Let just take an example for channel capacity, let us go back to our Binary Symmetric
Channel and quickly look at what we had yesterday.
437
For a BSC channel with the mutual information was given by,
𝐼(𝑋; 𝑌) = 𝐻(𝛼 + 𝑝 − 2𝛼𝑝) − 𝐻(𝑝)
Here 𝛼 is the probability of 0 at the source and 𝑝 is the flip probability of the channel.
438
The capacity will be given by,
𝐶 = max 𝐼(𝑋; 𝑌) = max(𝐻(𝛼 + 𝑝 − 2𝛼𝑝) − 𝐻(𝑝))

𝛼 𝛼
Maxima of the first term occurs for 𝛼 + 𝑝 − 2𝛼𝑝 = 1/2. Considering symmetry at the
1 1 1
source assume 𝛼 = 1/2. Then, 𝛼 + 𝑝 − 2𝛼𝑝 = + 𝑝 − 2 × × 𝑝 =
2 2 2
Since, 𝐻(1/2) = 1 we get,

𝐶 = 1 − 𝐻(𝑝)
439
The unit of this capacity will be bits per channel use or bits per symbol (which can later
be translated to bits per second).
440
𝐶 = 1 − 𝐻(𝑝) = 1 − 𝐻(1 − 𝑝)
𝑝 = 0 𝐶 = 1 − 𝐻(0) = 1
𝑝 = 1 𝐶 = 1 − 𝐻(1) = 1
Thus, for both p=0 and p=1, the capacity is maximum and equal to 1. This means with no
flip and 100% flip the capacity is maximized. This is because with 100% flip also we can
know for sure what was the source symbol. The minimum will be for p=0.5
1 1
𝑝= 𝐶 = 1−𝐻( ) = 0
2 2
i.e., with 50% flip probability the capacity is effectively zero.
441
IN the next set of modules, we will discuss the capacity of more channels, and chiefly the
AWGN channel.
442
Lecture – 36
Differential Entropy, Example for Uniform Probability Density function
Hello, welcome to another module in this massive open online course. So, we looking at
information theory we have looked at Shannon's landmark result that is the channel coding
theorem alright which is basically gives result for the fundamental rate at which the
information can be transmitted across a channel, alright.
443
So far we have considered sources with discreet alphabet only. However, a source can be
continuous source as well, with output 𝑥(𝑡) being a continuous signal.
This can take values from a continuous domain instead of a finite-number symbols.
444
The output distribution will be characterized by a probability density function 𝑓𝑋 (𝑥).
445
For example, 𝑓𝑋 (𝑥) can be a gaussian function 𝑁(𝜇, 𝜎) with mean 𝜇 and variance 𝜎.
1 −
𝑥−𝜇
𝑓𝑋 (𝑥) = 𝑒 2𝜎2
√2𝜋𝜎 2
For an exponential source,
1 𝑥
𝑓𝑋 (𝑥) = 𝑒 −𝜆
𝜆
446
447
Differential entropy for a continuous source is defined as,
∞
1
ℎ(𝑋) = ∫ 𝑓𝑋 (𝑥) log 2 𝑑𝑥
−∞ 𝑓𝑋 (𝑥)
This is represented using small ℎ in place of 𝐻, which is used for discreet sources.
448
Suppose the output the output probability is uniformly distributed between (0, 𝑎). Then we
have,
1 1
𝑓𝑋 (𝑥) = {𝑎 𝑓𝑜𝑟 0 ≤ ≤𝑎
𝑎
449
450
𝑎
1
ℎ(𝑋) = ∫ log 2 𝑎 𝑑𝑥
0 𝑎
𝑎
1
= log 2 𝑎 ∫ 𝑑𝑥
𝑎 0
= log 2 𝑎
451
Not that unlike entropy for discreet systems, the battery can be negative too. For example,
in this case for 𝑎 < 1 the entropy will be negative.
We will stop with this and we look at other aspects in the subsequent modules.
452
Lecture – 37
Differential Entropy of Gaussian Source, Insights
at the concept of differential entropy for a source which can take continuous set of values
alright. So, source alphabet which can take a continuous set of values right now let us look
at the differential entropy of a Gaussian source that is a source output which follows the
Gaussian which has the Gaussian probability density function and this is very important
because this is as we are going to see later this is one of the; I mean this is one of the most
commonly occurring such source or one can say that the source output at the output one
of the most commonly occurring probability density functions for the source output and
has a fundamental role to play in the context of information theory as we are going to see
later.
So, we are going to look at the differential entropy differential entropy of a; this is the
differential entropy of a of a Gaussian source. This is one of the most commonly occurring
and practically relevant sources in communication systems.
453
A gaussian source with mean 𝜇 and variance 𝜎 is given by,
1 (𝑥−𝜇)2
−
𝑓𝑋 (𝑥) = 𝑒 2𝜎2
√2𝜋𝜎 2
The differential entropy is given by,
∞
1
ℎ(𝑋) = ∫ 𝑓𝑋 (𝑥) log 2 𝑑𝑥
−∞ 𝑓𝑋 (𝑥)
454
For the gaussian function,
1 1
log 2 = ln log 𝑒
𝑓𝑋 (𝑥) 𝑓𝑋 (𝑥) 2
1 1
= log 2 𝑒 { ln 2𝜋𝜎 2 + 2 (𝑥 − 𝜇)2 }
2 2𝜎
455
456
Substituting the above in the expression for differential entropy we will have,
∞ (𝑥 − 𝜇)2
1 2
ℎ(𝑋) = ∫ 𝑓𝑋 (𝑥) log 2 𝑒 { ln 2𝜋𝜎 + } 𝑑𝑥
−∞ 2 2𝜎 2
∞ ∞ (𝑥 − 𝜇)2
1
= log 2 𝑒 ∫ 𝑓𝑋 (𝑥) ln 2𝜋𝜎 2 𝑑𝑥 + log 2 𝑒 ∫ 𝑓𝑋 (𝑥) 𝑑𝑥
−∞ 2 −∞ 2𝜎 2
457
∞
1 1
= log 2 2𝜋𝜎 2 + 2 log 2 𝑒 ∫ 𝑓𝑋 (𝑥)(𝑥 − 𝜇)2 𝑑𝑥
2 2𝜎 −∞
1 1
= log 2 2𝜋𝜎 2 + log 2 𝑒
2 2
1
= log 2 2𝜋𝜎 2 𝑒
2
458
Hence,
1
ℎ(𝑋) = log 2𝜋𝜎 2 𝑒
2 2
Note that the differential entropy depends only on the standard deviation or the width of
the gaussian distribution, not the mean value. Thus shifting the mean, without changing
the variance will not change the differential entropy of the source.
459
460
So, let us stop here and we look at other aspects in the subsequent modules.
461
Lecture - 38
Joint Conditional/ Differential Entropies, Mutual Information
at the differential entropy of a continuous source and in that on the same lines, let us extend
also the concept similar to the entropy. We had the concept of joint entropy and conditional
entropy. So, let us extend the differential entropy to also the other aspects. So, there is the
joint and the conditional entropies, ok.
In this lecture we will define the concepts of joint and conditional differential entropies
(i.e., joint and conditional entropy for continuous sources).
462
Definitions of these will be similar to that for entropy of discreet sources.
463
Let us say we have two continuous sources X and Y with marginal pdf’s 𝑓𝑋 (𝑥), 𝑓𝑌 (𝑦).
These sources can be input and output of a communication channel respectively.
The joint pdf’s of these two sources is given by 𝑓𝑋 (𝑥).
464
Joint entropy is given by,
∞ ∞
1
ℎ(𝑋, 𝑌) = ∫ ∫ 𝑓𝑋,𝑌 (𝑥, 𝑦) log 2 𝑑𝑥𝑑𝑦
−∞ −∞ 𝑓𝑋,𝑌 (𝑥, 𝑦)
465
∞
ℎ(𝑋|𝑌) = ∫ 𝑓𝑌 (𝑦)ℎ(𝑋|𝑌 = 𝑦)𝑑𝑦
−∞
∞ ∞
1
= ∫ 𝑓𝑌 (𝑦) ∫ 𝑓𝑋|𝑌 (𝑥|𝑌 = 𝑦) log 2 𝑑𝑥 𝑑𝑦
−∞ −∞ 𝑓𝑋|𝑌 (𝑥|𝑌 = 𝑦)
In the above expression we substitute,
𝑓𝑋,𝑌 (𝑥, 𝑦)
𝑓(𝑋|𝑌 = 𝑦) =
𝑓𝑌 (𝑦)
466
The conditional probability density function can be defined as follows that is F of X given
Y of X given Y. We just write the definition for your convenience F of X given Y X given
Y equals Y equals F of X, Y divided by F of F of X. So, this is the conditional probability
density function and now if you look at, you can write this again as the following thing.
You can basically write this as follows. Of course, there has to be an integral, sorry there
has to be an integral with respect to here the integral with respect to X followed by another
integral with respect to Y.
467
Now, if we look at this F of Y, the probability density function of Y into F of X, Y given
F of X given Y is nothing, but the joint probability density function F of X, Y.
So, therefore, you will have combining these two that is if we use the property that where
we use the property that F of X F of X F of X given Y times, this is equal to the joint
probability density function. So, F of X is the conditional probability density function, F
of X given y times the probability density function equal to the joint probability density
function.
Therefore if you look at this joint therefore, if you look at this conditional entropy, the
definition will boil down to minus infinity to infinity minus infinity to infinity F of X, Y
correct X, Y times log to the base two 2 over F of X, Y. I am sorry F of X, this will still be
F of X given Y X given Y times dx dy, ok.
468
We can then write,
𝑓𝑋|𝑌(𝑋|𝑌 = 𝑦)𝑓𝑌(𝑦) = 𝑓𝑋,𝑌(𝑥, 𝑦)

∞ ∞
1
⇒ ℎ(𝑋|𝑌) = ∫ ∫ 𝑓𝑋,𝑌 (𝑥, 𝑦) log 2 𝑑𝑥 𝑑𝑦
−∞ −∞ 𝑓𝑋|𝑌 (𝑥|𝑦)
Mutual information is similarly defined as,
𝐼(𝑋; 𝑌) = ℎ(𝑋) − ℎ(𝑋|𝑌)

= ℎ(𝑌) − ℎ(𝑌|𝑋)
469
The definitions for joint and mutual information for both the cases is similar to those for
the case of discreet sources.
So, we will stop here and look at other aspects in subsequent modules.
470
Lecture – 39
Capacity of the Gaussian Channel – Part I
let us look at another fundamental property or fundamental result for the Gaussian channel
which is the capacity characterized the capacity of the Gaussian channel.
Gaussian channels are the most commonly occurring and practically relevant models for
digital communication channels. And capacity is a fundamental characterization of such
channels.
471
Suppose the channel has input X and output Y. Assuming an AWGN channel with noise
N,
𝑋 =𝑌+𝑁
Where N is the additive white gaussian noise.
472
We want to find the capacity of the AWGN channel. It is the maximum rate of information
transfer at arbitrarily low probability of error. The capacity is given by,
𝐶= max 𝐼(𝑋; 𝑌)
𝑓𝑋 (𝑥):𝐸{𝑋 2 }≤𝑃
The capacity is given by the maximum of the mutual information of X and Y over the pdf
of X. One constraint that has been introduced is the power constraint at input. A
communication system will be powered by a power source like a battery, this source will
limit the power available for the communication channel. i.e., the power source limits the
power consumption by the input communication channel.
473
474
Mutual information us given by,
𝐼(𝑋; 𝑌) = ℎ(𝑋) − ℎ(𝑋|𝑌)
= ℎ(𝑌) − ℎ(𝑌|𝑋)
From the above discussion we know, 𝑌 = 𝑋 + 𝑁. The noise 𝑁 is gaussian noise with mean
0 and variance, 𝜎 2 . Given X, 𝑋 + 𝑁 simply shifts the gaussian distribution by X without
changing the variance. Thus, Y|X is a random variable with mean 𝑋 and variance 𝜎 2 .
𝑌|𝑋~𝑁(𝑋, 𝜎 2 )
ℎ(𝑌|𝑋) = ℎ(𝑁(𝑋, 𝜎 2 ))
1
= log 2 2𝜋𝜎 2 𝑒
2
Now, since 𝑌 = 𝑋 + 𝑁
𝐸(𝑌) = 𝐸(𝑋) + 𝐸(𝑁)
= 𝐸(𝑋) = 0
The power at output is given by,
475
𝐸{𝑌 2 } = 𝐸(𝑋 2 + 𝑁 2 + 2𝑋𝑁)
= 𝐸(𝑋 2 ) + 𝐸(𝑁 2 ) + 2𝐸(𝑋𝑁)
Assuming that X and N are independent random variables 𝐸{𝑋𝑁} = 𝐸{𝑋}𝐸{𝑁} = 0\
Hence we can write,
𝐸{𝑌 2 } = 𝐸{𝑋 2 } + 𝐸{𝑁 2 }
= 𝐸{𝑋 2 } + 𝜎 2
We also know that 𝐸{𝑋 2 } ≤ 𝑃 hence,
𝐸{𝑌 2 } ≤ 𝑃 + 𝜎 2 = 𝜎̃ 2
The mutual information is given by,
𝐼(𝑋; 𝑌) = ℎ(𝑌) − ℎ(𝑋|𝑌)
1
= ℎ(𝑌) − log 2 2𝜋𝜎 2 𝑒
2
The capacity is given by,
1
𝐶= max 2 ℎ(𝑌) − log 2 2𝜋𝜎 2 𝑒
𝑓𝑋 (𝑥):𝐸{𝑋 }≤𝑝 2
1
= max2 ℎ(𝑌) − log 2𝜋𝜎 2 𝑒
𝑓𝑌 (𝑦):𝐸{𝑌 ̃2
}≤𝜎 2 2
The second term in the above expression is a constant, thus we only have to maximize the
entropy of 𝑌 over it’s pdf’s bounded by the power constraint 𝐸{𝑌 2 } ≤ 𝜎̃ 2 .
476
477
478
479
480
481
So, we will stop here basically where we are trying to characterize the capacity the
maximum rate at which information can be transmitted across in Gaussian channel that is
the channel with Gaussian noise. So, we will stop here and continue in the subsequent
modules.
482
Lecture – 40
Capacity of the Gaussian Channel – Part II, Practical Implications, Maximum Rate
in Bits/Sec
We are mid way in calculating the capacity of a gaussian noise channel. In the previous
module we have shown that, the capacity will be given by,
1
( max2 ℎ(𝑦)) − log 2 2𝜋𝜎 2 𝑒
𝑓𝑌 (𝑦):𝐸{𝑌 ̃ 2}
≤𝜎 2
The second term in the above expression is constant, hence, we will focus on maximizing the first
term. How for this we will first define a quantity called Kullback Leibler divergence.
483
Consider,
∞
𝑓𝑌 (𝑦)
∫ 𝑔𝑌 (𝑦) log 2 ( ) 𝑑𝑦
−∞ 𝑔𝑌 (𝑦)
Where, 𝑔𝑌 (𝑦) and 𝑓𝑌 (𝑦) are two different probability density functions. To the above
quantity, we can apply Jensen’s inequality for the log function.
484
Jensen’s inequality is given by (for a log function),
𝑀−1 𝑀−1
∑ 𝑝𝑋 (𝑥𝑖 ) log 2 𝑥𝑖 ≤ log 2 ∑ 𝑝𝑖 𝑥𝑖

𝑖=0 𝑖=0
In the continuous domain this can be extended to,
∞ ∞
∫ 𝑝𝑋 (𝑥) log 2 𝑔(𝑥)𝑑𝑥 ≤ log 2 ∫ 𝑝𝑋 (𝑥)𝑔(𝑥)𝑑𝑥
−∞ −∞
485
Applying the inequality to the expression we get,
∞ ∞
𝑓𝑌 (𝑦) 𝑔𝑌 (𝑦)𝑓𝑌 (𝑦)
∫ 𝑔(𝑦) log 2 ( ) 𝑑𝑦 ≤ log 2 (∫ 𝑑𝑦)
−∞ 𝑔𝑌 (𝑦) −∞ 𝑔𝑌 (𝑦)
∞
= log 2 (∫ 𝑓𝑌 (𝑦)𝑑𝑦) = log 2 1 = 0
∞
486
Thus we have,
∞
𝑓𝑌 (𝑦)
∫ g Y (y)log 2 ( ) 𝑑𝑦 ≤ 0
−∞ 𝑔𝑌 (𝑦)
∞
𝑔𝑌 (𝑦)
⇒ ∫ g Y (y)log 2 ( ) 𝑑𝑦 ≥ 0
−∞ 𝑓𝑌 (𝑦)
487
The above quantity is defined as Kullback Leibler Divergence represented by 𝐷(𝑔𝑌 ||𝑓𝑌 ).
Let 𝑔𝑌 (𝑦) be any arbitrary pdf of a random variable Y with 𝐸{𝑌 2 } = 𝜎̃ 2 and 𝑓𝑌 (𝑦) be the
gaussian pdf given by,
1 𝑦2
−
𝑓𝑌 (𝑦) = 𝑒 2𝜎̃2
√2𝜋 𝜎̃ 2
488
The KL divergence of g and f is,
∞ ∞ ∞
𝑔𝑌 (𝑦) 1
∫ 𝑔𝑌 (𝑦) log 2 𝑑𝑦 = ∫ g Y (y)log 2 𝑔𝑌 (𝑦)𝑑𝑦 − ∫ 𝑔𝑌 (𝑦) log 2 𝑑𝑦
−∞ 𝑓𝑌 (𝑦) −∞ −∞ 𝑓𝑌 (𝑦)
∞
1
= −ℎ(𝑔𝑌 ) − ∫ 𝑔𝑌 (𝑦) log 2 𝑑𝑦
−∞ 𝑓𝑌 (𝑦)
489
Now,
1 𝑦2
√ 2
= 2𝜋 𝜎̃ 𝑒 ̃ 2
2𝜎
𝑓𝑌 (𝑦)
1 1 log 2 𝑒 𝑦 2
⇒ log 2 = log 2 2𝜋𝜎̃ 2 +
𝑓𝑌 (𝑦) 2 2 𝜎̃ 2
490
So, we are focusing on this now the first integral is there focus now on this integral.
We can then write the second part of the integral as,
∞ ∞
1 1 2
𝑦2
∫ 𝑔𝑌 (𝑦) log 2 𝑑𝑦 = ∫ 𝑔𝑌 (𝑦) { log 2 2𝜋𝜎̃ 𝑒 + log 2 𝑒 2 } 𝑑𝑦
−∞ 𝑓𝑌 (𝑦) −∞ 2 2𝜎̃
1 log 2 𝑒 ∞
= log 2 2𝜋𝜎̃ 2 𝑒 + ∫ 𝑔 (𝑦)𝑦 2 𝑑𝑦
2 2𝜎̃ 2 −∞ 𝑌
491
∞
Here we have used that ∫−∞ 𝑔𝑌 (𝑦) = 1, since 𝑔𝑌 (𝑦) is a pdf.
The second integral is the expectation value of 𝑦 2 for pdf 𝑔𝑌 (𝑦) i.e., 𝐸{𝑦 2 } = 𝜎̃ 2 . The
integral then is,
1 log 2 𝑒
= log 2 2𝜋𝜎̃ 2 𝑒 +
2 2
1
= log 2 2𝜋𝜎̃ 2 𝑒
2
492
493
Hence, we have,
1
0≤ log 2𝜋𝜎̃ 2 𝑒 − ℎ(𝑔𝑌 )
2 2
1
⇒ ℎ(𝑔𝑌 ) ≤ log 2𝜋𝜎̃ 2 𝑒
2 2
In the above expression, LHS is the entropy if a gaussian source with zero mean and RHS
is the entropy of a arbitrary source with zero mean.
494
The maximum capacity is given by,
1
𝐶=( max2 ℎ(𝑦)) − log 2 2𝜋𝜎 2 𝑒
𝑓𝑌 (𝑦):𝐸{𝑌 ̃2
}≤𝜎 2
1
From the above derivation we know that max ℎ(𝑦) = 2 log 2 2𝜋𝜎̃ 2 𝑒
Thus we have,
1 1
𝐶= log 2 2𝜋𝜎̃ 2 𝑒 − log 2 2𝜋𝜎 2 𝑒
2 2
1 𝜎̃ 2
= log 2 2
2 𝜎
1 𝑃 + 𝜎̃ 2
log 2
2 𝜎2
1 𝑃
= log 2 (1 + 2 )
2 𝜎
1
𝐶 = log 2 (1 + 𝑆𝑁𝑅)
2
495
The final result above is a very fundamental and popular result in information theory and
theory of communication systems.
496
The above result characterizes a channel in terms of bits per channel use or bits per symbol.
Suppose we a channel with bandwidth 2W (from -W and +W), band limited to W. Then
by Nyquist criteria the signal is characterized by a sampling rate of 2W.
497
Thus the maximum rate of information transfer is,
𝑃
𝐶̃ = 𝑊 log 2 (1 + )
𝜎2
= 𝑊 log 2 (1 + 𝑆𝑁𝑅)
498
This is also knows as the Shannon Hartley law and is a very fundamental and practically
relevant result for communication systems.
So, we will stop here and look at another aspect in the subsequent modules.
499
Lecture - 41
Introduction to Source Coding and Data Compression, Variable Length Codes,
Unique Decodability
Hello. Welcome to another module in this massive open online course. In this module we
will start another important topic in information theory and communication systems which
is source coding.
Source coding is used for data compression and is thus also useful in storage of large
volumes of multimedia content, along with their transmission.
500
Suppose we have a source X and it generates symbols 𝑥(0), 𝑥(1), 𝑥(2) … … …. with time.
The symbols belong to a finite alphabet,
𝑥(𝑘)𝜖𝑆 = {𝑠0 , 𝑠1 , 𝑠2 , 𝑠3 , … … . , 𝑠𝑀−1 }
501
A basic assumption that we have made throughout this course till now is that the source is
a discreet memory less source. This means that the probability of a symbol is independent
of the symbol produced before a symbol. i.e.,
Pr(𝑥(𝑘) = 𝑠𝑖 |𝑥(𝑘 − 1) = 𝑠𝑗 ) = Pr(𝑥(𝑘) = 𝑠𝑖 ) = 𝑝𝑖

This is one of the most commonly employed source models in communication systems.
502
Source coding is the process of converting each symbol into a sequence of binary symbols
which are known as codewords. For efficient coding we would like to minimize average
number of bits per symbol. The mapping from symbols to words is known as code.
503
For example, consider a simple code, for an alphabet of M=4 symbols, {𝑠0 , 𝑠1 , 𝑠2 , 𝑠3 }
𝑠0 → 11
𝑠 → 10
{ 1
𝑠2 → 01
𝑠3 → 00
504
Here each symbol is represented by 2 bits such a code is known as a fixed length code.
Also observe that the code can be mapped to the alphabet one to one. One of the benefits
of such a code is the low complexity of the code, which allows for an easy decoding.
Now consider the following code,
𝑠0 → 0
𝑠 →1
{ 1
𝑠2 → 00
𝑠3 → 11
505
Here, each symbol is represented by a variable number of bits. Such a code is known as a
variable length code. Such a code reduces the average number of bits per symbol. But then
consider the following sequence of bits, (000). This sequence can be decoded as 𝑠0 𝑠0 𝑠0 or
as 𝑠2 𝑠0 . Thus, the coding scheme does not provide a one to one mapping. This will be an
issue for a communication system. i.e., we cannot uniquely decode the symbol from the
received codes.
In the next module we will see how to get a variable length code from which we can uniquely
decode the symbols.
Now, in a variable length code for instance, again let us take an example of M equal to 4.
For instance, we have s0. Let us take an example s0 represented by 0, s1 represented by 1,
s2 represented by 0 0, s 3 represented by 1 1, ok.
Now, the variable length code, obviously you can see different symbols are represented
using different numbers of bits and further observed in this case, there is a problem. It is a
variable length code. For instance, here we have for 0, we have 1 bit. This is 1 bit, 2 bits
and 3 bits. Now, there is a problem for instance, what is the problem?
Let us look at the problem and the problem should be apparent to you. Here if you look at
the transmitted sequence 0, either correspond to s0 followed by s2 or it can correspond to
s0 s0 s0 or it can correspond to s2 followed by s0, ok.
506
So, the same sequence 0 0 0, you can correspond to either s0 s2 s0 s0 s0 or s2 s0. Note this
problem. So, same sequence what is the problem here in this example, the same sequence
can correspond either to s0 s2 s2 s0 or s0 s0. So, these three sequences, given the received
bits sequence 0 0 0, either the stored bit sequence or the bit sequence 0 0 0 at the receiver,
one cannot uniquely decode the transmitted symbol sequence or the transmitted source
coded sequence. It can either correspond to s2 s0 s0 s2 or s0 s0 s0.
So, this code is not uniquely decodable. So, this code, hence code is not uniquely
decodable. So, original sequence what do we mean by the original sequence? What do we
mean by uniquely decodable? Let us now formally define what we mean by, we see a code
is uniquely decodable. If the original symbol sequence is uniquely decodable if the original
if the original source sequence or the original source symbol sequence, original source
sequence can be reconstructed perfectly from the coded sequence. In this case, we have
seen that there exist at least 1 that is 0 0 0 at corresponding to which the symbol sequence,
the corresponding symbol sequence cannot be decoded uniquely. There are multiple
symbol sequence which give raise to the same code sequence, right so many to one
mapping.
So, not one to one mapping and therefore, this code is not uniquely decodable which
creates problems for us because either when the symbols have been received at the receiver
or if the symbols have been stored and you try or they have the bits have been stored and
507
you are trying to reconstruct the original symbol sequence, then it cannot be the symbol
sequence is unknown. It cannot be reconstructed uniquely, all right. So, there is an
information loss that happens in the source coding process and naturally this is not
preferable. So, therefore, what we are going to do? So, what we have seen here is, we have
seen now that does not mean that all variable length codes are not uniquely decodable here.
We have considered a variable for instance, therefore symbols are different numbers of
bits. So, this is a variable length code or what is also termed as VLC. Sometimes now of
course, it is clear that for a fixed length code is always going to be uniquely decodable,
correct as long as many symbols are not mapped to the same fix length codeword, the fix
length code because its nature is uniquely decodable, correct. How are the variable length
code? This is not the case. Now, what are the conditions to be satisfied by a code, variable
length code? For that matter, any code for it to be uniquely decodable and how do we
construct such codes. So, these are some of the aspects that we look at in the subsequent
modules.
508
Lecture - 42
Uniquely Decodable Codes, Prefix-free Code, Instantaneous Code, Average Code
length
Hello. Welcome to another module in this massive open online course.
In the previous module we took an example of a variable length code given by,
𝑠0 → 0
𝑠 →1
{ 1
𝑠2 → 00
𝑠3 → 11
509
We also observed that this code is not uniquely decodable as we can the sequence 000 can
be decoded as 𝑠0 𝑠2 as well as 𝑠2 𝑠0 . This will be a problem for a practical communication
system as the original message will be lost in decoding.
510
Now consider the following code,
𝑠0 → 1
𝑠1 → 01
{
𝑠2 → 001
𝑠3 → 000
The above code will be uniquely decodable. For e.g., 1001 can only be 𝑠0 𝑠2 . How do we
actually get a uniquely decodable variable length code? Observe in this example that no
code is a prefix to another codeword. Such a code is known as a prefix free code or a prefix
code.
511
In the previous example from the last module code word for 𝑠0 was 0 which was a prefix
for codeword 00 for 𝑠2 .
512
Prefix free codes are also known as instantaneous codes as a codeword as each codeword
is decodable by itself, it is not required to wait for the next codeword.
513
It can be shown that prefix free codes and instantaneous codes are same and are uniquely
decodable.
514
Let the bit length of 𝑖 𝑡ℎ code be 𝑙𝑖 and the probability of occurrence be 𝑝𝑖 . Then the average
length of a codeword is given by,
𝑀−1
𝐿̅ = ∑ 𝑝𝑖 𝑙𝑖
𝑖=0
515
We would now like to minimize the average codelength as well as develop fundamental
insights into the process of code design.
So, with this we will stop this module here, and look at other aspects in the sub segment
modules.
516
Lecture - 43
Binary Tree Representation of code, Example Kraft Inequality
Hello, welcome to another module in this massive open online course. We are looking at
source coding and are trying to define different kinds of codes.
In this module we will look at binary tree representation of a code. This representation is
useful for designing as well decoding of codes.
517
Consider the code,
𝑠0 − 10
𝑠 − 00
{ 1
𝑠2 − 11
𝑠3 − 110
518
The above code can be represented as binary tree. Suppose starting from the root, and we
move above for bit 1 and bellow for bit 2. Then the tree will look as follows (see pict
above).
In the above tree observe that the code for symbol 𝑠2 is an “ancestor” of code for symbol
𝑠3 , i.e., 𝑠2 occurs on the path of 𝑠3 . Also, not the code for 𝑠2 is a prefix to the code for 𝑠3 .
In general we can say that if a codeword lies along the path to another codeword, then the
code is not a prefix free code.
Now take the other example we took,
𝑠0 − 1
𝑠 − 01
{ 1
𝑠2 − 001
𝑠3 − 000
519
Now if we look at the binary tree for the above code, we will observe that (see pict above),
All the symbols lie on the final ends of the tree or the “leaves” of the tree. In this case no
codeword lies on the path to another codeword.
Therefore, this code is a prefix free code. No codeword is an ‘ancestor’ of another

codeword.
520
The above binary tree has three levels. Suppose we ‘complete’ the above tree, then we can
see that the maximum number of leaves or the maximum number of symbols that can be
represented using the above code is equal to 8. (𝑙𝑚𝑎𝑥 = 3). In general, is the depth if the
tree is 𝑙𝑚𝑎𝑥 the maximum number of leaves will be 2𝑙𝑚𝑎𝑥 .
The number of nodes eliminated at each leaf can be written as 2𝑙𝑚𝑎𝑥 −𝑙𝑖 where i is the level
(depth) of an actual leaf of the tree.
521
If we look at lengths of each leaf, it can be seen that the depth corresponds to the number
of bits in the codeword,
𝑠0 − 1 𝑙0 = 1
𝑠 − 01 𝑙1 = 2
{ 1
𝑠2 − 001 𝑙2 = 3
𝑠3 − 000 𝑙3 = 3
In a prefix free code each codeword for 𝑠𝑖 corresponds to 2𝑙𝑚𝑎𝑥 −𝑙𝑖 at depth 𝑙𝑚𝑎𝑥 of the
binary tree. From this we observe that
𝑀−1
∑ 2𝑙𝑚𝑎𝑥 −𝑙𝑖 ≤ 2𝑙𝑚𝑎𝑥 = 𝑚𝑎𝑥. 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑜𝑑𝑒𝑠 𝑎𝑡 𝑙𝑚𝑎𝑥

𝑖=0
522
523
𝑀−1
⇒ ∑ 2𝑙𝑚𝑎𝑥 −𝑙𝑖 ≤ 2𝑙𝑚𝑎𝑥

𝑖=0
𝑀−1
⇒ ∑ 2−𝑙𝑖 ≤ 1
𝑖=0
This result is known as Kraft’s inequality. And can be used to check prefix free ness of a
code.
524
The converse also holds, i.e.,
If a set of lengths 𝑙𝑖 satisfies the kraft’s inequality, then there exists a codeword with
lengths 𝑙𝑖
525
Lecture - 44
Lower Bound on Average Code Length, Kullback-Leibler Divergence
Hello. Welcome to another module in this massive open online course.
526
We now want to minimize the average codelength. Consider a symbols set,
𝑋 𝜖𝑆 = {𝑠0 , 𝑠1 , 𝑠2 , 𝑠3 , … . 𝑠𝑀−1 }
Pr(𝑠𝑖 ) = 𝑝𝑖
The average codelength is given by,
𝑀−1
𝑖=0
The lengths 𝑙𝑖 will be bound by craft’s inequality,
𝑀−1
∑ 2−𝑙𝑖 ≤ 1
𝑖=0
527
We define a quantity
2−𝑙𝑖
𝑞𝑖 = −𝑙𝑖
∑𝑀−1
𝑗=0 2
Since denominator and numerator both are positive thus, 𝑞𝑖 ≥ 0. Further,
𝑀−1
2−𝑙𝑖 ≤ ∑ 2−𝑙𝑖
𝑗=0
Thus,
2−𝑙𝑖
−𝑙𝑖
≤1
∑𝑀−1
𝑗=0 2
⇒ 0 ≤ 𝑞𝑖 ≤ 1
Furthermore,
𝑀−1 −𝑙𝑖
2−𝑙𝑖 ∑𝑀−1
𝑗=0 2
∑ −𝑙𝑗 = −𝑙𝑗 =1
∑𝑀−1
𝑗=0 2 ∑𝑀−1
𝑗=0 2
𝑖=0
𝑀−1
⇒ ∑ 𝑞𝑖 = 1
𝑖=0
Thus, we have that 𝑞𝑖 ′𝑠 can be treated as set of probabilities
528
529
We can now apply, Kullback Leibler divergence on 𝑝𝑖 and 𝑞𝑖 .
530
𝑀−1
𝑝𝑖
𝐷(𝑝||𝑞) = ∑ 𝑝𝑖 log 2 ≥0
𝑞𝑖
𝑖=0
531
𝑀−1
𝑝𝑖
∑ 𝑝𝑖 log 2 ≥0
𝑞𝑖
𝑖=0
𝑀−1 𝑀−1
1
⇒ ∑ 𝑝𝑖 log 2 𝑝𝑖 + ∑ 𝑝𝑖 log 2 ≥0
𝑞𝑖
𝑖=0 𝑖=0
−𝑙𝑗
1 ∑𝑀−1
𝑗=0 2
log2 = log2
𝑞𝑖 2−𝑙𝑖
𝑀−1
= 𝑙𝑖 + log 2 ∑ 2−𝑙𝑗
𝑗=0
= 𝑙𝑖 − 𝜖
532
𝜖 = − log 2 ∑ 2−𝑙𝑗
𝑗
533
Substituting the above in the equation for divergence we have,
𝑀−1
−𝐻(𝑋) + ∑ 𝑝𝑖 (𝑙𝑖 − 𝜖) ≥ 0
𝑖=0
𝑀−1 𝑀−1
⇒ ∑ 𝑝𝑖 𝑙𝑖 ≥ 𝐻(𝑋) + ∑ 𝑝𝑖 𝜖
𝑖=0 𝑖=0
⇒ 𝐿̅ = 𝐻(𝑋) + 𝜖
Assuming that 𝜖 can take arbitrary low values and as it approaches zero, we will have,
̅ ≥ 𝐻(𝑋)
𝐿
534
Thus, the entropy is a fundamental lower bound of the average codeword length for any
prefix free code.
535
Lecture - 45
Optimal Code Length, Constrained Optimization, Morse Code Example
Hello, welcome to another module in this massive open online course. So we are looking
at the average code length and we saw a fundamental bound for average codeword length
In this module we will look at how to get optimal code-lengths in order to approach the
lower bound H(x).
536
We know that any optimal set of lengths will have to satisfy kraft’s inequality,
𝑀−1
∑ 2−𝑙𝑖 ≤ 1
𝑖=0
And the average length is given by,
𝑀−1
𝑙𝑎𝑣𝑔 = ∑ 𝑝𝑖 𝑙𝑖
𝑖=0
537
We can now formulate a constrained optimization problem, where we want to
𝑀−1
𝑚𝑖𝑛. ∑ 𝑝𝑖 𝑙𝑖
𝑖=0
𝑀−1
𝑠. 𝑡. ∑ 2−𝑙𝑖 ≤ 1
𝑖=0
538
We can use the lagrange multiplier framework in order to achieve this above. Define
𝑀−1 𝑀−1
𝑓 = ∑ 𝑝𝑖 𝑙𝑖 + 𝜆 ( ∑ 2−𝑙𝑖 − 1)
𝑖=0 𝑖=0
In the above expression 𝜆 is known as the Lagrange multiplier. We differentiate the above
w.r.t 𝑝𝑖 and set them to 0 for each 𝑝𝑖 .
𝜕𝑓
=0
𝜕𝑝𝑖
⇒ 𝑝𝑖 + 𝜆2−𝑙𝑖 (−1) ln 2 = 0
𝜆𝑙𝑛2
⇒ 𝑙𝑖 = log 2
𝑝𝑖
The value of lambda will be given by equality on the constraint,
𝑀−1
∑ 2−𝑙𝑖 = 1
𝑖=0
𝑀−1
𝑝𝑖
⇒ ∑ =1
𝜆 ln 2
𝑖=0
539
𝑀−1
1
⇒𝜆= ∑ 𝑝𝑖
ln 2
𝑖=0
1
⇒𝜆=
ln 2
540
The optimal codeword length which minimizes the average codeword length is given by,
1
𝑙𝑖∗ = log 2
𝑝𝑖
541
i.e., the codelength is inversely proportional to the probability of occurrence of the symbol.
This leads to an effective and efficient representation.
Consider an example of the Morse-code, which formed the backbone of the telegraph
system. Look at the representation of the following alphabets.
𝑇→−
𝑄 → −−. −
542
𝑍 → −−..
T is a much more frequently occurring alphabet than Q or Z. Thus, it’s length of code is
shorter. It is also intuitive that if a frequently occurring alphabet is shorter in code, the
overall statement will be more compact.
Of course, we have shown that using a constrained optimization frame work all, right?
Using the KKT frame work formulating the lagrangian using the lagrangian multiplier the
Lagrange multiplier solving this, right? Solving the lagrangian to obtained the optimal
543
codeword lengths l i which is given by log to the base 2 1 over p i all right. So, will stop
here and in the subsequent models we look at schemes to design efficient codes in order
to achieve this lower bound all right.
544
Lecture – 46
Approaching Lower Bound on Average code length, Block Coding
achieving the minimum possible codelengths.
545
In the last module we saw that minimum code length can be achieved (for which the
average codelength is equal to the entropy of the source),
1
𝑙𝑖 = log 2
𝑝𝑖
The length of a code has to be an integer, but the quantity on the right side might not be
an integer. And since this the minimum we can approach, we select for 𝑙𝑖 the greatest
integer function or the ceiling function, i.e.,
1
𝑙𝑖 = ⌈log 2 ⌉
𝑝𝑖
The ceiling function is the closest integer to the input, hence, we can write,
𝑥 ≤ ⌈𝑥⌉ ≤ 𝑥 + 1
So, I can choose l i equals the ceiling function because if you choose the lower integer then
you can violet because you cannot decrease we have a lower bound I cannot decrease it
beyond that is I cannot make it in the lower bound I can always choose an average length
I can always choose a length which is higher.
546
So, this is remember this is the integer ceiling function, this is the integer ceiling function
correct gives the next highest integer example for examples ceiling of 4.7 is equal to 5
etcetera so on and it is a very function alright, which you take the next highest integer
because it is not an integer you approximated by the next highest integer.
So, therefore, what I have is basically now if you look at this log 2 to the base 1 over P i
which is your l i which your setting your l i, it naturally follows that this is less than or
equal to log two to the base log to the base 2 1 over P i plus 1 and log to the base 2 1 over
547
P i, and that can be easily seen for example, ceiling of 4.7 is 5 and this is less than or equal
to 5 plus 4.7 plus 1 this is less than or equal to 4.7 plus 1 equals 5.7 and this is greater than
or equal to 4 and this is greater than and equal to I am sorry 4.7 this is greater than. So, we
have inequality that x is less than or equal to ceiling of x, less than or equal to x plus 1,
naturally because the ceiling of x is the next highest integer. So, it is greater than equal to
x and the next highest integer lies within 1 within a distance of 1 from the given quantity.
So, it is obvious less than equal to x by ceiling of x is less than equal to x plus 1 ok.
Using this relation, we can write,
𝑀−1 𝑀−1 𝑀−1

1 1 1
∑ 𝑝𝑖 log 2 ≤ ∑ 𝑝𝑖 ⌈log 2 ⌉ ≤ ∑ 𝑝𝑖 (1 + log 2 )
𝑝𝑖 𝑝𝑖 𝑝𝑖
𝑖=0 𝑖=0 𝑖=0
⇒ 𝐻(𝑥) ≤ 𝐿̅ ≤ 𝐻(𝑥) + 1
Hence, we have shown that in a practical scenario we can chose the average code length
within one bit of the entropy of the source.
548
549
Now let us see how to achieve this code length.
Consider the sequence
𝑋̅ = 𝑋1 , 𝑋2 , 𝑋3 , 𝑋4 , … … . , 𝑋𝑛
That is a group of n symbols. Suppose we encode this block of symbols (instead of each
symbols). Assume that these are IID symbols from a memoryless discreet source.
Let the average length be 𝐿̅𝑛 .Since the symbols are IID each has the same entropy,
𝐻(𝑋̅) = 𝑛𝐻(𝑋𝑖 )
Applying the above limits on average length of codeword,
𝐻(𝑋̅) ≤ ̅̅̅
𝐿𝑛 ≤ 𝐻(𝑋̅) + 1
⇒ 𝐻(𝑋) ≤ ̅̅̅
𝐿𝑛 ≤ 𝑛𝐻(𝑋) + 1
𝐻(𝑥) ̅̅̅
𝐿𝑛 𝐻(𝑥) 1
⇒ ≤ ≤ +
𝑛 𝑛 𝑛 𝑛
̅̅̅
𝐿𝑛 /𝑛 is the average number of bits per symbol for a set of n symbols. Now if we increase
the block length n, we can make the average length of a codeword, arbitrarily closer to the
lower bound (or entropy). Thus encoding larger and larger blocks of symbols, we can
achieve arbitrarily low lengths of code.
550
Lecture – 47
Huffman Code, Algorithm, Example, Average Code Length
Hello, welcome to another module in this massive open online course. So, we were looking
at various aspects of source coding the fundamental bound. On the lowest possible average
length of the source code for a given source and also we have seen the mechanism by
which we can approach this lower bound, that is by coding symbols across larger and
larger block lengths all right.
So, in this module we look at a practical scheme when algorithm to construct an efficient
source code for a given source which is a very simple and elegant scheme termed as the
Huffman code. So, that is the algorithm we look at in this module.
551
Huffman code is a popular and simple algorithm, to construct an efficient prefix free code
for a source with given probability distribution for the source alphabet.
552
Consider the following source
𝑋𝜖𝑆 = {𝑠0 , 𝑠1 , 𝑠2 , 𝑠3 , 𝑠4 }
With the probability distribution
Pr(𝑠0 ) = 0.15
Pr(𝑠1 ) = 0.2
Pr(𝑠2 ) = 0.15
Pr(𝑠3 ) = 0.25
{Pr(𝑠4 ) = 0.25
553
We now demonstrate the procedure of generating Hoffman codes.
Step 1: Write the probabilities in decreasing order,
𝑠3 = 0.25
𝑠4 = 0.25
𝑠1 = 0.2
𝑠2 = 0.15
{𝑠𝑜 = 0.15
554
Now we add the probabilities of the last two (lowest probabilities). One branch assigned
0 and another branch 1.
Then we continue the process with these two combined and the next lowest probability
symbol and continue above to form a Hoffman tree.
In order to get the Huffman code for each symbol, start from the end point (the root), and
reach the symbol node.
555
The code we get is,
𝑠0 − 001
𝑠1 − 11
𝑠2 − 000
𝑠3 − 01
{ 𝑠4 − 10
556
Note that the two lowest probability symbols will have a code of equal length (since they
are combined first).
In the binary tree representation we can see that the code is a prefix free code.
Now we calculate the average code length of the code,
557
𝑀−1
𝑖=0
= 3 × 0.15 + 2 × 0.2 + 3 × 0.15
= 2.3 𝑏𝑖𝑡𝑠 𝑝𝑒𝑟 𝑠𝑦𝑚𝑏𝑜𝑙
The lower bound of average codeword length is given by the entropy,
558
1 1 1
𝐻(𝑋) = 2 × 0.15 × log 2 + 0.2 × log 2 + 2 × 0.25 × log 2
0.15 0.2 0.25
= 2.2855
The difference between the lower bound and the actual average length is 0.0145
bits/symbol. By coding over large blocks we can go arbitrarily close to the entropy.
559
Lecture - 48
Introduction to Channel Coding, Rate of Code, Repetition Code, Hamming
Distance
560
Above is a schematic diagram of a digital communication system. In any digital
communication system, there is a probability of error due to a noise, resulting in a bit flip.
Eg, we send the bits, 011001 and the received signal is 011101.
We need to be able to correct these errors to the best possible extent. But how do we
achieve this?
561
Consider a simple example we need to transmit 011001. Now in the transmission we send
each bit three times, that is we transmit, 000 111 111 000 000 111. Suppose the received
bit stream is 000 011 011 000 000 110. Notice that some bits have been inverted.
Now we reconstruct the original bit sequence using maximum bit rule. That is the bit read
from a group of 3 bits is the bit that has maximum number of occurrences in that group.
Thus the reconstructed message is, 011001. Hence, in this case we are able to receive a
message and reconstruct it correctly even with the presence of noise.
562
One observation to draw here is that we need redundancy of bits in order to be able to
correct the errors.
563
This process of introducing systematic redundancy in the code for the purpose of error
correction is known as channel coding. A particular code is termed as a error control code.
In the above scheme for each 3 bits we have 1 info bits which means,
1 #𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑏𝑖𝑡𝑠
𝑅𝑎𝑡𝑒 𝑜𝑓 𝑐𝑜𝑑𝑒 = =
3 # 𝑜𝑓 𝑡𝑜𝑡𝑎𝑙 𝑏𝑖𝑡𝑠
564
As redundancy increases the rate decreases which decreases the bandwidth efficiency.
Moreover, suppose to transit 1, we send 111 but, in the channel, two bits get flipped, and
we get 010. This group will be reconstructed as 0 and we will be unable to get the correct
code.
565
Thus the maximum number of bit errors in a group that can be handled is 1 for this case.
This is known as the hamming distance (𝑑𝐻 ). The hamming distance is the number of
dissimilar bits in the received message.
566
What we will see next is there are much better and much more efficient codes.
567
Lecture - 49
Introduction to Convolutional Code, Binary Field Arithmetic, Linear Codes
Hello welcome to another module in this massive open online course. So, we are looking
at channel coding, and in this module, we start looking at a different class of channel codes
termed as convolutional codes.
568
Consider a bit sequence 𝑥(0), 𝑥(1), 𝑥(2), 𝑥(3), … … The sequence can be possibly infinite.
Consider the following convolution code,
𝐶0 (𝑗) = 𝑥(𝑗) + 𝑥(𝑗 − 2)
𝐶1 (𝑗) = 𝑥(𝑗) + 𝑥(𝑗 − 1) + 𝑥(𝑗 − 2)
569
𝐶0 (𝑗) and 𝐶1 (𝑗) are code bits at time j. Since two code bits are generated at each time
instant the rate is given by
1
𝑅𝑎𝑡𝑒 =
2
The addition of bits above will be according to binary field arithmetic. In binary field
arithmetic the addition operation is similar to an XOR operation and a multiplication is
similar to a AND logic operation.
0+0=0
0+1=1+0=0
1+1=0
0.0 = 1.0 = 0.1 = 0

{ 1.1 = 1
570
The binary field operations described above are commutative, associative and distributive,
i.e., if a, b, c are three bits, then,
𝑎+𝑏 =𝑏+𝑎
𝑐𝑜𝑚𝑚𝑢𝑡𝑎𝑡𝑖𝑣𝑒 − {
𝑎. 𝑏 = 𝑏. 𝑎
𝑎 + (𝑏 + 𝑐) = (𝑎 + 𝑏) + 𝑐
𝑎𝑠𝑠𝑜𝑐𝑖𝑡𝑖𝑣𝑒 {
𝑎. (𝑏. 𝑐) = (𝑎. 𝑏). 𝑐
𝑎. (𝑏 + 𝑐) = 𝑎. 𝑏 + 𝑎. 𝑐
𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑣𝑒 {
(𝑎 + 𝑏). 𝑐 = 𝑎. 𝑐 + 𝑏. 𝑐
571
Now consider two information bit sequences, 𝑥(0), 𝑥(1), 𝑥(2), … . .. and
𝑦(0), 𝑦(1), 𝑦(2) … ….
𝐶0𝑥 (𝑗) = 𝑥(𝑗) + 𝑥(𝑗 − 2)
𝑦
𝐶0 (𝑗) = 𝑥(𝑗) + 𝑥(𝑗 − 2)
572
Now,
𝑦
𝐶0𝑥 (𝑗) + 𝐶0 (𝑗) = 𝑥(𝑗) + 𝑥(𝑗 − 2) + 𝑦(𝑗) + 𝑦(𝑗 − 2)
= (𝑥(𝑗) + 𝑦(𝑗)) + (𝑥(𝑗 − 2) + 𝑦(𝑗 − 2))
𝑥+𝑦
= 𝐶0 (𝑗)
573
The coded bit for sum of two bits is same as sum of two coded bits. Thus, the scheme
follows additive law and hence is linear.
574
575
Lecture - 50
Example of Convolutional Code Output, Convolution Operation for Code
Generation
Hello. Welcome to another module in this massive open online course. We are looking at
convolutional codes correct which are class of error control codes to recover or correct
errors that occur over the channel in a communication system.
576
We will consider an example of a convolutional code. Consider the following bit sequence
from 𝑥(0) to 𝑥(9). 1 0 1 1 0 0 1 0 1 1. We take x (-1) =x (-2) =0. The coded bit 𝐶0 will be
given by binary field addition,
𝐶0 (𝑗) = 𝑥(𝑗) + 𝑥(𝑗 − 2)

𝐶0 (0) = 𝑥(0) + 𝑥(−2) = 1 + 0
=1
𝐶1 (0) = 𝑥(0) + 𝑥(−1) + 𝑥(−2)
=1
577
Similarly, 𝑐0 (1) = 0 and 𝐶1 (1) = 1. 𝐶0 (2) = 1 + 1 = 0 and 𝐶1 (2) = 0
578
Repeating for further time instants we get,
𝐶0 = 1001111001
𝐶1 = 1100011100
579
We had 10 information bits and have 20 coded bits thus
10 1
𝑟𝑎𝑡𝑒 = =
20 2
Let us now look at why these codes are called convolution codes.
𝐶0 = 𝑥(𝑗) + 𝑥(𝑗 − 2)
= 1. 𝑥(𝑗) + 0. 𝑥(𝑗 − 1) + 1. 𝑥(𝑗 − 2)
580
∞
= ∑ 𝑥(𝑗 − 𝑘)ℎ(𝑘)
−∞
Where h(k) is given by,
1 𝑘 = 0,2
ℎ(𝑘) = {
the coded sequence is given by convolution of 𝑥(𝑗) with a filter ℎ(𝑗) whose impulse response is
given as above.
For 𝐶1 (𝑗) we can write,
581
𝐶1 (𝑗) = 𝑥(𝑗) + 𝑥(𝑗 − 1) + 𝑥(𝑗 − 2)
∞
= ∑ 𝑥(𝑗 − 𝑘)ℎ1 (𝑘)

−∞
ℎ1 (k) is given by,
1 𝑖𝑓 𝑘 = 0,1,2
ℎ1 (𝑘) = {
The Codes are termed convolution codes since they are generated by convolution operation
of the input bitstream by convolution operation.
582
So, we will stop here and continue in the subsequent modules.
Thank you.
583
Lecture - 51
Matrix Representation of Convolutional Code, Generator Matrix,
Transform Domain Representation, Shift Register Architecture
Hello. Welcome to another module in this massive open online course. We are looking at
convolutional codes. Let us look at the matrix representation of a convolutional code.
In this module we will look at the concept of a generator matrix. Consider the input bit
sequence, 𝑥(0), 𝑥(1), 𝑥(2), 𝑥(3), … … … . We can represent the sequence using a row
vector, [𝑥(0) 𝑥(1) 𝑥(2) 𝑥(3) 𝑥(4) … … … … ].
584
Now let us write a matrix, with 𝑥 as it’s column, 𝐶0 𝐶1 are consecutive rows (see pict
above). The cell of a matrix corresponds to the coefficient of the row bit for the column of
the coded bit. For e.g.,
𝐶0 (2) = 1. 𝑥(0) + 0. 𝑥(1) + 1. 𝑥(2) + 0. 𝑥( ) … ..

Thus the column for bit 𝐶1 (2) will be [1 0 1 0 0 0 ……..]. Other cells can be similarly
filled.
585
Now observe the following block at the top in 𝐶0 (2) and 𝐶1 (2) columns
11
{0 1
11
If we observe the previous and the next column, we can see that the this block is slowly
shifting downwards as we move forward in time (in the previous columns, the coded bits
are hidden for earlier (negative) times).
586
IF we call the matrix so formed as the generation matrix, we can write
̅=𝑥
𝐶 ̅𝐺
̅
Where 𝐺 is the generation matrix? 𝐶 is the code output and 𝑥̅ is the input vector (bit stream)
Now we will look at the transform domain representation of codes.
We define a D transform as,
587
∞
𝑗
∑ 𝑥(𝑗)𝐷
𝑗=0
(From a D transform we can obtain a z transform by substituting 𝐷 = 𝑧 −1 )
D transform 𝑋(𝐷) of 𝑥(𝑗) is given by
𝑋(𝐷) = ∑ 𝑥(𝑗 − 𝑘)𝐷 𝑗

𝑗=0
∞
= ∑ 𝑥(𝑗)𝐷 𝑗
𝑗=0
(since x(j)=0 for j<0)
588
If we substitute 𝑗 − 𝑘 = 𝑗̃ then 𝑗 = 𝑗̃ + 𝑘.
𝑋(𝐷) = ∑ 𝑥(𝑗̃)𝐷 𝑗̃ +𝑘
𝑗̃ =0
∞
= 𝐷 𝑘 ∑ 𝑥(𝑗̃)𝐷 𝑗̃
𝑗̃ =0
⇒ 𝑥(𝑗 − 𝑘) ←→ 𝐷 𝑘 𝑋(𝐷)
589
⇒ 𝑥(𝑗 − 1) → 𝐷𝑋(𝐷)
We now look at the D transform of the convolution codes.
𝐶0 (𝑗) = 𝑥(𝑗) + 𝑥(𝑗 − 2)

𝐶0 (𝐷 = 𝑋(𝐷) + 𝐷2 𝑋(𝐷)
= (1 + 𝐷 2 )𝑋(𝐷)
𝐶1 (𝐷) = 𝑋(𝐷) + 𝐷𝑋(𝐷) + 𝐷2 𝑋(𝐷)
= 𝑋(𝐷)(1 + 𝐷 + 𝐷 2 )
590
Convolution codes can be represented as shift register architecture. In this module we look
at a feed forward shift register architecture.
591
𝑥(𝑗 − 1) can simply be obtained by passing 𝑥(𝑗) through a delay filter and passing 𝑥(𝑗 −
1) through another delay block will give us 𝑥(𝑗 − 2). Then the required outputs of the
delay blocks and the input can be added to get the coded output. (see pict above)
So, we will stop here and look at other aspects in the subsequent modules.
592
Lecture – 52
State Diagram Representation of Convolutional Code, State Transitions, Example
of Code Generation using State Diagram
convolutional codes and we are look at various forms of representation convolutional
codes such as the generator matrix.
IN this module we will look at state diagram representation of a convolution code.
Consider the example that we have taken,
𝐶0 (𝑗) = 𝑥(𝑗) + 𝑥(𝑗 − 2)
𝐶1 (𝑗) = 𝑥(𝑗) + 𝑥(𝑗 − 1) + 𝑥(𝑗 − 2)
At any time instant we need knowledge of only 𝑥(𝑗 − 1) and 𝑥(𝑗 − 2), that is the previous
two bits (along with the current bit). Knowledge of earlier bits is not required.
593
The current state of the convolution code can therefore be taken as 𝑥(𝑗 − 1), 𝑥(𝑗 − 2).We
thus have four possible states.
𝑥(𝑗 − 1) 𝑥(𝑗 − 2)
0 0
0 1
594
1 0
1 1
595
To generate the output and the transition of the encoder, we need knowledge of the state
and the current bit at any given time instant.
596
The convolution code can be represented by a state transition diagram. The code, at any
time instant, receives the current bit and transitions into the next state. (the next state
depends on the current input bits). The mechanism of state transition can be represented
by a state transition diagram.
597
Now we will illustrate the state transition diagram of the example code that we have taken.
There will be four states, 00,01,10,11. If the current bit is one, then 00 transitions to 10, if
it is 0, it transitions to itself. Similarly, all the transitions can be sketched. (each will state
will have two arrows going out of it). The output is also represented with a slash (along
the current bit)
598
The above pict. Shows the complete state transition diagram, with the output.
599
600
Now we consider what happens for the input sequence, 1011001011. The coded bits and
the states are shown in bellow table,
Time Input State Output
0 1 00 11
1 0 10 01
2 1 01 00
3 1 10 10
4 0 11 10
5 0 01 11
6 1 00 11
601
7 0 10 01
8 1 01 00
9 1 10 10
11
602
The coded output is given by,
𝐶0 : 1 0 0 1 1 1 1 0 0 1
𝐶1 : 1 1 0 0 0 1 1 1 0 0
The output can be verified using individual binary field addition operation
603
604
Lecture – 53
Trellis Representation of Convolutional Code and Valid Code Words
In this module we will introduce the concept of a trellis diagram, or a trellis-based

representation of a convolutional code. The trellis representation is derived from the state
diagram representation of a convolution code. A trellis based diagram explicitly shows the
time evolution of a convolution code.
605
For a trellis diagram (see pict above), the following all the states are written in a single
column and these columns are repeated in a line. Each repetition (column) represents a
time instant. Arrows are drawn from one column to the next using the state representation.
These arrows too, will be repeated (since the convolution code is same for all instants).
606
607
Note that any code sequence will correspond a path on the trellis.
E.g.: the state transition 00 → 10 → 11. Corresponds to a path, with codeword output 11
10 ….
Observe however that, some sequences cannot be represented on a path. E.g. 00 → 11 →

10, does not correspond to any path on the diagram. Thus, it is not a legitimate sequence
of possible states (no sequence of bits can produce this sequence of states). Similarly, 11
00 10, is not a legitimate codeword. (since no sequence of bits will result in this codeword).
608
609
610
611
For the receiver, we need to find the input information bit sequence, this sequence will
correspond to a particular path on the trellis diagram. Thus, the process of decoding of
convolutional codes will involve finding a path on the trellis diagram from the received
code words
612
Lecture - 54
Decoding of Convolutional Code, Minimum Hamming Distance, Maximum
Likelihood Codeword Estimate
Hello. Welcome to another module in this massive open online course
Suppose we receive the following code sequence,
𝑟̅ = 11 11 10 01 10 10 11
= 𝑟̅0 𝑟̅1 𝑟̅2 … … … . 𝑟̅6
Corresponding to codeword sequence 𝑐̅0 𝑐̅1 … … … . 𝑐̅6 .
613
What we want now to decode the codeword received. However, when the codeword is
received, it might actually not be valid codeword, due to bit flips resulting from channel
noise. In such a scenario, we want to find the most probable codeword that has been
transmitted.
614
In order to find the most probable code word transmitted, we use the concept of hamming
distance that we used earlier. Hamming distance between two codewords (equal length),
is the number of dissimilar bits between them. E.g., hamming distance between 001001
and 101000 is 2.
𝑑𝐻 = 2
615
The decoding principle of a maximum likelihood decoder would be finding 𝑐̅ such that 𝑑𝐻
is minimized.
min 𝑑𝐻 (𝑟̅ , 𝑐̅ )
616
Further,
𝑑𝐻 (𝑐̅ , 𝑟̅ ) = 𝑑(̅̅̅
𝑐0 , ̅̅̅)
𝑟0 + 𝑑(̅̅̅
𝑐1 , ̅̅̅)
𝑟1 + 𝑑(̅̅̅
𝑐2 , ̅̅̅)
𝑟2 + ⋯
𝑁−1
= ∑ 𝑑(𝑐̅,
𝑖 𝑟
̅)
𝑖
𝑖=0
617
Observe that 𝑐𝑖 will correspond to a branch in the trellis at the 𝑖 𝑡ℎ stage. Therefore optimal
distance codeword involves finding a path through the trellis in which the sum of this brach
metric is minimum.
Thank u very much.
618
Lecture – 55
Principles of Decoding of Convolutional Code
Hello, welcome to another module in this massive online course. In the previous few
modules, we have discussed state representation of convolution codes and trellis
representation. We also discussed that the decoding of received codes involves finding the
maximum probable codeword that was transmitted. In this module we will discuss two
basic principles for convolutional decoding.
619
A maximum likelihood decoder involves a minimum hamming distance code word, i.e.,
𝑑𝐻 (𝑟̅ , 𝑐̅) has to be minimized. Where 𝑟̅ is the received codeword and 𝑐̅ is the decoded
codeword. Decoding involves path finding through the trellis diagram i.e., finding the
sequence of states of the encoder.
620
The first principle is as stated bellow,
Suppose from state 𝑠𝑜 to 𝑠𝑗 the minimum distance is given by 𝑒1 𝑒2 𝑒3 𝑒4 … 𝑒𝑗−1 𝑒𝑗 . Then,

𝑒1 𝑒2 … . 𝑒𝑗−1 is the minimum distance between states 𝑠0 and 𝑠𝑗−1 .
621
The above theorem can be proved using contradiction
If there exists an alternate path 𝑒′1 𝑒′2 𝑒′3 … . 𝑒′𝑗−1 which is the minimum path between 𝑠0
and 𝑠𝑗−1, while the minimum path from 𝑠0 to 𝑠𝑗 is 𝑒1 𝑒2 𝑒3 … … 𝑒𝑗 .
𝑗−1
𝑑(𝑒0 , 𝑒1 , 𝑒2 , 𝑒3 , … … ) = ∑ 𝑑(𝑒0 ) + 𝑑(𝑒𝑗 )

𝑖=0
𝑗−1
≥ ∑ 𝑑(𝑒0′ ) + 𝑑(𝑒𝑗 )
𝑖=0
This is a contradiction since 𝑒0 𝑒1 … is the minimum distance path. Thus, there will be no
alternate path 𝑒0′ 𝑒1′ …. Which has a lower 𝑑𝑚𝑖𝑛 .
What happens then is that for a particular node or state, we only need to store the
information regarding the minimum path, not all the possible paths. This is helpful since
the total number of paths will grow exponentially with time.
622
623
624
625
The second principle can be stated as follows:
If the minimum distance between two words 𝑠𝑖 and 𝑠𝑗 and a third node 𝑠𝑗 is taken to be
𝑑𝑖𝑘 and 𝑑𝑗𝑘 the distance to 𝑠𝑖 , 𝑠𝑗 and 𝑠𝑘 is given by 𝑑𝑖 , 𝑑𝑗 respectively. Then the
minimum distance to 𝑠𝑘 is given by,
min{𝑑𝑖 + 𝑑𝑖𝑘 + 𝑑𝑗 + 𝑑𝑗𝑘 }
626
627
These principles which seem to be abstract here, are highly useful in implementation of
Viterbi algorithm for decoding. That is these are used in order to find the minimum
distance at time 𝑚 given the minimum distance nodes till time 𝑚 − 1.
628
Prof. Adithya K. Jagannathan
Lecture - 56
Viterbi Decoder for Maximum Likelihood Decoding of Convolutional
Code Using Trellis Representation, Branch Metric Calculation,
State Metric Calculation, Example
Hello. Welcome to another module in this massive open online course. So, we have looked
at the principles of decoding convolutional code.
In order to demonstrate the Viterbi decoder, we will consider an example. Suppose we

have a received message,
𝑟̅ = 11 11 10 01 10 10 11
629
We draw the trellis diagram corresponding to the above message and convolution.
For the first time instant we give
𝑑𝑚𝑖𝑛 (00) = 0
𝑑min (𝑠) = ∞ 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Then, for j=1 we find the paths with minimum hamming distance. For example, for 00 to
00 path, the output is 00, and the received code is 11 thus the hamming distance for this
path is 2 (hamming distance for each are written above the path). For (00) at j=1, there are
two paths leading to it. The shortest path is from (00) at j=0, (the other has d of infinity).
The shortest path at each node are marked with curly lines. The process is repeated till j=7.
Here we use the two principles we discussed in the previous module. That the minimum
path is the sum of the minimum paths from the previous time instants to the current node.
630
631
632
633
At the final instant the minimum distance is 𝑑 = 2 for state (00). Thus, now we trace back
our path from this node. The rectangles in the figure show the minimum distance path.
From the end point we trace the minimum distance path at each state, and the
corresponding outputs will be our received bits.
The decoded codeword, we then have is,
𝑐̅ = 11 10 10 00 10 10 11
634
Comparing to the received codeword,
𝑟̅ = 11 11 10 01 10 10 11
We can see that two bits are mismatched (marked red) . Thus the hamming distance
between the received and the decoded codeword ids 2
𝑑𝐻 (𝑐̅, 𝑟̅ ) = 2
The maximum likely hood information bit sequence decoded is then,
1 1 0 1 1 01 0
635
THIS BOOK IS NOT FOR SALE
NOR COMMERCIAL USE
(044) 2257 5905/08 nptel.ac.in swayam.gov.in

Cs-2 Nptel Notes

Uploaded by

Copyright:

Available Formats

Cs-2 Nptel Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cs-2 Nptel Notes

Uploaded by

Copyright:

Available Formats

INDEX

S. No Topic Page No.

(Refer Slide Time: 00:30)

(Refer Slide Time: 07:07)

(Refer Slide Time: 11:48)

(Refer Slide Time: 14:27)

(Refer Slide Time: 17:25)

This pulse is denoted by:

1, 𝑖𝑓 |𝑡| < 𝑇/2

(Refer Slide Time: 22:23)

(Refer Slide Time: 23:22)

The net digital transmitted signal is given by:

(Refer Slide Time: 25:10)

(Refer Slide Time: 00:26)

(Refer Slide Time: 02:14)

Fourier transform of this pulse is given by

(Refer Slide Time: 05:30)

(Refer Slide Time: 07:35)

This can be written as:

𝐸{𝑎𝑘 𝑎𝑚 } = 𝐸{𝑎𝑘 }𝐸{𝑎𝑚 } 𝑖𝑓 (𝑘 ≠ 𝑚)

Now we need to find the average value of x(t).

𝐸{𝑥(𝑡)} = ∑ 𝐸{𝑎𝑘 }𝑃𝑇 (𝑡 − 𝑘𝑇) = 0 (𝑠𝑖𝑛𝑐𝑒 𝐸{𝑎𝑘 } = 0)

(Refer Slide Time: 10:19)

Fourier transform of x(t) is given by:

(Refer Slide Time: 13:07)

(Refer Slide Time: 15:05)

(Refer Slide Time: 16:06)

(Refer Slide Time: 17:42)

(Refer Slide Time: 18:15)

𝑅𝑥𝑥 (𝜏) = 𝐸{𝑥(𝑡)𝑥(𝑡 − 𝜏)}

Power spectral density will be the Fourier transform of autocorrelation.

Thank you very much.

(Refer Slide Time: 00:29)

𝐸{𝑥(𝑡)𝑥(𝑡 − 𝜏)} = 𝐸{ ∑ 𝑎𝑘 𝑃𝑇 (𝑡 − 𝑘𝑇) × ∑ 𝑎𝑚 𝑃𝑇 (𝑡 + 𝜏 − 𝑚𝑇)}

Now, we expand the summation that is,

(Refer Slide Time: 02:41)

Now, if I take the expectation operator inside,

= ∑ ∑ 𝐸{𝑎𝑘 𝑎𝑚 }𝑃𝑇 (𝑡 − 𝑘𝑇)𝑃𝑇 (𝑡 + 𝜏 − 𝑚𝑇)

(Refer Slide Time: 03:56)

Since symbols are IID, we can write,

𝐸{𝑎𝑘 𝑎𝑚 } = 𝐸{𝑎𝑘 }𝐸{𝑎𝑚 } = 0 (𝑖𝑓 𝑘 ≠ 𝑚)

The only terms that remain are those where k=m.

So, therefore, the summation equivalently reduces to,

= ∑ 𝐸{𝑎𝑘2 }𝑃𝑇 (𝑡 − 𝑘𝑇)𝑃𝑇 (𝑡 + 𝜏 − 𝑘𝑡)

We no from the previous modules that,

𝑃𝑟(𝑎𝑘 = 𝐴) = 0.5 𝑎𝑛𝑑 𝑃𝑟 (𝑎𝑘 = −𝐴) = 0.5

𝐸{𝑎𝑘2 } = 0.5𝐴2 + 0.5(−𝐴)2

We can A2 as Pd or ‘data symbol power’.

(Refer Slide Time: 06:53)

Therefore, the summation reduces to,

𝐸{𝑥(𝑡)𝑥(𝑡 − 𝜏)} = 𝑃𝑑 ∑ 𝑃𝑇 (𝑡 − 𝑘𝑇)𝑃𝑇 (𝑡 + 𝜏 − 𝑘𝑇)

(Refer Slide Time: 09:00)

to is a uniformly distributed random delay in [0, T].

(Refer Slide Time: 09:45)

to is assumed to be uniformly distributed over [0, T] i.e.,

(Refer Slide Time: 11:14)

Now, with this shifted signal,

𝐸{𝑥(𝑡)𝑥(𝑡 − 𝜏)} = 𝑃𝑑 ∑ 𝐸{𝑃𝑇 (𝑡 − 𝑘𝑇 − 𝑡𝑜 ) × 𝑃𝑇 (𝑡 + 𝜏 − 𝑘𝑇 − 𝑡𝑜 )}

We have to take the expected value of the summation w.r.t to.

(Refer Slide Time: 12:32)