Wireless Communications by Theodore S Ra
Wireless Communications by Theodore S Ra
Wireless Communications by Theodore S Ra
Speech Coding
7.1 Introduction
In mobile communication systems, the design and subjective test of speech
coders has been extremely difficult. Without low data rate speech coding, digital
modulation schemes offer little in the way of spectral efficiency for voice traffic.
To make speech coding practical, implementations must consume little power
and provide tolerable, if not excellent speech quality.
The goal of all speech coding systems is to transmit speech with the highest
possible quality using the least possible channel capacity. This has to be accom-
plished while maintaining certain required levels of complexity of implementa-
tion and communication delay. In general, there is a positive correlation between
coder bit-rate efficiency and the algorithmic complexity required to achieve it.
The more complex an algorithm is, the more its processing delay and cost of
implementation. A balance needs to be struck between these conflicting factors,
361
362 Ch. 7• Speech Coding
and it is the aim of all speech processing developments to shift the point at which
this balance is made towards ever lower bit rates [Jay92).
The hierarchy of speech coders is shown in Figure 7.1. The principles used
to design and implement the speech coding techniques in Figure 7.1 are
described throughout this chapter.
LPC
Figure 7.1
Hierarchy of speech coders (courtesy of R.Z. Zaputowycz).
general more complex. They are based on using a priori knowledge about the sig-
nal to be coded, and for this reason, they are, in general, signal specific.
p(x) = (7.1)
Note that this pdf shows a distinct peak at zero which is due to the exist-
ence of frequent pauses and low level speech segments. Short-time pdfs of speech
segments are also single-peaked functions and are usually approximated as a
Gaussian distribution.
Nonuniform quantizers, including the vector quantizers. attempt to match
the distribution of quantization levels to that of the pdf of the input speech sig-
nal by allocating more quantization levels in regions of high probability and
fewer levels in regions where the probability is low.
Autocorrelation Function (ACF) — Mother very useful property of
speech signals is that there exists much correlation between adjacent samples of
a segment of speech. This implies that in every sample of speech, there is a large
component that is easily predicted from the value of the previous samples with a
364 Ch. 7• Speech Coding
small random error. All differential and predictive coding schemes are based on
exploiting this property.
The autocorrelation function (ACF) gives a quantitative measure of the
closeness or similarity between samples of a speech signal as a function of their
time separation. This function is mathematically defined as [Jay84]
N—Ikl—I
C(k)
=
x(n)x(n+IkI) (7.2)
where, Sk is the k th frequency sample of the PSD of the speech signal. Typi-
cally, speech signals have a long-term SFM value of 8 and a short-time SFM
value varying widely between 2 and 500.
Quartization Techniques 365
where x (t) represents the original speech signal, and (t) represents the
quantized speech signal. The distortion introduced by a quantizer is often mod-
eled as additive quantization noise, and the performance of a quantizer is mea-
sured as the output signal-to-quantization noise ratio (SQNR). A pulse code
modulation (PCM) coder is basically a quantizer of sampled speech amplitudes.
PCM coding, using 8 bits per sample at a sampling frequency of 8 kHz, was the
first digital coding standard adopted for commercial telephony. The SQNR of a
PCM encoder is related to the number of bits used for encoding through the fol-
lowing relation:
(SQNR)dB = 6.02n + a (7.5)
where a = 4.77 for peak SQNR and a = 0 for the average SQNR. The above equa-
tion indicates that with every additional bit used for encoding, the output SQNR
improves by 6 dB.
D = E[(x—fQ(xfl2l [x_fQ(x)12p(x)dx
= f (7.6)
366 Ch. 7 Speech Cod:ng
where fQ (x) is the output of the quantizer. From the above equation, it is clear
that the total distortion can be reduced by decreasing the quantization noise,
[x (x) 12, where the pdf, p (x). is large. This means that quantization levels
need to be concentrated in amplitude regions of high probability.
To design an optimal nonuniform quantizer, we need to determine the
quantization levels which will minimize the distortion of a signal with a given
pdf. The Lloyd-Max algorithm provides a method to determine the opti-
mum quantization levels by iteratively changing the quantization levels in a
manner that minimizes the mean square distortion.
A simple and robust implementation of a nonuniform quantizer used in
commercial telephony is the logarithmic quantizer. This quantizer uses fine
quantization steps for the frequently occurring low amplitudes in speech and
uses much coarser steps for the less frequent, large amplitude excursions. Differ-
ent companding techniques known as p-law and A-law companding are used in
the U.S and Europe, respectively.
Nonuniform quantization is obtained by first passing the analog speech sig-
nal through a compression (logarithmic) amplifier, and then passing the com-
pressed speech into a standard uniform quantizer. In U.S. -law companding,
weak speech signals are amplified where strong speech signals are compressed.
Let the speech voltage level into the compander be w (t) and the speech output
voltage be t'0 (t) Following [Smi57],
-
(tfl = ln( I
(7.7)
In ( I + p3
where p is a positive constant and has a value typically between 50 and 300. The
peak value of w (1) is normalized to 1.
In Europe, A-law companding is used [Cat691, and is defined by
( Aiw(tfl
l+lnA A
v U) = (7.8)
I ± in (tfl) i< k
l+]nA
C
Example 7.1
Let the input signal to a quantizer have a probability density function (pdf) as
shown in Figure E7.1. Assume the quantization levels to be (1. 3, 5, 71. Com-
pute the mean square error distortion at the quantizer output and the output
signal-to-distortion ratio. How would you change the distribution of quantiza-
tion levels to decrease the distortion? For what input pdf would this quantizer
be optimal?
Quantization Techniques 367
p(x
1/4
From Figure E7.1, the pdf of the input signal can be recognized as:
p(x) =
p(x) = 0 elsewhere
Given the quantization levels to be (1, 3, 5. 7} we can define the quantization
boundaries as (0, 2, 4, 6, 8}.
D J(x--])2p(x)dx+f(x_3)2p(x)dx+J(x_5)p(x)dx
±J(x 7Yp(x)dx
= 19.82dB
To minimize the distortion we need to concentrate the quantization levels in
regions of higher probability Since the input signal has a greater probability of
higher amplitude levels than lower amplitudes, we need to place the quantiza-
tion levels closer (i.e., more quantization levels) at amplitudes close to 8 and
farther (i.e., less quantization levels) at amplitudes close to zero.
Since this quantizer has quantization levels uniformly distributed, this would
be optimal for an input signal with a uniform pdf.
368 Ch. 7• Speech Coding
fQ(x) fç(x)
(a) (b)
Figure 7.2
Adaptive quantizer characteristics (a) when the input signaL has a low amplitude swing and (b) when
the input signal has a large amplitude swing.
voice quality. Efficient algorithms for ADPCM have been developed and stan-
dardized. The CCIII' standard G.721 ADPCM algorithm for 32 kbps speech cod-
ing is used in cordless telephone systems like CT2 and DEC11
In a differential PCM scheme, the encoder quantizes a succession of adja-
cent sample differences, and the decoder recovers an approximation to the origi-
nal speech signal by essentially integrating quantized adjacent sample
differences. Since the quantization error variance for a given number of bits/
sample R, is directly proportional to the input variance, the reduction obtained
in the quantizer input variance leads directly to a reduction of reconstruction
error variance for a given value of R.
In practice, ADPCM encoders are implemented using signal prediction
techniques. Instead of encoding the difference between adjacent samples, a lin-
ear predictor is used to predict the current sample. The difference between the
predicted and actual sample called the prediction error is then encoded for trans-
mission. Prediction is based on the knowledge of the auuworrelation properties
of speech.
Figure 7.3 shows a simplified block diagram of an ADPCM encoder used in
the CT2 cordless telephone system [Det89J. This encoder consists of a quantizer
that maps the input signal sample onto a 4-bit output sample. The ADPCM
encoder makes best use of the available dynamic range of 4 bits by varying its
step size in an adaptive manner. The step size of the quantizer depends on the
dynamic range of the input which is speaker dependent and varies with time.
The adaptation is, in practice, achieved by normalizing the input signals via a
scaling factor derived from a prediction of the dynamic range of the current
input. This prediction is obtained from two components: a fast component for sig-
nals with rapid amplitude fluctuations and a slow component for signals that
vary more slowly. The two components are weighted to give a single quantization
scaling factor. It should be noted that the two feedback signals that drive the
algorithm — (k), the estimate of the input signal and y(k), the quantization
scaling factor — are ultimately derived solely from 1(k), the transmitted 4-bit
ADPCM signal. The ADPCM encoder at the transmitter and the ADPCM
decoder, at the receiver, are thus driven by the same control signals, with decod-
ing simply the reverse of encoding.
Example 7.2
In an adaptive PCM system for speech coding, the input speech signal is sam-
pled at 8 kHz, and each sample is represented by 8 bits. The quantizer step
size is recomputed every 10 ms, and it is encoded for transmission using 5 bits.
Compute the transmission bit rate of such a speech coder. What would be the
average and peak SQNR of this system?
Frequency Domain Coding of Speech 371
&k)
Figure 7.3
Block diagram of ADPCM encoder.
encoded according to some perceptual criteria for each band, and hence the
quantization noise can be contained within bands and prevented from creating
harmonic distortions outside the band. These schemes have the advantage that
the number of bits used to encode each frequency component can be dynamically
varied and shared among the different bands.
Many frequency domain coding algorithms, ranging from simple to complex
are available. The most common types of frequency domain coding include sub-
band coding (SBC) and block transform coding. While a sub-band coder divides
the speech signal into many smaller sub-bands and encodes each sub-band sepa-
rately according to some perceptual criterion, a transform coder codes the short-
time transform of a windowed sequence of samples and encodes them with num-
ber of bits proportional to its perceptual significance.
1 200-700 Hz
2 700-1310 Hz
3 13 10-2020 Hz
4 2020-3200 Hz
Another way to split the speech band would be to divide it into equal with'
sub-bands and assign to each sub-band number of bits proportional to perceptu.
significance while encoding them. Instead of partitioning into equal width band
octave band splitting is often employed. As the human ear has an exponential]
decreasing sensitivity to frequency, this kind of splitting is more in tune with V
perception process.
There are various methods for processing the sub-band signals. One obv
ous way is to make a low pass translation of the sub-band signal to zero Fr
quency by a modulation process equivalent to single sideband modulation.
Frequency Domain Coding of Speech 373
kind of translation facilitates sampling rate reduction and possesses other bene-
fits that accrue from coding low-pass signals. Figure 7.4 shows a simple means of
achieving this low pass translation. The input signal is filtered with a bandpass
filter of width for the ii th band. is the lower edge of the band and is
the upper edge of the band. The resulting signal (t) is modulated by a cosine
wave cos and filtered using a low pass filter (t) with bandwidth
(0— The resulting signal rn (() corresponds to the low pass translated ver-
sion of (t) and can be expressed as
= øhn(t) (7.10)
where 0 denotes a convolution operation. The signal (t) is sampled at a rate
This signal is then digitally encoded and multiplexed with encoded signals
from other channels as shown in Figure 7.4. At the receiver the data is demulti-
plexed into separate channels, decoded, and bandpass translated to give the esti-
mate of (t) for the ii th channel.
The low pass translation technique is straightforward and takes advantage
of a bank of nonoverlapping bandpass filters. Unfortunately, unless we use
sophisticated bandpass filters, this approach will lead to perceptible aliasing
effects. Estaban and Galand proposed [Est77j a scheme which avoids this incon-
venience even with quasiperfect, sub-band splitting. Filter banks known as
quadrature mirror filters (QMF) are used to achieve this. By designing a set of
mirror filters which satisfy certain symmetry conditions, it is possible to obtain
perfect alias cancellation. This facilitates the implementation of sub-band coding
without the use of very high order filters. This is particularly attractive for real
time implementation as a reduced filter order means a reduced computational
load and also a reduced latency.
Sub-band coding can be used for coding speech at bit rates in the range 9.6
to 32 kbps. In this range, speech quality is roughly equivalent to that of
OPCM at an equivalent bit rate. In addition, its compleiity and relative speech
•uality at low bit rates make it particularly advantageous for coding below about
6 kbps. However, the increased complexity of sub-band coding when compared
other higher bit rate techniques does not warrant its use at bit rates greater
sian about 20 kbps. The CD-900 cellular telephone system uses sub-band coding
'r speech compression.
Example 7.3
Consider a sub-band coding scheme where the speech bandwidth is partitioned
into 4 bands. Table below gives the corner frequencies of each band along with
the number of bits used to encode each band. Assuming that no side informa-
tion need be transmitted, compute the minimum encoding rate of this SBC
coder.
374 Ch. 7 . Speech Coding
Encoder
sj't)
Decoder
s (t)
Figure 7.4
Block diagram of a sub-band coder and decoder.
1000-1500 2
1800 - 2700
1
4 1
-j
Solution to Example 7.3
Given:
Number of sub-bands = N =4
Frequency Domain Coding ol Speech 375
where g(0) = I and g(k) = [2, k = 1,2 ,N—1. The inverse DCT is
defined as:
In practical situations the DCT and IDCT are not evaluated directly using
the above equations. Fast algorithms developed for computing the DCT in a corn-
putationally efficient manner are used.
Most of the practical transform coding schemes vary the bit allocation
among different coefficients adaptively from frame to frame while keeping the
total number of bits constant. This dynamic bit allocation is controlled by tine-
varying statistics which have to be transmitted as side information. This consti-
tutes an overhead of about 2 kbps. The frame of N samples to be transformed or
inverse-transformed is accumulated in the buffer in the transmitter and receiver
respectively. The side information is also used to determine the step size of the
various coefficient quantizers. In a practical system, the side information trans-
mitted is a coarse representation of the log-energy spectrum. This typically con-
sists of L frequency points, where L is in the range 15-20, which are computed
376 Ch. 7• Speech Coding
7.6 Vocoders
Vocoders are a class of speech coding systems that analyze the voice signal
at the transmitter, transmit parameters derived from the analysis, and then syn-
thesize the voice at the receiver using those parameters. All vocoder systems
attempt to model the speech generation process as a dynamic system and try to
quantify certain physical constraints of the system. These physical constraints
are used to provide a parsimonious description of the speech signal. Vocoders are
in general much more complex than the waveform coders and achieve very high
economy in transmission bit rate. However, they are less robust, and their per-
formance tends to be talker dependent. The most popular among the vocoding
systems is the linear predictive coder (LPC). The other vocoding schemes include
the channel vocoder, formant vocoder, cepstrum vocoder and voice excited
vocoder.
Figure 7.5 shows the traditional speech generation model that is the basis
of all vocoding systems [F1a791. The sound generating mechanism forms the
source and is linearly separated from the intelligence modulating vocal tract fil-
ter which forms the system. The speech signal is assumed to be of two types:
voiced and unvoiced. Voiced sound (5n", pronunciations) are a result of
quasiperiodic vibrations of the vocal chord and unvoiced sounds (7', "s", "sh" pro-
nunciations) are fricatives produced by turbulent air flow through a constriction.
The parameters associated with this model are the voice pitch, the pole frequen-
cies of the modulating filter, and the corresponding amplitude parameters. The
pitch frequency for most speakers is below 300 Hz, and extracting this informa-
tion from the signal is very difficult. The pole frequencies correspond to the reso-
nant frequencies of the vocal tract and are often called the fonnants of the
speech signal. For adult speakers, the formants are centered around 500 Hz,
1500 Hz, 2500 Hz, and 3500 Hz. By meticulously adjusting the parameters of the
speech generation model, good quality speech can be synthesized.
Vocoders 377
r —
Second Source
speech
/ output
Pulse
Source
L J
speech r'J
input
30 ms. Along with the energy information about each band, the voiced/unvoiced
decision, and the pitch frequency for voiced speech are also transmitted.
the vocal tract spectral envelope, with the high frequency excitation coefficients
forming a periodic pulse train at multiples of the sampling period. Linear filter-
ing is perfonned to separate the vocal tract cepstral coefficients from the excita-
tion coefficients. In the receiver, the vocal tract cepstral coefficients are Fourier
transformed to produce the vocal tract impulse response. By convolving this
impulse response with a synthetic excitation signal (random noise or periodic
pulse train), the original speech is reconstructed.
H(z) = (7.13)
I +
where C is a gain of the filter and £' represents a unit delay operation. The
excitation to this filter is either a pulse at the pitch frequency or random white
noise depending on whether the speech segment is voiced or unvoiced. The coeffi-
cients of the all pole filter are obtained in the time domain using linear predic-
tion techniques [Mak75]. The prediction principles used are similar to those in
ADPCM coders. However, instead of transmitting quantized values of the error
signal representing the difference between the predicted and actual wavefonn,
the LPC system transmits only selected characteristics of the error signal. The
Linear Predictive Coders 379
parameters include the gain factor, pitch information, and the voiced/unvoiced
decision information, which allow approximation of the correct error signal. At
the receiver, the received information about the error signal is used to determine
the appropriate excitation for the synthesis filter. That is, the error signal is the
excitation to the decoder. The synthesis filter is designed at the receiver using
the received predictor coefficients. In practice, many LPC coders transmit the fil-
ter coefficients which already represent the error signal and can be directly syn-
thesized by the receiver. Figure 7.6 shows a block diagram of an LPC system
EJay86].
Determination of Predictor coefficients — The linear predictive coder
uses a weighted sum of p past samples to estimate the present sample, where p
is typically in the range of 10-15. Using this technique, the current sample s,,
can be written as a linear sum of the immediately preceding samples s,, k
Sn (7.14)
where, en is the prediction error (residual). The predictor coefficients are calcu-
lated to minimize the average energy E in the error signal that represents the
difference between the predicted and actual speech amplitude.
N N p 2
E= = (715)
p N
= =0 (7.17)
k = On
The inner summation can be recognized as the correlation coefficient Crm
and hence the above equation can be rewritten as
=0 (7.18)
k=0
After determining the correlation coefficients equation (7.18) can be
used to determine the predictor coefficients. Equation (7.18) is often expressed in
matrix notation and the predictor coefficients calculated using matrix inversion.
A number of algorithms have been developed to speed up the calculation of pre-
0
Figure 7.6
Block diagram of a LPC coding system.
_______
dictor coefficients. Normally, the predictor coefficients are not coded directly; as
they would require 8 bits to 10 bits per coefficient for accurate representation
[De193]. The accuracy requirements are lessened by transmitting the reflection
coefficients (a closely related parameter) which have a smaller dynamic range.
These reflection coefficients can be adequately represented by 6 bits per coeffi-
cient. for a 10th order predictor, the total number of bits assigned to the
model parameters per frame is 72, which includes 5 bits for a gain parameter
and 6 bits for the pitch period. If the parameters are estimated every 15 ms to 30
ms, the resulting bit rate is in the range of 2400 bps to 4800 bps. The coding of
the reflection coefficient can be further improved by performing a nonlinear
transformation of the coefficients prior to coding. This nonlinear transformation
reduces the sensitivity of the reflection coefficients to quantization errors. This is
normally done through a log-area ratio (LAB) transform which performs an
inverse hyperbolic tangent mapping of the reflection coefficients, (k)
rl÷R (k)i
LAR 'k) —— tanh-''R
k 'k')lo
I i
I (719)
Various LPC schemes differ in the way they recreate the error signal (exci-
tation) at the receiver. Three alternatives are shown in Figure 7.7 [Luc891. The
first one shows the most popular means. It uses two sources at the receiver, one
of white noise and the other with a series of pulses at the current pitch rate. The
selection of either of these excitation methods is based on the voiced/unvoiced
decision made at the transmitter and communicated to the receiver along with
the other information. This technique requires that the transmitter extract pitch
frequency information which is often very difficult. Moreover, the phase coher-
ence between the harmonic components of the excitation pulse tends to produce
a buzzy twang in the synthesized speech. These problems are mitigated in the
other two approaches: Multi-pulse excited LPC and stochastic or code excited
LPC.
Vocoder
Filter
I
Multipulse Excitation
Synthesis
Filter
Stochastic Exciralion
Figure 7.7
LPC excitation methods.
voice periodicity and adjust the spectral envelope. The regenerated speech sam-
ples at the output of the second filter are compared with samples of the original
speech signal to form a difference signal. The difference signal represents the
objective error in the regenerated speech signal. This is further processed
through a linear filter which amplifies the perceptually more important frequen-
cies and attenuates the perceptually less important frequencies.
original speech
excitation +
Predictor F
Perceptual
Averaging error
perceptual 4 . 4 .
I
Weighting
. .
Figure 7.8
Block diagram illustrating the CELP code book search.
+
s( n)
Encoded
output
Figure 7.9
Block diagram of a kELP encoder.
tive to determine the perceptual importance of each bit and group them
according to their sensitivity to errors. Depending on their perceptual signifi-
cance, bits in each group are provided with different levels of error protection
through the use of different forward error correction (FEC) codes.
The choice of the speech coder will also depend on the cell size used. When
the cell size is sufficiently small such that high spectral efficiency is achieved
through frequency reuse, it may be sufficient to use a simple high rate speech
codec. In the cordless telephone systems like CT2 and DECT. which use very
small cells (microcells), 32 kbps ADPCM coders are used to achieve acceptable
performance even without channel coding and equalization. Cellular systems
operating with much larger cells and poorer channel conditions need to use error
correction coding, thereby requiring the speech codecs to operate at lower bit
rates. In mobile satellite communications, the cell sizes are very large and the
available bandwidth is very small. In order to accommodate realistic number of
users the speech rate must be of the order of 3 kbps, requiring the use of vocoder
techniques [Ste93].
The type of multiple access technique used, being an important factor in
determining the spectral efficiency of the system, strongly influences the choice
of speech codec. The US digital TDMA cellular system (IS-54) increased the
capacity of the existing analog system (AMPS) threefold by using an 8 kbps
VSELP speech codec. CDMA systems, due to their innate interference rejection
capabilities and broader bandwidth availability, allow the use of a low bit rate
speech codec without regard to its robustness to transmission errors. Transmis-
sion errors can be corrected with powerful FEC codes, the use of which, in CDMA
systems, does not effect bandwidth efficiency very significantly.
The type of modulation employed also has considerable impact on the
choice of speech codec. For example, using bandwidth-efficient modulation
schemes can lower the bit rate reduction requirements on the speech codec, and
vice versa. Table 7.1 shows a listing of the types of speech codecs used in various
digital mobile communication systems.
Example 7.4
A digital mobile communication system has a forward channel frequency band
ranging between 810 MHz to 826 MHz and a reverse channel band between
940 MHz to 956 MHz. Assume that 90 per cent of the band width is used by
traffic channels. It is required to support at least 1150 simultaneous calls using
FDMA The modulation scheme employed has a spectral efficiency of 1.68 bps!
Hz. Assuming that the channel impairments necessitate the use of rate 1/2
FEC codes, find the upper bound on the transmission bit rate that a speech
coder used in this system should provide?
Example 7.5
The output of a speech coder has bits which contribute to signal quality with
varying degree of importance. Encoding is done on blocks of samples of 20 ms
duration (260 bits of coder output). The first 50 of the encoded speech bits (say
type 1) in each block are considered to be the most significant and hence to pro-
tect them from channel errors are appended with 10 CRC bits and convolution.
ally encoded with a rate 112 FEC coder. The next 132 bits (say type 2) are
appended with 5 CRC bits and the tast 78 bits (say type 3) are not error pro-
tected. Compute the gross channel data rate achievable.
Therefore, gross channel bit rate = 335/ (20 x IC3) = 16.75 kbps.
RPE
coded
speech
Figure 710
Block diagram of GSM speech encoder
encoder. The received excitation parameters are RPE decoded arid passed to the
LW synthesis filter which uses the pitch and gain parameter to synthesize the
long-term signal. Short-term synthesis is carried out using the received reflec-
tion coefficients to recreate the original speech signal.
output
Figure 7.11
Block diagram of GSM speech decoder.
Every 260 bits of the coder output (i.e. 20 ms blocks of speech) are ordered,
depending on their importance, into groups of 50, 132, and 78 bits each. The bits
in the first group are very important bits called type [a bits. The next 132 bits
The USDC Codec 389
areimportant bits called lb bits, and the last 78 bits are called type II bits. Since
type Ia bits are the ones which effect speech quality the most, they have error
detection CRC bits added. Both Ia and lb bits are convolutionally encoded for for-
ward error correction. The least significant type II bits have no error correction
or detection.
Sn
Filter Coefficients
Figure 7.12
Block diagram of the USDC speech encoder.
ear. Since the listener is the ultimate judge of the signal quality, subjective lis-
tening tests constitute an integral part of speech coder evaluation.
Subjective listening tests are conducted by playing the sample to a number
of listeners and asking them to judge the quality of the speech. Speech coders are
highly speaker dependent in that the quality varies with the age and gender of
the speaker, the speed at which the speaker speaks and other factors. The sub-
jective tests are carried out in different environments to simulate real life condi-
tions such as noisy, multiple speakers, etc. These tests provide results in terms of
overall quality, listening effort, intelligibility, and naturalness. The intelligibility
tests measure the listeners ability to identify the spoken word. The diagnostic
rhyme test (OWl') is the most popular and widely used intelligibility test. In this
test a word from a pair of rhymed words such as "those-dose" is presented to the
listener and the listener is asked to identify which word was spoken. Typical per-
centage correct on the DRT tests range from 75-90. The diagnostic acceptability
measure (DAM) is another test that evaluates acceptability of speech coding sys-
tems. All these tests results are difficult to rank and hence require a reference
system. The most popular ranking system is known as the mean opinion score or
MOS ranking. This is a five point quality ranking scale with each point associ-
ated with a standardized descriptions: bad, poor, fair, good, excellent, Table 7.2
gives a listing of the mean square opinion ranking system.
Pertormance Evakation of Speech Coders 391
7]
I
Output
72
Filter Coefficients
Figure 7.13
Block diagram of the USDC speech decoder.
One of the most difficult conditions for speech coders to perform well in is
the case where a digital speech-coded signal is transmitted from the mobile to
the base station, and then demadulated into an analog signal which is then
speech coded for retransmission as a digital signal over a landline or wireless
link. This situation, called tandem signaling, tends to exaggerate the bit errors
originally received at the base station. Tandem signaling is difficult to protect
against, but is an important evaluation criterion in the evaluation of speech cod-
ers. As wireless systems proliferate, there will be a greater demand for mobile-
to-mobile communications, and such links will, by definition, involve at least two
independent, noisy tandems.