Comparative Analysis of Speech Compression Algorithms With Perceptual and LP Based Quality Evaluations

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

International Journal of Computer Applications (0975 - 8887)

Volume 51-No. 15, August 2012

Comparative Analysis of Speech Compression


Algorithms with Perceptual and LP Based Quality
Evaluations
Nasir Saleem Sunniya Nasir Sher Ali
Institute of Engineering and Institute of Engineering and City University of science and
Technology, Gomal University, Technology, Gomal University, Information Technology,
D.I.Khan, KPK, Pakistan D.I.Khan, KPK, Pakistan Peshawar, KPK, Pakistan
Nasir_Salemm@yahoo.c Sanakhan_132@yahoo.c malaksherkhan@yahoo.c
om om om

tolerable while 500 msec time delay is acceptable in video


ABSTRACT telephony. Among these limitations, One more limitation is
Speech compression is one of the leading vicinity of digital
that the overall bit rate should not go beyond the 16 kbps.
signal processing that spotlight on dipping the bit rate of
With all these constraints, the specific system ought to have
speech signals for transmission and storage devoid of
less than 20 MOPS. Normally the speech coders are
considerable loss of quality. In past decades many speech
categorized into waveform coding, parametric and hybrid
coding techniques have been proposed for speech analysis.
coders. Waveform coders including PCM and ADPCM [12]
This paper attempts to assess and compare two compression
try to maintain the novel shape of speech and works at 32
techniques on speech signals. To execute this idea we have
kbps and higher. In parametric coders [13] the parameters of
chosen two low bit rate and widely used speech analysis
the speech are extracted and same parameters are utilized in
methods called VELP and MELP. The performances of both
synthesis process. The bit rate range for parametric codes is 2
are evaluated by performing objective quality tests including
to 5 kbps. MELP and LPC are the examples of this class.
PESQ, IS and CEP. Similar speech files are tested with both
While hybrid coders are the arrangement of waveform and
coders. The objective assessments show that at low bit rate the
parametric coders and bit rate range is 5 to 32 kbps. In this
MELP shows better performance as compared to VELP.
paper the developed speech coders (MELP and VELP) are
measured for quality with objective quality measurements.
Keywords The objective investigation is launched in order to assess the
MELP, VELP, PESQ, IS, Cepstrum distance, MOS. speech quality and minimize the human preconception. The
objective quality has been achieved by figure out the PESQ,
1. INTRODUCTION IS and DCEP for the novel and reconstructed speech signals.
The aim of the Speech compression system (SCS) is to
transform the speech signals to a more compact representation 2. HUMAN SPEECH GENERATION
which can be transmitted across the channel with This section includes explanation of human anatomy of
comparatively lesser storage memory. Practically it is not speech generation and is sketched in figure 1.
possible that one gets full access to entire bandwidth of the .
network; consequently, networks require compressing the
speech signal. In general the speech compression techniques
[1],[3],[4] are utilized in long distance communication,
excellent quality storage and message encryption. For
example in digital cellular system (DCS) and VoIP networks
number of consumers contribute to the same bandwidth. As
bandwidth is limited in networks therefore speech
compression permits additional consumers to acquire access
to the accessible network. Another example of SCS is the
digital storage of voice. For fixed on hand memory, the
speech compression makes it achievable to accumulate the
message for extended time. Speech coding is fundamentally
the lossy category of coding where reconstructed speech is not
the precise replica of novel speech and also does not sound
similar to the original speech. Numerous speech coding
techniques are available like waveform coding, vocoding Fig 1: Human Speech Generation
including Mixed Excited Linear Predictive Coding (MELP)
[2],[5], Linear Predictive Coding (LPC) and Voice Excited On the basis of above human speech production model
Linear Predictive Coding (VELP) [6], [11] , and hybrid coding. following observations has been made
The speech attempts to be compressed are wideband in nature
surrounds frequency range 0 to 8 KHz. The sampling  In support of voiced sounds, the vocal cords
frequency for these speech signals according to sampling tremble. The rate at which the vocal cords tremble is
theorem must be 16 KHz with end-to- end delay limit of 100 known as Pitch period of voice.
ms. Time delay limitations depend upon application to
application. Like in telephony network only 1 msec delay is
37
International Journal of Computer Applications (0975 - 8887)
Volume 51-No. 15, August 2012

 In support of unvoiced sounds, there is no vocal 3.1 Pulse Shaping Filters


trembling subsequently having no pitch. In MELP, every pulse shaping filter is composed of five
 In support of fricatives or plosive sounds, a pressure filters recognized as synthesis filters. Among five filters every
is build up in lungs. synthesis filter manages one specific band of frequencies with
 For nasal sounds, the vocal tract is acoustically pass bands classified as:
coupled with nasal cavity. Filter 1: 0-500 Hz
 The first and last type of sounds is classified as Filter 2: 500-1000 Hz
voiced, while in remaining two forms the vocal Filter 3: 1000-2000 Hz
cords stay unlock and there is no vibration for these Filter 4: 2000-3000 Hz
kinds of sound so they are called as unvoiced. Filter 5: 3000-40000 Hz.
 The nature of the vocal tract resolves the type of These synthesis filters are linked in parallel and depicts
sound. frequency responses of shaping filters. Figure 3 expresses this
 Throughout the speech generation process, vocal idea.
tract transforms the shape which fallout in diverse G = Gain
Synthesis
sounds. Filter 1
 Vocal tract is regarded as to be non-uniform and G
I/P to O/P of
time varying tube. Shaping
Synthesis
Shaping
Filter 2
Filter Filter
3. MELP CODING G
The figure 2 represents the speech generation model for
Synthesis
MELP coder which principally develops on the basis of LPC Filter 5
model [14]. The MELP decoder employs a complicated
interpolation to level out the inter-frame shift. An arbitrarily
pitch jitter is produced in order to disturb value of pitch period Fig 3: Pulse shaping filters
hence produces an A-periodic train of impulse. In MELP
coder the sound classification is expanded to three groups, the
voiced, unvoiced and additional class of sounds the jittery 3.2 Bit Allocation
voiced. The jittery voiced situation in MELP is managed by The MELP coder splits 8Kbps speech to 22.5 msec frame for
the pitch jitter parameter and a random number. The jittery analysis. Depends on nature of speech, the inter-frame
voicing situation matches to condition when the excitation is redundancy can be developed to proficiently quantize
A-periodic but not wholly random which is present in voicing parameters. The bits provision is summarized in table 1. Total
transitions. 54 bits are put on the air per frame with frame length of 22.5
msec. 2.4 kbps bit rate is desired to transmit the 54 bits/frame.
Period
Impulse
Jitter Table 1 Bits allocation for MELP coder
Response
Pitch
Filter Parameters Voiced Unvoiced
Period
Coefficients

Impulse Pulse Pulse


LPC 25 25
Train generation Shaping Pitch Period/Low band voicing strength 7 7
generator Filter Filter
Band pass voicing strength 4 -
Synthesis First gain 3 3
Voicing Filter
Strengths Second gain 5 5
White Noise Aperiodic flag 1 -
Noise Shaping
Gain Fourier magnitudes 8 -
Generator Filter
Synchronization 1 1
Speech Error protection _ 13
Total 54 54
Fig 2: MELP model of speech generation
The excitation pulse shape in case of periodic excitation is
detached from speech and detached parameters sent out
decoder for synthesis purpose. The pulse shapes hold 4. VELP CODING
significant information about speech and are computed via The LPC on speech fames is utilized to approximate the
Fourier magnitudes of prediction error. The computed parameters like pitch [15], formants, intensity, loudness and
parameters are utilized to produce the impulse response of the spectra of speech. The theory following the LPC is to trim
pulse generation filter [Figure 2] which is accountable for the down the sum of squared difference between original and
creation of periodic excitation. Periodic and noise excitations anticipated speech over a restricted period. This process is
are filtered with the help of pulse and noise shaping filers carried out to provide distinctive set of predictor coefficients.
respectively. Afterward these excitations are combined to These predictor coefficients are approximated for each frame
acquire the entire mixed excitation. That’s why called mixed of 20 msec. The coefficients are symbolized as ak. An
excited LPC. The frequency responses of shaping filters are additional essential parameter is gain (G). The transfer
directed by the parameters called as voicing potencies that function for filter is:
computes the voicedness. These responses of filters are
changeable with the passage of time. H(z) = (4.1)

38
International Journal of Computer Applications (0975 - 8887)
Volume 51-No. 15, August 2012

The summation is computed from k=1 to k=p. In case of LPC employs inverse DCT to recreate signal with the intention of
this is 10 means that 10 coefficients are sent for synthesis. excitation of voice.
Two key practices are utilized to compute coefficients, the
autocorrelation and covariance. Here the autocorrelation 4.1 Bits Allocation
scheme is utilized as the roots of polynomials in denominator The speech segment in VLPC encloses 88-bits as additional
causes the poles to be located within unity circle which shows 49 bits are needed for DCT. The 8000 samples/second are
stable system. The Levinson-Durbin scheme is took up in wrecked down to 180 samples segments as sampling rate is
order to compute the parameters for autocorrelation. The LPC 8000 samples/second. Thus approximately 44.44
analysis of frames is carried out to answer the question that frames/second fallout in approximately 4kbps. The bits
specific frame is voiced or unvoiced? In favor of voiced allocation is summarized in table 2.
segment, the impulse train is utilized. The precise pitch
detecting algorithm is used to conclude the pitch. We Table 2 Bits allocation for VELP coder
deployed autocorrelation function to decide pitch of segment.
In case of unvoiced segment, white noise is utilized for Parameters Bits
demonstration. Hence impulse train or white noise is used as DCT 40
excitation for synthesis filer. Mathematical model for speech K1 and K2 10
production is given in figure 4.
K3 and K4 10
Pitch Period
Av (Voiced Gain)
K5, K6, K7 and K8 16
K9 03
Vocal Tract
K10 02
Parameters
Impulse Glottal
Gain 05
Train Pulse R(z)
generator Model Synchronization 01
E(z) G(z)
Glottal Radiation Total 88
Switch Pulse Model
Model

Random 5. WAVEFORM ANALYSIS


V(z)
Noise The assessment has been completed with help of original
generator
S(z) speech sentences spoken by different speakers against the
VELP and MELP regenerated speech. In both schemes, the
Auv (Unvoiced Gain) reconstructed speech has poor quality than original. The
MELP reconstructed speech waveform is superior to VELP
Fig 4: Mathematical Modeling of Speech Generation and are plotted in figure 6 and 7 respectively.
Generally the predictor coefficients are not quantized directly
as they do not guarantee the stability. To guarantee the
stability, reasonably elevated accuracy is mandatory. A
minute variation in predictor coefficients is responsible for the
great change in pole locations. The LD algorithm creates
intermediate values. Therefore quantization of intermediate
values is less challenging. The core proposal following the
voice excitation is to keep away from the inaccurate discovery
of pitch and exercise of impulse train for synthesis of speech.
As a result, the incoming speech in frames is filtered out with
predictable TF of LPC analyzer. This filtered outcome is
known as residual signal. By transmitting the residual signal
to receiver, we can search out better quality speech. The block
diagram for VELP is given in figure 5.
Fig 6: MELP original (Blue) and rebuilt (Red) speech
S(t)
LPC Coder Channel Decoder
Analyzer

Excitation LPC
Detector Synthesizer

S'(t)
Fig 5: VELP vocoder block diagram

For rebuilding of excitation merely low frequencies of


residual signal are desired. We have applied the discrete
cosine transform (DCT) as it focused the majority of the
energy in first few coefficients. Therefore transmit the
Fig 7: VELP original (Blue) and rebuilt (Red) speech
coefficients as they hold nearly all energy. The receiving side

39
International Journal of Computer Applications (0975 - 8887)
Volume 51-No. 15, August 2012

It can be clearly observed from figures (6,7) that the estimation is based on divergence between power spectra of
reconstructed speech waveforms are close to novel in MELP original and reconstructed speech. If the distance is high for a
as compared to VELP. specific speech then algorithm is performing badly, but if the
distance is low then the algorithm performs well for that
6. PERCEPTUAL MEASUREMENT speech. Here in MELP, the IS distance is low while VELP
distance is higher means low performance. The chart 3 shows
6.1PESQ the IS computations for both coders.
The aim of the PESQ [6],[7],[8] algorithm is to measure the
quality of the speech. The quality evaluation process is carried Chart 3: IS distance for Speech coders
out by the comparison of the original and the degraded speech
due to the compression. For the comparison process the PESQ
takes the help of stochastic and cognitive models and is 15 13.2305 13.5837
associated with MOS [16]. PESQ score is calculated by the
linear grouping of the average disturbance (Dave) and the 10
average asymmetrical disturbance (Aave) values respectively
and is given by the formula: 5
PESQ = ao +a1Dave + a2Aave (6.1) 1.1221 1.0444
The PESQ results for MELP are better than VELP shown in 0
Male Speaker Female Speaker
Chart 1: PESQ computation for coders

4 MELP Coder VELP Coder


3.1033 3.1068
3 2.4216
2.1147
8. CONCLUSION AND FUTURE WORK
2 Speech compression has been carried out with help of two
1 algorithms having well different bit rates. Although the bit
rate for the VELP is much higher (almost twice) than MELP
0 still the quality of compressed speech obtained from the
Male Speaker Female Speaker MELP is better than the Voice Excited LPC. When the PESQ,
CEP and IS for both methods were compared, it was observed
MELP Coder VELP Coder that the speech degradation is more in voice excited LPC. The
mixed excited LPC performed better for both kinds of
speakers, that is, male and female. The signal level is weak for
7. LPC BASED OBJECTIVEMEASURES voice excited LPC while strong signal level is observed for
mixed excited LPC. Whenever we reconstruct the speech and
7.1 Cepstrum Distance (CEP) its signal level is relatively weak to prearranged threshold
CEP distance [17],[8] is the estimate of log-spectrum distance level we can conclude that the system is performing not well
between original and reconstructed speech and is computed by for specific speech signals. In this work we tested clean
taking the log of spectrum and retransformed to time-domain. speech signals and for English language only. In future we
With this process the excitation signal of speech is separated can extend this work to compute distortion introduced by the
from the convolved vocal tract characteristics. CEP distance coder in noisy background and the behavior of coders towards
can be computed from equation: other languages.
DCEP = 10/Log10 √ 2 Σ |Cc (k) – Cp (k) |2 (7.1)
The CEP computation for MELP coder has better 9. REFERENCES
performance as compared to the VELP and shown in chart 2. [1] Milan Z. Markovic, “Speech Compression – Recent
Smaller the distance higher will be the quality and vice versa. Advances and Standardization”, 19-21 September 2001,
NiS, Yugoslavia.
Chart 2: CEP distance for coders [2] John S. Collura, Diane F. Brandt, Douglas J. Rahikka,
“The 1.2 Kbps/2.4Kbps MELP Speech Coding Suite with
8 Integrated Noise Pre Processing”, IEEE, National Security
5.755 5.5365 Agency, 2002.
6
[3] Chin-Chen Chang, Richard Char-Tung Lee, Guang- Xue
4 2.8012 2.9792 Xiao, Tung-Shou C!hen, “A New Speech Hiding
Scheme Based upon Sub-Band Coding”, ICICS-
2 PCM, 15-18DemnbcrZ003 Singapore.
0 [4] G. Rajesh, A. Kumar, K.Ranjeet, “Speech
Male Speaker Female Speaker Compression using Different Transform Techniques
International Conference on Computer &
MELP Coder VELP Coder Communication Technology (ICCCT), 2011.
[5] Lann M Supplee, Ronald P Cohn, John S. Collura from
US DOD and Alan V McCree from Corporate R&D , TI,
Itakura Saito Distance Dallas, MELP: The New Federal Standard at 2400 bps.
IS [18] evaluates the perceptual distance between original [6] ITU-T: ITU-T Recommendation P.862: Perceptual
spectrum and an approximation of that spectrum. The Evaluation of Quality (PESQ): An Objective Method for
40
International Journal of Computer Applications (0975 - 8887)
Volume 51-No. 15, August 2012

End-to-End Speech Quality Assessment of Narrow-


Band Telephone Networks and Speech Codecs
(2001).
[7] Ma, J., Hu, Y., Loizou, P.C.: Objective measures
for predicting speech intelligibility in the noisy conditions
based on new band-importance functions. J. Acoust.
Soc. Am. 125(5), 3387–3405 (2009).
[8] Hu, Y., Loizou, P. C.: Evaluation of objective quality
measures for speech enhancement. IEEE Trans. Audio
Speech Lang. Process.16(1), 229–238(2008).
[9] B. S. Atal, M. R. Schroeder, and V. Stover, "Voice-
Excited Predictive Coding System for Low Bit-
Rate Transmission of Speech", Proc. ICC, pp.30-
37 to 30-40, 1975.
[10] C. J. Weinstein, "A Linear Predictive Vocoder with Voice
Excitation", Proc. Eascon, September 1975.
[11] Raza, M.A., “Implementation of Voice Excited Linear
Predictive Coding (VELP) on TMS320C6711 DSP
Kit”, 9th International Multitopic Conference, IEEE
INMIC, 2005.
[12] McPherson, T., Jr , “PCM speech compression via
ADPCM/TASI”, Acoustics, Speech, and Signal
Processing, IEEE International Conference on ICASSP
'77. 1977.
[13] Madrid, K.M. , Tan, E.C. , Guevara, R.C.L, “Low bit-rate
wideband LP and wideband sinusoidal parametric speech
coders” ,TENCON 2004. 2004 IEEE Region 10
Conference.
[14] Haagen, J. Nielsen, H. ; Hansen, S.D., “A 2.4 kbps high-
quality speech coder”, Acoustics, Speech, and Signal
Processing, 1991. ICASSP-91, 1991 International
Conference on 14-17 Apr 1991.
[15] Barbara Resch, Mattias Nilsson , Anders Ekman , W.
Bastiaan Kleijn, “Estimation of the Instantaneous Pitch of
Speech”, Audio, Speech, and Language Processing,
IEEE Transactions on March 2007.
[16] Itoh, Y. Tajima, K. ; Kuwabara, N., “Measurement of
subjective communication quality for optical mobile
communication systems by using mean opinion
score”, Personal, Indoor and Mobile Radio
Communications, 2000. PIMRC 2000. The 11th IEEE
International Symposium on 2000.
[17] Kitawaki, N.Nagabuchi, H. Itoh, K., “Objective quality
evaluation for low-bit-rate speech coding systems “,
Selected Areas in Communications, IEEE Journal, Feb
1988.
[18] Enqvist, P. Karlsson, J., ” Minimal Itakura-Saito distance
and covariance interpolation”, Decision and Control,
2008. CDC 2008. 47th IEEE Conference 9-11 Dec.
2008

41

You might also like