15th ICPhS Barcelona
Improved ASR in noise using harmonic decomposition
David M. Moreno∗ , Philip J.B. Jackson† , Javier Hernando∗ and Martin J. Russell‡
∗
†
TALP Research Centre, Universitat Politècnica de Catalunya, Barcelona, Spain.
CVSSP, Electronic Engineering, University of Surrey, Guildford, UK. [p.jackson@surrey.ac.uk]
‡
Electronic Electrical & Computer Engineering, University of Birmingham, Birmingham, UK.
ABSTRACT
tic sources, hence improving the feature extraction for
both kinds of cue: voiced and unvoiced.
Application of the pitch-scaled harmonic filter (PSHF)
to automatic speech recognition in noise was investigated using the Aurora 2.0 database. The PSHF decomposed the original speech into periodic and aperiodic streams. Digit-recognition tests with the extended
features compared the noise robustness of various parameterisations against standard 39 MFCCs. Separately, each stream reduced word accuracy by less than
1 % absolute; together, the combined streams gave substantial increases under noisy conditions. Applying
PCA to concatenated features proved better than to
separate streams, and to static coefficients better than
after calculation of deltas. With multi-condition training, accuracy improved by 7.8 % at 5 dB SNR, thus
providing resilience from corruption by noise.
Although researchers have experimented with a
plethora of ways to extract a single set of features from
speech, methods of decomposing the acoustic signal
from the speaker into parallel streams of information
are not so well studied. Some have shown benefit in
sub-band processing [1] and MFCCs mixed with formants [2], while others have used a single set of features
with parallel models [3]. Since we know from personal
experience that whispered, breathy or creaky speech
is more difficult to understand in a noisy environment
than normally-phonated speech, it seems logical that a
feature extraction technique for ASR that also exploits
the signal’s harmonicity should offer gains in recognition accuracy and robustness to noise.
Therefore, to separate the quasi-periodic voiced component from the noise-like residual, the pitch-scaled
harmonic filter (PSHF) was used. It was designed
to split an input speech signal into two synchronous
streams: periodic and aperiodic, which act respectively
as estimates of the voiced and unvoiced components
of the signal at any time [4]. After decomposition,
features extracted from each of the streams may be
concatenated or further manipulated into an extended
feature vector, as required. The feature extraction processes are described below, with experimental details,
and a brief discussion of the results, which demonstrate
the capability of the PSHF for enhancing the digitrecognition accuracy of an ASR system in tests on the
Aurora 2.0 database.
1 INTRODUCTION
In a conventional front end for automatic speech recognition (ASR), incoming speech signals are converted
into Mel-frequency cepstral coefficients (MFCCs), before any analysis or interpretation is carried out (e.g.,
by Viterbi decoding). In the present study, we have
sought first to separate the voiced and unvoiced contributions to the speech signal (as periodic and aperiodic components respectively), which are then converted into MFCCs. Thus, the acoustic models may
be considered as learning distinct characteristics of the
voiced and unvoiced parts for any given phoneme.
The acoustic cues of speech come from a variety of different mechanisms, such as phonation, frication and
plosion. Many ASR front ends treat them equally
although human speech production and speech coding studies have shown the characteristics of radiated
speech signals to depend greatly on vibration of the vocal cords. Standard front ends try to extract features
that are not strongly influenced by the source characteristics. Here, we attempt to segregate harmonic
and noise-like cues before describing their characteristics, by extracting the contribution from voicing (with
large relative amplitude) from those of other acous-
2 METHOD
Preparation of the acoustic features from Aurora had
three main stages: (i) estimating the fundamental frequency for voiced sections of the speech corpus, (ii) decomposing the speech files into periodic and aperiodic
components, and (iii) calculating the feature vectors.
All training and test utterances are processed alike, as
in figure 1.
751
ISBN 1-876346-48-5 © 2003 UAB
15th ICPhS Barcelona
periodic
aperiodic
4
features
Frequency (kHz)
PSHF
waveform
(a) Original speech
Feature
Extraction
Figure 1: Front-end overview. The waveform is split by
window
f0opt
Pitch optimization
optimised
pitch
window
Nopt
w(n)
s(n)
Harmonic Decomposition
0.6
0.8
1
(b) Periodic estimate
1.2
1.4
1.6
0.4
0.6
0.8
1
(c) Aperiodic estimate
1.2
1.4
1.6
0.4
0.6
1.2
1.4
1.6
2
1
4
v^w(n)
sw(n)
+
0.4
3
0
0.2
w(n)
Frequency (kHz)
waveform
1
4
Frequency (kHz)
f0raw
2
0
0.2
the PSHF into periodic and aperiodic components, from which features are extracted.
pitch
3
_
uw(n)
aperiodic
waveform
vw(n)
periodic
waveform
3
2
1
0
0.2
0.8
1
Time (s)
Figure 2: The PSHF (from top): optimal pitch and period, f0opt and N opt , are calculated, harmonic
decomposition estimates the periodic contribution, which is subtracted from the original signal to give the aperiodic estimate.
Figure 3: MFCC-derived spectrograms of the utterance
“zero-two-six-zero”: (a) s, (b) v̂, and (c) û.
was added to the periodic features during voiceless sections.1 As well as concatenation, the technique of principal component analysis (PCA) was employed to offer
six parameterisations of the data,
2.1 Pitch extraction
An initial estimate of each file’s fundamental frequency f0raw was made, then optimised by the PSHF,
which scales the window size to the pitch period as
part of the decomposition. After robust pitch extraction by the Entropic utility get f0, our own pitchcorrection script was applied to resolve glitches in
voice activity and pitch discontinuties, e.g., octave errors. The parameters of both steps were determined
empirically (minimum voiced/unvoiced segment durations of 30 ms/10 ms, respectively). The clean files
were processed automatically to produce f0raw values
for the entire database, which the PSHF optimised
with a matched cost function to yield f0opt (4 periods,
8 harmonics, 4 ms shift, [5]).
front-end processing
base:
MFCC
+∆, +∆∆
+∆, +∆∆
split:
PSHF
MFCC
pca26:
PSHF
MFCC
pca78:
PSHF
MFCC
+∆, +∆∆
pca13:
PSHF
MFCC
PCA
pca39:
PSHF
MFCC
+∆, +∆∆
cat
cat
PCA
cat
+∆, +∆∆
PCA
+∆, +∆∆
cat
PCA
cat
where “+∆, +∆∆” denotes calculation of 1st - and 2nd order differences (aka. velocities and accelerations),
and “cat” implies concatenation of the periodic and
aperiodic feature streams. PCA parameterisations are
distinguished by the size of matrix in the analysis,
which depends on the operations’ order. Thus all feature vectors had 78 coefficients, except base with 39.
2.2 Periodic-aperiodic decomposition
The harmonic decomposition was performed using the
optimised clean pitch estimates, giving a pair of periodic and aperiodic files for every file in the database.
Figure 2 shows the windowing and decomposition of a
frame of speech, which is by selection of harmonics in
the frequency domain. Successive shift and respilicing
of the outputs yields complete periodic and aperiodic
signal, synchronised with the input. The algorithm
and an assessment of its performance are described
elsewhere [4, 6]; software and examples are online, [7].
2.4 Recognition experiments
The Aurora 2.0 database comprises clean 8 kHz speech
recordings of connected digits with noise added at
seven signal-to-noise ratios (SNRs): ∞, 20 dB, 15 dB,
10 dB, 5 dB, 0 dB and −5 dB. There are matched and
unmatched noise conditions in the test data for both
additive and convolutional noise (i.e., channel distortion). Hence, a recognizer may be trained using only
clean data or multiple SNR conditions, and the results
viewed according to test SNR.
2.3 Feature extraction
Standard MFCC features (0–12, plus deltas and deltadeltas) were extracted from the original signal and
from the pair of decomposed signals, using HTK [8].
A small amount of Gaussian white noise, or dither,
ISBN 1-876346-48-5 © 2003 UAB
parm.
1 Adding dither avoided numerical instability in training probability distributions that may be induced by total silence.
752
15th ICPhS Barcelona
Typically there were only 13 dominant dimensions in
the data (including the deltas and delta-deltas) but
the detection of voiced segments introduced one extra
for the periodic component, so we would expect it to
contain one more useful PC. With a threshold at 1 %
of the total variance, the numbers of selected PCs for
original, periodic and aperiodic streams were 13, 10
and 13, and 15 for the recombined streams (split). If
periodic and aperiodic streams were completely redundant, the number of PCs after recombination would be
equal to those for the original stream (viz. 13); if totally
independent, the number should be their sum (i.e., 23).
As the number of recombined PCs fell between 13 and
23, it implies that complementary information was extracted through the decomposition.
Training scripts instructed HTK to generate a set of
16-state word models for each of the digit prototypes
(and a 3-state silence model). After flat initialisation
and 16 iterations of the Baum-Welch algorithm, the
models were tested and word accuracy recorded. In
the split tests, likelihoods of the two streams could be
weighted independently. For all results reported here,
the same weighting was used during training, thanks
to minor adjustment of HTK [9].
3 RESULTS
Figure 3 gives a spectrographic example of the features used in the recognizer, showing the effect of the
standard front end on the original signal and on the
periodic and aperiodic components. Although no new
information was introduced by the decomposition, it is
interesting to observe the prominence of voicing transitions and the distribution of spectral details during
voiced segments. From listening, the aperiodic estimate sounds similar to whispered speech (as expected
for an absent voice source); though the periodic estimate contains only voiced segments, it is perfectly
recognizable, due to language and coarticulation cues
that remain. Under noisy conditions, the incoherent
aperiodic contribution accrues most distortion and is
much more easily masked than the periodic one.
4 CONCLUSION
The PSHF was used to split each speech waveform in
the Aurora 2.0 database into two synchronous streams,
periodic and aperiodic, acting respectively as estimates of the voiced and unvoiced components. Features were extracted from each stream and combined
(by some sequence of concatenation, PCA and calculation of delta coefficients) to form an extended feature vector. Experiments yielded accuracy scores for
connected-digit recognition, and tested the noise robustness of our parameterisations against a conventional one (39 MFCCs+∆, +∆∆). Used separately,
each of the streams gave recognition accuracy that was
only slightly degraded (by less than 1 % absolute, compared to the baseline using the original speech); together, accuracy was increased under noisy conditions,
demonstrating not only redundancy between streams
but also complementary information. Tests applying
PCA to the concatenated feature set tended to perform better than applying PCA to the streams separately, and PCA of the static coefficients, before calculation of the deltas, was better than afterwards.
With multi-condition training, the accuracy improved
by 7.8 % under 5 dB SNR using concatenated streams
(78 MFCCs); whereas with PCA of the combined static
MFCCs, and derivatives (48 coefficients), the improvement was 5.6 %. Thus, voiced regions of a speech utterance appear to provide resilience of a message to
corruption by noise. However, no significant improvement on 99.0 % baseline accuracy was achieved under
clean test conditions. Complete details of this research
to-date are reported in Moreno’s thesis [9].
3.1 Effect of decomposition
Points for equally-weighted streams, γp = γa = 1.0
(centre of each graph in figure 4), correspond to direct
concatenation of the periodic and aperiodic features.
The improvement in recognition word accuracy is remarkable, especially under noisy test conditions, suggesting that useful information had been masked in the
features extracted from the original speech.
3.2 Influence of stream weights
Changing the balance of the streams weights defines
three scenarios: (i) under clean test conditions, best
performance was achieved when the aperiodic stream
carried much more weight than the periodic one; (ii) in
very noisy conditions, the best results arose with all the
weight given to the periodic stream; (iii) at intermediate noise levels, a combination of both streams gave
best results. This behaviour is due to the PSHF mainly
ascribing corrupting noise to the aperiodic component.
3.3 Principal component analysis
PCA was used to decorrelate the dimensions of feature data, and sort them by the proportion of variance
each dimension explains. It enabled us to determine
which part of the variation in the data was useful to
the recognizer. How complementary, or redundant, the
periodic and aperiodic streams were can be measured
by the number of dimensions, or PCs, beneficial to the
recognition task.
In the future, we propose to explore the influence of
the voicing information on different classes of speech
sound, for instance on a phoneme recognition task using TIMIT corpus, whose 16 kHz speech provides more
turbulence-noise information. It would also be interesting to apply different forms of front-end processing to the two streams, and to consider other forms of
model combination.
753
ISBN 1-876346-48-5 © 2003 UAB
15th ICPhS Barcelona
(b) multi
99.5
99
99
98
98
Word accuracy (%)
Word accuracy (%)
(a) clean
99.5
95
90
80
95
90
80
70
70
50
50
20
0
20
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Periodic−stream weight
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Periodic−stream weight
Figure 4: Split test results of word accuracy (%) versus periodic-stream weight γp , averaged across each noise level:
(a) clean and (b) multi-condition training. Solid lines are (from top): ∞, 20 dB, 15 dB, 10 dB, 5 dB, 0 dB and
−5 dB SNR test conditions. Thick horizontal lines indicate baseline average scores, and the dashed line (with
⊙) indicates the best at each noise level.
base
split
pca26
pca78
pca13
pca39
Clean
∞
20
99.0 91.9
99.2 96.8
99.0 95.8
98.9 94.2
98.5 96.5
98.4 95.9
Signal-to-Noise Ratio (dB)
15
10
5
0
77.7 54.0 28.4 11.4
94.1 88.4 77.6 56.0
92.0 82.6 64.2 40.8
87.5 70.9 44.4 23.2
93.3 85.5 68.0 43.3
91.9 83.1 64.3 39.7
−5
5.8
33.1
23.8
14.3
23.0
23.2
Ave.
52.6
77.9
71.2
61.9
72.6
70.9
Multi
∞
20
98.5 97.4
98.5 97.5
98.4 97.7
98.3 97.4
98.0 97.0
97.8 97.0
Signal-to-Noise Ratio (dB)
15
10
5
0
96.5 93.7 84.2 55.2
96.9 95.8 92.8 83.2
97.1 95.7 92.1 81.7
96.6 95.1 91.0 80.4
96.3 94.4 90.5 79.4
96.3 94.8 90.9 79.3
−5
22.4
59.4
59.2
57.8
57.7
56.5
Ave.
78.3
89.1
88.8
88.1
87.6
87.5
Table 1: Best word accuracy (%) achieved by the each front end in § 2.3. The split and pca results respectively depend
on the stream weights and number of principal components used.
ACKNOWLEDGEMENTS
tive and aspiration components in speech production,
Ph.D. thesis, Dept. Electronics & Comp. Sci., Univ.
of Southampton, UK, 2000.
The authors would like to acknowledge the support
of their respective organizations, and Climent Nadeu,
Jaume Padrell, Nick Wilkinson, Matt Stuttle and
Dušan Macho for helpful discussions.
[5] H. Muta, T. Baer, K. Wagatsuma, T. Muraoka, and
H. Fukuda, “A pitch-synchronous analysis of hoarseness in running speech,” J. Acoust. Soc. Am., vol.
84, no. 4, pp. 1292–1301, 1988.
[6] P. J. B. Jackson and C. H. Shadle, “Pitch-scaled
estimation of simultaneous voiced and turbulencenoise components in speech,” IEEE Trans. on Spch.
& Aud. Proc., vol. 9, no. 7, pp. 713–726, 2001.
REFERENCES
[1] H. Boulard and S. Dupont, “Sub-band based
speech recognition,” in Proc. IEEE-ICASSP, Munich, 1997, pp. 1251–1254.
[7] P. J. B. Jackson, D. M. Moreno, J. Hernando,
and M. J. Russell, Columbo project, CVSSP,
Univ. of Surrey, Guildford, UK, 2001, [http://www.
ee.surrey.ac.uk/Personal/P.Jackson/Columbo/].
[2] N. Wilkinson and M. J. Russell, “Improved phone
recognition on TIMIT using formant frequency data
and confidence measures,” in Proc. Int. Conf. on
Spoken Lang., Denver, CO, 2002, pp. 2121–2124.
[8] S. J. Young, J. Odell, D. Ollason, V. Valtchev, and
P. Woodland, The HTK Book, Entropic Camb. Res.
Lab., Cambridge, UK, v2.1 edition, 1997.
[3] M. J. F. Gales and S. J. Young, “Robust speech
recognition in additive and convolutional noise using parallel model combination,” Comp. Speech &
Lang., vol. 9, pp. 289–308, 1995.
[9] D. M. Moreno, “Harmonic decomposition applied
to automatic speech recognition,” M.S. thesis, Universitat Politècnica de Catalunya, Barcelona, 2002.
[4] P. J. B. Jackson, Characterisation of plosive, frica-
ISBN 1-876346-48-5 © 2003 UAB
754