0% found this document useful (0 votes)
75 views12 pages

Extracting Room Reverberation Time From Speech Using Artificial Neural Networks

1. The document presents a novel method to extract room reverberation time from speech utterances using artificial neural networks. 2. Reverberated speech signals are preprocessed by calculating short-term RMS values and used to train a multilayer feedforward neural network, which learns to predict reverberation times. 3. In testing, the trained neural network is able to extract room reverberation times from new speech samples with an accuracy of 0.1 seconds, providing an alternative to traditional measurement methods.

Uploaded by

camt2112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views12 pages

Extracting Room Reverberation Time From Speech Using Artificial Neural Networks

1. The document presents a novel method to extract room reverberation time from speech utterances using artificial neural networks. 2. Reverberated speech signals are preprocessed by calculating short-term RMS values and used to train a multilayer feedforward neural network, which learns to predict reverberation times. 3. In testing, the trained neural network is able to extract room reverberation times from new speech samples with an accuracy of 0.1 seconds, providing an alternative to traditional measurement methods.

Uploaded by

camt2112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Extracting Room Reverberation Time from Speech Using

Artificial Neural Networks*

T. J. COX, AES Member, F. LI, AND P. DARLINGTON

School of Acoustic and Electronic Engineering, University of Salford, Salford M5 4WT, UK

A novel method to extract the reverberation time from reverberated speech utterances is
presented In this study, speech utterances are restricted to pronounced digits, uncontrolled
discourse is not considered. The reverberation times considered are wide band, within the
frequency range of speech utterances. A multilayer feed forward neural network is trained
on speech examples with known reverberation times generated by a room simulator The
speech signals are preprocessed by calculating short-term rms values. A second decision-
based neural network is added to improve the reliability of the predictions In the retrieve
phase, the trained neural networks extract room reverberation times from speech signals
picked up in the rooms to an accuracy of 0.1 s This provides an alternative to traditional
measurement methods and facilitates the occupied measurement of room reverberation times

0 INTRODUCTION Cremer and Muller [1] discussed the feasibility of


extracting reverberation times from live music perform-
Reverberation time (RT) is an important room acoustic ances. They pointed out that it is possible to extract
parameter, which has a significant impact on the sound reverberation times from music signal segments con-
quality of an auditory space. Due to the fact that humans taining significant instantaneous energy differences (that
are significant sound absorbers, occupied and unoccu- is, a high-level signal followed by a long enough pause
pied reverberation times of a space normally exhibit for reverberant decay to be measured). Due to their pul-
significant differences. To obtain realistic results and sating nature, speech utterance sequences are a rich
consequently give insightful information and better de- source of instantaneous energy differences and therefore
sign guidelines, it is desirable that the measurement of could also be useful natural sound sources for reverber-
the reverberation time be performed when the space is ation time estimation.
in use. However, for logistical reasons, in-use measure- The measured speech signal in a room is the convolu-
ments using traditional methods are rarely carried out. tion of the speech utterances and the ~mpulse response
Traditional measurements utilize artificial test signals of the room. In theory a deconvolu~(ion algorithm or
as the excitation, such as a white noise, an impulsive autocorrelation analysis can be used ~o separate the im-
excitation, or a more sophisticated maximum-length se- pulse response from reverberated ,~peech and subse-
quence (MLS). These test signals are annoying noises quently calculate the reverberation time. However, this
at high sound pressure levels and are not generally ac- approach needs complete information of the source sig-
ceptable to listeners. Moreover, measurements need to nals and is sensitive to various noises involved in the
be performed repeatedly at a number of positions in measurement. From an occupied measurement perspec-
a room and therefore are very time consuming. These tive, source independence and noise tolerance are of
constraints make occupied measurements very difficult importance. Blind deconvolution and signal separation
to obtain. Even when volunteers are available, the noise techniques have made notable advances in the past few
produced by the audience may affect the precision of years [2], but they have not yet developed sufficiently
the measurements. As a consequence, most of the rever- for this problem. The method of extracting reverberation
beration time data available are obtained under unoccu- times presented in this paper takes a statistical machine-
pied conditions. learning approach and is considered slightly easier than
an accurate separation of convolved signals, in the sense
* Manuscript received 2000 March 23; revised 2001 Janu- that reverberation time is only a low-frequency subspace
ary 26. of room impulse responses. The fact that experienced

J Audio Eng Soc, Vol 49, No 4, 2001 April 219


UUK I- I AL PAPERS

acousticians can make rather precise judgments concern- work. The reverberation times estimated by the neural
ing room reverberation times through listening to speech network and the corresponding true reverberation times
or music in a room suggests that the reverberation time (teacher) are compared to obtain the errors. The training
might be obtained using cognitive models, provided that process consists in updating the internal synaptic
both the anechoic speech signals and the impulse re- weights of the neural network iteratively so that the
sponses of the room are statistically obtainable by means errors between the true and the estimated reverberation
of machine learning through presenting a large number times over all the training examples are minimized. In
of training samples. the retrieve phase, speech utterances as received by a
Inspired by the study of human brains, artificial neural microphone in the space are sent to the trained neural
networks (ANNs), or neural networks for short, are net- network via the preprocessor. The trained neural net-
works of a large number of primitive neurons. Neural work then gives the reverberation time, as shown in
networks have the capability of being trained to store Fig. 2.
and retrieve information and generalize problems from
examples by altering internal connection weights. It has 2 REVERBERATED SPEECH UTTERANCES AND
been proven that such networks can be trained to map THEIR ENVELOPES
complicated nonlinear functions to an arbitrarily prede-
fined precision, provided that adequate computing power 2.1 Reverberation Time, Impulse Response, and
is used [3]. Moreover it has been demonstrated that some Perceived Sound Pressure in a Room
neural network models are powerful signal processors The transmission characteristics of a room can be de-
or filters and can extract useful information from noisy scribed by its impulse response (Fig. 3). The reverber-
data. As a result, artificial neural networks have been ation time is referred to as the period of time it takes
used widely as artificial intelligence means to solve for the sound pressure level in the room to be attenuated
classification, approximation, feature extraction, gener- by 60 dB from the moment when the excitation stops.
alization, and signal processing problems in various ap- Due to the fact that a signal-to-noise ratio higher than
plication areas [4]. In particular, nonlinear multilayer 60 dB is difficult to obtain, it is a common practice to
feedforward network architectures are the most popular take the - 5 - d B to - 3 5 - d B segment and calculate the
neural network model used widely to solve various engi- reverberation time accordingly. The reverberation time
neering problems. When certain short-term memory can be calculated from the impulse response using the
mechanisms such as tapped delay lines are used, they well-known Schroeder backward integration method.
are also powerful in handling temporal signals such as The sound m(t) perceived at a listening or measure-
speech [5], [6]. Bringing these thoughts together, this ment position can be calculated through the convolution
paper proposes using artificial neural networks to extract of the source signal s(t) and the impulse response h(t)
reverberation times from speech utterances. This novel of the room,
method utilizes "natural" test signals, that is, speech,
and hence facilitates occupied measurement of reverber- m(t) = s(t) @ h(t) (1)
ation times. In addition, examining whether artificial
neural networks are capable of modeling cognitive be- where the room impulse response is assumed linear and
havior relating to room characteristics is of considerable possibly time variant. This convolution relationship in-
academic interest. dicates that information on the room impulse response
is contained in the perceived sound signals. Research
1 OVERVIEW into speech recognition and enhancement suggests that

Fig. 1" shows a block diagram of the training process


of the proposed neural network system for reverberation ~ _ ~ ~ TrainedNeural ~._I~R T
Mic Pre-processor Network
time extraction. Reverberated speech utterances with
known reverberation times are used as training exam-
Fig 2. Block diagram of retrieve phase Reverberated speech
ples. The training examples are preprocessed and condi- sample as received by a microphone in space as input vector;
tioned to yield suitable input vectors for the neural net- trained ANN gives estimated RT.

Known RTs as teachers

Reverberated
Speech ~ ~_~
Utterances Pre-processor Ne work
(with known RT) "
\
Adjust NN using error signals
Fig l. Block diagram of training neural network Reverberated speech examples are used as input vectors, known RTs are used
as teachers

220 J Audio Eng Soc, Vol 49, No 4, 2001 April


PAPERS EXTRACTING ROOM REVERBERATION TIME

speech signals can be treated as stochastic processes, each individual word has different buildup and decay
and it is possible to utilize machine learning or artificial phases of its own
intelligence to identify the statistical features of speech 2) The impulse responses of most real rooms are not
[7], [8]. Furthermore, reverberation time is, in fact, a exactly exponential, two or more decay rates are common.
statistical feature of the impulse response, Extraction 3) Natural speech utterances happen at random in-
of the reverberation time from speech can therefore be stants in time and last for different periods.
regarded as finding a particular statistical feature from 4) The energy in a room may not have decreased to
reverberated speech which comprises both original an insignificantly low value when new excitations start.
speech and room impulse response information. This 5) Ambient and other noises caused by the measure-
can be achieved by machine learning, provided a large ment system and produced by the audience may be sig-
number of training samples are available and the neural nificant, particularly in occupied measurements.
network algorithm can generalize training samples suf- Fig. 5 shows an anechoic and two reverberated ver-
ficiently to learn these statistical features. Consequently, sions of the speech utterances "one, two, three, four,"
to extract reverberation times from speech, the neural read by a narrator, and their rms envelopes. From these
network must learn statistical features of both the im- signatures the following can be seen.
pulse responses and the speech 1) The slopes of the rise and fall edges are related to
the reverberation time. The longer the reverberation
2.2 Buildup and Decay of Sound Pressure and time, the slower the rise and decay.
Reverberation Time 2) Different speech utterances (dry speech) may have
In reverberation time extraction, only a low-frequency slightly different rise and fall slopes.
subspace of the reverberated speech is of interest. Envel-
opes of speech utterances have been used successfully
to assess reverberation and other effects on speech trans- >
mission quality and speech intelligibility [9], [10]. C
Short-term rms values of speech signals can be used to .o 0

detect the envelope of speech utterances from an energy O


X
perspective When excited by a burst of a stationary test UJ
signal such as a tone burst or noise burst, the measured
0
short-term rms values of the sound pressure buildup and (a)
decay caused by the switch-on and switch-off of the
test signal are roughly exponential I111 (Fig 4). The
exponential rise and fall edges give information on the .a_
reverberation time CI
or)
One can simply use a ruler to measure the slopes from ct)

Fig 4 and estimate the reverberation time through a 13..


brief calculation Unfortunately this is an idealized situa-
tion. In reality the short-term rms sound pressure or 0
0
energy as received by a measurement microphone in a Time [s] --t1~
room is nmch noiser. (b)
1) Speech signals are stochastic processes, which do
Fig 4 Idealized buildup and decay of sound pressure in a
not often have constant short-term average energy such ioom Neat exponential edges are obtained under idealized
as tone bursts or band-limited white noise Moreover, conditions

g ~ o,,esoun
061 I
if,/

El.
70
C
::Iii
0
Or)
"O

N
04

E
O
-06
Z
08

-1
-01 O 01 02 03 04 05 06 07 08 09

Time [s]
Fig 3 Example of simulated impulse response Plot generated by stochastic model described in Section 4 1

J Audio Eng Soc, Vol 49, No 4, 2001 April 221


COX ET AL PAPERS

3) The slopes, though indicating reverberation times, 3 NEURAL NETWORK ARCHITECTURE AND
are very noisy DATA PREPROCESSOR
An estimation of the reverberation time through sim-
ple examinations of slopes of buildup and decay edges 3.1 Architecture of the Neural Network
using a ruler is impossible in the case of real speech The neural network architecture for extracting rever-
Neural networks, however, are inherently capable of beration times is illustrated in Fig. 6. It is a multilayered
extracting features from noisy signals The tasks for the feedforward network with two hidden nonlinear layers.
neural networks are twofold. First, the neural networks The input layer is simply an interface between the exter-
are expected to generalize impulse responses so that once nal data and internal nonlinear neurons. In this particular
trained on a closed set of examples, they can correctly case, the input layer bridges and distributes the output
identify the reverberation times of rooms having any from the preprocessor to the first nonlinear hidden layer
possible impulse responses in the retrieve phase Sec- without additional processing of data. In theory, the
ond, the neural networks are also expected to learn the more nonlinear neurons are used, the more powerful the
short-term energy variation feature of speech utterances network is. However, research results indicate that in
through examples and perform averaging and generaliza- most cases a network with two nonlinear layers is a wise
tion, so that the trained neural networks can have some choice in the sense that excessive nonlinear layers may
robustness to different utterances, leading eventually to degrade the learning speed and make training difficult
source-independent measurement due to the presence of too many local minima and the
reduced speed of error propagation [4]. In this project,
empirical comparison has found a neural network suita-
ble that has 40 neurons on the input layer, 20 nonlinear
neurons on the first hidden layer, 5 nonlinear neurons
on the second nonlinear layer, and one linear summation
output neuron. All the neurons in the nonlinear hidden
(a)
layers have the same structure, as depicted in Fig. 7 (the
ith neuron) Each of these neurons has two parts, a linear
summation basis function u(') and a nonlinear activation
05 \ ~ ,/~',\,,,\ ,,,,, function/(.) The nonlinear activation function f ( . ) is
0 ' '-'/\ ' / '~\'t, ' ' a sigmoid:
0 1 2 3 4 5 6 7
(b) 1
ai- 1 + e-"~ (2)

and the basis function used is a linear combination of


e--
-
all the inputs to the neuron and the bias of the neuron b i,
0 1 2 3 4 5 b 7
E
-~ (c)
{-.- Ui(W, X) = 2 W i j X / 4- b i . (3)
j-I

._.N.
05 "
i\, t \ The notation w~i represents the connection weight from
the jth neuron to "the ith neuron. The output layer has
E
o ' ' only one node and is a linear combination processor,
z 0 1 2 3 4 5 6 7
that is, a neuron without a nonlinear activation function.
(d)
The dynamic equation of such a network can therefore
be expressed as
N/ I

ui = ~ wij(l)a(l 1) + b i ( l ) (4)
j--I
0 1 2 3 4 5 6 7
(e)
ai(1) = f ( u i ( 1 ) ) , 1 <~ i <~ N i , 1 <~ l <~ L (5)

O5

'f
0
0
i
1
v~\|
2
i
3
-W~ .U,.~'.,..~/a

T i m e [s]
r
4
i
z
5 6 7
where l represents the layer number (the input layer is
presented as layer 0), and the input to the neural network
is uniformly expressed as ai(0) The training process of
such a network can be carried out using the so-called
back-propagation method [12], [13] Back-propagation
(9 is a gradient-based optimization method and reveals reli-
Fig 5 Signatures of anechoic and reverberated speech utter- able convergence features, tolerable convergence
ances and their rms envelopes (amplitudes normalized to their speech, and good generalization capability. The back-
maximum values for visual comparison) (a) RT = 0
(b) RT - 1 s (c) RT = 4 s Much noisier edges are found propagation method of training a neural network is to
in envelopes of reverberated speech apply different input data (examples) continuously onto

222 J Audio Eng Soc, Vol 49, No 4, 2001 April


PAPERS EXTRACTING ROOM REVERBERATION TIME

the input layer of the network and compare the output where the error signal 5~Z)(l) is defined as
of the neural network with the value of the teacher (true
value) to obtain the error. The connection weights inside OE
the neural network are then adjusted so as to minimize 5Im)(l) - ~ i t)'a~ml"" (9)
the overall errors between the outputs of the neural net-
work and the teachers over all the training examples. and the scalar xI is the learning rate.
The particular rule of training the network is to update For simplicity, the bias value b~ is included in the
the connection weights according to a gradient-type weight vector by setting it to 1 and using an extra weight
learning formula, w0 to adjust it as appropriate. Thus the connection
w! m+1)(1) = wij(~)(l) + Awl~")(l) weights and biases can be treated identically in training.
"tJ (6)

with the mth training sample a('n)(o) and the correspond- 3.2 Input Data and Preprocessing
ing teacher t (") pairs so as to minimize the energy func- The data preprocessor is designed to perform four
tion E defined by functions:
1) Normalization of the signal energy so that the signal
OUt(m)]2 . presentation is independent of the input level
E = ~1 ~ [t (m) - (7)
m=l 2) Implementation of a short-term m e m o r y mechanism
3) Detection of the short-term average rms value or
The back-propagation training process follows a chain envelope of speech
rule: 4) Conversion of the input vector to a suitable format
for the neural network.
(m) OE
A w i j (1) = - ' q Owl,;)(l ) Speech signals are temporal sequences A multitapped
delay line with short-term rms value detectors is used
to detect the short-term average energy changes of the
OE Oalm)(1)
perceived sounds, as illustrated in Fig. 8. In this applica-
= --~] aalm)(1)" OwlT)(l ) tion 2000-point overlapped rms detectors, providing 10
rms values per second, are found adequate for speech
= ~qSlm)(l)f'(ulm)(l))a}m)(l - 1) (8) signals sampled at 16 kHz. This is equivalent to monitor-
ing fluctuations lower than 5 Hz in a speech envelope.
For a better resolution in differentiating early reflections
more rms detectors may be considered. As the problem
is to extract the rate of change, it is natural to consider
a differential preprocessor between the rms detector and
the input layer. This can be written as

y(i) = x ( i ) - x ( i - 1). (1o)

In this study unfiltered speech utterances are sent to


this preprocessor. This leads to the wide-band reverber-
ation time, which covers the frequency range of speech
utterances. For narrow-band reverberation time extrac-
tion, say, one-octave-band reverberation time, appropri-
Input Two Nonlinear Output
Layer Hidden Layers Layer
Fig. 6 Neural network architecture Multilayer feedforward
network structure with two hidden nonlinear layers is used

bi
~.~ To Input Layer

, Wil~ of ANN

f(')
Input w i 2 . ~

J Win
speech
U(.) ....... Basis Function f( )....... Activation Function signals
Fig. 7 Nonlinear neuron model A linear summation basis Fig 8 Block diagram of preprocessor Tapped delay line
function and a sigmoid activation function used a nonlinear followed by rms detectors is used to preprocess received
neuron in hidden layers speech signals. Preprocessor output is fed into ANN

J Audio Eng Soc, Vol 49, No 4, 2001 April 223


COX ET AL PAPERS

ate band-pass filters may be applied to the speech signal algorithm was used in the application of ECG recogni-
and a related narrow-band reverberation time used as tion [4]. Similarly, a speech utterance alignment algo-
teacher in the training phase. However, the limitation rithm that is used widely in speech recognition can be
on extracting the reverberation time from speech is that adopted in the application of reverberation time extrac-
only reverberation parameters within the frequency tion. As an alternative to the alignment algorithm, an
range of speech utterances can be extracted. additional decision-based network is proposed here as a
"watchdog." The watchdog network has an architecture
3.3 Capability of the Neural Network similar to the main network, but a reduced network size
Neural networks can be used as classifiers or approxi- and modified output neuron. It has 40, 10, 5, and 1
mators. Neural n e t w o r k - b a s e d classifiers have found neurons on the input, first nonlinear, second nonlinear,
wide application in pattern recognition In these applica- and output layers, respectively. The output neuron is
tions, a common approach is to use one decision-based nonlinear, having only two output states, 0 and 1.
subnet for each particular pattern, known as one-class- Training of the additional neural network watchdog
in-one-network (OCON) model [4] If reverberation time consists of two steps. The first step is to train the main
extraction is treated as a kind of classification problem network to be partially misalignment tolerant Examples
(that is, a different subnet for a particular reverberation with certain amounts of position and width changes are
time), a large number of subnets will be needed. Another included in the training set. Pulse widths and spaces
disadvantage is that the network will not be able to between two pulses, as shown in Fig 9, uniformly dis-
output continuous values of reverberation time There- tributed from 0.35 to 0.65 s, are included in the training
fore an approximation-based network and a correspond- set. Up to this stage, when tested, the main network can
ing training strategy are favored for the reverberation respond correctly to examples that appear in the training
time extraction problem. set. However, when pulse widths and positions are sig-
To identify the capability of the proposed neural net- nificantly different from the cases included in the train-
work model, preliminary tests are carried out prior to ing set, large errors occur (more than 5 s in the worst
training the network tediously on real speech signals and case). The second step is to train the watchdog to identify
impulse responses. Extracting the rate of change from input examples that exceed the misalignment tolerance
randomly positioned exponential segments is first stud- of the main network Fig 10 is a block diagram showing
ied. The training and test examples are simply generated the training method of the watchdog. Training examples
by the convolution of pulse trains and exponential decay are generated by the convolution of pulse trains with
functions, as illustrated in Fig. 9. This is in analogy to random position and pulse widths and exponential decay
the idealized noise-free situation described in Section functions. Each input datum from the data set is sent
2.1. Equivalent reverberation times according to the simultaneously to the trained main network and the
equation [ 14] watchdog. The output of the trained main network and
the expected output are compared. If an error is smaller
h(t) = e x p ( - 6.9t/RT) (11)

Trained I = If abs(o-t)<=e
are used as sample-teacher pairs to train the neural T=I,
approximation
network. neural network " T=0,
Else
Various comparisons are made Investigation showed I Input data
that such a network architecture is capable of extracting
exponential rates of change. However, the neural model Decisio~based
neural ne~,o rk
appears to be sensitive to the positions of the input pulses (Watchdo~
and not completely tolerant of misalignment. This find-
ing is similar to the misalignment-sensitive problem
Fig 10 Training of watchdog decision-based neural network
identified in electrocardiogram recognition, as reported Watchdog is trained to identify input cases from which main
in [ 15 ], [ 16]. As a solution to this problem, an alignment network can satisfactorily extract RTs

ID
r
m

ID
"O \,
B
Q.
\
\,

<E

Pulse train Exponential decay function


Time
(a) (b)
Fig Patterns of pulse train and exponential decay function

224 J Audio Eng Soc, Vol 49, No 4, 2001 April


PAPERS EXTRACTING ROOM REVERBERATION TIME

than a predefined value, say 0.25 s in this example, a


teacher value of 1 for training the watchdog network is 4.1 Generation of Training and Testing Samples
generated; otherwise a teacher value of 0 is generated. To apply supervised neural network models to any
In this way the watchdog is trained to detect those input practical problem, it is essential that the required sets
patterns that the main network cannot process correctly. of training and testing data be available. Real room sam-
In the retrieve phase, the watchdog monitors the input pling and measurement is a convincing way to obtain
and decides whether the main network can process it the required data. However, it is almost impossible to
correctly. For an input that may cause intolerable error, obtain the vast amount of training data by real room
the watchdog network will output a 0, indicating that this measurements. Moreover, the objective is to enable in-
input case should be filtered out; otherwise it produces a use measurement, which virtually rules out using ex-
1 to enable the main network's output. It is worth noting isting measurement methods to obtain training samples.
that the watchdog approach cannot filter out only badly Therefore artificially generated examples are considered
misaligned input examples but also other input cases to as alternatives. Important concerns of most existing arti-
which the main network cannot respond correctly, as ficial reverberation algorithms are natural sound effect
long as those cases are included in the training of the and real-time implementation capability [17]-[20]. The
watchdog. As a result, the effect of applying such a physical processes and the mechanism of the room re-
watchdog to the main network is an improved overall sponse are simplified to ensure real-time implementa-
reliability of the neural network-based reverberation ex- tion These algorithms may be used to produce a rever-
traction methdd. beration effect that sounds good, but the output they
Tests show that the watchdog is effective in dealing produce can be quite different from what is measured
with misaligned inputs. Fig 11 shows the error distribu- in real rooms. For example, the density of the reflections
tion of 1000 outputs of the trained network when the in an impulse response generated by an artificial rever-
watchdog is applied. Test examples are generated by the berator is usually much lower than that appearing in real
convolution of pulse trains with random positions and measurement. Another possible approach to generating
widths and exponential decay functions. All the output examples is to use room simulation techniques such as
values that can pass through the watchdog are within image source and ray tracing methods. However, they
the predefined error tolerance range (in this test, 0.25 s). still have various limitations. To meet the needs of train-
ing the neural network, it is required that the room acous-
4 TRAINING NEURAL NETWORKS ON tic simulation algorithm incorporate related physical
CONTROLLED SPEECH processes and parameters, such as frequency features of
reflective surfaces, and the algorithm should be able
In the preceding section the capability of the proposed to generate a superset of all possible kinds of impulse
neural networks was provisionally tested by training the responses in reality. Hence, a stochastic diffuse-field
networks on idealized training and validation examples. model is considered.
This is based on tremendously simplified models. In In a diffuse field, the temporal density of reflections
reality, due to the complexity of room impulses and in a room is dominated by [21].
speech utterances, reverberated speech utterances as
measured in rooms are much "noisier " This section fo- dN c3t 2
- 4'rr-- (12)
cuses on training the neural network on real speech dt V
examples.
Stochastic sequences with probability density functions
6 shaped to fit Eq. (12) are used to decide the time instants
z when reflections occur. The amplitude of each reflection
~5 .~ is dominated by three elements: 1) attenuation due to
] spherical spreading, 2) attenuation depending on the
number of surface reflections and the impedance of these
surfaces, and 3) air attenuation. Based on the known
frequency-dependent absorption coefficient, reflections
{3 from surfaces are modeled as different filters. This simu-
lator has two advantages in this particular application.
First, it incorporates the frequency-dependent feature
of room impulse responses Second, it can generate a
superset of realistic impulse responses found in any nor-
z mal space, provided that it is run for a sufficiently large
number of times. The simulated impulse responses are
0 strictly examined and are found sufficiently realistic to
0 1 2 3 4 5
True value [S] meet the requirements of this study However, due to
the limitations of the diffuse-field model, spaces with
Fig 11 Test result when additional decision-based network
is used All outputs, when watchdog is applied, are within special shapes such as coupled rooms may not be well
region of predefined error tolerance reflected in the data set.

J Audio Eng Soc, Vol 49, No 4, 2001 April 225


COX ET AL PAPERS

Anechoic speech utterances "one," "two," and "three" trial and then recoded in C language to gain the full
read by a female narrator are used. The convolution of potential of computing power Training the neural net-
the anechoic speech utterances and 10 000 simulated works is quite a tedious job. It takes about one week on
room impulse responses with reverberation times from a 333-MHz Pentium II desktop PC to train the proposed
0.1 to 5 s is performed to generate a large data set. Half neural network. To identify a suitable learning rate and
of the data set is used as a training set, and the second the initial values for weights, several trials and restarts
half is used as a validation set. As generalization is are often inevitable. Fortunately training is a one-off
an important concern in this application, validation is process in developing the neural network models for this
performed to test the trained network. The test patterns application. Once the neural networks are trained and
used for validation purposes are strictly selected from validated, the connection weights become constant val-
those patterns that have not been encountered in the ues in the networks. So in the retrieve phase, the neural
training phase. When testing the robustness of the networks have very quick response to any input cases
trained neural networks to speech utterances, two addi- and extract room acoustic parameters in real time.
tional words, "four" and "five," are used in the experi-
ment, a sufficiently large data set (10 000 examples) is 4.3 Robustness to Different Impulse Responses
used to ensure more rigorous validation tests. In the experiment, single-syllable utterances "one,"
"two," "three," read by a female narrator are recorded
4.2 Training in an anechoic chamber at a sampling frequency of 16
Training on speech utterances and simulated impulse kHz. The average length of each utterance is about 0.4
responses is performed using a quasi-block-adaptive s. The source used to carry out the test is 27 possible
method. This is a balance between an aggressive data- combinations of these three utterances, that is, " o n e . . .
adaptive method and a prudent block-adaptive method. two . . . three," "one . . . three . . . two," "two . . .
The data-adaptive method updates the connection one. t h r e e , " . . . , "three . . t h r e e . . , three." The
weights in a neural network based on one example- time intervals between these single utterances are on
teacher pair, and therefore weights are updated in each average 0.5 s. To tackle the time misalignment problem
iteration. In some applications this method exhibits mentioned before, time-shifted patterns are included in
faster convergence speed than the block-adaptive the training set. The time shifting for spaces between
method However, it lacks consistent numerical ro- two words is randomly arranged from 0.4 to 0.6 s. The
bustness The block-adaptive method does not update learning process is illustrated in Fig. 12, where the y
the weights until the entire block of example-teacher axis, according to Eq. (7), represents the summed square
pairs has been presented. In this study the so-called errors between teachers and outputs of the neural net-
quasi-block-adaptive method works in such a way that work over 50 randomly chosen samples from the valida-
weight updating happens after 50 training example pairs tion set. It can be seen that the mean squared error,
of different reverberation times have been presented. In when tested using data from the validation set, decreases
particular, one quasi-block comprises examples with all continuously when the number of training iterations in-
the reverberation time values ranging from 0.1 to 5 s at
steps of 0.1 s. The example of each individual reverber- 106
ation value in a quasi-block is chosen randomly from
the training set. This is, in fact, a random selection of e~
(/)
utterance sequences and impulse response types having 4
10
identical reverberation times. The training algorithm fol- Q.

lows Eqs. (6)-(9) in Section 3.1. E


t~
The learning rate "q in Eq. (8) is an important concern -& 102
in training a neural network. In a machine-learning con-
O
text, the concept of learning rate is similar to the step
size in many numerical algorithms. Training a neural
~o 10~
network using back-propagation is, in fact, to search
for a global minimum using a gradient-based numerical
algorithm In general, a larger learning rate leads to a ~ 10.2
fast training process. But an excessively large learning
rate may cause divergence of the algorithm and is likely
to skip important minimum points in the error space. In I1~ -4
E 10
contrast, a smaller learning rate results in a slow training E
process. Another danger of using too small a learning if-)
rate is that the training is much more likely to terminate 10 -6
at certain local minimum points. The suitable learning 0 4 6 8 10
rate for this application was explored empirically. A Training iteration x 106
learning rate of 0.05 to 0.08 was found suitable for
Fig. 12 Learning process versus iteration of training When
this application. number of training iterations increases, squared errors between
The algorithms are first coded in Matlab for an initial outputs and teachers decrease gradually

226 J Audio Eng Soc, Vol 49, No 4, 2001 April


PAPERS EXTRACTING ROOM REVERBERATION TIME

creases. A f t e r 1 million iterations, for any arbitrary test networks on utterances, and results of a p r e l i m i n a r y
data (data from the validation set), the error between study of source independence are presented in Table 1.
the outputs o f the neural network and the actual reverber- For each reverberation time within the range of 0.5 to
ation times is less than _+0.048 s Fig. 13 shows the
worst case o f the test results, when the trained neural
networks are tested using the data from the validation
set. z
5
The training and testing process has already incorpo- E
rated the t i m e - s h i f t e d s a m p l e s , and therefore the trained t-
O
network shows a certain tolerance to the time shifting = 4
of input signals. H o w e v e r , the time shift included in the
training e x a m p l e s is not adequate to deal with uncon-
trolled speech, a p r e p r o c e s s o r that aligns the measured
speech utterances is still needed. The approach of using
a d e c i s i o n - b a s e d neural network as a watchdog, dis- E 2
cussed in the p r e c e d i n g section, is adopted. A f t e r the ffl

main network is trained, speech utterance sequences z


z
with arbitrary mute interval lengths (in this study < 1
0 . 2 - 0 . 8 s) are used to test the trained main network. If
the output error e x c e e d s a predefined threshold o f 0.1 i i i i i

s, a logic 0 presents as the teacher of the watchdog 1 2 3 4 5


network. O t h e r w i s e a logic 1 should appear. In this w a y
the w a t c h d o g network is trained to make the decision as Actual reverberation time [s]
(a)
to whether the current reverberation time is identifiable
by the main neural network. In the retrieve phase the
w a t c h d o g network is used as a prefilter It prevents those
inputs that the main network cannot identify from being
sent to the main network. Tests are performed using 5
r a n d o m l y spaced utterances (spaces between utterances
vary from 0.2 to 0.8 s) as inputs to produce 200 output
data. Six output values e x c e e d the predefined 0.1-s error
tolerance, p r o v i d i n g a correctness o f 97%. Ten real r o o m
m e a s u r e m e n t samples are used to further test the trained >' 3
neural n e t w o r k These samples are obtained by the con-
volution of the speech utterances used to train the neural
network and real r o o m impulse responses. The errors o f
neural network estimation over the 10 samples (reverber-
ation times from 0.5 to 2.2 s) are all within _+0.1 s z 1
Z
<
4.4 Source Independence
It is desirable that the extraction o f reverberation times o
can be source i n d e p e n d e n t so that reverberation times 0 1 2 3 4 5 6
can be extracted from any speech utterance rather than Actual early decay time [s]
a fixed set of utterances included in the training set. (b)
New utterances "four" and "five" are added to the valida- Fig 13 Worst cases of neural network estimation for 5000
tion set to test the i n d e p e n d e n c e o f the trained neural different validation tests (a) RTs. (b) EDTs

Table 1 Test results when new words are used.

With One New Utterance With Two New Utterances


Expected NN Predicted Percentage NN Predicted Percentage
Value (s) Value (s) Error (%) Value (s) Error (%)
0 50 0 50 -- 0 50 --
1 00 1 03 3 0 1 00 --
1 50 1 58 5 3 1 46 2.7
2 00 2 59 30 2 95 48
2 50 3 26 30 4 29 72
3 00 4.07 36 4 64 55
3.50 5 38 53 4 10 17
4 00 3 76 6 5 04 26
4 50 4 24 6 5.52 23
5 00 6 23 25 6 05 21

J Audio Eng Soc, Vol 49, No 4, 2001 April 227


COX ET AL PAPERS

5 s, with a step size of 0.5 s, 100 different tests are verberation times. It is known that the envelope spec-
performed, and the errors listed in Table 1 are the worst trum of anechoic speech taken over a long period (40 s)
cases. It can be seen that the trained neural network does is relatively constant [9]. Reverberation modifies the
show a certain independence from the utterances used, envelopes of speech signals according to different rever-
but in the middle of the reverberation time range, unac- beration times. The difference between envelops of an-
ceptable large errors occur. It is anticipated that when echoic speech and reverberated speech provides a modu-
more utterance samples are presented in the training set, lation transfer function (MTF) from Which reverberation
the neural network should learn the characteristics of times and a speech transmission index (STI) can be ex-
speech utterances better and average them out to achieve tracted. This indicates that the source independence fea-
better source-independent features. ture of the proposed method could be improved provided
that longer speech sequences and more words are used
4.5 Training on Early Decay Time in both the training and the retrieve phases. Feeding
Early decay time (EDT) is another important objective long-term envelope spectra of reverberated speech into
acoustic parameter. It focuses on the initial 10-dB decay the neural network could be a practical solution, as this
period and has been shown to correlate better to subjec- enables averaged information of speech envelopes over
tive responses in reverberant fields with continuous sig- a sufficiently long period to be sent to neural networks
nals. The proposed method is also tried to extract early with reasonable sizes.
decay time from speech utterances in an identical way. The simulator used to generate training and validation
But the input layer of the approximate network is ex- examples is based on a stochastic model in diffuse fields.
tended to have 60 neurons. This is to ensure that the early Although it is expected to generate any possible impulse
decay part of the examples can be sufficiently sensed by responses (with various early reflection patterns) to form
the neural networks Similar results are obtained. The a superset of realistic impulse responses using the sto-
maximum error of early decay time extraction, when chastic model, some acoustic conditions such as coupled
tested using the words that appeared in the training set, rooms may not be well reflected in the data set. It is
is 0.056 s [see Fig. 13(b)]. believed that this system would be more useful when
more complicated training examples are included.
5 DISCUSSION AND CONCLUSIVE REMARKS As a first step toward extracting room acoustic param-
eters from speech using artificial neural networks, this
This paper has presented a novel method to extract study has demonstrated the feasibility and potential of
room reverberation times from single-syllable speech utter- this new approach As an alternative to traditional mea-
ances (in this case, pronounced digits) using artificial neu- surement methods, this new method can facilitate occu-
ral networks. The limitation of the approximation-based pied measurements of reverberation times and early de-
multilayer feedforward network in this application can cay times. Moreover, once the networks are trained and
be circumvented by using a second decision-based connection weights obtained, implementation of such a
watchdog neural network. The early decay time of rooms neural network-based measurement system is straight-
can be extracted from speech utterances in a similar forward and can be realized on a hardware platform at
fashion. In this study wide-band reverberation param- low cost. This may lead to the development of a new
eters within the frequency range of speech utterances type of instrumentation. The method presented here may
are considered. Although frequency-dependent rever- also be useful in robust speech recognition: the estimated
beration times can be extracted with added band-pass reverberation times provide important information for
filters, the method is restricted to the estimation of rever- dereverberation. As a long-term objective, the authors
beration parameters within the frequency range of are investigating novel approaches to realize source-
speech utterances. independent room acoustic parameter identification.
Two important concerns of the proposed m e t h o d - - r o - This may enable reverberation times to be extracted from
bustness to different impulse responses and source inde- nonrestricted speech.
p e n d e n c e - h a v e been addressed Test results show that
the proposed neural networks can sufficiently generalize 6 ACKNOWLEDGMENT
different impulse responses and therefore are robust to
rooms with different types of impulse responses. When The authors would like to acknowledge the support
using utterances included in the training phase as excita- from the Engineering and Physical Sciences Research
tions, the trained neural networks can achieve a better Council, UK (EPSRC grant reference GR/L89280) for
than 0.1-s resolution in reverberation time measure- funding this project. Inspiring discussions with Y. W.
ments. This can meet the precision requirement of practi- Lam, W. J. Davies, and M. West are gratefully
cal use and can be implemented by playing back prere- appreciated.
corded anechoic speech utterances as excitations. An
initial investigation also reveals that the proposed 7 REFERENCES
method has certain source-independent features. When
two out of three words in the excitations are new words [ 1] L Cremer and H. Muller, Principles andApplica-
that have never appeared in the training set, the trained tions of Room Acoustics, T. Schultz, Transl., vol. 1
neural networks can still give coarse estimations of re- (Applied Science, London, 1978), p. 194.

228 J Audio Eng Soc, Vol 49, No 4, 2001 April


PAPERS EXTRACTING ROOM REVERBERATIONTIME

[2] H. Attias and C. E. Schreiner, "Blind Source Sep- Internal Representations by Error Propagations," in Par-
aration and Deconvolution: The Dynamic Component allel Distributed Processing: Exploration in the Micro-
Analysis Algorithm," Neural Computation, vol. 10, pp. structure of Cognition, vol. 1 (MIT Press, Cambridge,
1373-1424 (1998). MA, 1986), pp. 318-362.
[3] G. Cybenco, "Approximation by Superpositions [13] M. Riedmiller, "Advanced Supervised Learning
of a Sigmoidal Function, Mathematics of Control," Sig- in Multi-layer Perceptrons--From Back Propagation to
nals and Syst., vol. 2, pp. 303-314 (1989). Adaptive Algorithms," Int. J. Computer Standards and
[4] S. Y. Kung, Digital Neural Network, Information Interfaces, special issue on neural networks, (16), pp.
and System Science ser. (Prentice-Hall, Englewood 265-278 (1994).
Cliffs, NJ, 1993). [14] M. Schroeder, "Modulation Transfer Functions:
[5] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, Definition and Measurement," Acustica, vol. 49, pp.
and K. J. Lang, "Phoneme Recognition Using Time- 179- 182 (1981).
Delay Neural Networks," IEEE Trans. Acoust., Speech, [15] Y. H. Hu, W. J. Thompkins, and Q. Xue, "Arti-
Signal Process., vol. ASSP-37, pp. 328-339 (1989). ficial Neural Networks for ECG Arrhythmia Monitor-
[6] S. Haykin, Neural Networks: A Comprehensive ing," in Neural Networks for Signal Processing, vol. 2,
Foundation, 2nd ed. (Prentice-Hall, Englewood Cliffs, S. Y. Kung, F. Fallside, J. A. Sorensen, and C. A.
NJ, 1999). Kamm, Eds., Proc. 1992 IEEE Workshop (Helsingoer,
[7] Y. Ephram, D. Malah, et al., "On the Application Denmark, 1992), pp. 350-359.
of Hidden Markov Models for Enhancing Noisy Speech," [16] J. S. Taur and S. Y. Kung, "Prediction-Based
IEEE Trans. Acoust., Speech, Signal Process., vol. Networks with ECG Application," in Proc. IEEE Int.
ASSP-37, pp. 1846-1856 (1989 Dec.). Conf. on Neural Networks (San Francisco, CA, 1993),
[8] B. Juang and L. R. Rabiner, "Mixture Autore- pp. 1920-1925.
gressive Hidden Markov Model for Speech Signals," [17] M. R. Schroeder and B. F. Logan, "'Colorless'
IEEE Trans. Acoust., Speech, Signal Process., vol. Artificial Reverberation," J. Audio Eng. Soc., vol. 9,
ASSP-33, pp. 1404-1413 (1985 Dec.). pp. 192-197 (1961 July).
[9] T. Houtgast and H. J. M. Steeneken, "Envelope [18] M. R. Schroeder, "Natural-Sounding Artificial
Spectrum and Intelligibility of Speech in Enclosures," in Reverberation," J. Audio Eng. Soc., vol. 10, pp.
Proc. IEEE-AFCRL 1972 Speech Conf., pp. 392-395. 219-223 (1962 July).
[10] H, J. M. Steeneken and T. Houtgast, "A Physi- [ 19] J. Dattorro, "Effect Design, Part 1: Reverberator
cal Method for Measuring Speech Transmission Qual- and Other Filters," J. Audio Eng. Soc., vol. 45, pp.
ity," J. Acoust. Soc. Am., vol. 67, pp. 318-326 660-684 (1997 Sept.).
(1980 Jan.). [20] K. H. Kuttruff, "Auralization of Impulse Re-
[11] M. Barron, Auditorium Acoustics and Architec- sponses Modeled on the Basis of Ray-Tracing Results,"
tural Design (E & FN Spon, Chapman & Hall, Lon- J. Audio Eng. Soc., vol. 41, pp. 876-880 (1993 Nov.).
don, 1998). [21] H. Kuttruff, Room Acoustics (Elsevier Science
[12] D. E. Rumelhart, G. Hinton, et al., "Learning Publishers, England, 1991).

THE AUTHORS

T. J. Cox F Li P Darlington
Trevor Cox is a senior lecturer at the School of Acous- sor Systems, Inc in a variety of spaces world wide He
tics and Electronic Engineering at the University of Sal- currently sits on two standards working groups concern-
ford. His research and teaching interests center on room ing the characterizationof diffusion, includingthe AES
acoustics and digital signal processing. He is also tutor SC-04-02 of which he is vice-chair.
of the M.Sc course in audio acoustics. A majority of Dr. Cox's interest in neural networks began with his
Dr Cox's research work is in the area of measurement, work on modeling nonlinear blast noise propagation.
prediction, design, and characterization of room acous- This paper represents some of his interests in applying
tic diffusers He has numerous journal publications on neural networks to problems in room acoustics. His re-
diffusers, and his designs have been used by RPG Diffu- search work also includes more conventional numerical

J Audio Eng Soc, Vol 49, No 4, 2001 April 229


COX ET AL PAPERS

techniques, such as boundary element methods, and the ical methods, and instrumentation.
subjective evaluation of performance space acoustics us-
ing quantitative and qualitative methods
Dr. Cox is a member of the Audio Engineering Soci- Paul Darlington is a director of Apply Dynamics
ety and the Institute of Acoustics in the U K. He is an Ltd , a company that provides consultancy, design,
associate editor of the Acustica united with Acta Acus- and training services in audio, acoustics, and electro-
tica, and is also on the editorial board of the Institute acoustics After receiving an undergraduate degree
of Acoustics Bulletin from the Institute of Sound and Vibration Research,
University of Southampton, Dr Darlington studied
for a master's degree in electronics at the same univer-
Francis Feng Li, was born in 1963 in Shanghai, sity He was then awarded a Ph D for his research
China, and holds a bachelor's degree in electronics and into adaptive filters, conducted at ISVR He lectured
computer engineering and a master of philosophy in at the University of Wyoming for two years and for
computer modeling of EMC Before he went to the the next ten years was lecturer and senior lecturer in
U K , he spent ten years on the academic staff at Shang- the Department of Acoustics and Audio Engineering,
hai University teaching a variety of courses and conduct- University of Salford
ing research in his field He is currently a research assist- Dr. Darlington's work is motivated by those engi-
ant to Dr Trevor Cox and a Ph D candidate near neering applications that involve both electronics and
completion Mr Li has a broad teaching and research acoustics He has worked in the areas of control, com-
interest in the area of computational intelligence, digital munications, instrumentation, electroacoustics, and sig-
signal processing, acoustics, audio engineering, numer- nal processing

230 J Audio Eng Soc, Vol 49, No 4, 2001 Apri[

You might also like